Methodology
How we calculate scores, weight benchmarks, and measure perception.
Score Calculation
Each model's composite score is a weighted average of benchmark scores and user perception. Weights vary by task to reflect what matters most for each use case.
Composite Score = Σ(benchmark_score × weight) / Σ(weights)Only benchmarks with available data contribute to a model's score. Missing benchmarks are excluded from both numerator and denominator, ensuring fair comparison.
Task-Specific Weights
Different tasks prioritize different benchmarks. Here's how weights are distributed:
Coding
- SWE-Bench Verified50%
- LiveCodeBench30%
- NerfWatch Perception20%
Math
- AIME 202540%
- MATH-50040%
- GSM8K10%
- NerfWatch Perception10%
General Knowledge
- GPQA Diamond30%
- MMLU-Pro30%
- Humanity's Last Exam20%
- ARC-AGI 210%
- NerfWatch Perception10%
Conversation
- Chatbot Arena50%
- AlpacaEval 2.020%
- MT-Bench10%
- NerfWatch Perception20%
NerfWatch Perception
Our perception score captures how users actually experience AI models in practice. It combines two signals:
User Votes
Direct votes from users rating models as "Sharp," "Normal," or "Nerfed." Each vote is weighted: Sharp = 100, Normal = 50, Nerfed = 0. Only votes from the last 30 days are counted.
Social Signals
Mentions from Reddit and Twitter analyzed for sentiment. Positive mentions score 100, neutral score 50, negative score 0. Rolling 30-day window to capture recent model updates.
Perception = (Vote Score × 0.6) + (Social Score × 0.4)Benchmark Sources
All benchmark scores are pulled from official leaderboards and research papers. We update scores regularly and link directly to sources for transparency.
SWE-Bench Verified
Real GitHub issues for coding evaluation. Industry-standard for agent coding ability.
LiveCodeBench
Contamination-free coding benchmark with continuously updated problems.
Chatbot Arena
Human preference voting with Elo ratings. Gold standard for conversation quality.
GPQA Diamond
Graduate-level science questions requiring expert reasoning.
MMLU-Pro
Enhanced MMLU with harder questions and reduced noise.
Humanity's Last Exam
Frontier benchmark with expert-level questions. 46% is current SOTA.
ARC-AGI 2
Abstract reasoning corpus testing generalization beyond training data.
AIME 2025
American Invitational Mathematics Examination problems.
MATH-500
Competition mathematics problems across difficulty levels.
AlpacaEval 2.0
Automated evaluation of instruction-following ability.
MT-Bench
Multi-turn conversation benchmark with GPT-4 as judge.
Score Normalization
Some benchmarks have different scales or saturated performance levels. We normalize scores to make them comparable:
Standard Benchmarks
Most benchmarks (HumanEval, MMLU, etc.) use raw percentage scores directly. A score of 85% means the model answered 85% correctly.
Frontier Benchmarks
For benchmarks where SOTA is far below 100% (like HLE at 46%), we scale relative to top performance. This prevents frontier benchmarks from artificially dragging down composite scores.
Arena Ratings
Chatbot Arena Elo ratings are converted to percentiles based on the current leaderboard distribution, then scaled to 0-100.
Transparency
Every score on NerfWatch links to its source. Click any benchmark score to see where the data came from. We believe in verifiable metrics over black-box ratings.
- All benchmark data is publicly sourced and linked
- Vote counts are visible when you expand model details
- Social signal counts show volume of mentions analyzed
- Weights are displayed on every task view
- No hidden adjustments or manual overrides