Methodology

How we calculate scores, weight benchmarks, and measure perception.

Score Calculation

Each model's composite score is a weighted average of benchmark scores and user perception. Weights vary by task to reflect what matters most for each use case.

Composite Score = Σ(benchmark_score × weight) / Σ(weights)

Only benchmarks with available data contribute to a model's score. Missing benchmarks are excluded from both numerator and denominator, ensuring fair comparison.

Task-Specific Weights

Different tasks prioritize different benchmarks. Here's how weights are distributed:

Coding

  • SWE-Bench Verified50%
  • LiveCodeBench30%
  • NerfWatch Perception20%

Math

  • AIME 202540%
  • MATH-50040%
  • GSM8K10%
  • NerfWatch Perception10%

General Knowledge

  • GPQA Diamond30%
  • MMLU-Pro30%
  • Humanity's Last Exam20%
  • ARC-AGI 210%
  • NerfWatch Perception10%

Conversation

  • Chatbot Arena50%
  • AlpacaEval 2.020%
  • MT-Bench10%
  • NerfWatch Perception20%

NerfWatch Perception

Our perception score captures how users actually experience AI models in practice. It combines two signals:

60%

User Votes

Direct votes from users rating models as "Sharp," "Normal," or "Nerfed." Each vote is weighted: Sharp = 100, Normal = 50, Nerfed = 0. Only votes from the last 30 days are counted.

40%

Social Signals

Mentions from Reddit and Twitter analyzed for sentiment. Positive mentions score 100, neutral score 50, negative score 0. Rolling 30-day window to capture recent model updates.

Perception = (Vote Score × 0.6) + (Social Score × 0.4)

Benchmark Sources

All benchmark scores are pulled from official leaderboards and research papers. We update scores regularly and link directly to sources for transparency.

Score Normalization

Some benchmarks have different scales or saturated performance levels. We normalize scores to make them comparable:

Standard Benchmarks

Most benchmarks (HumanEval, MMLU, etc.) use raw percentage scores directly. A score of 85% means the model answered 85% correctly.

Frontier Benchmarks

For benchmarks where SOTA is far below 100% (like HLE at 46%), we scale relative to top performance. This prevents frontier benchmarks from artificially dragging down composite scores.

Arena Ratings

Chatbot Arena Elo ratings are converted to percentiles based on the current leaderboard distribution, then scaled to 0-100.

Transparency

Every score on NerfWatch links to its source. Click any benchmark score to see where the data came from. We believe in verifiable metrics over black-box ratings.

  • All benchmark data is publicly sourced and linked
  • Vote counts are visible when you expand model details
  • Social signal counts show volume of mentions analyzed
  • Weights are displayed on every task view
  • No hidden adjustments or manual overrides