Methodology

How NerfWatch ranks AI models — the formula, the weights, and the data.

How scores are calculated

Each model gets a composite score per task (Coding, Math, General, Conversation). The score is a weighted average of benchmark results and NerfWatch community perception:

Score = Sum(benchmark_score × weight) ÷ Sum(weights)

Only benchmarks with available data for a given model count — missing benchmarks are excluded from both numerator and denominator so models with fewer data points aren't unfairly penalized.

The weight formula

NerfWatch contributes via two separate signals — direct user votes and aggregated social sentiment from Reddit and Hacker News. Each NerfWatch signal and each outside benchmark is weighted equally. Combined, the two NerfWatch signals give community perception twice the influence of any single benchmark source while still grounding scores in real benchmark data.

Every item gets weight = 1
Each item % = 1 ÷ N × 100  •  Combined NerfWatch % = 2 ÷ N × 100

N is the total number of items in a task (outside benchmarks + 2 NerfWatch signals). Adding a new benchmark automatically dilutes all existing weights — the formula stays balanced without manual tuning.

Weights by task

Coding

NerfWatch combined = 28.6%

  • NerfWatch User Votes14.3%
  • NerfWatch Social Signals14.3%
  • SWE-Bench Verified14.3%
  • LiveCodeBench14.3%
  • AA Coding Index14.3%
  • SciCode14.3%
  • Terminal-Bench Hard14.3%

Math

NerfWatch combined = 40.0%

  • NerfWatch User Votes20.0%
  • NerfWatch Social Signals20.0%
  • AIME 202520.0%
  • MATH-50020.0%
  • GSM8K20.0%

General Knowledge

NerfWatch combined = 22.2%

  • NerfWatch User Votes11.1%
  • NerfWatch Social Signals11.1%
  • GPQA Diamond11.1%
  • MMLU-Pro11.1%
  • Humanity's Last Exam11.1%
  • AA Intelligence Index11.1%
  • AA Agentic Index11.1%
  • ARC-AGI 211.1%
  • SimpleBench11.1%

Conversation

NerfWatch combined = 40.0%

  • NerfWatch User Votes20.0%
  • NerfWatch Social Signals20.0%
  • Chatbot Arena20.0%
  • AlpacaEval 2.020.0%
  • MT-Bench20.0%

The two NerfWatch signals

Each NerfWatch signal contributes independently to a model's composite score, at the same weight as any single outside benchmark. They are not blended into one number — User Votes and Social Signals each get their own column in the breakdown.

User Votes

Task-scoped direct votes — voting nerfed on Gemini 3 in the math tab only affects gemini-3 + math, not coding/general/conversation. Each vote is scored: Sharp = 100, Normal = 50, Nerfed = 0.

Social Signals

Mentions from Reddit (11 subreddits) analyzed for sentiment via word-boundary keyword matching, aggregated per model family. Positive = 100, neutral = 50, negative = 0. Rolling 30-day window. Counts are shown alongside scores.

Benchmark sources

All benchmark scores are pulled from official leaderboards and research papers. Scores are refreshed daily via automated scrapers and linked directly to their sources for transparency.

Score normalization

Some benchmarks have different scales or saturated top-end performance. We normalize to make them comparable:

Standard benchmarks

Most benchmarks use raw percentage scores directly. A score of 85% means the model solved 85% of problems.

Frontier benchmarks

For benchmarks where SOTA is far below 100% — Humanity's Last Exam (top: 46%) and ARC-AGI 2 (top: 69%) — we scale relative to top performance (score ÷ top × 95) so they don't drag down composites.

Arena ELO ratings

Chatbot Arena Elo ratings are linearly mapped from the observed range: 1000 → 0, 1600 → 100. This puts flagship models in the 70–95 band instead of all clamping at 100.

Transparency

Every score on NerfWatch links to its source. Click any benchmark value to see where the data came from. We believe in verifiable metrics over black-box ratings.

  • All benchmark data is publicly sourced and linked
  • Vote counts and breakdowns are visible per model
  • Social signal volumes are displayed alongside scores
  • Weights are shown on every task view and on this page
  • No hidden adjustments or manual overrides