Methodology
How NerfWatch ranks AI models — the formula, the weights, and the data.
How scores are calculated
Each model gets a composite score per task (Coding, Math, General, Conversation). The score is a weighted average of benchmark results and NerfWatch community perception:
Score = Sum(benchmark_score × weight) ÷ Sum(weights)Only benchmarks with available data for a given model count — missing benchmarks are excluded from both numerator and denominator so models with fewer data points aren't unfairly penalized.
The weight formula
NerfWatch contributes via two separate signals — direct user votes and aggregated social sentiment from Reddit and Hacker News. Each NerfWatch signal and each outside benchmark is weighted equally. Combined, the two NerfWatch signals give community perception twice the influence of any single benchmark source while still grounding scores in real benchmark data.
Every item gets weight = 1
Each item % = 1 ÷ N × 100 • Combined NerfWatch % = 2 ÷ N × 100N is the total number of items in a task (outside benchmarks + 2 NerfWatch signals). Adding a new benchmark automatically dilutes all existing weights — the formula stays balanced without manual tuning.
Weights by task
Coding
NerfWatch combined = 28.6%
- NerfWatch User Votes14.3%
- NerfWatch Social Signals14.3%
- SWE-Bench Verified14.3%
- LiveCodeBench14.3%
- AA Coding Index14.3%
- SciCode14.3%
- Terminal-Bench Hard14.3%
Math
NerfWatch combined = 40.0%
- NerfWatch User Votes20.0%
- NerfWatch Social Signals20.0%
- AIME 202520.0%
- MATH-50020.0%
- GSM8K20.0%
General Knowledge
NerfWatch combined = 22.2%
- NerfWatch User Votes11.1%
- NerfWatch Social Signals11.1%
- GPQA Diamond11.1%
- MMLU-Pro11.1%
- Humanity's Last Exam11.1%
- AA Intelligence Index11.1%
- AA Agentic Index11.1%
- ARC-AGI 211.1%
- SimpleBench11.1%
Conversation
NerfWatch combined = 40.0%
- NerfWatch User Votes20.0%
- NerfWatch Social Signals20.0%
- Chatbot Arena20.0%
- AlpacaEval 2.020.0%
- MT-Bench20.0%
The two NerfWatch signals
Each NerfWatch signal contributes independently to a model's composite score, at the same weight as any single outside benchmark. They are not blended into one number — User Votes and Social Signals each get their own column in the breakdown.
User Votes
Task-scoped direct votes — voting nerfed on Gemini 3 in the math tab only affects gemini-3 + math, not coding/general/conversation. Each vote is scored: Sharp = 100, Normal = 50, Nerfed = 0.
Social Signals
Mentions from Reddit (11 subreddits) analyzed for sentiment via word-boundary keyword matching, aggregated per model family. Positive = 100, neutral = 50, negative = 0. Rolling 30-day window. Counts are shown alongside scores.
Benchmark sources
All benchmark scores are pulled from official leaderboards and research papers. Scores are refreshed daily via automated scrapers and linked directly to their sources for transparency.
SWE-Bench Verified
Real GitHub issues for coding evaluation. Industry-standard for agent coding ability.
LiveCodeBench
Contamination-free coding benchmark with continuously updated problems.
Aider Polyglot
225 multi-language coding exercises run through a real agentic loop.
HumanEval+ / MBPP+
Rigorous code-completion variants with ~80x more test cases per problem.
Artificial Analysis
Composite intelligence, coding, and agentic indices plus SciCode, Terminal-Bench Hard, MMMU Pro, and Tau-Bench 2.
Chatbot Arena
Human preference voting with Elo ratings. Gold standard for conversation quality.
SimpleBench
Spatio-temporal reasoning easy for humans (~83%) but hard for LLMs.
GPQA Diamond
Graduate-level science questions requiring expert reasoning.
MMLU-Pro
Enhanced MMLU with harder questions and reduced noise.
Humanity's Last Exam
Frontier benchmark with expert-level questions. 46% is current SOTA.
ARC-AGI 2
Abstract reasoning corpus testing generalization beyond training data.
AIME 2025
American Invitational Mathematics Examination problems.
MATH-500
Competition math problems across difficulty levels.
AlpacaEval 2.0
Automated evaluation of instruction-following ability.
MT-Bench
Multi-turn conversation benchmark with GPT-4 as judge.
Vellum AI
Cross-validation source for AIME, GPQA, HLE, MMLU, ARC-AGI, and SWE-Bench.
Score normalization
Some benchmarks have different scales or saturated top-end performance. We normalize to make them comparable:
Standard benchmarks
Most benchmarks use raw percentage scores directly. A score of 85% means the model solved 85% of problems.
Frontier benchmarks
For benchmarks where SOTA is far below 100% — Humanity's Last Exam (top: 46%) and ARC-AGI 2 (top: 69%) — we scale relative to top performance (score ÷ top × 95) so they don't drag down composites.
Arena ELO ratings
Chatbot Arena Elo ratings are linearly mapped from the observed range: 1000 → 0, 1600 → 100. This puts flagship models in the 70–95 band instead of all clamping at 100.
Transparency
Every score on NerfWatch links to its source. Click any benchmark value to see where the data came from. We believe in verifiable metrics over black-box ratings.
- All benchmark data is publicly sourced and linked
- Vote counts and breakdowns are visible per model
- Social signal volumes are displayed alongside scores
- Weights are shown on every task view and on this page
- No hidden adjustments or manual overrides