Methodology

How NerfWatch ranks AI models — the formula, the weights, and the data.

How scores are calculated

Each model gets a composite score per task (Coding, Math, General, Conversation). The score is a weighted average of benchmark results and NerfWatch community perception:

Score = Sum(benchmark_score × weight) ÷ Sum(weights)

Only benchmarks with available data for a given model count — missing benchmarks are excluded from both numerator and denominator so models with fewer data points aren't unfairly penalized.

The weight formula

NerfWatch contributes via two separate signals — direct user votes and aggregated social sentiment from Reddit and Hacker News. Each NerfWatch signal and each outside benchmark is weighted equally. Combined, the two NerfWatch signals give community perception twice the influence of any single benchmark source while still grounding scores in real benchmark data.

Every item gets weight = 1
Each item % = 1 ÷ N × 100  •  Combined NerfWatch % = 2 ÷ N × 100

N is the total number of items in a task (outside benchmarks + 2 NerfWatch signals). Adding a new benchmark automatically dilutes all existing weights — the formula stays balanced without manual tuning.

Weights by task

Coding

NerfWatch combined = 28.6%

NerfWatch User Votes14.3%
NerfWatch Social Signals14.3%
SWE-Bench Verified14.3%
LiveCodeBench14.3%
AA Coding Index14.3%
SciCode14.3%
Terminal-Bench Hard14.3%

Math

NerfWatch combined = 40.0%

NerfWatch User Votes20.0%
NerfWatch Social Signals20.0%
AIME 202520.0%
MATH-50020.0%
GSM8K20.0%

General Knowledge

NerfWatch combined = 22.2%

NerfWatch User Votes11.1%
NerfWatch Social Signals11.1%
GPQA Diamond11.1%
MMLU-Pro11.1%
Humanity's Last Exam11.1%
AA Intelligence Index11.1%
AA Agentic Index11.1%
ARC-AGI 211.1%
SimpleBench11.1%

Conversation

NerfWatch combined = 40.0%

NerfWatch User Votes20.0%
NerfWatch Social Signals20.0%
Chatbot Arena20.0%
AlpacaEval 2.020.0%
MT-Bench20.0%

The two NerfWatch signals

Each NerfWatch signal contributes independently to a model's composite score, at the same weight as any single outside benchmark. They are not blended into one number — User Votes and Social Signals each get their own column in the breakdown.

User Votes

Task-scoped direct votes — voting nerfed on Gemini 3 in the math tab only affects gemini-3 + math, not coding/general/conversation. Each vote is scored: Sharp = 100, Normal = 50, Nerfed = 0.

Social Signals

Mentions from Reddit (11 subreddits) analyzed for sentiment via word-boundary keyword matching, aggregated per model family. Positive = 100, neutral = 50, negative = 0. Rolling 30-day window. Counts are shown alongside scores.

Benchmark sources

All benchmark scores are pulled from official leaderboards and research papers. Scores are refreshed daily via automated scrapers and linked directly to their sources for transparency.

SWE-Bench Verified

Real GitHub issues for coding evaluation. Industry-standard for agent coding ability.

LiveCodeBench

Contamination-free coding benchmark with continuously updated problems.

Aider Polyglot

225 multi-language coding exercises run through a real agentic loop.

HumanEval+ / MBPP+

Rigorous code-completion variants with ~80x more test cases per problem.

Artificial Analysis

Composite intelligence, coding, and agentic indices plus SciCode, Terminal-Bench Hard, MMMU Pro, and Tau-Bench 2.

Chatbot Arena

Human preference voting with Elo ratings. Gold standard for conversation quality.

SimpleBench

Spatio-temporal reasoning easy for humans (~83%) but hard for LLMs.

GPQA Diamond

Graduate-level science questions requiring expert reasoning.

MMLU-Pro

Enhanced MMLU with harder questions and reduced noise.

Humanity's Last Exam

Frontier benchmark with expert-level questions. 46% is current SOTA.

ARC-AGI 2

Abstract reasoning corpus testing generalization beyond training data.

AIME 2025

American Invitational Mathematics Examination problems.

MATH-500

Competition math problems across difficulty levels.

AlpacaEval 2.0

Automated evaluation of instruction-following ability.

MT-Bench

Multi-turn conversation benchmark with GPT-4 as judge.

Vellum AI

Cross-validation source for AIME, GPQA, HLE, MMLU, ARC-AGI, and SWE-Bench.

Score normalization

Some benchmarks have different scales or saturated top-end performance. We normalize to make them comparable:

Standard benchmarks

Most benchmarks use raw percentage scores directly. A score of 85% means the model solved 85% of problems.

Frontier benchmarks

For benchmarks where SOTA is far below 100% — Humanity's Last Exam (top: 46%) and ARC-AGI 2 (top: 69%) — we scale relative to top performance (score ÷ top × 95) so they don't drag down composites.

Arena ELO ratings

Chatbot Arena Elo ratings are linearly mapped from the observed range: 1000 → 0, 1600 → 100. This puts flagship models in the 70–95 band instead of all clamping at 100.

Transparency

Every score on NerfWatch links to its source. Click any benchmark value to see where the data came from. We believe in verifiable metrics over black-box ratings.

All benchmark data is publicly sourced and linked
Vote counts and breakdowns are visible per model
Social signal volumes are displayed alongside scores
Weights are shown on every task view and on this page
No hidden adjustments or manual overrides

← Back to leaderboard