Phương pháp xếp hạng | Ranking Methodology

Data sources

The leaderboard aggregates public data from four sources, each measuring a different dimension:

Artificial Analysis: Intelligence Index v4, coding & math scores, speed, latency, price.
METR: 50% task-completion time horizon (the task length a model completes 50% of the time).
Aider Polyglot: Polyglot code-editing benchmark (pass rate %).
OpenRouter: Model catalog, context window, modalities, live pricing.

METR has no API; data is pulled periodically from the public eval-analysis-public repo and the horizon is computed via logistic regression. Reference/display only, never used for training.

Note: crowdsourced Elo sources (LMArena, Design Arena) currently offer no public API, so they are not integrated yet; they can be added if/when access is granted.

What the metrics mean

Score: DataCore composite score: weighted average of Intelligence (55%), Coding (30%) and Math (15%), on a 0-100 scale.
Intelligence: Artificial Analysis Intelligence Index (v4): a composite of many hard evals, scored 0 to 100. Higher is more capable.
Coding: Artificial Analysis coding score (SciCode, LiveCodeBench, etc.), 0-100.
Code (Aider): Aider Polyglot benchmark: % of multi-language code-editing tasks the model solves.
Math: Artificial Analysis math score (AIME, MATH-500, etc.), 0-100.
Time Horizon: METR: the task length (in human time) a model completes 50% of the time. Longer = handles longer, more complex tasks.
Speed: Median output generation speed (tokens/sec), measured by Artificial Analysis. Higher = faster.
Price: Blended 3:1 (input:output) price in USD per 1M tokens. Lower = cheaper.
Context: Context window: the max tokens the model can process at once.

Model merging

Each source names models differently (e.g. “openai/gpt-5.4”, “GPT-5.4”, “gpt-5.4-high”). We normalize names to a canonical key (stripping provider prefixes, date stamps and variant suffixes) to merge metrics from multiple sources onto one row. Every metric is attributed to its source (colored dot).

DataCore composite score

The “Overall” category uses a weighted average of the available normalized signals. A model missing a signal is scored on what it has (not unfairly zeroed):

Intelligence Index: 55%
Coding score: 30%
Math score: 15%

Update cadence

Data is cached server-side and refreshed periodically (default every 10 minutes). The browser also refreshes the table every 60 seconds. DevOps can force an immediate refresh via the /api/revalidate endpoint.

Caveats

When a source is not connected (missing API key), DataCore baseline values are used temporarily and are replaced as soon as live data is available. All benchmarks belong to their respective creators.