Methodology

How StackAIQ collects, normalizes, and displays data. No hidden logic.

1. What We Measure

Pricing

Input, output, and cached input prices per 1M tokens. All prices are in USD, sourced from official provider pricing pages, and include a verified date and source URL.

Cost calculations are deterministic. The formulas exist in code, are unit tested, and produce the same result every time for the same inputs. No LLM-generated estimates.

Task-Specific Benchmark Performance

We display benchmark scores from established, independent evaluation sources. Each score is linked to its source, includes the evaluation date, and is normalized to a 0-100 scale for cross-model comparison within the same task.

We do not run our own benchmarks. We aggregate and display results from third-party sources.

2. What We Do Not Measure

✗
No global “intelligence” ranking. There is no single number that captures overall model quality. Every score on StackAIQ is task-specific.
✗
No subjective “best model” claims. We show data. We do not make recommendations based on opinion. The “Use Cases” page uses rule-based scoring on price, not quality judgment.
✗
No speculative latency or throughput. We do not estimate response times or tokens-per-second unless we have a verified, sourced measurement. Currently, we have none.
✗
No guessed or inferred benchmark data. If a model has no SWE-bench score, it shows “No data” — not an estimate. Scores for one model version are never applied to another version.
✗
No hallucination rate comparisons. We do not evaluate or compare model accuracy, safety, or hallucination tendencies.

3. Benchmark Tasks (MVP)

Each task maps to one or more independent benchmark sources. For MVP, each task uses a single benchmark.

Chat

Conversational quality measured by human preference

Benchmark	LMSYS Chatbot Arena Elo
Score type	elo
Normalization	Min-max (see Section 4)
Staleness	30 days
Source	https://chat.lmsys.org/?leaderboard ↗
Notes	Uses overall Elo, not category-specific

Coding

Code generation and software engineering capability

Benchmark	SWE-bench Verified
Score type	percentage
Normalization	Direct (score is already 0-100)
Staleness	60 days
Source	https://www.swebench.com/ ↗
Notes	Verified subset only

General Knowledge

Broad reasoning and knowledge capability across domains

Benchmark	MMLU-Pro
Score type	percentage
Normalization	Direct (score is already 0-100)
Staleness	90 days
Source	https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro ↗
Notes	Harder MMLU variant with 12k professional-level questions. Scores sourced from provider model cards and third-party evaluators (vals.ai, Artificial Analysis).

4. Normalization Method

Direct

Used when the raw score is already a percentage (0-100). The raw score is used as-is, clamped to the 0-100 range.

normalized_score = clamp(raw_score, 0, 100)

Applies to: SWE-bench (pass rate %) and HF Open LLM Leaderboard v2 (composite %).

Min-Max

Used for unbounded score types like Elo ratings. The score is normalized relative to the minimum and maximum scores across all tracked models that have data for that benchmark.

normalized_score = (raw_score - min) / (max - min) × 100

Where min and max are the lowest and highest raw scores in the current dataset for that benchmark. These values are recomputed whenever the dataset changes.

Applies to: LMSYS Chatbot Arena Elo.

Important limitation

Normalized scores are comparable only within the same task. A score of 80 on Chat and 80 on Coding do not mean the same thing. They indicate relative position within each task’s benchmark.

5. Value Index

The Value Index is a ratio that combines benchmark performance with cost. It answers: “How much benchmark performance do I get per dollar?”

value_index = normalized_score / cost_denominator

The default cost denominator is output price per 1M tokens. This can be configured per task.

Higher value = more performance per dollar.
If a model has no benchmark data for a task, Value Index shows “N/A”.
If the cost denominator is $0, Value Index shows “N/A”.
The Value Index is a mathematical ratio, not a quality judgment.

In the Calculator, a separate cost-based value is computed as normalized_score / estimated_monthly_cost to show performance relative to your actual projected spend.

6. Freshness Policy

Pricing

Each pricing entry includes a verified_at date. A site-wide banner appears when any model’s pricing data is older than 14 days. After 30 days, the banner escalates to a critical warning.

Benchmarks

Each benchmark source has its own staleness threshold, configured in data/tasks.json. Thresholds differ because benchmarks update at different cadences.

Benchmark	Staleness Threshold	Rationale
LMSYS Chatbot Arena Elo	30 days	Updates frequently (weekly)
SWE-bench Verified	60 days	Updates irregularly (monthly)
MMLU-Pro	90 days	Updates infrequently (quarterly)

Coverage Badges

When a benchmark task is selected, each model displays a coverage badge:

Score shownValid, fresh benchmark data exists.

StaleData older than the source’s staleness threshold.

PartialSome contributing benchmarks are missing.

No dataNo benchmark data available for this task.

UnrankedModel was excluded or removed from the benchmark.

7. Sources & Citations

Benchmark Sources

Every benchmark entry in our dataset includes a direct link to the source. We only use official or primary sources.

•
LMSYS Chatbot Arena Elo — https://chat.lmsys.org/?leaderboard ↗
•
SWE-bench Verified — https://www.swebench.com/ ↗
•
MMLU-Pro — https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro ↗

Pricing Sources

Every pricing field includes a source_url linking to the official provider pricing page and a verified_at date. Source links are displayed in the Compare table and Cost Calculator results.

If a price cannot be verified against its official source, it is not published. We prefer no data over incorrect data.

This methodology is versioned alongside the codebase. All formulas referenced on this page are implemented in lib/pricing.ts and lib/benchmarks.ts, with unit tests in tests/. Source configuration lives in data/tasks.json.