Methodology

How StackAIQ collects, normalizes, and displays data. No hidden logic.

1. What We Measure

Pricing

Input, output, and cached input prices per 1M tokens. All prices are in USD, sourced from official provider pricing pages, and include a verified date and source URL.

Cost calculations are deterministic. The formulas exist in code, are unit tested, and produce the same result every time for the same inputs. No LLM-generated estimates.

Task-Specific Benchmark Performance

We display benchmark scores from established, independent evaluation sources. Each score is linked to its source, includes the evaluation date, and is normalized to a 0-100 scale for cross-model comparison within the same task.

We do not run our own benchmarks. We aggregate and display results from third-party sources.

2. What We Do Not Measure

  • No global “intelligence” ranking. There is no single number that captures overall model quality. Every score on StackAIQ is task-specific.
  • No subjective “best model” claims. We show data. We do not make recommendations based on opinion. The “Best For” page uses rule-based scoring on price, not quality judgment.
  • No speculative latency or throughput. We do not estimate response times or tokens-per-second unless we have a verified, sourced measurement. Currently, we have none.
  • No guessed or inferred benchmark data. If a model has no SWE-bench score, it shows “No data” — not an estimate. Scores for one model version are never applied to another version.
  • No hallucination rate comparisons. We do not evaluate or compare model accuracy, safety, or hallucination tendencies.

3. Benchmark Tasks (MVP)

Each task maps to one or more independent benchmark sources. For MVP, each task uses a single benchmark.

Chat

Conversational quality measured by human preference

BenchmarkLMSYS Chatbot Arena Elo
Score typeelo
NormalizationMin-max (see Section 4)
Staleness30 days
Sourcehttps://chat.lmsys.org/?leaderboard
NotesUses overall Elo, not category-specific

Coding

Code generation and software engineering capability

BenchmarkSWE-bench Verified
Score typepercentage
NormalizationDirect (score is already 0-100)
Staleness60 days
Sourcehttps://www.swebench.com/
NotesVerified subset only

General Knowledge

Broad reasoning and knowledge capability across domains

BenchmarkMMLU-Pro
Score typepercentage
NormalizationDirect (score is already 0-100)
Staleness90 days
Sourcehttps://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
NotesHarder MMLU variant with 12k professional-level questions. Scores sourced from provider model cards and third-party evaluators (vals.ai, Artificial Analysis).

4. Normalization Method

Direct

Used when the raw score is already a percentage (0-100). The raw score is used as-is, clamped to the 0-100 range.

normalized_score = clamp(raw_score, 0, 100)

Applies to: SWE-bench (pass rate %) and HF Open LLM Leaderboard v2 (composite %).

Min-Max

Used for unbounded score types like Elo ratings. The score is normalized relative to the minimum and maximum scores across all tracked models that have data for that benchmark.

normalized_score = (raw_score - min) / (max - min) × 100

Where min and max are the lowest and highest raw scores in the current dataset for that benchmark. These values are recomputed whenever the dataset changes.

Applies to: LMSYS Chatbot Arena Elo.

Important limitation

Normalized scores are comparable only within the same task. A score of 80 on Chat and 80 on Coding do not mean the same thing. They indicate relative position within each task’s benchmark.

5. Value Index

The Value Index is a ratio that combines benchmark performance with cost. It answers: “How much benchmark performance do I get per dollar?”

value_index = normalized_score / cost_denominator

The default cost denominator is output price per 1M tokens. This can be configured per task.

  • Higher value = more performance per dollar.
  • If a model has no benchmark data for a task, Value Index shows “N/A”.
  • If the cost denominator is $0, Value Index shows “N/A”.
  • The Value Index is a mathematical ratio, not a quality judgment.

In the Calculator, a separate cost-based value is computed as normalized_score / estimated_monthly_cost to show performance relative to your actual projected spend.

6. Freshness Policy

Pricing

Each pricing entry includes a verified_at date. A site-wide banner appears when any model’s pricing data is older than 7 days. After 14 days, the banner escalates to a critical warning.

Benchmarks

Each benchmark source has its own staleness threshold, configured in data/tasks.json. Thresholds differ because benchmarks update at different cadences.

BenchmarkStaleness ThresholdRationale
LMSYS Chatbot Arena Elo30 daysUpdates frequently (weekly)
SWE-bench Verified60 daysUpdates irregularly (monthly)
MMLU-Pro90 daysUpdates infrequently (quarterly)

Coverage Badges

When a benchmark task is selected, each model displays a coverage badge:

Score shownValid, fresh benchmark data exists.
StaleData older than the source’s staleness threshold.
PartialSome contributing benchmarks are missing.
No dataNo benchmark data available for this task.
UnrankedModel was excluded or removed from the benchmark.

7. Sources & Citations

Benchmark Sources

Every benchmark entry in our dataset includes a direct link to the source. We only use official or primary sources.

Pricing Sources

Every pricing field includes a source_url linking to the official provider pricing page and a verified_at date. Source links are displayed in the Compare table and Cost Calculator results.

If a price cannot be verified against its official source, it is not published. We prefer no data over incorrect data.

This methodology is versioned alongside the codebase. All formulas referenced on this page are implemented in lib/pricing.ts and lib/benchmarks.ts, with unit tests in tests/. Source configuration lives in data/tasks.json.