Methodology
How StackAIQ collects, normalizes, and displays data. No hidden logic.
1. What We Measure
Pricing
Input, output, and cached input prices per 1M tokens. All prices are in USD, sourced from official provider pricing pages, and include a verified date and source URL.
Cost calculations are deterministic. The formulas exist in code, are unit tested, and produce the same result every time for the same inputs. No LLM-generated estimates.
Task-Specific Benchmark Performance
We display benchmark scores from established, independent evaluation sources. Each score is linked to its source, includes the evaluation date, and is normalized to a 0-100 scale for cross-model comparison within the same task.
We do not run our own benchmarks. We aggregate and display results from third-party sources.
2. What We Do Not Measure
- ✗No global “intelligence” ranking. There is no single number that captures overall model quality. Every score on StackAIQ is task-specific.
- ✗No subjective “best model” claims. We show data. We do not make recommendations based on opinion. The “Best For” page uses rule-based scoring on price, not quality judgment.
- ✗No speculative latency or throughput. We do not estimate response times or tokens-per-second unless we have a verified, sourced measurement. Currently, we have none.
- ✗No guessed or inferred benchmark data. If a model has no SWE-bench score, it shows “No data” — not an estimate. Scores for one model version are never applied to another version.
- ✗No hallucination rate comparisons. We do not evaluate or compare model accuracy, safety, or hallucination tendencies.
3. Benchmark Tasks (MVP)
Each task maps to one or more independent benchmark sources. For MVP, each task uses a single benchmark.
Chat
Conversational quality measured by human preference
| Benchmark | LMSYS Chatbot Arena Elo |
| Score type | elo |
| Normalization | Min-max (see Section 4) |
| Staleness | 30 days |
| Source | https://chat.lmsys.org/?leaderboard ↗ |
| Notes | Uses overall Elo, not category-specific |
Coding
Code generation and software engineering capability
| Benchmark | SWE-bench Verified |
| Score type | percentage |
| Normalization | Direct (score is already 0-100) |
| Staleness | 60 days |
| Source | https://www.swebench.com/ ↗ |
| Notes | Verified subset only |
General Knowledge
Broad reasoning and knowledge capability across domains
| Benchmark | MMLU-Pro |
| Score type | percentage |
| Normalization | Direct (score is already 0-100) |
| Staleness | 90 days |
| Source | https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro ↗ |
| Notes | Harder MMLU variant with 12k professional-level questions. Scores sourced from provider model cards and third-party evaluators (vals.ai, Artificial Analysis). |
4. Normalization Method
Direct
Used when the raw score is already a percentage (0-100). The raw score is used as-is, clamped to the 0-100 range.
Applies to: SWE-bench (pass rate %) and HF Open LLM Leaderboard v2 (composite %).
Min-Max
Used for unbounded score types like Elo ratings. The score is normalized relative to the minimum and maximum scores across all tracked models that have data for that benchmark.
Where min and max are the lowest and highest raw scores in the current dataset for that benchmark. These values are recomputed whenever the dataset changes.
Applies to: LMSYS Chatbot Arena Elo.
Important limitation
Normalized scores are comparable only within the same task. A score of 80 on Chat and 80 on Coding do not mean the same thing. They indicate relative position within each task’s benchmark.
5. Value Index
The Value Index is a ratio that combines benchmark performance with cost. It answers: “How much benchmark performance do I get per dollar?”
The default cost denominator is output price per 1M tokens. This can be configured per task.
- Higher value = more performance per dollar.
- If a model has no benchmark data for a task, Value Index shows “N/A”.
- If the cost denominator is $0, Value Index shows “N/A”.
- The Value Index is a mathematical ratio, not a quality judgment.
In the Calculator, a separate cost-based value is computed as normalized_score / estimated_monthly_cost to show performance relative to your actual projected spend.
6. Freshness Policy
Pricing
Each pricing entry includes a verified_at date. A site-wide banner appears when any model’s pricing data is older than 7 days. After 14 days, the banner escalates to a critical warning.
Benchmarks
Each benchmark source has its own staleness threshold, configured in data/tasks.json. Thresholds differ because benchmarks update at different cadences.
| Benchmark | Staleness Threshold | Rationale |
|---|---|---|
| LMSYS Chatbot Arena Elo | 30 days | Updates frequently (weekly) |
| SWE-bench Verified | 60 days | Updates irregularly (monthly) |
| MMLU-Pro | 90 days | Updates infrequently (quarterly) |
Coverage Badges
When a benchmark task is selected, each model displays a coverage badge:
7. Sources & Citations
Benchmark Sources
Every benchmark entry in our dataset includes a direct link to the source. We only use official or primary sources.
- •
- •
- •
Pricing Sources
Every pricing field includes a source_url linking to the official provider pricing page and a verified_at date. Source links are displayed in the Compare table and Cost Calculator results.
If a price cannot be verified against its official source, it is not published. We prefer no data over incorrect data.
This methodology is versioned alongside the codebase. All formulas referenced on this page are implemented in lib/pricing.ts and lib/benchmarks.ts, with unit tests in tests/. Source configuration lives in data/tasks.json.