Research Report

Which LLM Should You Trust With Your Tools?

9 models. 12 benchmarks. One answer: you're probably overpaying.

Coming Soon

What You'll Learn

Key findings from the benchmark.

1

The best-performing model costs $0.00 per million tokens -- and outperforms models costing 300x more

2

Beyond the free tier, quality improvement drops below measurable thresholds

3

The most expensive model tested ($120/M output tokens) had the lowest pass rate of any model in the study

4

6 of 9 models failed the ambiguity benchmark, regardless of price tier

What's Included

Everything you need to make better model decisions.

  • 42-page PDF report with full methodology, per-benchmark breakdowns, and recommendations
  • 108 raw prompt-response pairs (JSON) -- every model's actual output for every benchmark
  • Scoring dataset (CSV) with quality scores, latency, cost, and pass/fail for all 108 evaluations
  • Cost-vs-quality scatter plot data for your own analysis

Sample Insight

GPT-5 Pro at $120/M output tokens scored 70.8% quality. A free NVIDIA model scored 83.3%. The 171x price premium bought negative 12.5 percentage points of accuracy.

This is one data point from the full report. The complete analysis covers all 9 models across all 12 benchmarks.

Report Contents

42 pages of independent analysis.

  1. Executive Summary
  2. Methodology (model selection, benchmark suite, scoring rubric, DIAL orchestration)
  3. Model Roster (9 models across 5 cost tiers)
  4. Results by Benchmark (12 detailed breakdowns with per-model pass/fail)
  5. Three-Dimensional Analysis (quality, cost, latency + inflection point)
  6. Alignment & Confusion Analysis (inter-model agreement, Shannon entropy)
  7. Recommendations (best overall, best value, best budget, situational guidance)
Coming Soon

Get notified when this report is released

We tested 9 large language models across 12 function-calling benchmarks to find the inflection point where paying more stops buying better tool-use accuracy. The results challenge conventional wisdom about model pricing.