Research Report
Which LLM Should You Trust With Your Tools?
9 models. 12 benchmarks. One answer: you're probably overpaying.
Coming Soon
What You'll Learn
Key findings from the benchmark.
The best-performing model costs $0.00 per million tokens -- and outperforms models costing 300x more
Beyond the free tier, quality improvement drops below measurable thresholds
The most expensive model tested ($120/M output tokens) had the lowest pass rate of any model in the study
6 of 9 models failed the ambiguity benchmark, regardless of price tier
What's Included
Everything you need to make better model decisions.
- 42-page PDF report with full methodology, per-benchmark breakdowns, and recommendations
- 108 raw prompt-response pairs (JSON) -- every model's actual output for every benchmark
- Scoring dataset (CSV) with quality scores, latency, cost, and pass/fail for all 108 evaluations
- Cost-vs-quality scatter plot data for your own analysis
Sample Insight
GPT-5 Pro at $120/M output tokens scored 70.8% quality. A free NVIDIA model scored 83.3%. The 171x price premium bought negative 12.5 percentage points of accuracy.
This is one data point from the full report. The complete analysis covers all 9 models across all 12 benchmarks.
Report Contents
42 pages of independent analysis.
- Executive Summary
- Methodology (model selection, benchmark suite, scoring rubric, DIAL orchestration)
- Model Roster (9 models across 5 cost tiers)
- Results by Benchmark (12 detailed breakdowns with per-model pass/fail)
- Three-Dimensional Analysis (quality, cost, latency + inflection point)
- Alignment & Confusion Analysis (inter-model agreement, Shannon entropy)
- Recommendations (best overall, best value, best budget, situational guidance)
Get notified when this report is released
We tested 9 large language models across 12 function-calling benchmarks to find the inflection point where paying more stops buying better tool-use accuracy. The results challenge conventional wisdom about model pricing.
No spam. Unsubscribe anytime.