Research Report

Which LLM Should You Trust With Your Tools?

9 models. 12 benchmarks. One answer: you're probably overpaying.

Coming Soon

What You'll Learn

Key findings from the benchmark.

The best-performing model costs $0.00 per million tokens -- and outperforms models costing 300x more

Beyond the free tier, quality improvement drops below measurable thresholds

The most expensive model tested ($120/M output tokens) had the lowest pass rate of any model in the study

6 of 9 models failed the ambiguity benchmark, regardless of price tier

What's Included

Everything you need to make better model decisions.

42-page PDF report with full methodology, per-benchmark breakdowns, and recommendations
108 raw prompt-response pairs (JSON) -- every model's actual output for every benchmark
Scoring dataset (CSV) with quality scores, latency, cost, and pass/fail for all 108 evaluations
Cost-vs-quality scatter plot data for your own analysis

Sample Insight

GPT-5 Pro at $120/M output tokens scored 70.8% quality. A free NVIDIA model scored 83.3%. The 171x price premium bought negative 12.5 percentage points of accuracy.

This is one data point from the full report. The complete analysis covers all 9 models across all 12 benchmarks.

Report Contents

42 pages of independent analysis.

Executive Summary
Methodology (model selection, benchmark suite, scoring rubric, DIAL orchestration)
Model Roster (9 models across 5 cost tiers)
Results by Benchmark (12 detailed breakdowns with per-model pass/fail)
Three-Dimensional Analysis (quality, cost, latency + inflection point)
Alignment & Confusion Analysis (inter-model agreement, Shannon entropy)
Recommendations (best overall, best value, best budget, situational guidance)

Coming Soon

Get notified when this report is released

We tested 9 large language models across 12 function-calling benchmarks to find the inflection point where paying more stops buying better tool-use accuracy. The results challenge conventional wisdom about model pricing.