The Case for Smaller Models: Why Frontier AI Is Not Always the Answer
Frontier models are incredible. They are also overkill for 80% of what enterprise teams use them for.
There’s a default assumption in enterprise AI right now: when in doubt, choose the biggest model. It feels safe. More parameters, more capability, fewer complaints from users. But this instinct is often wrong and almost always expensive.
Most enterprise AI workloads can be easily handled sans frontier scale reasoning. Summarization, extraction, classification, routing, rewriting. These tasks succeed reliably on models that cost a fraction of the most cutting-edge models. The gap between adequate and overkill is where AI budgets go to die.
This isn’t about being cheap. It’s about good engineering.
The math that matters
Teams often anchor on token price alone, but the real cost comes from how frontier models amplify inefficiencies across the entire workflow. Three forces combine to make oversized model choices disproportionately expensive.
- Latency compounds across chains: Large models are slower. In agentic systems where one LLM call triggers the next, a small delay becomes a long wait. A two second response can stretch into twenty seconds once it ripples across a ten step chain. Longer chains lead to more retries and more resource consumption, and both drive spend upward.
- Behavioral variance increases failure rates: Frontier models excel at open ended reasoning, but that can introduce unnecessary creativity on simple, deterministic tasks. This increases retries and validation failures. Every retry on an expensive model multiplies your bill.
- Failures are more expensive: A flaky prompt hitting a lightweight model is inconvenient. The same prompt hitting a premium tier model is a budget problem, especially at scale. Large models make small issues financially loud.
Taken together, these forces turn oversized model usage into a system wide tax, not just a token price issue.
The gap between adequate and overkill is where AI budgets go to die. This isn’t about being cheap. It’s about good engineering.
How to choose the right model for the right task
Most workloads fall cleanly into two categories: tasks where small models excel and tasks where scale matters. Treating everything like a hard problem is how AI budgets get distorted.
Where smaller models excel
Lightweight models are ideal when tasks are narrow, well scoped, and have clear success criteria. Examples include:
- Structured extraction from documents or receipts
- Ticket routing and classification
- Summaries of bounded content
- Format and style transformations
- First pass filtering before a more capable model
These tasks depend on pattern recognition and constraint handling. Bigger models do not make them better and often introduce variation that teams do not want.
Where frontier models earn their cost
Premium models are justified when the task is genuinely complex. Examples include:
- Multi-step reasoning across ambiguous or incomplete inputs
- Synthesis that requires broad world knowledge
- Highly constrained instruction following
- Creative generation where novelty matters
- Problems where correctness cannot be defined up front
Most enterprises have a mix of both types. The inefficiency comes from routing all of them to the same model tier by default.
BLOG: Shadow AI: The Silent Budget Killer Inside Every Company
The governance gap and how teams fix it
The core issue isn’t model choice. It’s lack of visibility and policy enforcement around how models are actually being used.
A developer builds an internal tool, hardcodes a model, verifies that it works, and moves on. Six months later the workflow is handling 50k requests a day on a model that costs 20x more than necessary. No one intended this. It is simply the absence of guardrails.
The teams that manage AI economics successfully do two things: they measure everything, and they control everything that matters.
- Task level instrumentation: You cannot optimize what you cannot see. High-performing teams measure task type, latency, retries, and cost for every model call.
- Tiered routing policies: Simple tasks should go to lightweight models by default. Complex tasks should escalate to frontier models only when needed. This must be automated, not left to memory or preference.
- Continuous evaluation loops: Model capabilities and pricing shift constantly. Workflows that required a frontier model last year may run perfectly well on a compact model today. Evaluation should be ongoing.
- Guardrails and anomaly detection: Policies need enforcement. High-volume workloads should not silently drift into more expensive model tiers. Anomalies should trigger alerts long before they become a month-end invoice surprise.
Where Cake fits in
This is exactly the layer that Cake provides. Cake gives organizations the visibility, routing logic, and enforcement controls that make intentional model selection possible.
- You get workflow-level telemetry across every LLM call, embedding, tool invocation, and vector operation
- You can define routing policies that map task types to model tiers with automatic escalation paths
- You can isolate workloads with per agent API keys for clean cost attribution
- You can set spend caps, block unauthorized model usage, and prevent drift toward premium tiers
- You can detect runaway chains and behavioral anomalies in real time
- You can discover where frontier models are not required and downgrade safely without breaking production workloads
Cake turns model selection into an engineering practice driven by measurement, control, and continuous improvement. Teams ship faster, spend less, and run more predictable systems because every task hits the model that matches its true complexity.
The bottom line
The biggest model is rarely the best model for any given task. Enterprise value comes from matching model capability to task complexity and having the visibility and guardrails to enforce that match at scale.
The real question is not whether you can afford frontier models. It is whether you can afford not to know when you are using them unnecessarily.
About Author
Skyler Thomas & Carolyn Newmark
SKYLER THOMAS is Cake's CTO and co-founder. He is an expert in the architecture and design of AI, ML, and Big Data infrastructure on Kubernetes. He has over 15 years of expertise building massively scaled ML systems as a CTO, Chief Architect, and Distinguished Engineer at Fortune 100 enterprises for HPE, MapR, and IBM. He is a frequently requested speaker and presented at numerous AI and Industry conferences including Kubecon, Scale-by-the-Bay, O’reilly AI, Strata Data, Strat AI, OOPSLA, and Java One.
CAROLYN NEWMARK (Head of Product, Cake) is a seasoned product leader who is helping to spearhead the development of secure AI infrastructure for ML-driven applications.
More articles from Skyler Thomas & Carolyn Newmark
Related Post
Shadow AI: The Silent Budget Killer Inside Every Company
Skyler Thomas & Carolyn Newmark
The AI Budget Crisis You Can’t See (But Are Definitely Paying For)
Skyler Thomas & Carolyn Newmark
The Hidden Costs Nobody Expects When Deploying AI Agents
Skyler Thomas & Carolyn Newmark
Why I Co-Founded Cake: Unlocking Frontier AI for Everyone
Skyler Thomas