The Case for Smaller Models: Why Frontier AI Is Not Always the Answer

Published: 12/2025

5 minute read

An illustration of a small robot next to a big robot

Frontier models are incredible. They are also overkill for 80% of what enterprise teams use them for.

There’s a default assumption in enterprise AI right now: when in doubt, choose the biggest model. It feels safe. More parameters, more capability, fewer complaints from users. But this instinct is often wrong and almost always expensive.

Most enterprise AI workloads can be easily handled sans frontier scale reasoning. Summarization, extraction, classification, routing, rewriting. These tasks succeed reliably on models that cost a fraction of the most cutting-edge models. The gap between adequate and overkill is where AI budgets go to die.

This isn’t about being cheap. It’s about good engineering.

IN DEPTH: AI cost management optimization

The math that matters

Teams often anchor on token price alone, but the real cost comes from how frontier models amplify inefficiencies across the entire workflow. Three forces combine to make oversized model choices disproportionately expensive.

Latency compounds across chains: Large models are slower. In agentic systems where one LLM call triggers the next, a small delay becomes a long wait. A two second response can stretch into twenty seconds once it ripples across a ten step chain. Longer chains lead to more retries and more resource consumption, and both drive spend upward.
Behavioral variance increases failure rates: Frontier models excel at open ended reasoning, but that can introduce unnecessary creativity on simple, deterministic tasks. This increases retries and validation failures. Every retry on an expensive model multiplies your bill.
Failures are more expensive: A flaky prompt hitting a lightweight model is inconvenient. The same prompt hitting a premium tier model is a budget problem, especially at scale. Large models make small issues financially loud.

Taken together, these forces turn oversized model usage into a system wide tax, not just a token price issue.

The gap between adequate and overkill is where AI budgets go to die. This isn’t about being cheap. It’s about good engineering.

How to choose the right model for the right task

Most workloads fall cleanly into two categories: tasks where small models excel and tasks where scale matters. Treating everything like a hard problem is how AI budgets get distorted.

Where smaller models excel

Lightweight models are ideal when tasks are narrow, well scoped, and have clear success criteria. Examples include:

Structured extraction from documents or receipts
Ticket routing and classification
Summaries of bounded content
Format and style transformations
First pass filtering before a more capable model

These tasks depend on pattern recognition and constraint handling. Bigger models do not make them better and often introduce variation that teams do not want.

Where frontier models earn their cost

Premium models are justified when the task is genuinely complex. Examples include:

Multi-step reasoning across ambiguous or incomplete inputs
Synthesis that requires broad world knowledge
Highly constrained instruction following
Creative generation where novelty matters
Problems where correctness cannot be defined up front

Most enterprises have a mix of both types. The inefficiency comes from routing all of them to the same model tier by default.

BLOG: Shadow AI: The Silent Budget Killer Inside Every Company

The governance gap and how teams fix it

The core issue isn’t model choice. It’s lack of visibility and policy enforcement around how models are actually being used.

A developer builds an internal tool, hardcodes a model, verifies that it works, and moves on. Six months later the workflow is handling 50k requests a day on a model that costs 20x more than necessary. No one intended this. It is simply the absence of guardrails.

The teams that manage AI economics successfully do two things: they measure everything, and they control everything that matters.

Task level instrumentation: You cannot optimize what you cannot see. High-performing teams measure task type, latency, retries, and cost for every model call.

Tiered routing policies: Simple tasks should go to lightweight models by default. Complex tasks should escalate to frontier models only when needed. This must be automated, not left to memory or preference.

Continuous evaluation loops: Model capabilities and pricing shift constantly. Workflows that required a frontier model last year may run perfectly well on a compact model today. Evaluation should be ongoing.

Guardrails and anomaly detection: Policies need enforcement. High-volume workloads should not silently drift into more expensive model tiers. Anomalies should trigger alerts long before they become a month-end invoice surprise.

Where Cake fits in

This is exactly the layer that Cake provides. Cake gives organizations the visibility, routing logic, and enforcement controls that make intentional model selection possible.

You get workflow-level telemetry across every LLM call, embedding, tool invocation, and vector operation
You can define routing policies that map task types to model tiers with automatic escalation paths
You can isolate workloads with per agent API keys for clean cost attribution
You can set spend caps, block unauthorized model usage, and prevent drift toward premium tiers
You can detect runaway chains and behavioral anomalies in real time
You can discover where frontier models are not required and downgrade safely without breaking production workloads

Cake turns model selection into an engineering practice driven by measurement, control, and continuous improvement. Teams ship faster, spend less, and run more predictable systems because every task hits the model that matches its true complexity.

The bottom line

The biggest model is rarely the best model for any given task. Enterprise value comes from matching model capability to task complexity and having the visibility and guardrails to enforce that match at scale.

The real question is not whether you can afford frontier models. It is whether you can afford not to know when you are using them unnecessarily.

About Author

Skyler Thomas & Carolyn Newmark

SKYLER THOMAS is Cake's CTO and co-founder. He is an expert in the architecture and design of AI, ML, and Big Data infrastructure on Kubernetes. He has over 15 years of expertise building massively scaled ML systems as a CTO, Chief Architect, and Distinguished Engineer at Fortune 100 enterprises for HPE, MapR, and IBM. He is a frequently requested speaker and presented at numerous AI and Industry conferences including Kubecon, Scale-by-the-Bay, O’reilly AI, Strata Data, Strat AI, OOPSLA, and Java One.

CAROLYN NEWMARK (Head of Product, Cake) is a seasoned product leader who is helping to spearhead the development of secure AI infrastructure for ML-driven applications.

More articles from Skyler Thomas & Carolyn Newmark