Skip to content

You Can't Forecast AI Costs in Tokens

Published: 12/2025
9 minute read
Illustration showing stacks of money in an AI system

Here's a prediction: your team will miss its AI cost forecast this quarter.

Not because you didn't do the math. You estimated the request volume, multiplied by average tokens, and applied the vendor rate. The model was reasonable. The inputs, however, only reflected token pricing, not the full execution path.

The problem is that LLM inference is only one slice of your AI cost stack (and often not the largest one). A production AI system also incurs costs from embedding generation, vector database operations, GPU compute for fine-tuned models, orchestration infrastructure, guardrails, logging, and the internal services that glue it all together.

These costs are billed by different vendors, measured in different units, and logged in different systems. Correlating them is manual at best, impossible at worst. And until you can observe and attribute costs across the full stack, forecasting is guesswork.

The AI bill you didn’t expect

When leadership asks "what are we spending on AI?," they usually mean LLM API costs. That's the visible part. Here's what the full cost stack actually looks like:

  • LLM inference. The obvious one. Pay-per-token from OpenAI, Anthropic, Google, or others. But also: self-hosted inference costs if you're running open-weight models on your own GPUs, which shifts cost from API fees to compute infrastructure.

  • Embedding models. Every RAG system, semantic search feature, and retrieval pipeline runs embedding calls. These are cheap per-call but high-volume—often orders of magnitude more calls than generation requests. Billed separately from LLM inference, sometimes from different vendors.

  • Vector databases. Pinecone, Weaviate, Qdrant, Chroma, pgvector. Costs include storage, read operations, write operations, and index rebuilding. High-traffic applications with frequently updated indices can see vector DB costs grow faster than expected.

  • Compute infrastructure. GPU instances for fine-tuning jobs. Serverless functions for orchestration. Kubernetes clusters for model serving. Spot vs. on-demand pricing decisions. This layer is often managed by a different team than the one making LLM calls.

  • Supporting model services. Rerankers for retrieval quality. Classification models for routing. Guardrails and content filtering. Speech-to-text and vision models for multimodal pipelines. Each has its own pricing model and billing cycle.

  • Operational overhead. Observability infrastructure to store traces and logs (which grow linearly with AI traffic). Gateway and proxy services. Rate limiting and caching layers.

And the stack doesn’t stop at systems. Prompt engineering hours. Labeling for fine-tuning. Eval dataset curation. These are all part of the bill you didn’t expect. They aren’t infrastructure costs, but they scale with AI usage and show up as real line items in the AI budget.

When teams forecast AI costs based only on LLM token pricing, they routinely capture just 30–50% of actual spend. The rest stays invisible until invoices arrive.

 

When teams forecast AI costs based only on LLM token pricing, they routinely capture just 30–50% of actual spend.

The unit mismatch

Here's the deeper problem: even if you tracked all of these systems, they don't share a common unit.

  • LLM APIs charge per token
  • Embedding APIs charge per token (but with different tokenization)
  • Vector databases charge per operation, per GB stored, or per "read unit"
  • Compute charges per hour, per vCPU, or per GPU-second
  • Serverless charges per invocation and per GB-second of memory

What you actually care about is: what does it cost to serve this request? To run this workflow? To support this customer?

That answer requires converting across all of these units and summing them. This requires knowing which resources each request touched. This requires distributed tracing across systems that weren't designed to talk to each other.

Most teams don't have this. So they manage each cost center independently and hope the totals come out reasonable.

 

An agent decides its own execution path based on intermediate results. That's what makes agents useful—and what makes their costs non-deterministic.

Why agents make this worse

Traditional AI applications had relatively fixed execution paths. A classification endpoint uses roughly the same resources every time. A summarization pipeline scales predictably with input length.

Agentic systems break this. 

An agent decides its own execution path based on intermediate results. That's what makes agents useful—and what makes their costs non-deterministic.

A research agent might:

  • Make three retrieval calls and synthesize touches embedding API, vector DB, one LLM call

  • Make 12 retrieval calls, loop for clarification, re-retrieve, synthesize 4x the embedding calls, 4x the vector reads, 5 LLM calls

  • Hit an edge case and loop until step limit 20x the expected resource consumption across every layer

All three are the "same" request from the user's perspective. They're radically different across every cost dimension.

And the costs fan out across the entire stack. An agent loop doesn't just burn LLM tokens—it burns embedding calls, vector reads, compute cycles, logging storage. The multiplier effect hits every layer.

The more autonomous your AI systems become, the wider your cost variance. The industry is moving toward more autonomy, not less.

The attribution gap

The core problem isn’t that AI costs are high. It’s that they’re not debuggable.

In a modern AI stack, there is no equivalent of a stack trace for cost. When spend spikes, teams can see the totals, but not the cause. Did an agent change its retrieval behavior? Did a routing rule start sending traffic to a more expensive model? Did a loop trigger more embedding calls? The data to answer these questions exists, but it lives in systems that don’t share identifiers or timelines.

Each layer logs in isolation. LLM providers log requests and tokens. Vector databases log queries and reads. Cloud platforms report aggregate compute usage. Orchestration layers capture execution traces. None of them agree on what constitutes a single request, workflow, or user action. There is no shared request ID linking these events into a coherent execution path.

Without that linkage, attribution breaks down.

Ask your team questions like:

  • Which workflow is most expensive when fully loaded across all infrastructure?

  • What is the true cost per customer or per feature, including retrieval, compute, and orchestration?

  • When AI spend jumped last month, which system actually drove it?

Most organizations can’t answer with evidence. They infer from vendor invoices, run partial analyses, or rely on intuition. Cost incidents become investigations instead of diagnoses.

The impact shows up everywhere. Engineering can’t optimize because there’s no way to trace cost back to specific execution paths or agent behaviors. Finance can’t allocate spend cleanly across teams or products because costs only exist as vendor-level totals. Product teams struggle to price AI features sustainably because they don’t know the fully loaded cost to serve different usage patterns.

Until AI cost can be traced end-to-end at the request and workflow level, predictability is impossible. You can’t fix what you can’t trace. And today, most AI stacks are effectively flying without instruments.

Cheaper models won’t fix broken cost visibility

LLM prices are falling. New models ship with lower per-token rates, vendors offer deeper volume discounts, and open-weight models continue to drive inference costs down. On paper, this looks like progress. If tokens get cheaper, AI should be easier to budget.

In practice, it does not work that way.

Lower token prices reduce one line item. They do nothing to address how AI systems actually execute. They do not make costs more traceable, more attributable, or more predictable across the stack. As AI systems become more agentic and more distributed, the dominant cost problem shifts from price to visibility.

Predictable AI cost does not come from cheaper models. It comes from observability.

  • Unified telemetry across the stack. LLM calls, embedding requests, vector operations, compute usage, and internal services all need to emit cost-attributed events with shared identifiers. You need to trace a request through every system it touches.

  • Workflow-level attribution. Costs should roll up to meaningful units: this agent, this workflow, this team, this customer. Not "here's your OpenAI bill and here's your Pinecone bill and here's your AWS bill"—but "here's what this workflow costs, broken down by resource type."

  • Execution visibility. Seeing aggregate numbers isn't enough. You need to see why costs accrue—which branches an agent took, where loops occurred, what triggered fallbacks. The execution graph, not just the summary.

  • Anomaly detection that spans systems. A cost spike might show up in your vector DB bill but be caused by an agent behavior change that increased retrieval calls. Detection needs to correlate across layers, not alert on each silo independently.

  • Proactive controls. Budgets should be enforcement mechanisms, not reporting artifacts. Per-workflow spend caps, per-agent limits, and automatic circuit breakers turn cost from something you discover into something you manage.

This is the layer Cake is built for.

Cake provides unified cost telemetry across the entire AI stack, including LLM providers, embedding services, vector databases, compute infrastructure, and orchestration layers. A single request is traced end to end through every system it touches, with cost attribution captured at each step.

Costs roll up to agents, workflows, teams, environments, and customers. Engineering can see exactly how execution paths translate into spend. Finance can attribute cost with confidence. Product teams can understand the fully loaded cost to serve different usage patterns and price accordingly.

Most importantly, Cake turns cost into an operational signal. Teams can set per-workflow budgets, enforce per-agent limits, and apply circuit breakers before runaway behavior turns into a surprise invoice. Cost becomes something you manage in real time, not something you explain after the fact.

LLM prices will continue to fall. AI systems will continue to grow more autonomous. Stacks will continue to get more complex.

The organizations that scale AI successfully will not be the ones chasing the cheapest tokens. They will be the ones that treat cost as a first-class observable across the entire stack, with the same rigor they apply to performance and security.

You cannot forecast what you cannot trace. And today, most AI cost is still invisible.