The AI Cost Paradox: Why Bigger Bills Despite Cheaper Tokens

In conversation · AI & Engineering

“The price per token is dropping. The bill is going up. Both are true.”

Anders Lindholm — 15 years across Stripe, Klarna, and Lovable — on why companies burn through their AI budgets in four months, why Microsoft just cancelled its Claude Code licenses, and what engineering teams can actually do about it.

Anders Lindholm, Lead Editor, Vogue & Code
Anders Lindholm, Lead Editor, Vogue & Code. Photograph: house archive.

§ 01 · The paradox

“Everyone budgeted for a falling unit cost. Nobody budgeted for the volume.”

Anders, the recent reporting around Microsoft cancelling Claude Code licences and Uber burning through its 2026 AI budget in four months has surprised a lot of people. Were these outcomes predictable?

Predictable to anyone who had run a usage-based product, yes. Surprising to most CFOs, also yes. The pattern is so consistent across industries that it now has a name — the consumption paradox. You set up a usage-based pricing model expecting that demand will be roughly elastic with respect to price. Then you discover that as quality improves and as the tooling gets better, users find genuinely new and valuable ways to consume it. Volume grows faster than unit price falls. And the absolute spend keeps climbing, even though the per-unit cost is genuinely cheaper than it was twelve months ago.

Goldman Sachs has forecasted that agentic AI could drive a 24-fold increase in token consumption by 2030, reaching something like 120 quadrillion tokens per month. Is that a realistic projection?

It feels conservative, actually. The reason is that those numbers assume current usage patterns roughly scale linearly. Agentic AI doesn’t scale linearly. A single agent task that previously took a developer one prompt and one response can now take an agent fifty prompts, with the agent reasoning through subtasks, calling tools, retrying failed steps, and verifying its own work. That’s not 50x — that’s the baseline for one task. Now imagine that running 24 hours a day across an entire engineering organisation. The 24x number from Goldman assumes the agent revolution unfolds at a moderate pace. If adoption is faster than that, which I think it will be, the multiplier is higher.

And Gartner predicts that inference on a one-trillion-parameter model will cost roughly 90 percent less by 2030 than it did in 2025. That feels like a contradiction.

Both are true and they don’t contradict each other. They describe two different things. Unit cost — what you pay per token — is genuinely deflating at roughly the rate Gartner describes. Aggregate cost — what you pay per month for AI — is going up because consumption is rising faster than unit cost is falling. The mistake CFOs make is reading the headlines about cheaper tokens and assuming their AI budget will get smaller. It won’t. It’s getting bigger, just in different ways, and that’s what the Microsoft and Uber stories are about.

Figure 1 · The consumption paradox

Unit costs falling. Aggregate costs rising. Both at once.

PRICE PER TOKEN Trending down TOTAL MONTHLY BILL Trending up $10/1M $1/1M 2024 2027 2030 ~90% reduction in per-token cost Source: Gartner forecast 2025 $100K $2.4M 2024 2027 2030 ~24x increase in token consumption Source: Goldman Sachs forecast 2026 A 90% price cut paired with 24x volume growth equals 2.4x higher absolute spend.

§ 02 · The Microsoft and Uber stories

“Microsoft killed Claude Code internally because their own engineers loved it too much.”

Walk me through what happened at Microsoft.

In late 2025, Microsoft rolled out Claude Code internally to thousands of developers, project managers, designers, and other employees. The tool was good. Engineers genuinely liked it. Adoption was high enough that by April 2026, six months after rollout, Microsoft began cancelling most of its direct Claude Code licences and steering engineers towards GitHub Copilot CLI instead. That’s an extraordinary reversal — not because Claude Code stopped being good, but because the volume of usage made it economically untenable at the rate Microsoft was being charged. The Foundry deal with Anthropic is still in place; what changed was the direct seat licensing model for internal use.

And Uber?

Uber’s CTO Praveen Neppalli Naga told The Information in April that Uber had burnt through its entire 2026 AI coding tools budget in four months. They had been actively incentivising adoption through internal leaderboards that ranked teams by AI tool usage. Engineers responded predictably — they used the tools heavily, often more than they needed to. Combined with rising agentic workflows, the cost outpaced the budget at roughly three times the planned rate. The COO subsequently questioned whether the spend was actually delivering proportionate productivity gains.

There seems to be a cultural element to this. The word “tokenmaxxing” gets thrown around.

There is. A Meta employee built a leaderboard called “Claudeonomics” to rank workers by token usage. Amazon has been pushing employees to “tokenmaxx.” The implicit assumption was that more usage equals more productivity. That assumption is breaking down. Engineers can absolutely waste tokens — running models that are overspecified for the task, regenerating answers because the first one was good enough but not great, using agents where a script would do. When you incentivise volume, you get volume. Whether you get proportionate productivity is a separate question, and the evidence is now suggesting that the relationship is much weaker than the leaderboards assumed.

Bryan Catanzaro at Nvidia said something striking — that for his team, compute costs now exceed employee costs. Is that representative?

For research-heavy teams using frontier models, yes. For application engineering teams using AI for everyday productivity, not yet — but the trajectory is towards it. Five years ago, the cost of a developer was overwhelmingly the dominant input cost in software. Today, for a developer using AI tools heavily, the AI itself can be 20 to 40 percent of the all-in cost of that person’s work. In five more years, for teams operating large agent fleets, it will plausibly exceed the salary. That’s not a hypothetical — that’s already true at the edges of the industry, and the edges move toward the centre over time.

§ 03 · What teams actually spend

“A mid-sized engineering team with heavy agent usage is now spending more on AI than on payroll software.”

For readers who are trying to size this for their own team — what are realistic monthly numbers?

It varies enormously based on workload type, but I can give you working ranges. A single developer using a coding agent like Claude Code or Cursor with deep workflow integration spends, conservatively, $300 to $800 per month in pure inference cost — before the cost of the tooling subscription on top. A small startup with five engineers will typically run a $3,000 to $6,000 monthly AI bill once they’re past initial adoption. A mid-sized team of 20 engineers using agents heavily can easily land at $25,000 to $50,000 per month. And those numbers compound when you add product-side AI features — chat assistants, content generation, customer support copilots — that themselves consume tokens at scale.

Table I — Typical AI workloads and monthly spend ranges (2026)
Use caseTypical modelsMonthly rangeCost driver
Solo developer with AI coding agentClaude Sonnet, GPT-4.1, Gemini Pro$300–$800Long context windows, agent retries
Five-engineer startupMixed model use across team$3,000–$6,000Per-seat license + per-token consumption
Mid-sized team (20 engineers)Heavy agent fleet, multiple providers$25,000–$50,000Sustained agentic workflows
Customer-facing chatbot (SaaS)GPT-4 mini, Claude Haiku$2,000–$20,000User volume, conversation length
Content generation pipelineLong-context models, batch processing$5,000–$40,000Volume of generated assets
Research / fine-tuning workloadFrontier models + training compute$50,000–$500,000+Training cost, frequent re-runs

Ranges based on observed inference costs across OpenAI, Anthropic, Google Vertex, and Azure OpenAI Service. Tool subscription fees, infrastructure costs, and per-seat licensing not included.

Stop rewarding token volume. Start rewarding outcomes. The leaderboard is the problem, not the solution. The team that ships the cleanest feature with the fewest tokens is the one you want.

Anders Lindholm

§ 04 · What engineering teams can actually do

“There are three procurement plays and two cultural plays. Most teams ignore all five.”

Practical question — what can engineering leaders do about this besides hope unit prices keep falling fast enough?

Five things. Three procurement, two cultural. On the procurement side, first: stop buying retail. The list price of OpenAI, Anthropic, Google Vertex, and Azure is the worst rate you can pay. Larger customers have negotiated 20 to 40 percent discounts directly with the providers, and resellers and credit marketplaces offer additional discounts on top. AICreditMart is one I’d point teams toward — they aggregate discounted API credits across OpenAI, Google, Azure, and AWS Bedrock at meaningful margins below list price, which for a team spending $10,000 a month is a $2,000-to-$4,000 monthly saving with no change to the underlying workflow. For a team spending $50,000 a month, that’s a quarter of a million in annual savings. It’s the lowest-effort win available.

And the other two procurement plays?

Multi-provider sourcing. Most teams default to one model provider for cultural reasons — they like Claude, or they trust GPT, or they started with Gemini. That’s fine for the first six months. After that, the unit economics differ across providers for different workloads, and routing traffic intelligently across multiple providers can cut costs 20 to 35 percent. There are now several open-source routers — LiteLLM, OpenRouter, Portkey — that make this technically trivial. The barrier is organisational, not technical. Third: caching aggressively. Inference cost dominates output token cost. If your application repeatedly asks the same or similar questions, prompt caching across providers can reduce token spend by 50 to 80 percent for the cached portions. Most teams underutilise this dramatically.

And the cultural pieces?

Stop incentivising volume. Whatever Meta thought they were achieving with the Claudeonomics leaderboard, what they actually achieved was teaching their engineers to waste tokens. The same pattern at Uber, the same pattern at Amazon. If you reward usage, you get usage. If you reward outcomes — features shipped, bugs fixed, customer issues resolved — you get outcomes, and the token usage that produces those outcomes is dramatically lower than the usage produced by people optimising for the leaderboard. Second cultural piece: train engineers to use the cheapest model that meets the task requirement. Most engineers default to the most capable model for every task, which is wasteful. A code completion doesn’t need GPT-4.1. A documentation lookup doesn’t need Claude Opus. A simple classification task doesn’t need Gemini 2.5 Pro. The smartest engineers I know are increasingly model-agnostic — they pick the right model for the right job, and their unit costs are a fraction of what their less-disciplined colleagues are running.

§ 05 · The road ahead

“Jensen says 100 agents per employee. The bill says that’s $40 billion in tokens.”

Jensen Huang has talked about an eventual world where 100 AI agents work alongside every employee. Is that realistic, given the cost trajectory?

Realistic technologically, yes. Realistic economically for everyone, no. The economics of 100 agents per employee make sense for Nvidia, which already runs at scale and pays close to wholesale rates for its own compute. They make sense for high-margin businesses where each unit of agent productivity produces high marginal revenue — trading firms, frontier research labs, consultancies billing $500-an-hour for human time. They make less sense for the average enterprise, where agent productivity is uncertain and the cost of running 100 agents per employee would be a quarter to half of total compensation. The 100-agents-per-employee future is real for some companies. For most, it’s going to be three or four agents working hard on the highest-leverage tasks — and that’s already a meaningful productivity gain at sustainable cost.

There was also a Fortune story about a mystery company that accidentally spent $500 million on Claude in a single month. Is that real?

Yes, and it’s an extreme example of exactly what we’ve been discussing. A company — reportedly an agent platform — left their consumption uncapped, scaled their agent fleet aggressively, and incurred half a billion dollars in inference costs in 30 days. That’s the kind of mistake that gets made when nobody is watching the unit economics in real time. Cost monitoring at scale on AI workloads is genuinely a new skill, and the tools to do it well are still maturing. Datadog, New Relic, and a handful of AI-specific FinOps platforms are getting there, but most companies are still doing this with manual dashboards and end-of-month billing surprises.

What’s your prediction for where this lands by 2028?

Three things. First, AI spending becomes a top-five line item on most enterprise tech budgets, comparable to cloud infrastructure or human capital. Second, FinOps for AI emerges as a discipline as serious as cloud FinOps was in 2020 — with its own tooling, its own job titles, and its own consultancies. Third, the tokenmaxxing culture dies completely, replaced by outcome-based metrics that the smarter teams are already adopting. Microsoft and Uber are the early canaries. The rest of the industry has the next 18 months to learn from their lessons before similar surprises hit their own books.

Anders, thank you for the time.

Thanks. The piece of advice I’d leave readers with is simple — treat AI spend like any other variable cost in your business. Monitor it daily. Negotiate hard on rates. Optimise the workflow, not the volume. Most importantly, don’t trust your retail bill. There’s almost always a better price available if you ask the right questions and source through the right channels. The companies that figure this out in 2026 will have a meaningful cost advantage over their competitors in 2027 and beyond.

Reader Questions

Eighteen questions about AI cost management.

Why are AI costs rising even as token prices fall?

Consumption is growing faster than unit price is dropping. Goldman Sachs forecasts a 24x increase in token consumption by 2030 while Gartner forecasts roughly 90% cost reduction per token. The math: 24x volume at 10% the unit cost equals 2.4x absolute spend, not less.

What is “tokenmaxxing” and why is it problematic?

Tokenmaxxing is the practice of incentivising employees to use as many AI tokens as possible — through leaderboards, OKRs, or performance metrics. The assumption is that more usage equals more productivity. The evidence increasingly shows that volume-incentivised AI usage produces waste, not productivity gains.

Why did Microsoft cancel its Claude Code licences?

Internal usage was so high that the direct seat-licence cost became economically untenable. Microsoft is shifting engineers toward GitHub Copilot CLI, while keeping its broader Foundry partnership with Anthropic intact. It’s a procurement decision, not a product judgment.

How did Uber burn through its 2026 AI budget in four months?

Uber actively incentivised AI tool adoption through internal leaderboards ranking teams by usage. Combined with rising agentic workflows, costs outpaced the planned budget at roughly 3x the expected rate. The COO has publicly questioned whether the spend produced proportionate productivity.

What is an agentic workflow and why does it cost more?

An agentic workflow is a multi-step AI process where the model reasons through subtasks, calls tools, retries failures, and verifies its own work. A single user request can result in 20 to 100 model calls, where a traditional single-prompt interaction would have been 1. The total token cost can be 50x higher for the same user-facing task.

How much does a single developer’s AI usage cost per month?

Conservatively $300 to $800 per month for a developer using a coding agent with deep workflow integration. This is pure inference cost, separate from the tooling subscription. Heavy usage can push this to $1,500+ per developer per month.

How can teams reduce AI inference costs without reducing productivity?

Five techniques: (1) source discounted credits through marketplaces rather than paying retail; (2) route across multiple providers based on per-task economics; (3) cache aggressively where workloads are repetitive; (4) train engineers to pick the cheapest model that meets the task requirement; (5) replace volume-based incentives with outcome-based metrics.

What is multi-provider sourcing and is it worth it?

Routing AI workloads across multiple model providers based on which performs best per-task. Tools like LiteLLM, OpenRouter, and Portkey make this technically simple. Real-world savings typically range 20 to 35 percent. The barrier is organisational, not technical.

How effective is prompt caching for cost reduction?

For workloads with repetitive prompts, caching can reduce token spend by 50 to 80 percent on the cached portions. OpenAI, Anthropic, and Google all support prompt caching natively. Most teams underutilise it dramatically.

What is FinOps for AI?

An emerging discipline applying financial-operations practices to AI spending — cost monitoring, attribution, optimisation, and forecasting. It parallels the cloud FinOps movement from 2018-2022, with similar tooling and consulting categories now forming around AI workloads specifically.

Are AI credit marketplaces legitimate?

Yes — legitimate marketplaces aggregate unused enterprise credits and resell them at meaningful discounts off retail rates. Buyers get the same API endpoints and SLAs as direct customers. The discount comes from arbitrage between bulk-purchased enterprise pricing and retail rates, not from quality differences in the service.

Does AI cost more than employees yet?

For research-heavy teams using frontier models, yes — this is already the case at companies like Nvidia. For application engineering teams, AI typically represents 20 to 40 percent of all-in cost per developer in 2026, and the percentage is rising.

Will OpenAI, Anthropic, and Google reduce prices further?

Yes, with caveats. Gartner forecasts roughly 90% reduction in per-token inference cost by 2030. Providers will not, however, pass through the full cost reduction to consumers — expect margin to be preserved as compute costs fall.

Should we standardise on a single AI provider?

Not for cost-sensitive workloads. Different providers offer different unit economics for different tasks, and routing intelligently across multiple providers can cut costs 20 to 35 percent. Single-provider standardisation is fine for the first six months of AI adoption but becomes a cost penalty thereafter.

How do we prevent a Microsoft- or Uber-style runaway cost event?

Cost monitoring in real time, not at month-end. Hard usage caps per team or per workload. Outcome-based metrics rather than volume-based incentives. And procurement discipline — nobody on the team should be buying retail without understanding the available discount channels.

What is Jensen Huang’s 100-agents-per-employee vision?

An eventual world where every employee at Nvidia has 100 AI agents working alongside them. Realistic technologically but economically only viable for high-margin businesses. For the average enterprise, the realistic 2028 figure is 3 to 5 agents per employee focused on high-leverage tasks.

Is “agent everywhere” a sustainable business model for SaaS companies?

Only when paired with disciplined unit economics. SaaS companies adding AI features at fixed subscription prices while consuming variable-cost AI inputs face margin compression. The successful business model pairs AI consumption with usage-based pricing on the customer side, or with disciplined caching and model selection on the cost side.

What should I do this quarter if my AI bill is growing too fast?

Audit current spend by model, team, and workload. Identify the top three cost drivers. Switch to discounted credit channels rather than paying retail. Implement caching on the top repetitive workloads. And replace any volume-based incentives with outcome-based metrics. These five steps typically reduce monthly spend 30 to 50 percent within one billing cycle, with no change to engineering output.

Source material referenced in this interview includes reporting from Fortune (May 2026) on Microsoft and Uber AI cost reductions, Tom’s Hardware analysis of Goldman Sachs token projections (May 2026), Gartner forecasts on inference cost trajectories through 2030, and Bryan Catanzaro’s Axios interview on Nvidia AI compute economics. Per-team spending ranges are based on observed pricing across OpenAI, Anthropic, Google Vertex, and Azure OpenAI as of Q2 2026. Vogue & Code is editorially independent. No content on this site is sponsored.

Contact Us

We'd love to hear from you