The True Cost of AI Tokens (And Why Current Prices Can't Last)

AI tokens are underpriced. Not by a little. By roughly an order of magnitude. I ran the numbers myself, and the gap between what providers charge and what it actually costs to serve inference is staggering.

Here’s the full breakdown.

Hardware: What a GPU-Hour Actually Costs

Take an 8×H200 server. Street price is around $308K–$315K for an SXM board [1]. I’ll use the lower end: $300K. Amortize each GPU over 5 years (generous, since most operators refresh faster):

Per GPU per year: $7,500
Per GPU per hour: ~$0.86

Add power, cooling, networking, and basic support. Standard 3–5 year support plans run $15K–$40K per server [2], and total rack-level costs (networking, storage, cooling) can push well beyond that [3]. I’m using ~$60K for a 3-year contract on an 8-GPU rack as a reasonable midpoint.

That’s ~$0.29/hr per GPU.

Total bare-metal cost per GPU-hour: ~$1.15

Not bad. But this is just the floor.

Running a Real Model

I used Kimi-K2 as the benchmark: the largest open-source model available. 1 trillion parameters, MoE architecture with 32 billion active parameters, 384 experts with 8 active per token [4]. It needs ~640GB of VRAM. With H200s at 141GB HBM3e each [5] (vs. 80GB HBM3 on H100s [6]), that’s 5 GPUs minimum.

5 GPUs × $1.15/hr = $5.75/hr for raw inference.

Still under minimum wage. But we haven’t accounted for the real cost multipliers.

The Training Tax (Dario Amodei’s 1:1 Ratio)

Here’s something most token-cost analyses miss entirely: you don’t just run inference servers. For every GPU serving requests, you need GPUs for training, fine-tuning, RLHF, red-teaming, and experimentation.

In his Dwarkesh Patel interview [7], Dario Amodei used a toy example with a 1:1 compute split between training and inference. Let’s use the same toy example here: for every dollar spent serving tokens, assume roughly a dollar spent on the research and training that produced the model.

This ratio will shift toward inference-heavy as models mature and scale. But as a simplifying assumption for back-of-envelope math today? Double the cost.

$5.75 → $11.50/hr

Utilization Reality

No cluster runs at 100%. You need headroom for traffic spikes, geographic load imbalance (US/EU peak hours vs. dead zones), and queuing buffers. Realistic utilization: 60–70%. Even being generous and assuming 80%, you need to spread that 20% idle cost across the productive hours.

$11.50 × 1.25 = ~$14.40/hr

Token Throughput → Price Per Million

Now, I should be honest: I did not personally benchmark Kimi-K2 on a rack of H200s. Shockingly, I don’t have $300K worth of NVIDIA hardware lying around. But I was able to find people who did. Reported benchmarks show 40–52 tokens/second on 8×H200 setups [8], though results vary wildly depending on the serving stack; some vLLM configs report much higher aggregate throughput [9]. Taking the optimistic single-request end at 50 tok/s:

1M tokens ÷ 50 tok/s = ~5.56 hours
5.56 hours × $14.40/hr = ~$80 per million tokens

And this is with a generous setup: lower-end hardware costs, optimistic utilization, best-case throughput, and a model that’s smaller than what OpenAI and Anthropic serve.

What’s Still Missing

That $80 doesn’t include: - Data center construction and lease - Physical security and compliance - Engineering salaries ($500K–$700K/yr for top ML researchers, Jensen Huang’s own ballpark [10], and that’s before the elite packages that go into millions [11]) - Corporate overhead, legal, taxes - Profit margin

Add 25% for overhead, then price for a 30% margin:

~$143 per million tokens at break-even profitability.

Compare to What You’re Paying

Model	Current Price (per 1M output tokens)
GPT-5.4	$15 [12]
Claude Opus 4.6	$25 [13]
Estimated real cost (Kimi-K2 class)	~$143

And frontier models like Opus or GPT-5.4 are larger than Kimi-K2. Even a conservative 2x multiplier puts their real cost at ~$285/M tokens, over 10x what Anthropic charges.

This tracks. Anthropic’s Claude Code Max plan ($200/month) can consume up to ~$5,000 in compute at retail API prices for heavy users [14]. That’s a 25:1 subsidy ratio at the extreme end.

What Happens Next

Current token prices are venture-subsidized. This era ends. When it does, expect:

Direct price increases. The obvious one, likely 5–10x for frontier models
Shrinkflation. Same price, smaller model behind the API (higher quantization, fewer parameters, worse quality)
Token quotas. Hard caps on usage tiers, especially for reasoning-heavy models
Tiered access. Pay more for the real model, get the distilled version by default

The companies running AI inference today are burning cash to capture market share. That’s a strategy, not a business model.

So What?

If you’re building on top of AI APIs, architect for 10x token costs. Not because it’ll happen overnight, but because the current pricing is structurally unsustainable.

The builders who survive the pricing correction will be the ones who already optimized their token economics: CLI over MCP where possible, aggressive caching, smart routing between model tiers, and knowing exactly which tasks need frontier intelligence vs. which can run on smaller models.

The cheap-token era is a gift. Use it to build. But don’t depend on it.