The True Cost of AI Tokens (And Why Current Prices Can't Last)

AI tokens are underpriced. Not by a little — by roughly an order of magnitude. I ran the numbers myself, and the gap between what providers charge and what it actually costs to serve inference is staggering.

Here’s the full breakdown.

Hardware: What a GPU-Hour Actually Costs

Take an 8×H200 server. Street price is around $308K–$315K for an SXM board [1]. I’ll use the lower end — $300K. Amortize each GPU over 5 years (generous — most operators refresh faster):

Per GPU per year: $7,500
Per GPU per hour: ~$0.86

Add power, cooling, networking, and basic support. Standard 3–5 year support plans run $15K–$40K per server [2], and total rack-level costs (networking, storage, cooling) can push well beyond that [3]. I’m using ~$60K for a 3-year contract on an 8-GPU rack as a reasonable midpoint.

That’s ~$0.29/hr per GPU.

Total bare-metal cost per GPU-hour: ~$1.15

Not bad. But this is just the floor.

Running a Real Model

I used Kimi-K2 as the benchmark — the largest open-source model available. 1 trillion parameters, MoE architecture with 32 billion active parameters, 384 experts with 8 active per token [4]. It needs ~640GB of VRAM. With H200s at 141GB HBM3e each [5] (vs. 80GB HBM3 on H100s [6]), that’s 5 GPUs minimum.

5 GPUs × $1.15/hr = $5.75/hr for raw inference.

Still under minimum wage. But we haven’t accounted for the real cost multipliers.

The Training Tax (Dario Amodei’s 1:1 Ratio)

Here’s something most token-cost analyses miss entirely: you don’t just run inference servers. For every GPU serving requests, you need GPUs for training, fine-tuning, RLHF, red-teaming, and experimentation.

In his Dwarkesh Patel interview [7], Dario Amodei used a toy example with a 1:1 compute split between training and inference. Let’s use the same toy example here — for every dollar spent serving tokens, assume roughly a dollar spent on the research and training that produced the model.

This ratio will shift toward inference-heavy as models mature and scale. But as a simplifying assumption for back-of-envelope math today? Double the cost.

$5.75 → $11.50/hr

Utilization Reality

No cluster runs at 100%. You need headroom for traffic spikes, geographic load imbalance (US/EU peak hours vs. dead zones), and queuing buffers. Realistic utilization: 60–70%. Even being generous and assuming 80%, you need to spread that 20% idle cost across the productive hours.

$11.50 × 1.25 = ~$14.40/hr

Token Throughput → Price Per Million

Now, I should be honest — I did not personally benchmark Kimi-K2 on a rack of H200s. Shockingly, I don’t have $300K worth of NVIDIA hardware lying around. But I was able to find people who did. Reported benchmarks show 40–52 tokens/second on 8×H200 setups [8], though results vary wildly depending on the serving stack — some vLLM configs report much higher aggregate throughput [9]. Taking the optimistic single-request end at 50 tok/s:

1M tokens ÷ 50 tok/s = ~5.56 hours
5.56 hours × $14.40/hr = ~$80 per million tokens

And this is with a generous setup — lower-end hardware costs, optimistic utilization, best-case throughput, and a model that’s smaller than what OpenAI and Anthropic serve.

What’s Still Missing

That $80 doesn’t include: - Data center construction and lease - Physical security and compliance - Engineering salaries ($500K–$700K/yr for top ML researchers — Jensen Huang’s own ballpark [10], and that’s before the elite packages that go into millions [11]) - Corporate overhead, legal, taxes - Profit margin

Add 25% for overhead, then price for a 30% margin:

~$143 per million tokens at break-even profitability.

Compare to What You’re Paying

Model	Current Price (per 1M output tokens)
GPT-5.4	$15 [12]
Claude Opus 4.6	$25 [13]
Estimated real cost (Kimi-K2 class)	~$143

And frontier models like Opus or GPT-5.4 are larger than Kimi-K2. Even a conservative 2x multiplier puts their real cost at ~$285/M tokens — over 10x what Anthropic charges.

This tracks. Anthropic’s Claude Code Max plan ($200/month) can consume up to ~$5,000 in compute at retail API prices for heavy users [14]. That’s a 25:1 subsidy ratio at the extreme end.

What Happens Next

Current token prices are venture-subsidized. This era ends. When it does, expect:

Direct price increases — the obvious one, likely 5–10x for frontier models
Shrinkflation — same price, smaller model behind the API (higher quantization, fewer parameters, worse quality)
Token quotas — hard caps on usage tiers, especially for reasoning-heavy models
Tiered access — pay more for the real model, get the distilled version by default

The companies running AI inference today are burning cash to capture market share. That’s a strategy, not a business model.

So What?

If you’re building on top of AI APIs, architect for 10x token costs. Not because it’ll happen overnight — but because the current pricing is structurally unsustainable.

The builders who survive the pricing correction will be the ones who already optimized their token economics — CLI over MCP where possible, aggressive caching, smart routing between model tiers, and knowing exactly which tasks need frontier intelligence vs. which can run on smaller models.

The cheap-token era is a gift. Use it to build. But don’t depend on it.

References

[1] IntuitionLabs, “NVIDIA AI GPU Pricing Guide,” updated March 2026.

[2] Uvation, “Breaking Down the AI Server Data Center Cost.”

[3] TRG Datacenters, “NVIDIA H200 Price Guide.”

[4] Moonshot AI, “Kimi-K2 Model Card.”

[5] PNY / NVIDIA, “H200 NVL Datasheet.”

[6] TechPowerUp, “NVIDIA H100 PCIe 80 GB Specs.”

[7] D. Amodei, interview with Dwarkesh Patel, Feb 2026.

[8] HuggingFace Community, “Kimi-K2-Instruct Discussions — Inference Benchmarks.”

[9] vLLM Documentation, “Kimi-K2 Deployment Recipe.”

[10] J. Huang (NVIDIA CEO), remarks on AI engineer compensation, via Tom’s Hardware.

[11] Levels.fyi, “OpenAI Software Engineer Compensation.”

[12] PricePerToken, “OpenAI GPT-5.4 Pricing.”

[13] PricePerToken, “Anthropic Claude Opus 4.6 Pricing.”

[14] M. Alderson, “No, It Doesn’t Cost Anthropic $5K Per Claude Code User.”