The True Cost of AI Tokens (And Why Current Prices Can't Last)
I did the math on H200 inference costs. The real price per million tokens should be ~10x what you're paying today.
AI tokens are underpriced. Not by a little — by roughly an order of magnitude. I ran the numbers myself, and the gap between what providers charge and what it actually costs to serve inference is staggering.
Here’s the full breakdown.
Hardware: What a GPU-Hour Actually Costs
Take an 8×H200 server. Street price is around $308K–$315K for an SXM board [1]. I’ll use the lower end — $300K. Amortize each GPU over 5 years (generous — most operators refresh faster):
- Per GPU per year: $7,500
- Per GPU per hour: ~$0.86
Add power, cooling, networking, and basic support. Standard 3–5 year support plans run $15K–$40K per server [2], and total rack-level costs (networking, storage, cooling) can push well beyond that [3]. I’m using ~$60K for a 3-year contract on an 8-GPU rack as a reasonable midpoint.
That’s ~$0.29/hr per GPU.
Total bare-metal cost per GPU-hour: ~$1.15
Not bad. But this is just the floor.
Running a Real Model
I used Kimi-K2 as the benchmark — the largest open-source model available. 1 trillion parameters, MoE architecture with 32 billion active parameters, 384 experts with 8 active per token [4]. It needs ~640GB of VRAM. With H200s at 141GB HBM3e each [5] (vs. 80GB HBM3 on H100s [6]), that’s 5 GPUs minimum.
5 GPUs × $1.15/hr = $5.75/hr for raw inference.
Still under minimum wage. But we haven’t accounted for the real cost multipliers.
The Training Tax (Dario Amodei’s 1:1 Ratio)
Here’s something most token-cost analyses miss entirely: you don’t just run inference servers. For every GPU serving requests, you need GPUs for training, fine-tuning, RLHF, red-teaming, and experimentation.
In his Dwarkesh Patel interview [7], Dario Amodei used a toy example with a 1:1 compute split between training and inference. Let’s use the same toy example here — for every dollar spent serving tokens, assume roughly a dollar spent on the research and training that produced the model.
This ratio will shift toward inference-heavy as models mature and scale. But as a simplifying assumption for back-of-envelope math today? Double the cost.
$5.75 → $11.50/hr
Utilization Reality
No cluster runs at 100%. You need headroom for traffic spikes, geographic load imbalance (US/EU peak hours vs. dead zones), and queuing buffers. Realistic utilization: 60–70%. Even being generous and assuming 80%, you need to spread that 20% idle cost across the productive hours.
$11.50 × 1.25 = ~$14.40/hr
Token Throughput → Price Per Million
Now, I should be honest — I did not personally benchmark Kimi-K2 on a rack of H200s. Shockingly, I don’t have $300K worth of NVIDIA hardware lying around. But I was able to find people who did. Reported benchmarks show 40–52 tokens/second on 8×H200 setups [8], though results vary wildly depending on the serving stack — some vLLM configs report much higher aggregate throughput [9]. Taking the optimistic single-request end at 50 tok/s:
- 1M tokens ÷ 50 tok/s = ~5.56 hours
- 5.56 hours × $14.40/hr = ~$80 per million tokens
And this is with a generous setup — lower-end hardware costs, optimistic utilization, best-case throughput, and a model that’s smaller than what OpenAI and Anthropic serve.
What’s Still Missing
That $80 doesn’t include: - Data center construction and lease - Physical security and compliance - Engineering salaries ($500K–$700K/yr for top ML researchers — Jensen Huang’s own ballpark [10], and that’s before the elite packages that go into millions [11]) - Corporate overhead, legal, taxes - Profit margin
Add 25% for overhead, then price for a 30% margin:
~$143 per million tokens at break-even profitability.
Compare to What You’re Paying
| Model | Current Price (per 1M output tokens) |
|---|---|
| GPT-5.4 | $15 [12] |
| Claude Opus 4.6 | $25 [13] |
| Estimated real cost (Kimi-K2 class) | ~$143 |
And frontier models like Opus or GPT-5.4 are larger than Kimi-K2. Even a conservative 2x multiplier puts their real cost at ~$285/M tokens — over 10x what Anthropic charges.
This tracks. Anthropic’s Claude Code Max plan ($200/month) can consume up to ~$5,000 in compute at retail API prices for heavy users [14]. That’s a 25:1 subsidy ratio at the extreme end.
What Happens Next
Current token prices are venture-subsidized. This era ends. When it does, expect:
- Direct price increases — the obvious one, likely 5–10x for frontier models
- Shrinkflation — same price, smaller model behind the API (higher quantization, fewer parameters, worse quality)
- Token quotas — hard caps on usage tiers, especially for reasoning-heavy models
- Tiered access — pay more for the real model, get the distilled version by default
The companies running AI inference today are burning cash to capture market share. That’s a strategy, not a business model.
So What?
If you’re building on top of AI APIs, architect for 10x token costs. Not because it’ll happen overnight — but because the current pricing is structurally unsustainable.
The builders who survive the pricing correction will be the ones who already optimized their token economics — CLI over MCP where possible, aggressive caching, smart routing between model tiers, and knowing exactly which tasks need frontier intelligence vs. which can run on smaller models.
The cheap-token era is a gift. Use it to build. But don’t depend on it.
References
[1] IntuitionLabs, “NVIDIA AI GPU Pricing Guide,” updated March 2026.
[2] Uvation, “Breaking Down the AI Server Data Center Cost.”
[3] TRG Datacenters, “NVIDIA H200 Price Guide.”
[4] Moonshot AI, “Kimi-K2 Model Card.”
[5] PNY / NVIDIA, “H200 NVL Datasheet.”
[6] TechPowerUp, “NVIDIA H100 PCIe 80 GB Specs.”
[7] D. Amodei, interview with Dwarkesh Patel, Feb 2026.
[8] HuggingFace Community, “Kimi-K2-Instruct Discussions — Inference Benchmarks.”
[9] vLLM Documentation, “Kimi-K2 Deployment Recipe.”
[10] J. Huang (NVIDIA CEO), remarks on AI engineer compensation, via Tom’s Hardware.
[11] Levels.fyi, “OpenAI Software Engineer Compensation.”
[12] PricePerToken, “OpenAI GPT-5.4 Pricing.”
[13] PricePerToken, “Anthropic Claude Opus 4.6 Pricing.”
[14] M. Alderson, “No, It Doesn’t Cost Anthropic $5K Per Claude Code User.”