Who will have the largest single AI datacenter clusters over the next 2 years?

Based on my research, it’s currently difficult to do large unsupervised training runs across multiple data centers because part of the process is latency-sensitive (there’s an “all-reduce”/“all-gather” between steps). See: NVIDIA on long-haul training.
Because of that constraint, it does seem to matter how big your single largest data center cluster is for training. It’s also been widely rumored that GPT-5 took ~50k H100s to train (unconfirmed; you’ll see it repeated a lot). However, we do know that Grok-4 was trained on single run of 150k H100s; xAI confirmed the cluster size of 200k H100s, and Epoch AI estimated it cost about $490M to train Grok 4, see also - XAi notes using 150k GPUs to train.
So the biggest unsupervised runs for current SOTA models seem to be somewhere between ~50k and 150k H100 equivalents. That’s actually a lot smaller than many people might have guessed based on model performance and the build-outs of next-gen clusters with next-gen chips.
NOTE: single site training - it’s not a hard constraint, and it will probably be overcome soon with various techniques, but lets just say under optimal current circumstatnces this is how people seem to be doing it now…
NVIDIA SOTA chip roadmap
Relative to H100-equivalents, newer chips deliver more training throughput. Here are rough timelines and performance differences.
| Generation | Chip | Announced | First Ship | Volume Ramp | At Scale in DCs |
|---|---|---|---|---|---|
| Current: Blackwell | B200/GB200 | GTC Mar 2024 | Q4 2024 (Dec) | Q2-Q3 2025 | Mid-2025 |
| Blackwell Ultra | B300/GB300 | GTC Mar 2025 | H2 2025 | Q4 2025 | Late 2025/Early 2026 |
| Rubin | R100 | Computex Jun 2024 | H1 2026 | H2 2026 | Mid-to-Late 2026 |
| Rubin Ultra | R300 | — | H2 2027 | Late 2027 | 2028 |
And here are rough performance comparisons:
| Chip | vs H100 (FP8 Training) | Key Notes |
|---|---|---|
| H100 | 1x (baseline) | 80GB HBM3, 3.9 TB/s |
| B200 | ~2.5-3x | 192GB HBM3e, 8 TB/s, NVLink 5.0 |
| GB200 (2x B200 + Grace) | ~5-6x | 384GB combined, liquid-cooled racks |
| B300 | ~1.5x vs B200 | HBM3e refresh, higher clocks |
| R100 | ~2x vs B200 | New architecture, HBM4, TSMC 3nm |
| R300 | ~1.5x vs R100 | HBM4 refresh |
So roughly:
- B200: ~2.5–3× H100
- R100: ~5–6× H100
- R300: ~7–9× H100
Algorithmic breakthroughs
I think there are a lot of different algorithmic breakthroughs going on—many of which we can’t see from the outside—but one thing we do know is that training precision has dropped dramatically. Models used to primarily be trained in 32-bit floating point (FP32), and now frontier training is commonly done in 8-bit floating point (FP8), with inference potentially moving toward FP4.
Training precision evolution:
- Pre-2017: FP32 (full precision, 32-bit)
- 2017-2020: Mixed precision—FP16 compute with FP32 accumulation (big speedup)
- 2020+: BF16 became the standard (A100 era)—better dynamic range than FP16, less overflow headaches
- 2022+: FP8 training viable with H100’s Transformer Engine
- 2024+: FP8 becoming default for frontier labs, FP4 emerging for inference
We also know B200 has special optimizations around FP8 and FP4.
Who has the largest datacenter every 6 months for the next 2 years?
FYI: this is based on Epoch AI research (a nonprofit that compiles this data), and I composed it together with Claude Opus 4.5. And also remember what Epoch AI says about their data: it’s within a factor ~1.5 (80% confidence).
Baseline Reference
- GPT-5 training baseline: ~50,000 H100s (maximum).
- Current global AI compute stock: ~15 million H100-equivalents (H100e) delivered as of late 2025
- Epoch AI tracks: 2.5 million H100e across 13 major US facilities (~15% of global stock)
6-Month Incremental Projections
H1 2026 (Jan - Jun 2026)
| Facility | Owner/User | H100e Capacity | vs GPT-5 (50k) |
|---|---|---|---|
| Colossus 2 | xAI | 1,400,000 | 28x |
| Anthropic-Amazon New Carlisle | Anthropic | ~500,000 | 10x |
| OpenAI Stargate Abilene (Phase 1-2) | OpenAI | ~400,000 | 8x |
| Microsoft Fayetteville | Microsoft | ~350,000 | 7x |
| Meta Prometheus | Meta | ~300,000 | 6x |
| Google TPU clusters | Google/DeepMind | 1,000,000+ | 20x |
Largest single cluster: xAI Colossus 2 @ 1.4M H100e
Potential training run: 28x GPT-5 scale
H2 2026 (Jul - Dec 2026)
| Facility | Owner/User | H100e Capacity | vs GPT-5 (50k) |
|---|---|---|---|
| xAI Colossus (expanded) | xAI | 1,800,000+ | 36x |
| OpenAI Stargate Abilene (Phase 3-4) | OpenAI | ~800,000 | 16x |
| Microsoft Fairwater (Phase 1) | Microsoft/OpenAI | ~600,000 | 12x |
| Anthropic-Amazon (expanded) | Anthropic | ~700,000 | 14x |
| Meta Prometheus (expanded) | Meta | ~600,000 | 12x |
| CoreWeave clusters | Various | ~500,000 | 10x |
Largest single cluster: xAI @ 1.8M H100e
Potential training run: 36x GPT-5 scale
H1 2027 (Jan - Jun 2027)
| Facility | Owner/User | H100e Capacity | vs GPT-5 (50k) |
|---|---|---|---|
| Microsoft Fairwater Wisconsin | Microsoft/OpenAI | 2,500,000 | 50x |
| xAI Colossus (+ expansions) | xAI | 2,200,000 | 44x |
| OpenAI Stargate (6 buildings) | OpenAI | 1,500,000 | 30x |
| Meta clusters combined | Meta | 1,200,000 | 24x |
| Google TPU v6/v7 | Google/DeepMind | 1,500,000+ | 30x |
Largest single cluster: Microsoft Fairwater @ 2.5M H100e
Potential training run: 50x GPT-5 scale
H2 2027 (Jul - Dec 2027)
| Facility | Owner/User | H100e Capacity | vs GPT-5 (50k) |
|---|---|---|---|
| Microsoft Fairwater (full) | Microsoft/OpenAI | 5,000,000 | 100x |
| xAI (Memphis + expansions) | xAI | 3,000,000+ | 60x |
| OpenAI Stargate (8 buildings) | OpenAI | 2,500,000 | 50x |
| Meta Hyperion (Phase 1) | Meta | 2,000,000 | 40x |
| Amazon/Anthropic combined | Anthropic | 1,500,000 | 30x |
Largest single cluster: Microsoft Fairwater @ 5M H100e
Potential training run: 100x GPT-5 scale
H1 2028 (Jan - Jun 2028)
| Facility | Owner/User | H100e Capacity | vs GPT-5 (50k) |
|---|---|---|---|
| Meta Hyperion | Meta | 5,000,000 | 100x |
| Microsoft Fairwater (full) | Microsoft/OpenAI | 5,000,000 | 100x |
| xAI (1M GPU target) | xAI | 4,000,000+ | 80x |
| OpenAI Stargate (full) | OpenAI | 3,500,000 | 70x |
Multiple players at 5M H100e scale
Potential training run: 100x GPT-5 scale (multiple labs)
H2 2028 (Jul - Dec 2028)
| Facility | Owner/User | H100e Capacity | vs GPT-5 (50k) |
|---|---|---|---|
| Meta Hyperion (full) | Meta | 5,000,000+ | 100x+ |
| Microsoft (multi-site) | Microsoft/OpenAI | 6,000,000+ | 120x |
| xAI (approaching 1M GPUs) | xAI | 5,000,000+ | 100x |
Total global AI compute: ~100M H100e (per Epoch AI projections)
Multiple 100x+ GPT-5 scale runs possible
Summary: Largest Cluster by Period
| Period | Leader | Capacity (H100e) | Training Scale vs GPT-5 |
|---|---|---|---|
| H1 2026 | xAI Colossus 2 | 1.4M | 28x |
| H2 2026 | xAI (expanded) | 1.8M | 36x |
| H1 2027 | Microsoft Fairwater | 2.5M | 50x |
| H2 2027 | Microsoft Fairwater | 5.0M | 100x |
| H1 2028 | Meta Hyperion / MSFT | 5.0M | 100x |
| H2 2028 | Multiple | 5-6M+ | 100-120x |
Key Observations
- xAI leads through 2026 with Colossus 2 (1.4M H100e in early 2026)
- Microsoft/OpenAI takes lead H1 2027 as Fairwater ramps
- By late 2027, multiple players at 5M H100e (100x GPT-5)
- Total global stock: 15M H100e today → 100M H100e by end 2027 (6.6x growth)
- Power constraints: 1 GW datacenters arriving early 2026, 3+ GW by 2027
Training Run Implications
If GPT-5 was trained on 50k H100s and represents current SOTA:
- Q2 2026: 10-20x GPT-5 scale runs possible
- Q4 2026: 30-40x GPT-5 scale runs possible
- Q2 2027: 50x GPT-5 scale runs possible
- Q4 2027: 100x GPT-5 scale runs possible
- 2028: Multiple labs can do 100x+ simultaneously
Conclusions
From what we can see:
- xAI will have the largest data center for much of next year.
- Then Microsoft/OpenAI will begin to dominate.
Unknowns: what is Google doing? They’re relatively secretive, and we don’t know how their TPU capacity compares exactly.
And finally, we don’t know how the following things will evolve over the next two years:
- Delays / shortages
- Algorithmic advances
- New projects being planned out
But we can probably expect, if scaling laws hold, by the end of 2026 we till have AI readily available to everyone trained on 10x size clusters of what we have this year.
To put a finer point on it: 2026 is the year where we have the last massive jump of 10x compute size that is going to be readily available. After that the jump from 2026 to 2027 will only be 2x, and the jump from 2027 to 2028 will only be 2x (maybe). There will begin to be serious financial and material constraints on how big you build a cluster that nobody will be able to afford to surpass expect for maybe governments.
So 2026-2027 is the critical period where we will expect to see massive capabilities gains from these 10x larger training runs and progress will not be scale based so much after that: it will be algorithmic breakthroughs and research breakthroughs that truly matter. The implication is progress could slow after 2027 since it won’t be as straightforward as simply being able to “build bigger datacenters”.
We can still expect:
- Algorithmic breakthroughs
- Efficiency breakthroughs at hardware level
- Ability to network multiple clusters together increases
But the years of 10x jumps in scaling in a single year are probably over for a while after the 2026-2027 period.