Aaron Decker

Aaron Decker - Engineer & Entrepreneur

A practical forecast (with caveats) of which orgs are most likely to run the biggest single-site AI clusters by end of 2026 and end of 2027.

Who will have the largest single AI datacenter clusters over the next 2 years?

aidatacentersinfrastructurecloud

llm diagram

Based on my research, it’s currently difficult to do large unsupervised training runs across multiple data centers because part of the process is latency-sensitive (there’s an “all-reduce”/“all-gather” between steps). See: NVIDIA on long-haul training.

Because of that constraint, it does seem to matter how big your single largest data center cluster is for training. It’s also been widely rumored that GPT-5 took ~50k H100s to train (unconfirmed; you’ll see it repeated a lot). However, we do know that Grok-4 was trained on single run of 150k H100s; xAI confirmed the cluster size of 200k H100s, and Epoch AI estimated it cost about $490M to train Grok 4, see also - XAi notes using 150k GPUs to train.

So the biggest unsupervised runs for current SOTA models seem to be somewhere between ~50k and 150k H100 equivalents. That’s actually a lot smaller than many people might have guessed based on model performance and the build-outs of next-gen clusters with next-gen chips.

NOTE: single site training - it’s not a hard constraint, and it will probably be overcome soon with various techniques, but lets just say under optimal current circumstatnces this is how people seem to be doing it now…

NVIDIA SOTA chip roadmap

Relative to H100-equivalents, newer chips deliver more training throughput. Here are rough timelines and performance differences.

GenerationChipAnnouncedFirst ShipVolume RampAt Scale in DCs
Current: BlackwellB200/GB200GTC Mar 2024Q4 2024 (Dec)Q2-Q3 2025Mid-2025
Blackwell UltraB300/GB300GTC Mar 2025H2 2025Q4 2025Late 2025/Early 2026
RubinR100Computex Jun 2024H1 2026H2 2026Mid-to-Late 2026
Rubin UltraR300H2 2027Late 20272028

And here are rough performance comparisons:

Chipvs H100 (FP8 Training)Key Notes
H1001x (baseline)80GB HBM3, 3.9 TB/s
B200~2.5-3x192GB HBM3e, 8 TB/s, NVLink 5.0
GB200 (2x B200 + Grace)~5-6x384GB combined, liquid-cooled racks
B300~1.5x vs B200HBM3e refresh, higher clocks
R100~2x vs B200New architecture, HBM4, TSMC 3nm
R300~1.5x vs R100HBM4 refresh

So roughly:

  • B200: ~2.5–3× H100
  • R100: ~5–6× H100
  • R300: ~7–9× H100

Algorithmic breakthroughs

I think there are a lot of different algorithmic breakthroughs going on—many of which we can’t see from the outside—but one thing we do know is that training precision has dropped dramatically. Models used to primarily be trained in 32-bit floating point (FP32), and now frontier training is commonly done in 8-bit floating point (FP8), with inference potentially moving toward FP4.

Training precision evolution:

  • Pre-2017: FP32 (full precision, 32-bit)
  • 2017-2020: Mixed precision—FP16 compute with FP32 accumulation (big speedup)
  • 2020+: BF16 became the standard (A100 era)—better dynamic range than FP16, less overflow headaches
  • 2022+: FP8 training viable with H100’s Transformer Engine
  • 2024+: FP8 becoming default for frontier labs, FP4 emerging for inference

We also know B200 has special optimizations around FP8 and FP4.

Who has the largest datacenter every 6 months for the next 2 years?

FYI: this is based on Epoch AI research (a nonprofit that compiles this data), and I composed it together with Claude Opus 4.5. And also remember what Epoch AI says about their data: it’s within a factor ~1.5 (80% confidence).

Baseline Reference

  • GPT-5 training baseline: ~50,000 H100s (maximum).
  • Current global AI compute stock: ~15 million H100-equivalents (H100e) delivered as of late 2025
  • Epoch AI tracks: 2.5 million H100e across 13 major US facilities (~15% of global stock)

6-Month Incremental Projections

H1 2026 (Jan - Jun 2026)

FacilityOwner/UserH100e Capacityvs GPT-5 (50k)
Colossus 2xAI1,400,00028x
Anthropic-Amazon New CarlisleAnthropic~500,00010x
OpenAI Stargate Abilene (Phase 1-2)OpenAI~400,0008x
Microsoft FayettevilleMicrosoft~350,0007x
Meta PrometheusMeta~300,0006x
Google TPU clustersGoogle/DeepMind1,000,000+20x

Largest single cluster: xAI Colossus 2 @ 1.4M H100e
Potential training run: 28x GPT-5 scale


H2 2026 (Jul - Dec 2026)

FacilityOwner/UserH100e Capacityvs GPT-5 (50k)
xAI Colossus (expanded)xAI1,800,000+36x
OpenAI Stargate Abilene (Phase 3-4)OpenAI~800,00016x
Microsoft Fairwater (Phase 1)Microsoft/OpenAI~600,00012x
Anthropic-Amazon (expanded)Anthropic~700,00014x
Meta Prometheus (expanded)Meta~600,00012x
CoreWeave clustersVarious~500,00010x

Largest single cluster: xAI @ 1.8M H100e
Potential training run: 36x GPT-5 scale


H1 2027 (Jan - Jun 2027)

FacilityOwner/UserH100e Capacityvs GPT-5 (50k)
Microsoft Fairwater WisconsinMicrosoft/OpenAI2,500,00050x
xAI Colossus (+ expansions)xAI2,200,00044x
OpenAI Stargate (6 buildings)OpenAI1,500,00030x
Meta clusters combinedMeta1,200,00024x
Google TPU v6/v7Google/DeepMind1,500,000+30x

Largest single cluster: Microsoft Fairwater @ 2.5M H100e
Potential training run: 50x GPT-5 scale


H2 2027 (Jul - Dec 2027)

FacilityOwner/UserH100e Capacityvs GPT-5 (50k)
Microsoft Fairwater (full)Microsoft/OpenAI5,000,000100x
xAI (Memphis + expansions)xAI3,000,000+60x
OpenAI Stargate (8 buildings)OpenAI2,500,00050x
Meta Hyperion (Phase 1)Meta2,000,00040x
Amazon/Anthropic combinedAnthropic1,500,00030x

Largest single cluster: Microsoft Fairwater @ 5M H100e
Potential training run: 100x GPT-5 scale


H1 2028 (Jan - Jun 2028)

FacilityOwner/UserH100e Capacityvs GPT-5 (50k)
Meta HyperionMeta5,000,000100x
Microsoft Fairwater (full)Microsoft/OpenAI5,000,000100x
xAI (1M GPU target)xAI4,000,000+80x
OpenAI Stargate (full)OpenAI3,500,00070x

Multiple players at 5M H100e scale
Potential training run: 100x GPT-5 scale (multiple labs)


H2 2028 (Jul - Dec 2028)

FacilityOwner/UserH100e Capacityvs GPT-5 (50k)
Meta Hyperion (full)Meta5,000,000+100x+
Microsoft (multi-site)Microsoft/OpenAI6,000,000+120x
xAI (approaching 1M GPUs)xAI5,000,000+100x

Total global AI compute: ~100M H100e (per Epoch AI projections)
Multiple 100x+ GPT-5 scale runs possible


Summary: Largest Cluster by Period

PeriodLeaderCapacity (H100e)Training Scale vs GPT-5
H1 2026xAI Colossus 21.4M28x
H2 2026xAI (expanded)1.8M36x
H1 2027Microsoft Fairwater2.5M50x
H2 2027Microsoft Fairwater5.0M100x
H1 2028Meta Hyperion / MSFT5.0M100x
H2 2028Multiple5-6M+100-120x

Key Observations

  1. xAI leads through 2026 with Colossus 2 (1.4M H100e in early 2026)
  2. Microsoft/OpenAI takes lead H1 2027 as Fairwater ramps
  3. By late 2027, multiple players at 5M H100e (100x GPT-5)
  4. Total global stock: 15M H100e today → 100M H100e by end 2027 (6.6x growth)
  5. Power constraints: 1 GW datacenters arriving early 2026, 3+ GW by 2027

Training Run Implications

If GPT-5 was trained on 50k H100s and represents current SOTA:

  • Q2 2026: 10-20x GPT-5 scale runs possible
  • Q4 2026: 30-40x GPT-5 scale runs possible
  • Q2 2027: 50x GPT-5 scale runs possible
  • Q4 2027: 100x GPT-5 scale runs possible
  • 2028: Multiple labs can do 100x+ simultaneously

Conclusions

From what we can see:

  • xAI will have the largest data center for much of next year.
  • Then Microsoft/OpenAI will begin to dominate.

Unknowns: what is Google doing? They’re relatively secretive, and we don’t know how their TPU capacity compares exactly.

And finally, we don’t know how the following things will evolve over the next two years:

  • Delays / shortages
  • Algorithmic advances
  • New projects being planned out

But we can probably expect, if scaling laws hold, by the end of 2026 we till have AI readily available to everyone trained on 10x size clusters of what we have this year.

To put a finer point on it: 2026 is the year where we have the last massive jump of 10x compute size that is going to be readily available. After that the jump from 2026 to 2027 will only be 2x, and the jump from 2027 to 2028 will only be 2x (maybe). There will begin to be serious financial and material constraints on how big you build a cluster that nobody will be able to afford to surpass expect for maybe governments.

So 2026-2027 is the critical period where we will expect to see massive capabilities gains from these 10x larger training runs and progress will not be scale based so much after that: it will be algorithmic breakthroughs and research breakthroughs that truly matter. The implication is progress could slow after 2027 since it won’t be as straightforward as simply being able to “build bigger datacenters”.

We can still expect:

  • Algorithmic breakthroughs
  • Efficiency breakthroughs at hardware level
  • Ability to network multiple clusters together increases

But the years of 10x jumps in scaling in a single year are probably over for a while after the 2026-2027 period.