Aaron Decker

Aaron Decker - Engineer & Entrepreneur

A look at the current landscape of local AI inference: the hardware, the software, and whether running models at home is actually practical in 2026.

The State of AI Inference and Running Models at Home

aiinferencehardware

Quick Summary

Running frontier LLMs at home is not practical today. Even the largest single GPUs don’t have enough memory to hold a state-of-the-art model, meaning you need multiple of the best GPUs just to serve one. Bandwidth is a bigger bottleneck than most people realize. The real action is in the infrastructure: memory, interconnects, and photonics. If you want exposure to AI inference scaling, the investable bottlenecks are HBM memory (Samsung, SK Hynix, Micron) and optical interconnects (Lumentum, Coherent).


A few weeks ago I threw OpenClaw on an old computer and I’ve been doing little experiments with it, as have most people in tech. It got me thinking a lot about local models.

This particular computer was an old desktop: it has a mid-tier GPU, 32 GB of memory, and it actually has a newer upgraded NVMe drive, but the motherboard is probably 10 years old at this point.

Still, it should be good enough to run some local models right? Wrong.

I got some tiny models working: the open source Whisper model for speech to text, and also the Coqui model for text to speech. This enables me to send voice messages to the OpenClaw and have it understand what I’m saying, and it also lets me send articles to it that it can download and convert to audio files for me to listen to while I run.

But these are tiny models. A small GPU can adequately be used for these, but a higher quality TTS model needs like 12 GB of VRAM (many GPUs only have 4 GB).

If you want to run even a tiny LLM locally that has any useful capabilities you need hundreds of gigabytes of memory with a lot of bandwidth.

Why Mac Studios

When I started looking into this, I found that people are suggesting Mac Studios. As of right now you can get an M3 Mac Studio with 512 GB of memory for around $9500 dollars.

Now shockingly, 512 GB of memory actually isn’t quite enough to run full Kimi K2.5 or GLM 4.7 unless you quantize it (reduce performance). So even still you are not really going to run a full size SOTA open source LLM on a single Mac Studio.

But why are people doing this instead of buying a cheap blade server and stuffing it full of DDR5 ram?

Memory Bandwidth.

Mac Studios did a thing where they solder the RAM right on the board to make it very fast, which allows them to achieve 800 GB/s memory bandwidth. This is already quite slow for an LLM, but it’s usable.

On a server motherboard with pluggable memory slots, bandwidth is limited to more like 150 or 200 GB/s. The token generation would be painfully slow.

How much bandwidth do the SOTA Nvidia and AMD GPUs have?

NVIDIA & AMD: A100 through announced 2027 SOTA (state of the art) products.

Notice the bandwidth is very good (because they use HBM - high bandwidth memory), but the actual memory capacity is not that high.

NVIDIA

GPUYearMemoryCapacityBandwidthTDPBW vs A100
A100 SXM2020HBM2e80 GB2.0 TB/s400W
H100 SXM2022HBM380 GB3.35 TB/s700W1.7×
H200 SXM2024HBM3e141 GB4.8 TB/s700W2.4×
B200 SXM2024HBM3e192 GB8.0 TB/s1,000W
B300 Ultra2025HBM3e288 GB8.0 TB/s1,400W
R200 (Rubin)H2 2026HBM4288 GB22.2 TB/s~2,300W11.1×
Rubin Ultra2027HBM4E1,024 GB~32 TB/s~3,600W16×

AMD

GPUYearMemoryCapacityBandwidthTDPBW vs A100
MI250X2021HBM2e128 GB3.2 TB/s560W1.6×
MI300X2023HBM3192 GB5.3 TB/s750W2.7×
MI325X2024HBM3e256 GB6.0 TB/s750W
MI350X2025HBM3e288 GB8.0 TB/s750W
MI455XH2 2026HBM4432 GB19.6 TB/s~2,500W9.8×

Italic rows = announced, not yet shipping. B300 shipped Jan 2026. R200 and MI455X target H2 2026.

TDP values for unshipped products are estimates from leaks/rumors.

Sources: NVIDIA datasheets, AMD product pages, Tom’s Hardware, VideoCardz, SemiAnalysis, TweakTown. Feb 2026.

Still not enough memory!

Did you notice even in the current B300s and the MI350x, the amount of memory they have is only 288 GB?

You cannot fit a full copy of a SOTA model into a single SOTA GPU!

I will say it again: you need multiple GPUs to serve a model.

There are 4 strategies to do this:

  1. Tensor Parallelism (TP): you scale horizontally across GPUs. When a token comes through, all GPUs in the TP group compute their slice simultaneously, then do an all-reduce over NVLink to combine results

  2. Pipeline Parallelism (PP): slice the model vertically by layers. GPU group 1 handles layers 1-15, group 2 handles layers 16-30, etc. Each group only needs to pass activations (much smaller than weights) to the next group. Downside: some GPUs are idle at different steps.

  3. Expert Parallelism (EP): MoE (Mixture of Experts strategy). For a model like Kimi K2.5 with 384 experts, you distribute different experts to different GPUs. When a token is routed to experts 47 and 183, it gets sent (via all-to-all communication) to whichever GPUs hold those experts, processed, and the results returned.

  4. Data Parallelism (DP): multiple replicas of the model (each using TP/PP/EP internally) handle different user requests simultaneously for throughput.

Example: For a trillion-parameter MoE model, NVIDIA describes configurations like TP2EP16PP2, meaning tensor parallelism across 2 GPUs, expert parallelism across 16 GPUs, and pipeline parallelism across 2 stages. NVIDIA Developer. That’s 2 × 16 × 2 = 64 GPUs minimum for a single model replica. Then you add data parallelism replicas for throughput.

64 SOTA GPUs for a single model.

Tensor Parallelism diagram

Pipeline Parallelism diagram

Expert Parallelism diagram

How many GPUs per model do you need?

Here are some open source models we know about. For all we know though, Grok 4.2, Gemini 3.1, Opus 4.6 and GPT 5.3 thinking could be 5x or 10x larger or more complex in ways that are not public. Multiply in your mind.

Model Weight Sizes

ModelTypeTotal ParamsActive ParamsFP16 WeightsFP8 WeightsFP4 Weights
Llama 3.1 405BDense405B405B810 GB405 GB203 GB
GLM-4 9B→355BDense355B355B710 GB355 GB178 GB
DeepSeek R1MoE671B37B1,342 GB671 GB336 GB
Qwen3-235BMoE235B22B470 GB235 GB118 GB
Kimi K2.5MoE~1,000B32B~2,000 GB~1,000 GB~500 GB

Note: MoE models must load ALL parameters into memory even though only a fraction are active per token and the router decides which experts to use at runtime.

Typical Production Serving Configurations

Note that due to KV cache and batching the real systems look even worse…

ModelHardwareConfigGPUs per ReplicaWhy
Llama 3.1 405B1× DGX H100TP88Fits on one 8-GPU node with room for KV cache
Llama 3.1 405BDGX B300TP22FP8 fits in 2 GPUs, rest is KV cache headroom
DeepSeek R1 671B2× DGX H100TP8 PP216Too big for one node, pipeline across two
DeepSeek R1 671BDGX B300TP4 or TP2 EP24Fits in one node, use remaining GPUs for replicas
Kimi K2.5 ~1TDGX H100 clusterTP2 EP16 PP264NVIDIA’s reference config for trillion-param MoE
Kimi K2.5 ~1TRubin NVL72TP2 EP8~163× bandwidth per GPU means fewer GPUs needed

The Punchline

H100 era (2023)B300 era (2025)Rubin era (2027)
Biggest model on 1 GPU~40B (FP8)~140B (FP8)~500B (FP4)
405B dense model8 GPUs (full node)2 GPUs2 GPUs
671B MoE16 GPUs (2 nodes)4 GPUs3-4 GPUs
1T MoE64 GPUs (8 nodes)8 GPUs4-8 GPUs
Cost per replica$2M+~$400KTBD

Models are getting bigger and GPUs are getting more memory, but models are getting even bigger: every generation still needs multiple GPUs for frontier models.

The Current Inference Bottlenecks: memory capacity and memory bandwidth

It’s fairly simple to understand memory capacity alone as a bottleneck. But bandwidth is a lot more complex because you have multiple types of bandwidth to consider.

We will just use the Nvidia chips as an example since NVLink is pretty clearly documented different places.

Per-GPU Bandwidth at Each Tier

TierWhat ↔ WhatB300 (Blackwell Ultra)R200 (Rubin)Drop from Tier Above
1. Memory bandwidthHBM ↔ GPU compute cores8,000 GB/s22,200 GB/s
2. GPU-to-GPU (scale-up)GPU ↔ GPU via NVLink1,800 GB/s3,600 GB/s~4-6× drop
3. CPU-GPU coherentGPU ↔ host CPU via NVLink-C2C900 GB/s1,800 GB/s~2× drop
4. Network (scale-out)Node ↔ Node via InfiniBand/Ethernet100 GB/s200 GB/s~9-18× drop

All values are bidirectional per GPU. Scale-out is per NIC (ConnectX-8 at 800 Gb/s for Blackwell, ConnectX-9 at 1.6 Tb/s for Rubin).

What Lives at Each Tier

TierTraffic TypeWhy It Matters
1. Memory BWReading model weights for every token generatedThis is THE bottleneck for autoregressive decode; every token requires reading all active weights from HBM
2. NVLinkTensor parallelism all-reduce, expert parallelism all-to-allHappens every layer, every token when model is sharded across GPUs; effectively an extension of the memory system
3. CPU-GPUKV cache overflow, weight offloading, CPU-side preprocessingRubin unifies CPU LPDDR5x + GPU HBM4 into one address space; cold experts and KV cache can spill here
4. Scale-outPipeline parallelism between nodes, data parallel gradient syncOnly activations cross this boundary (much smaller than weights), but still the weakest link for multi-node models

Why This Matters: The 80x Cliff

From HBM to scale-out networking, you lose 80× bandwidth. This is why model placement strategy is so critical.

  • Weights that are accessed every token → must be in HBM (Tier 1)
  • Sharded layers that sync every token → must be within NVLink domain (Tier 2)
  • Cold MoE experts, KV cache overflow → can live in CPU memory (Tier 3)
  • Separate model replicas, pipeline stages → can cross the network (Tier 4)

Bandwidth scales across memory tiers

Model design & Deployment

From the tables above you can see that the way models are deployed from an ops & engineering perspective is very important to pay attention to.

It’s also true that not all models are able to be served in the same ways. Not every model is a mixture of experts construction. And it’s also true that we don’t know what algorithmic advances will occur in the future and how models need to be served via inference, but it’s very striking today how limited bandwidth looks in the way it falls off a cliff as you scale out and scale up.

However there may be a solution…

Solving Bandwidth with Photonics

Photonics: basically I’m talking about using lasers and fiber optics instead of copper cabling to transmit data. Right now you may be surprised to learn that all of the NVLink stuff is done with copper cabling. That might change.

Photonics is pretty complex. You have to think about the laser source, which is heat sensitive, modulating it (creating signals), and then converting between digital signals on silicon into light and vice versa. Compare this to copper which is dead simple.

Photonic Interconnects: What’s Coming at Each Bandwidth Tier

Based on Rubin R200 (H2 2026) as current baseline.

TierWhat ↔ WhatCurrent TechCurrent BW (per GPU)Photonic AlternativePotential BWImprovementKey PlayersTimeline
4. Scale-outNode ↔ NodePluggable optics (800G NICs)~200 GB/sCPO on switch ASICs: optical conversion moves from board edge onto the switch package, eliminating 22 dB electrical loss~400-800 GB/s2-4×NVIDIA Spectrum-X, Broadcom Bailly, Ayar Labs2026-2027
2-3. Scale-upGPU ↔ GPU (cross-node)Copper NVLink (rack-local only)3,600 GB/s (within NVL72 rack); 0 GB/s across racks at NVLink speed3D CPO on GPU package: optical I/O co-packaged with GPU die, extends NVLink-class BW beyond rack boundaries4,000-8,000 GB/s per GPU, across racksnew capabilityLightmatter L200/L200X, Ayar Labs TeraPHY2027-2028
2-4. FabricGPU ↔ any GPU (datacenter-wide)NVLink within rack + InfiniBand between racks (two separate domains, 18× cliff between them)3,600 GB/s local / 200 GB/s remotePhotonic interconnect fabric: 3D photonic interposer under GPU, single optical domain across 1000s of GPUs114+ Tbps (~14,000 GB/s) total per packageCollapses 18× cliff to ~2-3×Lightmatter Passage M10002028-2029
1→2. MemoryRemote memory ↔ ComputeHBM on-package only (no remote memory at HBM speed)22,200 GB/s on-package; 0 GB/s to off-package memory at HBM speedPhotonic memory fabric: disaggregated memory pools connected optically at HBM-equivalent bandwidth7,200 GB/s per link to 32 TB shared poolnew capabilityCelestial AI / Marvell2027-2028
Optical switchingElectrical packet switches (high power, latency)N/AOptical circuit switches (OCS): MEMS mirrors route light directly between any two ports, no electrical conversion needed100+ Tbps per switch at <150W80% power reduction vs electrical switchesLumentum R300/R642025-2026 (shipping now)

The Supply Chain: Who Makes It All Work

The companies above design the systems, but every photonic system needs lasers, modulators, and optical components. Two public companies dominate this supply chain and are arguably the most direct beneficiaries of the photonics buildout:

CompanyTickerRoleKey Products for AIRevenue TrajectoryWhy They Matter
LumentumLITELaser source supplier + OCSUltra-high-power CW lasers for CPO (sole source for NVIDIA’s CPO switches); EMLs for 800G/1.6T transceivers; R300/R64 optical circuit switches (MEMS-based)$480M→$600M+/quarter by mid-2026; OCS targeting $100M/quarter by Dec 2026; “largest single purchase commitment in company history” for CPO lasersNVIDIA’s named CPO laser partner. Every CPO switch needs an external laser source; Lumentum’s UHP lasers are the light that makes CPO work. Also pioneering OCS which Google has already deployed to replace top-of-rack electrical switches. Demand outpacing supply through FY2026.
CoherentCOHRVertically integrated photonics (lasers + transceivers + components)400mW CW lasers for CPO/silicon photonics; 1.6T transceivers (silicon photonics + EML-based); 2D VCSEL arrays for near-packaged optics; optical circuit switches (liquid crystal-based, 7 customers evaluating)$1.58B quarterly revenue (+17% YoY); record bookings extending 1yr+; datacenter/comms segment up 26% YoYAlso named as NVIDIA CPO ecosystem collaborator. More vertically integrated than Lumentum; makes lasers, transceivers, modulators, and silicon photonics all in-house. Ramping 6-inch InP wafer production (4× more devices per wafer). The broadest photonics supplier to the entire AI optical stack.

How It Fits Together

The photonics supply chain for AI has a clear structure. Check out this diagram:

photonics supply chain diagram

Every photonic link in an AI datacenter (whether it’s a CPO switch, a 3D photonic interposer, or an optical circuit switch) starts with a laser. Lumentum and Coherent are the indium phosphide laser suppliers to essentially the entire ecosystem. This is a genuine bottleneck: InP laser fabrication is a specialized, low-yield process that can’t be ramped quickly. Both companies are expanding capacity aggressively (Lumentum expanding its San Jose fab, Coherent transitioning to 6-inch InP wafers) but demand is still outstripping supply.

Think of it like HBM and TSMC CoWoS packaging in 2023: the designs exist, the demand exists, but the physical manufacturing capacity is the constraint. For photonics, InP lasers are that constraint.

Lumentum (LITE) and Coherent (COHR) are both public. Lightmatter ($4.4B valuation) and Ayar Labs are private. Celestial AI was acquired by Marvell (MRVL).

Investment implication

Let’s just summarize: if you want to think about investing in scaling bottlenecks there are a couple of places.

  1. Compute (GPUs)
  2. Memory capacity
  3. Memory Bandwidth

There are more than this (like power) but let’s just focus on what we talked about. Today inference is not really compute constrained. That could potentially change if the algorithms change but we don’t know yet. Currently GPU compute capacity is mostly a bottleneck on the training side of the equation.

What is clear is that memory capacity and bandwidth are clearly inference bottlenecks and it’s very likely the models keep getting larger.

So there are a couple of major players here we can invest in.

HBM capacity - who makes HBM memory?

Really only 3 major players here

  1. Samsung (S.Korea - 005930:KRX or BC94:LON)
  2. SK Hynix (S.Korea - 000660:KRX or HY9H:FRA)
  3. Micron ($MU US)

Memory Bandwidth

This is more interesting. Currently you can buy the copper cable suppliers:

  1. Credo ($CRDO US)

OR you can buy the laser source suppliers:

  1. Lumentum ($LITE US)
  2. Coherent ($COHR US)

OR you can invest in some of the fabs:

  1. Tower Semi ($TSEM US)
  2. Global Foundries ($GFS US)

OR you can invest in the packagers:

  1. Amkor ($AMKR US)
  2. ASE Technology Holding Co Ltd ($ASX US)

Note: Amkor and ASE also are involved in HBM packaging

Also, Lightmatter seems like one of those private companies that once it IPOs should be hotly anticipated. Probably not the scale of SpaceX or Anthropic, but maybe more like Stripe level (if they ever IPO).

Conclusion

We started out talking about at-home inference, but as you can see it’s not all that practical. The SOTA models are memory hungry monsters and you can buy access to them for $20 a month today on a subscription.

At this point it’s going to be years until you run really good models very fast at your home with any cost efficiency unless something drastically changes. But in the meantime you can invest in the bottlenecks and pay Anthropic and OpenAI a monthly subscription for the most powerful technology on the planet!

As always, this is not financial advice, it’s just my personal research. I personally own stock in $AMD, $NVDA, SK Hynix, $TSEM, $ASX and $MU. I would own $LITE and $COHR but kind of missed the train on them.