MoE ≠ Less RAM -- But More Speed ⚡️
MoE ≠ Less RAM – But More Speed ⚡️
There’s a persistent misconception that Mixture-of-Experts (MoE) reduces memory usage on end devices. In reality, during inference serving, all expert weights are loaded. The MoE trick: only a few experts (e.g., Top-2) are computed per token. This saves FLOPs and increases throughput – especially for large providers with many GPUs – but it doesn’t save on weights. 💾
📊 Numbers for Intuition
| Model | FP16 | 4-bit |
|---|---|---|
| Dense 7B | ≈ 14 GB | ≈ 4–5 GB (+ KV cache) |
| Dense 70B | ≈ 140 GB | ≈ 35–45 GB |
| MoE 8x7B (Top-2) | ≈ 112 GB (total ≈ 56 B params) | ≈ 28–35 GB |
| MoE 16x8B (Top-2) | ≈ ~256 GB (total ≈ 128 B) | ≈ 64–80 GB |
With MoE 8x7B, only ≈ 14 B parameters are active per token – but ~56 B remain loaded.
🚀 Why Providers Love MoE
- Higher throughput: Only 2 of 8 (or 16) experts compute → more tokens/s per GPU budget.
- Better specialization: Experts learn niches, quality improves with the same active parameters.
🙃 Why End Users Rarely Save RAM
- All experts must remain resident (GPU/CPU). An 8x7B MoE in FP16 only fits on ≥2x80 GB GPUs or with heavy quantization/offload.
- Additional memory is consumed by KV cache (batching, context length!). Paged attention helps with KV cache, not with weights.
⭐ Exceptions (With a Big Asterisk)
There are setups that “swap” experts:
- CPU/NVMe offload: Only active experts move to the GPU. This often requires 256–512 GB system RAM or very fast NVMe arrays (20–40 GB/s) – and comes with latency spikes (+50–300 ms/token) and complexity.
- On-demand loading/expert paging: Research-stage, fragile, low throughput. Works, but not “free.”
🧠 Conclusion
MoE is primarily a throughput/efficiency lever for providers, not a magic RAM saver for home PCs. If the goal is RAM reduction: opt for small dense models, aggressive quantization (e.g., 4-bit), and clever KV cache strategies. If the goal is cost per token: MoE shines. ✨
Ready for the next step?
Tell us about your project – we'll find the right AI solution for your business together.
Request a consultation