UFM ● research
run models larger than your VRAM
UFM treats GPU VRAM and CPU RAM as one elastic pool. It keeps the hot parts of a model on the card, prefetches what's about to be used, and evicts least-recently-used sub-modules when memory gets tight — so a single 4090 can run a model whose footprint exceeds 24 GB.
It's the first piece of the research program to get a formal, reproducible benchmark. On a routed Mixture-of-Experts, the standard all-on-GPU approach OOMs at a 24 GB expert bank; UFM runs the same model holding VRAM at 19.6 GB. When the active working set fits the budget, it does so within ~1% of full-GPU throughput and ~240× faster than naive CPU offloading.
I also published the case where it doesn't help: touch every expert every step and you're transfer-bound, where UFM ties dumb streaming. It's a bet on routing locality, not magic memory — and saying so plainly is the point.
// highlights
- Runs a 24 GB expert bank on a 23.5 GB RTX 4090 (baseline OOMs)
- Within ~1% of baseline throughput when the working set fits the budget
- ~240× faster than naive CPU offload (LRU keep-hot caching)
- One-command reproducible benchmark + honest failure case
- Open source (MIT) — github.com/Linutesto/ufm
// stack
- PyTorch
- CUDA
- memory paging
- MoE