UFM ● research

run models larger than your VRAM

UFM treats GPU VRAM and CPU RAM as one elastic pool. It keeps the hot parts of a model on the card, prefetches what's about to be used, and evicts least-recently-used sub-modules when memory gets tight — so a single 4090 can run a model whose footprint exceeds 24 GB.

It's the first piece of the research program to get a formal, reproducible benchmark. On a routed Mixture-of-Experts, the standard all-on-GPU approach OOMs at a 24 GB expert bank; UFM runs the same model holding VRAM at 19.6 GB. When the active working set fits the budget, it does so within ~1% of full-GPU throughput and ~240× faster than naive CPU offloading.

I also published the case where it doesn't help: touch every expert every step and you're transfer-bound, where UFM ties dumb streaming. It's a bet on routing locality, not magic memory — and saying so plainly is the point.

// highlights

Runs a 24 GB expert bank on a 23.5 GB RTX 4090 (baseline OOMs)
Within ~1% of baseline throughput when the working set fits the budget
~240× faster than naive CPU offload (LRU keep-hot caching)
One-command reproducible benchmark + honest failure case
Open source (MIT) — github.com/Linutesto/ufm

// stack

PyTorch
CUDA
memory paging
MoE

// highlights

// stack

Follow this work.