Running Local LLMs at Home: RTX 3060 vs 4090 vs 5090 on Qwen3 and Gemma 4
Table of Contents
A handful of consumer GPUs now sit at the center of serious homelab inference. The interesting question is no longer “can I run a 30B-class model” but “which card should I buy for the workload I actually have.” Below is a method-grounded look at three tiers — the RTX 3060 12GB, RTX 4090 24GB, and RTX 5090 32GB — running today’s small-active-parameter mixture-of-experts models: Qwen3.5-35B-A3B and its newer sibling Qwen3.6-35B-A3B, Gemma 4 26B-A4B, and the dense Gemma 4 31B.
The hardware ladder, and why VRAM beats FLOPS #
Three specs define the deck. The RTX 3060 12GB is Ampere with a 192-bit GDDR6 bus, roughly 360 GB/s of memory bandwidth, 170 W TDP, and 3,584 CUDA cores. The RTX 4090 24GB is Ada Lovelace with 384-bit GDDR6X at about 1,008 GB/s, 450 W TDP, and 16,384 cores. The RTX 5090 32GB is Blackwell with 512-bit GDDR7 at 1,792 GB/s, 575 W TDP, and 21,760 cores — roughly a 78% memory-bandwidth uplift over the 4090.
The naive read is “5090 has the most compute, 3060 the least.” For local LLM inference that read is wrong. Memory bandwidth, not raw TFLOPS, is the binding constraint on token generation, because every decoded token must stream the model’s active weights out of VRAM. The 5090’s 1,792 GB/s is the reason its generation numbers scale the way they do — not the core count.
VRAM capacity is the second constraint, and often the decisive one. A model that doesn’t fit in a card’s memory doesn’t run on it, full stop; a model that spills over into system RAM collapses to a fraction of GPU throughput. The 12 / 24 / 32 GB ladder is the real product segmentation for this class of work.
MoE changes the math: total params set VRAM, active params set speed #
This is the part most homelab writeups get muddled. Qwen3.5-35B-A3B and Qwen3.6-35B-A3B both advertise “35B” but activate only ~3B parameters per token. Gemma 4’s 26B-A4B is the same pattern at 26B total / ~4B active — the “A4B” suffix is Google’s notation for the active-parameter count. Gemma 4 31B is the dense outlier: all 31B parameters run on every token.
The implication is asymmetric:
- Resident VRAM scales with total parameters. Every expert’s weights must live in memory, even though only a few fire per token. A 35B-total MoE at Q4_K_M still needs roughly 18–20 GB just to host the weights, before KV cache and context.
- Per-token compute and KV-cache bandwidth scale with active parameters. A 3B-active MoE generates at speeds that look impossible for a 35B dense model on the same card.
So when sizing hardware, use total parameters to decide whether a model fits, and active parameters to estimate how fast it will run. Conflating the two is the single most common source of misleading claims in this space — including the “your $800 GPU is now a frontier workstation” genre, which is really just describing a model that feels like a 3B workload in throughput while carrying a much larger model’s quality.
The three test subjects and what fits where #
Qwen3.5/3.6-35B-A3B ships in community GGUF quants (Q4_K_M, Q5_K_M, Q8_0) from Unsloth and similar repos. A Q4_K_M build lands near 19–20 GB, so it fits the 4090 at usable context, the 5090 with headroom, and the 3060 only with a much more aggressive quant or CPU/expert offload — a 20 GB Q4 build does not fit 12 GB. This is worth stating plainly, because it reconciles an apparent contradiction: a community LocalLLaMA benchmark (via Startup Fortune) reports Qwen3.6-35B-A3B at roughly 110 tok/s on an RTX 4070 Super with 12 GB VRAM. That’s a real and useful data point for the 12 GB tier, but it can only hold on 12 GB at a tighter quant or with the MoE’s cold experts offloaded — not the full Q4_K_M load. Treat it as “what a 12 GB card can do with the model squeezed to fit,” not “Q4 on 12 GB.”
Gemma 4 26B-A4B (26B total / ~4B active) sits in a similar VRAM band and is the only model here that Google explicitly positions as running on consumer GPUs. Gemma 4 31B is dense, with no expert-skipping trick to lean on; depending on quant it lands roughly in the low-to-high 20s of GB, which is why the 5090’s 32 GB is the first tier that holds it comfortably with real context length. Google’s own numbers put the 31B at #3 and the 26B at #6 on the Arena AI text leaderboard at release, “outcompeting models 20x its size” — impressive, though as always, leaderboard standing and your specific workload are different things.
Runtime stack and quantization choices #
For a single-user homelab with one or two cards, the realistic runtimes are llama.cpp (with its llama-bench utility) and Ollama (which wraps llama.cpp). vLLM matters mainly once you serve concurrent requests — which, notably, is the regime where the 5090’s extra VRAM pulls decisively ahead: 32 GB lets you hold two models resident at once and route between them, or run continuous batching that a 24 GB card can’t sustain.
Quantization picks per tier, in practice:
- RTX 3060 12GB — an aggressive quant (Q3-class or below) or expert offload is the ceiling for the 35B MoE, and only at short context. This is the “make it fit and accept the tradeoffs” tier.
- RTX 4090 24GB — Q5_K_M and Q8_0 are both viable for the Qwen MoE and Gemma 4 26B-A4B; the dense 31B fits around Q4–Q5. The homelab sweet spot for this model generation.
- RTX 5090 32GB — the dense 31B at Q8 with 8k+ context becomes realistic, and the MoE models stop being VRAM-constrained at all.
Exact per-quant VRAM figures for Gemma 4 aren’t published yet, so treat the Gemma bands above as GGUF community estimates and verify against whatever quant you actually pull.
What to actually benchmark, and how #
The dimensions that matter for a homelab decision are: tokens/sec for steady-state generation, time-to-first-token for interactive feel, max context length before VRAM spills (or before you have to drop a quant level), and tokens/watt for anything left running. Phoronix’s llama.cpp RTX 5090 review is the cleanest public methodology reference for llama.cpp throughput, and there are community head-to-heads (e.g. a Tech-Practice YouTube comparison of Qwen3.6-35B-A3B across the 3090/4090/5090) — but their per-GPU numbers aren’t reliably extractable, so treat them as method references and re-measure rather than quoting figures.
That’s the honest core of this piece: no published benchmark covers the exact (3060, 4090, 5090) × (Qwen 35B-A3B, Gemma 4 26B-A4B, Gemma 4 31B) matrix. Anything claiming those specific numbers without a citation is guessing. The reproducible path is to publish the commands and run them yourself.
For tokens/sec on a GGUF:
llama-bench -m model.gguf -ngl 99 -c 4096 -p 512 -n 128
The -ngl 99 flag offloads every layer to the GPU, which is what you want for a clean per-card comparison (drop it lower on the 3060 when a model won’t fully fit). For power and efficiency, sample the draw once per second during a generation run and divide total tokens by average watts:
nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits
What to do #
- RTX 3060 12GB — the always-on, lightweight-MoE card. Qwen3.5/3.6-35B-A3B squeezed to fit, at short context, is the realistic workload; the 170 W TDP makes it painless to leave running.
- RTX 4090 24GB — the sweet spot for this model generation. It holds Gemma 4 26B-A4B and the dense 31B at Q4–Q5, runs the Qwen MoE at Q8 with healthy context, and does it at well under the 5090’s power draw.
- RTX 5090 32GB — earns its premium when you need long context on the dense 31B, want to host two models concurrently, or care about absolute tokens/sec over tokens/watt.
The right way to decide is empirical: pick the model you actually want to run, pick the quant that fits the card you’re weighing, run llama-bench at the context length you care about, and read the resulting tokens/sec and VRAM-used numbers off your own machine. The vendor benchmarks won’t tell you how your workload behaves. Your own nvidia-smi log will.