Skip to main content
  1. Posts/

Running 32B AI Models at Home: How I Built My Local LLM Workstation

· David Steeman · AI
Running 32B AI Models at Home: How I Built My Local LLM Workstation

There is something philosophically satisfying about running AI on your own hardware. No subscription, no rate limits, no terms-of-service clause about what you can ask. Your prompts stay on your network. The model runs in your house, on your electricity, and answers to no one but you.

That was the appeal, anyway. The practical questions were harder: what hardware do you actually need, what does it cost, and does the end result perform well enough to be useful? This post answers all three.

Why run locally?
#

The case for local LLMs (Large Language Models) comes down to a few things that matter differently to different people.

Privacy is the obvious one. When you query a cloud API, your prompt leaves your machine and passes through someone else’s infrastructure. For most queries that is fine. For anything sensitive — internal project details, code with proprietary logic, personal research — it is not. A local model cannot leak what it never receives.

Cost is less obvious until you actually use AI heavily. Cloud APIs charge per token (the chunks of text the model processes and generates). A 32B parameter model via API can cost several euros per hour of active use. Running locally, the marginal cost of a query is a fraction of a cent of electricity.

Availability matters too. A local model is always on, works without internet, and is not subject to outages, rate limits, or the provider deciding to deprecate the model you’ve built a workflow around.

And then there is the hobbyist angle. I enjoy building things. The question of whether you should just pay for an API is separate from whether building your own is interesting. It is.

The hardware decision — VRAM is everything
#

The first thing to understand about LLM inference is that the GPU’s VRAM (Video RAM — the dedicated memory on the graphics card) is the single most important constraint in the system. Not the CPU, not system RAM, not the SSD. VRAM.

Here is why: a language model is, at its core, a large set of numbers — billions of parameters — that need to be loaded and accessed at high speed during inference. If the entire model fits in VRAM, the GPU handles everything in-memory at maximum speed. If it does not fit, the model spills into system RAM, and the GPU spends time waiting for data over a much slower bus. The performance penalty is severe: 3–5× slower, sometimes more.

Model size and VRAM requirements are roughly predictable. The key variables are parameter count and quantization. Quantization reduces the precision of each stored number — a full-precision (f16) 32B model needs roughly 64 GB of VRAM, which is currently only available on enterprise hardware. But quantized models trade a small amount of quality for a dramatic reduction in size. A 32B model at Q4_K_M quantization (4-bit with a moderate quality optimisation) fits in about 20 GB — within reach of consumer hardware.

This narrows the GPU choice considerably. With 16 GB of VRAM, you are limited to 13B models. With 24 GB, 32B models fit cleanly. With less than 16 GB, you are looking at 7–8B models, which are capable but noticeably weaker on complex tasks.

The NVIDIA RTX 3090 has 24 GB of VRAM and, at the time of writing, can be found secondhand for around €600–700. Newer cards in the same price bracket typically have less VRAM (the RTX 4080 has 16 GB, for example), because NVIDIA has largely reserved high-VRAM configurations for their professional lines. For LLM use specifically, the 3090 remains a strong choice: more VRAM per euro than almost anything else on the consumer market.

The CUDA ecosystem also matters. NVIDIA’s CUDA (Compute Unified Device Architecture) is the dominant platform for GPU computing, and most LLM tooling — including Ollama, which I use — has better and more mature CUDA support than the AMD alternative (ROCm). If you are serious about running local models, NVIDIA is still the pragmatic choice.

The full build
#

The rest of the machine matters less than the GPU, but it still matters. Here is what I ended up with, and what I paid:

ComponentModelPrice
GPUNVIDIA RTX 3090 FE 24 GB€700
CPUAMD Ryzen 9 5950X (16 cores / 32 threads)
MotherboardASUS ROG Crosshair VIII Hero X570
CPU CoolerASUS ROG Ryujin 360 mm AIO
PSUASUS ROG Thor 1,200 W 80+ Platinum
SSDCrucial P310 2 TB NVMe PCIe 4.0
Casebe quiet! Pure Base 500
RAM64 GB G.Skill Ripjaws V DDR4-3600 CL16 (4× 16 GB)€538
Case fans3× Arctic P14 PWM PST~€30
Total~€2,138

The CPU, motherboard, cooler, PSU, SSD, and case came as a secondhand bundle for €900 — that deal is what made the overall build affordable. Individually, those components would have been considerably more.

One note on the RAM: €538 for 64 GB is significantly above what this memory cost a year prior. RAM prices were at a historic peak when I bought, driven by industry-wide shortages. If you are building now, check current prices — 64 GB DDR4 may be much cheaper. It is also worth noting that 64 GB is more than strictly necessary: since the GPU handles inference, system RAM is only used for the OS, software stack, and any CPU offloading when a model does not fit entirely in VRAM.

The AMD Ryzen 9 5950X has 16 cores and 32 threads. For LLM inference, you do not need this many cores — the GPU does the work. But those cores are useful for everything else running on the machine: the web UI, Docker containers, other services.

The 1,200 W PSU is sized for the RTX 3090’s appetite. Under full inference load the card draws around 350 W. The system as a whole sits closer to 400–450 W under load, and at around 90 W at idle. A 1,200 W PSU gives comfortable headroom and runs efficiently at these loads.

Fast NVMe storage (the Crucial P310 reads at ~7,100 MB/s) matters primarily for model loading times. A 19 GB model file loads from a fast NVMe in a few seconds. From a slower drive it can take noticeably longer. Once loaded, models stay in VRAM until evicted, so loading speed is a one-time startup cost per session.

OS and driver setup
#

I chose Ubuntu over Windows. The AI/ML toolchain is more mature on Linux, Docker containers are simpler to manage, and there is no overhead from a desktop GUI when the machine is running as a server. The trade-off is that NVIDIA driver management on Linux can be fiddly.

It was, in fact, fiddly. After an apt upgrade I hit a version mismatch between the NVIDIA kernel driver and the CUDA libraries — a classic Linux NVIDIA headache. The fix involved identifying the conflicting package versions, removing the old driver, and reinstalling the correct combination. I ended up on driver 580.126.16 with CUDA 13.0, which has been stable since. A clean nvidia-smi confirms things are working:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.16             Driver Version: 580.126.16     CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================================================================|
|   0  NVIDIA GeForce RTX 3090         Off |   00000000:08:00.0 Off |                  N/A |
|  0%   34C    P8             18W /  350W |       1MiB /  24576MiB |      0%      Default |
+-----------------------------------------------------------------------------------------+

Budget time for this if you go the Ubuntu route. It will probably work on the first try. But if it does not, the diagnostic path is not obvious unless you have done it before.

Installing Ollama
#

Ollama is a local model runner that wraps llama.cpp and presents a clean API. The reason I chose it over the alternatives — LM Studio (which is GUI-first and primarily targets Windows and Mac) and llama.cpp directly (which gives more control but requires more configuration) — is that Ollama installs as a systemd service, auto-detects the GPU, and exposes an OpenAI-compatible API endpoint.

That last point matters a lot. Any tool that knows how to talk to OpenAI’s API — including Claude Code, custom scripts, and most AI-enabled applications — can point at your local Ollama instance instead with a single URL change. No code modifications required.

Installation is a one-liner:

curl -fsSL https://ollama.com/install.sh | sh

Ollama starts automatically as a service after installation. Pulling a model is equally simple:

ollama pull qwen2.5:32b-instruct-q4_K_M

A useful tool before committing to a model is LLMfit , which shows you which models will fit in your VRAM and how much headroom you have left. With 24 GB, 32B Q4_K_M models fit with a few gigabytes to spare — but only if you are careful about context window size.

Performance tuning — from 5 tok/s to 35 tok/s
#

This is the part I spent the most time on, and where the payoff was largest.

Fresh out of the box, with default settings, qwen2.5:32b-instruct-q4_K_M running through Ollama produced about 5 tokens per second. That is technically usable but painfully slow — like watching someone type one character at a time.

Checking GPU utilisation during inference showed the problem immediately:

watch -n 1 nvidia-smi

The GPU was at 76% utilisation, with 24% of the work happening on the CPU. The model was spilling out of VRAM.

The culprit was the context window. By default, Ollama uses a 32,768-token context window — enough to hold roughly 100 pages of text in memory at once. With a 32B Q4_K_M model already using ~20 GB of VRAM, the KV cache (the memory structure that holds context during generation) for a 32k context adds roughly 3–4 GB more. That pushes the total over 24 GB, and the overflow lands on system RAM.

The fix is a custom Modelfile that caps context to 8,192 tokens:

FROM qwen2.5:32b-instruct-q4_K_M
PARAMETER num_ctx 8192
ollama create qwen2.5-32b-8k -f Modelfile

The second optimisation was KV cache quantization — reducing the precision of the context cache entries, which halves their VRAM footprint with minimal impact on output quality. This is set via an environment variable in the Ollama systemd service:

sudo systemctl edit ollama
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

After both changes and a service restart, the numbers looked very different:

ConfigurationContextGPU UtilSpeed
Default (32k context)32,768 tokens76% GPU / 24% CPU~5 tok/s
Tuned (8k context + KV q8_0)8,192 tokens100% GPU~35.5 tok/s

A 7× improvement. At 35 tokens per second, the model generates text faster than I can comfortably read it. The trade-off — a context window of 8k rather than 32k — is real: very long documents need to be chunked. For day-to-day conversational use and code generation, 8k is more than sufficient.

The software stack
#

Open WebUI provides a ChatGPT-like browser interface on top of Ollama. It runs in Docker, exposes a clean web UI on port 3000, and is accessible from any device on the home network. Model switching, conversation history, and file uploads are all included. I use it from the browser on any machine in the house.

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Tailscale handles remote access. It creates a private overlay network between all your devices — phone, laptop, work PC, the LLM workstation — without opening any firewall ports or configuring a VPN. The Open WebUI is accessible from anywhere via Tailscale’s private IP, as if the workstation were on the local network. Installation takes about two minutes and requires no network configuration changes.

SearXNG (optional) is a self-hosted meta-search engine that can give the model web search capability when connected to Open WebUI. Useful for queries that require current information outside the model’s training data.

Models in use
#

The current model roster on the machine:

qwen2.5-32b-8k — My general-purpose daily driver. Qwen 2.5 from Alibaba’s research lab has strong instruction-following, good multilingual capability (helpful for my Belgian context), and performs well on reasoning tasks. At 32B parameters and Q4_K_M quantization with 8k context, it runs fully in VRAM at 35+ tok/s.

qwen2.5-coder-32b-8k — The code-focused variant of the same model. Stronger on Python, Bash, and network automation scripts. I use this one when the task is primarily technical.

qwen3-coder:30b-q4_k_m — The newest addition, pulling from Qwen’s third generation. Qwen3-Coder is explicitly trained for agentic coding tasks — writing and executing multi-step code, tool use, repository-level understanding. At 30B Q4_K_M, it fits comfortably in 24 GB and is becoming my default for anything code-related. The q5_k_m quantisation variant is also installed for tasks where quality matters more than raw speed.

llama3:8b and llama3.2:latest — Meta’s Llama models in their smaller sizes. Useful when you want a fast answer to a simple question and the full 32B is unnecessary overhead. Llama 3.2 is particularly quick for short-context summarisation and classification tasks.

For most tasks, quality-wise, these models sit somewhere between mid-tier and strong cloud offerings. They are not on par with GPT-4o or Claude Sonnet for complex multi-step reasoning. They are fully capable for drafting, coding, research Q&A, and explanation tasks — which covers the majority of what I actually use AI for day to day.

How I actually use it
#

The machine runs 24/7. Ollama starts automatically on boot, models load on first request, and the whole stack is available on the network without me thinking about it.

Day-to-day uses:

  • Writing assistance and drafting, including posts for this website
  • Code generation and review, primarily with the Qwen Coder models
  • Technical research and Q&A
  • Network automation scripting at work (connecting via Tailscale)

Claude Code, the AI coding assistant I use to manage this website, can be pointed at the local Ollama endpoint as an alternative to Anthropic’s cloud API. For tasks that do not require the absolute best reasoning quality, running everything locally is a real option.

The Tailscale integration is particularly useful. My phone and laptop have Tailscale installed, so the local Open WebUI is accessible from anywhere — on the train, at work, without ever sending a query through an external server.

Honest assessment
#

What works well: Privacy is complete. Performance after tuning is genuinely fast and responsive. Zero per-query cost. The always-on service model means it is just there when you need it.

What does not: The 8k context limitation is a real constraint for long documents or extended conversations. The quality gap versus the top cloud models is real on complex reasoning tasks — do not expect it to match GPT-4o on hard problems.

When I still reach for a cloud API: Complex multi-step reasoning where quality matters most, very long-context tasks, anything where I need the absolute best answer rather than a good one.

Power consumption: ~350 W under inference load, ~90 W at idle. At the Belgian average of €0.25/kWh, the machine sitting idle 24/7 costs about €16/month in electricity (0.09 kW × 720 h × €0.25). Active inference sessions push that higher, but since the machine spends most of its time waiting rather than generating, €16–20/month is a realistic figure for typical use.

What I overpaid for: RAM. €538 for 64 GB was significantly above the pre-shortage price, and 64 GB is more than this workload strictly needs. Buy RAM when prices are down; 32 GB would work fine.

Would I do it again? Yes. The total cost (~€2,138) is comparable to about a year of moderate API usage, and the machine will keep running for years. The privacy and availability benefits are real. And the performance, after tuning, is genuinely impressive for hardware that sits on a desk at home.

What is next
#

The immediate next step is spending more time with qwen3-coder for agentic coding tasks — using it with tools that allow the model to read files, run commands, and iterate on code rather than just answering questions. I am still working out where it fits relative to the 2.5 generation.

Longer term, I am curious about multi-model setups — routing different types of queries to different models based on the task — and about whether a second GPU would be worth adding to run 70B models fully in VRAM. For now, the 32B configuration is doing everything I need.

If you are considering building something similar, the short version is: buy a secondhand RTX 3090, install Ubuntu and Ollama, tune the context window, and prepare to be surprised by what a desktop machine can now do.