Hardware for running local AI

Hardware 7 min read Updated Jun 13, 2026

Because Tholos AI runs models on your own machine, your hardware sets the ceiling on which models you can run and how fast they respond. The good news: it runs on ordinary laptops and desktops — a GPU is an enhancement, not a requirement. This guide covers what Tholos AI needs, and how to think about hardware for running local LLMs more generally.

What Tholos AI needs

The single most important factor is memory — the model has to fit. 16 GB RAM is the minimum, and it’s also the sweet spot: a 16 GB machine — Windows with integrated graphics or a modest GPU (e.g. an RTX 3060 6 GB), or an Apple Silicon Mac (e.g. M1 Pro) — comfortably runs a Balanced 7–8B model, the right default for most work. The larger tiers want more.

Tier	Example model	Memory needed	Best for
Light	Llama 3.2 3B	16 GB (lightest footprint)	Maximum speed — fast summaries, simple Q&A, extraction.
Balanced	Qwen 2.5 7B	16 GB RAM, or 4–6 GB VRAM	The default — strong instruction-following and multilingual work.
Power	Qwen 2.5 14B	24–32 GB+ RAM, or 12 GB VRAM	Highest-quality reasoning — contract review, complex analysis.
Workstation	gpt-oss 120B (MoE)	80 GB+ RAM and 24 GB+ VRAM, or 96 GB+ unified memory	Frontier-class reasoning for regulated-industry workloads (opt-in).

On first run, Tholos AI inspects your hardware and pre-selects a sensible tier; you can change it any time. See Choosing the right AI model.

GPU acceleration (optional)

A GPU speeds generation up substantially, but Tholos AI runs on CPU alone if you don’t have one. It’s detected and used automatically — support by platform:

Platform	GPU backend	Notes
Windows (NVIDIA / AMD / Intel)	Vulkan (language model) + DirectML (speech & search)	Auto-detected and vendor-agnostic. Keep your GPU drivers current.
macOS (Apple Silicon)	Metal	Auto-activated. Unified memory is a real advantage (see below).
macOS (Intel)	CPU only	No GPU acceleration.

Setup is essentially automatic — for details and troubleshooting, see Setting up GPU acceleration.

Understanding the hardware: RAM vs VRAM

Running an LLM is mostly a memory problem. Two numbers matter:

VRAM (GPU memory) determines how much of the model can live on the fast GPU. If the whole model fits in VRAM, it runs fastest.
System RAM holds whatever doesn’t fit in VRAM (and the entire model on a CPU-only machine). It’s slower than VRAM but lets you run larger models than your GPU alone could hold.

A useful rule of thumb: a model’s memory footprint is close to its file size, so a ~5 GB quantized 7B model wants roughly 6–8 GB free to run well. Quantization (see AI models in Tholos AI) is what makes big models fit. A fast SSD also helps — models load from disk into memory.

Hardware for running LLMs, by ambition

Laptop / everyday desktop

16 GB RAM — the minimum — runs Light and Balanced models well; an 8 GB+ discrete GPU makes them snappy. This covers the large majority of professional use.

Prosumer workstation

32–64 GB RAM with a 16–24 GB GPU (e.g. a high-end consumer card) comfortably runs Power-tier 14B models on the GPU and leaves headroom for documents and other apps. VRAM is the limiting factor for keeping a model fully on the GPU, so prioritize it.

Apple Silicon

Apple’s unified memory is shared between CPU and GPU, so a Mac with 32–128 GB can run models that would need an expensive multi-GPU rig on a PC — a cost-effective route to the Power tier and beyond. Use your total RAM as the budget when judging which tier fits.

Regulated-industry / frontier workstation

To run 100B-class Mixture-of-Experts models (the Workstation tier), you want 128 GB+ system RAM and 24–48 GB+ VRAM (a high-VRAM professional card, dual GPUs, or 128 GB+ unified memory on Apple Silicon). MoE models activate only a slice of their parameters per token, so they run faster than their total size suggests — but they still have to fit in memory, which is why RAM capacity dominates here. The download alone is 60–100+ GB.

You don’t have to get this exactly right up front. Start with the tier Tholos AI suggests for your machine; if answers feel shallow, step up a tier; if responses feel slow, step down. Adding RAM is usually the cheapest way to unlock a bigger model.

← Back to Help Center

Hardware for running local AI

What Tholos AI needs

GPU acceleration (optional)

Understanding the hardware: RAM vs VRAM

Hardware for running LLMs, by ambition

Laptop / everyday desktop

Prosumer workstation

Apple Silicon

Regulated-industry / frontier workstation

Related

Related articles