AI models in Tholos AI & the open-weight landscape
Tholos AI runs open-weight AI models entirely on your own machine. Nothing is sent to a hosted API — the model lives on your disk and runs on your hardware. This article explains the kinds of models Tholos AI uses, the open-weight landscape they come from, and how to think about model choice. For step-by-step help picking one for your machine, see Choosing the right AI model.
Our approach
The catalog is deliberately small and curated, on a few principles:
- Quality over quantity. Every recommended model passes an internal benchmark of real workflow tasks — summarizing a contract, answering cited questions, spotting asymmetric clauses — not just public leaderboards.
- No vendor lock-in. Each tier has at least two alternatives from different model families, so nothing depends on a single vendor.
- Transparency. Real model names are shown in the app — Tholos AI never rebrands open-source models. The value is the product, not a secret model.
- You can bring your own. Drop any
GGUFlanguage model into the models folder and the app detects and validates it — also the install path for air-gapped machines. See Bring your own GGUF model.
The four kinds of model Tholos AI uses
| Role | What it does |
|---|---|
| Language model (LLM) | The engine behind every text workflow — summarization, Q&A, writing, redaction, extraction, contract review, infographic data. These are GGUF models, the same format you can bring yourself. |
| Embeddings | Converts text into vectors for document search and the Knowledge Base. Small (~30–100 MB) and bundled with the installer. |
| Speech-to-text | Transcription and dictation — covering 99 languages, with a fast option tuned for major European languages. |
| OCR | Reads text from scanned PDFs and images so other workflows can use them. |
You mostly only choose the language model; the specialist models are handled for you (bundled, or offered as small downloads when a workflow needs them).
The open-weight model landscape
“Open-weight” means the trained model parameters are published, so anyone can download and run the model locally — the foundation of private, on-device AI. A handful of families dominate professional use, and Tholos AI draws from them:
- Llama (Meta) — broad, well-supported general-purpose models; the small Llama 3.2 3B is a good lightweight default.
- Qwen (Alibaba) — strong instruction-following and multilingual quality across sizes; Qwen 2.5 at 7B and 14B anchors the Balanced and Power tiers.
- Mistral — efficient, capable models that punch above their size; Mistral 7B is a solid Balanced alternative.
- Phi (Microsoft) — small models tuned for reasoning quality per parameter; useful on modest hardware.
- Gemma (Google) — another widely used small-to-mid open-weight family.
- Frontier open-weight / Mixture-of-Experts (MoE) — very large models such as gpt-oss 120B and DeepSeek V4 Flash that activate only a fraction of their parameters per token, giving frontier-class reasoning on workstation hardware. These power the opt-in Workstation tier.
Why models come in different sizes
A model’s parameter count (3B, 7B, 14B, 120B…) roughly tracks how capable — and how demanding — it is. To make large models practical on normal hardware, they’re quantized: the weights are stored at lower precision, shrinking the file and memory footprint with minimal quality loss. Tholos AI prefers Q4_K_M as the best size/quality tradeoff, steps up to Q5_K_M for the Power tier, and uses aggressive quantization for the giant MoE Workstation models (which tolerate it well). You never have to pick a quantization by hand — the catalog presents models as simple tiers (Light, Balanced, Power, Workstation) with plain quality labels.
How this maps to your machine
Bigger models need more memory and run slower; smaller models are fast but less capable. Which tier suits you depends on your hardware — see Hardware for running local AI for the memory each tier needs, and Choosing the right AI model for how to pick and switch tiers in the app.