AI models in Tholos AI & the open-weight landscape

Models 7 min read Updated Jun 13, 2026

Tholos AI runs open-weight AI models entirely on your own machine. Nothing is sent to a hosted API — the model lives on your disk and runs on your hardware. This article explains the kinds of models Tholos AI uses, the open-weight landscape they come from, and how to think about model choice. For step-by-step help picking one for your machine, see Choosing the right AI model.

Our approach

The catalog is deliberately small and curated, on a few principles:

Quality over quantity. Every recommended model passes an internal benchmark of real workflow tasks — summarizing a contract, answering cited questions, spotting asymmetric clauses — not just public leaderboards.
No vendor lock-in. Each tier has at least two alternatives from different model families, so nothing depends on a single vendor.
Transparency. Real model names are shown in the app — Tholos AI never rebrands open-source models. The value is the product, not a secret model.
You can bring your own. Drop any GGUF language model into the models folder and the app detects and validates it — also the install path for air-gapped machines. See Bring your own GGUF model.

The four kinds of model Tholos AI uses

Role	What it does
Language model (LLM)	The engine behind every text workflow — summarization, Q&A, writing, redaction, extraction, contract review, infographic data. These are `GGUF` models, the same format you can bring yourself.
Embeddings	Converts text into vectors for document search and the Knowledge Base. Small (~30–100 MB) and bundled with the installer.
Speech-to-text	Transcription and dictation — covering 99 languages, with a fast option tuned for major European languages.
OCR	Reads text from scanned PDFs and images so other workflows can use them.

You mostly only choose the language model; the specialist models are handled for you (bundled, or offered as small downloads when a workflow needs them).

The open-weight model landscape

“Open-weight” means the trained model parameters are published, so anyone can download and run the model locally — the foundation of private, on-device AI. A handful of families dominate professional use, and Tholos AI draws from them:

Llama (Meta) — broad, well-supported general-purpose models; the small Llama 3.2 3B is a good lightweight default.
Qwen (Alibaba) — strong instruction-following and multilingual quality across sizes; Qwen 2.5 at 7B and 14B anchors the Balanced and Power tiers.
Mistral — efficient, capable models that punch above their size; Mistral 7B is a solid Balanced alternative.
Phi (Microsoft) — small models tuned for reasoning quality per parameter; useful on modest hardware.
Gemma (Google) — another widely used small-to-mid open-weight family.
Frontier open-weight / Mixture-of-Experts (MoE) — very large models such as gpt-oss 120B and DeepSeek V4 Flash that activate only a fraction of their parameters per token, giving frontier-class reasoning on workstation hardware. These power the opt-in Workstation tier.

Open-weight is not one single licence. Some families ship under permissive licences (Apache 2.0, MIT); others use a community licence with conditions on commercial use or scale. If you plan to redistribute outputs or embed a model in a product, check the specific model’s licence.

Why models come in different sizes

A model’s parameter count (3B, 7B, 14B, 120B…) roughly tracks how capable — and how demanding — it is. To make large models practical on normal hardware, they’re quantized: the weights are stored at lower precision, shrinking the file and memory footprint with minimal quality loss. Tholos AI prefers Q4_K_M as the best size/quality tradeoff, steps up to Q5_K_M for the Power tier, and uses aggressive quantization for the giant MoE Workstation models (which tolerate it well). You never have to pick a quantization by hand — the catalog presents models as simple tiers (Light, Balanced, Power, Workstation) with plain quality labels.

How this maps to your machine

Bigger models need more memory and run slower; smaller models are fast but less capable. Which tier suits you depends on your hardware — see Hardware for running local AI for the memory each tier needs, and Choosing the right AI model for how to pick and switch tiers in the app.

← Back to Help Center