Help Center  /  Models

Bring your own GGUF model

Tholos AI ships a small, curated catalog of tested models, but you’re not limited to it. If you want to run a specific open-weight model — a newer release, a domain-tuned variant, or simply your own preference — you can drop it in as a GGUF file. It runs locally with the same privacy guarantees as everything else, and this is also the way to install models on an air-gapped machine that has no internet.

What you need

The model has to be something the built-in engine (llama.cpp) can load:

  • GGUF format. Tholos AI’s language engine runs GGUF files. (PyTorch, Safetensors, or GGML files won’t load — look for a GGUF build.)
  • An instruct / chat model. Pick the Instruct or Chat variant, not a raw “base” completion model — the workflows rely on a system/user/assistant chat format.
  • A quantization that fits your hardware. Q4_K_M is a good all-round choice; go higher (Q5/Q6/Q8) for more quality if you have the memory, lower (Q3/Q2) to squeeze a bigger model into less. See Hardware for running local AI for what your machine can hold.

Where to find GGUF models

Most open-weight models have community GGUF builds on Hugging Face — the same place Tholos AI’s own catalog pulls from (repackagers such as bartowski and unsloth are reliable starting points). Download the single .gguf file for the quantization you want.

How to add it

  1. Download the .gguf file.
  2. Put it in your models folder:
    • Windows: C:\ProgramData\TholosAi\models
    • macOS: ~/Documents/OfflineTranscriber/models
  3. Open the Models view (restart Tholos AI if it doesn’t appear right away). The app detects the file, validates that it’s a real GGUF, and lists it alongside the catalog models.
  4. Select it and run a workflow as usual.
Air-gapped install: the steps are identical — copy the .gguf onto the machine by USB or network share and place it in the models folder. No internet is ever required to add or run a model this way.

Tips

  • Match the model to your memory. A model’s file size is close to its memory footprint — if it’s larger than your RAM (or VRAM), it’ll be slow or fail to load. Step down a quantization or a size.
  • Prefer 8192+ context for document-heavy work if your hardware allows — larger context uses more memory.
  • Test it on a real task. Unlike the catalog models (which pass an internal workflow benchmark), a model you bring is unvetted — check it on a representative document before relying on it for, say, contract review.
  • Trust your source. Download GGUFs from reputable repackagers, and prefer ones that publish checksums.
Your installed models count against your edition’s model slots (Standard 2, Professional 6, Business unlimited) — a model you add yourself counts the same as a catalog one. See Which edition is right for you?

If it won’t load

  • Confirm the file really is a .gguf (not a renamed Safetensors/PyTorch file, and not a multi-part download you forgot to merge).
  • A brand-new model architecture may not yet be supported by the bundled engine — try a well-established family (Llama, Qwen, Mistral, Phi, Gemma).
  • If loading is extremely slow or crashes, the model is likely too large for your memory — try a smaller quantization. See Hardware for running local AI.

Related articles

← Back to Help Center