Help Center  /  How to use

Working with long documents: context windows explained

Cloud AI services advertise million-token context windows. Tholos AI runs on your machine — so its context window is sized by your hardware. This article explains what the context window actually is, why local AI can't simply match the cloud number, and how Tholos AI gets the most out of whatever machine you have.

What the context window is

The context window is the total amount of text a model can "hold in mind" at once — your conversation history, the document excerpts being discussed, and the answer being generated, all counted together in tokens. When a conversation outgrows it, earlier content has to fall away.

Why local can't do a million tokens

Cloud providers run datacenter GPU fleets purpose-built for long context. On a local machine, the binding constraint is the model's working memory (the KV cache), which grows linearly with context size. For a typical 9B-class model, that's roughly 0.66 MB per token:

Context sizeWorking memory needed
8K tokens~5 GB
128K tokens~85 GB
1M tokens~650 GB

That memory comes out of the same RAM or VRAM that holds the model itself. There's also a second ceiling: every model has a trained context length — pushing beyond it degrades answer quality. So the realistic goal is the largest context the model supports that also fits your memory, not a cloud-style headline number.

What Tholos AI does automatically

  • Hardware-aware sizing. At model load, Tholos AI measures your free RAM/VRAM, accounts for the model's own size plus a safety reserve, and sets the largest context window that fits — instead of using a fixed, one-size-fits-all number.
  • Respects the model's training limit. The context is never pushed past what the model was trained to handle.
  • Long-document summarization just works. When a document exceeds the context window, Summarization switches to chunked processing automatically — you'll see "summarizing chunk i / N" — and produces one coherent summary. You never split files by hand.

Choosing the right tool for the job

  • One long document, a handful of questions: Document Q&A — it retrieves only the relevant passages into the context, with citations, so even a 200-page filing works on modest hardware.
  • A corpus you query repeatedly: the Local Knowledge Base (Professional+). It indexes documents once, persists across sessions, and pulls only what each question needs — the right way to "load" thousands of pages.
  • Condensing, not querying: Summarization with its automatic chunking.
Rule of thumb: don't try to stuff everything into the chat. Retrieval (Q&A, Knowledge Base) scales to corpus sizes no context window ever will — and gives you citations on top.

If you need a bigger window

  • More RAM, or a GPU with more VRAM, directly buys you a larger context.
  • Closing memory-hungry applications before loading a model helps — the sizing is based on memory that's actually free.
  • Tier choice matters too: a smaller model leaves more memory for context on the same machine. See Choosing the right AI model for your hardware.

Related articles

← Back to Help Center