Local API Server reference
The Local API Server exposes an OpenAI-compatible HTTP API on 127.0.0.1, so most existing OpenAI client libraries work by simply pointing them at your local base URL and key. This is the reference for its endpoints, parameters, and limits. To turn it on and get your key, see Set up the Local API Server.
Base URL & authentication
- Base URL:
http://127.0.0.1:8756/v1(confirm the port in Settings → Local API Server). - Auth: every request needs
Authorization: Bearer YOUR_API_KEY.
Endpoints
| Method & path | Purpose |
|---|---|
GET /health | Liveness check — confirms the server is up. |
GET /v1/models | List the installed models you can target, in OpenAI list format. |
POST /v1/chat/completions | Generate a chat completion (streaming and non-streaming). |
Chat completions
Supported request fields:
| Field | Notes |
|---|---|
messages | Required. Array of {"role": "...", "content": "..."} turns. |
model | Optional. Omit to use the model currently loaded in the app. Name an installed model to switch to it (see model tiers); GET /v1/models lists valid ids. |
temperature, top_p | Optional sampling controls. |
max_tokens | Optional cap on the response length. |
stop | Optional array of stop strings. |
stream | Optional. true streams tokens as server-sent events (SSE). |
curl
curl http://127.0.0.1:8756/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Summarize: ..."}],
"temperature": 0.7,
"stream": false
}'
Python (official OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8756/v1",
api_key="YOUR_API_KEY",
)
resp = client.chat.completions.create(
model="", # empty = use the model loaded in the app
messages=[{"role": "user", "content": "Summarize: ..."}],
)
print(resp.choices[0].message.content)
JavaScript (official OpenAI SDK)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://127.0.0.1:8756/v1",
apiKey: "YOUR_API_KEY",
});
const resp = await client.chat.completions.create({
model: "",
messages: [{ role: "user", content: "Summarize: ..." }],
});
console.log(resp.choices[0].message.content);
Behavior & limits
- One request at a time. There’s a single on-device engine, so requests are queued. Each waits up to ~60 seconds for its turn; if it can’t start in time you get a
503(with aRetry-Afterhint). - Queue cap. When too many requests are already waiting, new ones get
429(“server busy”). Design your client to retry with backoff — this is a personal/automation endpoint, not a high-throughput service. - Model loading. If nothing is loaded and you don’t name a model, you get a
503asking you to load one. Naming an installed model triggers a switch before the request runs. - Errors are returned in OpenAI’s error shape (
invalid_request_error/rate_limit_error/service_unavailable) with standard HTTP status codes (400,429,503).
What it’s not
- Not remote-accessible. It binds to
127.0.0.1by design — it can’t serve other machines on your network, and that’s intentional for privacy. - Not a multi-user server. The single-engine queue means it’s built for your own scripts and tools, not for serving many concurrent users.
Like the rest of Tholos AI, the API runs entirely on your machine with no outbound path. Anything you send through it stays on your device.