Local API Server reference

Developers 7 min read Updated Jun 14, 2026

The Local API Server exposes an OpenAI-compatible HTTP API on 127.0.0.1, so most existing OpenAI client libraries work by simply pointing them at your local base URL and key. This is the reference for its endpoints, parameters, and limits. To turn it on and get your key, see Set up the Local API Server.

Base URL & authentication

Base URL: http://127.0.0.1:8756/v1 (confirm the port in Settings → Local API Server).
Auth: every request needs Authorization: Bearer YOUR_API_KEY.

Endpoints

Method & path	Purpose
`GET /health`	Liveness check — confirms the server is up.
`GET /v1/models`	List the installed models you can target, in OpenAI list format.
`POST /v1/chat/completions`	Generate a chat completion (streaming and non-streaming).

Chat completions

Supported request fields:

Field	Notes
`messages`	Required. Array of `{"role": "...", "content": "..."}` turns.
`model`	Optional. Omit to use the model currently loaded in the app. Name an installed model to switch to it (see model tiers); `GET /v1/models` lists valid ids.
`temperature`, `top_p`	Optional sampling controls.
`max_tokens`	Optional cap on the response length.
`stop`	Optional array of stop strings.
`stream`	Optional. `true` streams tokens as server-sent events (SSE).

curl

curl http://127.0.0.1:8756/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Summarize: ..."}],
    "temperature": 0.7,
    "stream": false
  }'

Python (official OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8756/v1",
    api_key="YOUR_API_KEY",
)

resp = client.chat.completions.create(
    model="",  # empty = use the model loaded in the app
    messages=[{"role": "user", "content": "Summarize: ..."}],
)
print(resp.choices[0].message.content)

JavaScript (official OpenAI SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://127.0.0.1:8756/v1",
  apiKey: "YOUR_API_KEY",
});

const resp = await client.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "Summarize: ..." }],
});
console.log(resp.choices[0].message.content);

Behavior & limits

One request at a time. There’s a single on-device engine, so requests are queued. Each waits up to ~60 seconds for its turn; if it can’t start in time you get a 503 (with a Retry-After hint).
Queue cap. When too many requests are already waiting, new ones get 429 (“server busy”). Design your client to retry with backoff — this is a personal/automation endpoint, not a high-throughput service.
Model loading. If nothing is loaded and you don’t name a model, you get a 503 asking you to load one. Naming an installed model triggers a switch before the request runs.
Errors are returned in OpenAI’s error shape (invalid_request_error / rate_limit_error / service_unavailable) with standard HTTP status codes (400, 429, 503).

What it’s not

Not remote-accessible. It binds to 127.0.0.1 by design — it can’t serve other machines on your network, and that’s intentional for privacy.
Not a multi-user server. The single-engine queue means it’s built for your own scripts and tools, not for serving many concurrent users.

Like the rest of Tholos AI, the API runs entirely on your machine with no outbound path. Anything you send through it stays on your device.

← Back to Help Center

Local API Server reference

Base URL & authentication

Endpoints

Chat completions

curl

Python (official OpenAI SDK)

JavaScript (official OpenAI SDK)

Behavior & limits

What it’s not

Related articles