POST https://api.anthropic.com/v1/messages — the single endpoint everything
runs through. Tools, structured outputs, thinking, and caching are all features
of this endpoint, not separate APIs.
| Header | Value |
|---|---|
x-api-key |
Your API key (sk-ant-...) |
anthropic-version |
2023-06-01 |
content-type |
application/json |
anthropic-beta |
Comma-separated beta IDs, only for beta features |
OAuth bearer tokens go on Authorization: Bearer <token> instead of
x-api-key (plus anthropic-beta: oauth-2025-04-20). Setting both
ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN makes the SDK send both headers
and the API rejects the request.
| Param | Type | Required | Notes |
|---|---|---|---|
model |
string | yes | Exact alias ID, e.g. claude-opus-4-8 — no date suffixes |
max_tokens |
int | yes | Hard output cap. Default sensibly: ~16000 non-streaming, ~64000 streaming, ~256 classification |
messages |
array | yes | Alternating user/assistant turns; first must be user. Consecutive same-role messages are merged |
system |
string | block[] | no | System prompt. Block-list form required for cache_control |
tools |
array | no | Custom + server tool definitions (see tool-use.md) |
tool_choice |
object | no | auto (default) / any / tool / none |
thinking |
object | no | {"type": "adaptive"} on 4.6+; {"type": "enabled", "budget_tokens": N} legacy models only |
output_config |
object | no | {"effort": "...", "format": {...}, "task_budget": {...}} |
stop_sequences |
string[] | no | Custom stop strings |
stream |
bool | no | SSE streaming |
metadata |
object | no | {"user_id": "..."} — opaque end-user id for abuse detection |
temperature / top_p / top_k |
number | no | Removed on Fable 5 / Opus 4.8 / 4.7 (400). On other 4.x: at most one of temperature/top_p |
cache_control |
object | no | Top-level auto-caching: caches the last cacheable block |
container |
string | no | Reuse a code-execution container id |
mcp_servers |
array | no | Remote MCP connector (beta mcp-client-2025-11-20) |
content is either a plain string or an array of blocks:
{"role": "user", "content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "<b64>"}},
{"type": "image", "source": {"type": "url", "url": "https://example.com/img.png"}},
{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": "<b64>"}},
{"type": "tool_result", "tool_use_id": "toolu_...", "content": "..."}
]}
{
"id": "msg_01...",
"type": "message",
"role": "assistant",
"model": "claude-opus-4-8",
"content": [
{"type": "thinking", "thinking": "...", "signature": "..."},
{"type": "text", "text": "Hello!"},
{"type": "tool_use", "id": "toolu_01...", "name": "get_weather", "input": {"location": "Paris"}}
],
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 1024,
"output_tokens": 256,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0
}
}
content is a list of typed blocks — never index content[0].text blindly
(a thinking block may come first). Filter by .type.
stop_reason |
Meaning | What to do |
|---|---|---|
end_turn |
Finished naturally | Done |
max_tokens |
Hit the max_tokens cap |
Raise the cap or stream; output may be truncated mid-thought |
stop_sequence |
Hit a custom stop string | stop_sequence field has which one |
tool_use |
Claude wants tool(s) executed | Execute each tool_use block, send tool_result(s), re-request |
pause_turn |
Server-side tool loop hit its iteration limit | Append the assistant turn and re-send unchanged — server resumes; do NOT add a "continue" user message |
refusal |
Safety refusal | Check stop_details (category: "cyber"/"bio"/"reasoning_extraction" (Fable 5)/null, explanation); don't retry same prompt |
model_context_window_exceeded |
Context window exhausted (distinct from max_tokens) | Compact, truncate, or split the conversation |
if response.stop_reason == "refusal" and response.stop_details:
print(response.stop_details.category, response.stop_details.explanation)
The API is stateless — send the full history every request:
messages = []
def chat(user_msg: str) -> str:
messages.append({"role": "user", "content": user_msg})
r = client.messages.create(model="claude-opus-4-8", max_tokens=16000, messages=messages)
# Append the FULL content list (preserves tool_use/thinking/compaction blocks)
messages.append({"role": "assistant", "content": r.content})
return next(b.text for b in r.content if b.type == "text")
For conversations that may exceed context: server-side compaction (beta
header compact-2026-01-12, context_management: {"edits": [{"type":
"compact_20260112"}]} on client.beta.messages.create). Critical: append
response.content back verbatim — compaction blocks must be preserved or
state is silently lost.
with client.messages.stream(model="claude-opus-4-8", max_tokens=64000,
messages=[...]) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message()
print(final.usage.output_tokens)
const stream = client.messages.stream({ model: "claude-opus-4-8", max_tokens: 64000, messages });
stream.on("text", (delta) => process.stdout.write(delta));
const final = await stream.finalMessage(); // never wrap .on() in new Promise()
message_start → message metadata (id, model, usage so far)
content_block_start → index + block type (text / thinking / tool_use)
content_block_delta → text_delta | thinking_delta | input_json_delta
content_block_stop → block finished
message_delta → stop_reason + final usage
message_stop → stream done
Tool inputs stream as input_json_delta (partial JSON strings) — accumulate
and parse at content_block_stop, or just use get_final_message() /
finalMessage() which assembles parsed blocks for you.
Why stream: non-streaming requests with large max_tokens exceed HTTP
timeouts (the Python SDK raises ValueError for non-streaming requests it
estimates will run >~10 min). Default to streaming for anything long.
| HTTP | error.type |
Retryable | Typical cause |
|---|---|---|---|
| 400 | invalid_request_error |
no | Bad params: removed sampling params, budget_tokens on 4.7+, prefill on 4.6+, role ordering |
| 401 | authentication_error |
no | Missing/invalid key; both key + token set |
| 403 | permission_error |
no | Key lacks model/feature access |
| 404 | not_found_error |
no | Bad model ID (date-suffix mistake) or endpoint |
| 413 | request_too_large |
no | Body over size limit — shrink images/history |
| 429 | rate_limit_error |
yes | RPM/ITPM/OTPM exceeded — honor retry-after |
| 500 | api_error |
yes | Transient server issue |
| 529 | overloaded_error |
yes | Capacity — backoff; consider another model |
Error envelope:
{"type": "error",
"error": {"type": "rate_limit_error", "message": "..."},
"request_id": "req_011CSH..."}
Log request_id (also response._request_id on SDK success objects) when
reporting issues to Anthropic.
import anthropic
try:
r = client.messages.create(...)
except anthropic.BadRequestError as e: # 400
raise # don't retry client errors
except anthropic.RateLimitError as e: # 429
wait = int(e.response.headers.get("retry-after", "60"))
except anthropic.APIStatusError as e: # catch-all with .status_code / .type
if e.status_code >= 500: ... # retryable
except anthropic.APIConnectionError:
... # network — retryable
try {
await client.messages.create({...});
} catch (err) {
if (err instanceof Anthropic.RateLimitError) { /* backoff */ }
else if (err instanceof Anthropic.APIError) { console.error(err.status, err.message); }
}
All subclasses expose .type (e.g. "overloaded_error") for finer
classification than the status code (e.g. billing_error vs
permission_error, both 403).
The official SDKs auto-retry connection errors, 408/409/429 and >=500 with
exponential backoff — default max_retries=2. Configure per client
(anthropic.Anthropic(max_retries=5)) or per call
(client.with_options(max_retries=5, timeout=20.0).messages.create(...)).
Only hand-roll retry logic when you need behavior beyond that (e.g. queue +
jitter across many workers):
import random, time
def call_with_retry(client, max_retries=5, base=1.0, cap=60.0, **kwargs):
last = None
for attempt in range(max_retries):
try:
return client.messages.create(**kwargs)
except anthropic.RateLimitError as e:
last = e
except anthropic.APIStatusError as e:
if e.status_code < 500:
raise # 4xx (except 429) is not retryable
last = e
time.sleep(min(base * 2 ** attempt + random.random(), cap))
raise last
Default request timeout is 10 minutes (timeout= on the client or
with_options). On timeout: anthropic.APITimeoutError, retried per
max_retries.
Limits are per-organization, per-model-class, measured three ways:
Tiers scale with cumulative spend (Tier 1-4, then custom/scale). Check live limits in Console or response headers:
| Header | Meaning |
|---|---|
retry-after |
Seconds to wait (on 429) |
anthropic-ratelimit-requests-limit / -remaining / -reset |
RPM state |
anthropic-ratelimit-input-tokens-* / -output-tokens-* |
ITPM / OTPM state |
Practical guidance:
retry-after, add jitter, cap concurrency.overloaded_error is capacity, not your quota — backoff and/or fail over
to a different model tier.POST /v1/messages/count_tokens — free, model-specific, counts a request
without running it:
n = client.messages.count_tokens(
model="claude-opus-4-8",
system=system, tools=tools,
messages=[{"role": "user", "content": text}],
).input_tokens
Never estimate with tiktoken (OpenAI tokenizer; 15-20% undercount on prose,
worse on code). Token counts differ between Claude models too — count
against the model you'll run.
{"type": "image", "source": {...}} blocks — base64, URL, or Files
API {"type": "file", "file_id": ...}. Opus 4.7+ supports high-res input
(up to 2576px long edge, pixel-accurate coordinates; up to ~3x image tokens).{"type": "document", "source": {...}} — base64, URL, plain text, or
file_id. Optional citations: {"enabled": true}.files-api-2025-04-14): upload once
(client.beta.files.upload(...)), reference by file_id across requests.
500 MB/file, 100 GB/org.