Messages API Reference

POST https://api.anthropic.com/v1/messages — the single endpoint everything runs through. Tools, structured outputs, thinking, and caching are all features of this endpoint, not separate APIs.

Required Headers

Header	Value
`x-api-key`	Your API key (`sk-ant-...`)
`anthropic-version`	`2023-06-01`
`content-type`	`application/json`
`anthropic-beta`	Comma-separated beta IDs, only for beta features

OAuth bearer tokens go on Authorization: Bearer <token> instead of x-api-key (plus anthropic-beta: oauth-2025-04-20). Setting both ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN makes the SDK send both headers and the API rejects the request.

Request Parameters

Param	Type	Required	Notes
`model`	string	yes	Exact alias ID, e.g. `claude-opus-4-8` — no date suffixes
`max_tokens`	int	yes	Hard output cap. Default sensibly: ~16000 non-streaming, ~64000 streaming, ~256 classification
`messages`	array	yes	Alternating `user`/`assistant` turns; first must be `user`. Consecutive same-role messages are merged
`system`	string \| block[]	no	System prompt. Block-list form required for `cache_control`
`tools`	array	no	Custom + server tool definitions (see tool-use.md)
`tool_choice`	object	no	`auto` (default) / `any` / `tool` / `none`
`thinking`	object	no	`{"type": "adaptive"}` on 4.6+; `{"type": "enabled", "budget_tokens": N}` legacy models only
`output_config`	object	no	`{"effort": "...", "format": {...}, "task_budget": {...}}`
`stop_sequences`	string[]	no	Custom stop strings
`stream`	bool	no	SSE streaming
`metadata`	object	no	`{"user_id": "..."}` — opaque end-user id for abuse detection
`temperature` / `top_p` / `top_k`	number	no	Removed on Fable 5 / Opus 4.8 / 4.7 (400). On other 4.x: at most one of temperature/top_p
`cache_control`	object	no	Top-level auto-caching: caches the last cacheable block
`container`	string	no	Reuse a code-execution container id
`mcp_servers`	array	no	Remote MCP connector (beta `mcp-client-2025-11-20`)

Message content blocks

content is either a plain string or an array of blocks:

{"role": "user", "content": [
  {"type": "text", "text": "What's in this image?"},
  {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "<b64>"}},
  {"type": "image", "source": {"type": "url", "url": "https://example.com/img.png"}},
  {"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": "<b64>"}},
  {"type": "tool_result", "tool_use_id": "toolu_...", "content": "..."}
]}

Response Shape

{
  "id": "msg_01...",
  "type": "message",
  "role": "assistant",
  "model": "claude-opus-4-8",
  "content": [
    {"type": "thinking", "thinking": "...", "signature": "..."},
    {"type": "text", "text": "Hello!"},
    {"type": "tool_use", "id": "toolu_01...", "name": "get_weather", "input": {"location": "Paris"}}
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 1024,
    "output_tokens": 256,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 0
  }
}

content is a list of typed blocks — never index content[0].text blindly (a thinking block may come first). Filter by .type.

Stop Reasons

`stop_reason`	Meaning	What to do
`end_turn`	Finished naturally	Done
`max_tokens`	Hit the `max_tokens` cap	Raise the cap or stream; output may be truncated mid-thought
`stop_sequence`	Hit a custom stop string	`stop_sequence` field has which one
`tool_use`	Claude wants tool(s) executed	Execute each `tool_use` block, send `tool_result`(s), re-request
`pause_turn`	Server-side tool loop hit its iteration limit	Append the assistant turn and re-send unchanged — server resumes; do NOT add a "continue" user message
`refusal`	Safety refusal	Check `stop_details` (`category`: "cyber"/"bio"/"reasoning_extraction" (Fable 5)/null, `explanation`); don't retry same prompt
`model_context_window_exceeded`	Context window exhausted (distinct from max_tokens)	Compact, truncate, or split the conversation

if response.stop_reason == "refusal" and response.stop_details:
    print(response.stop_details.category, response.stop_details.explanation)

Multi-Turn Conversations

The API is stateless — send the full history every request:

messages = []
def chat(user_msg: str) -> str:
    messages.append({"role": "user", "content": user_msg})
    r = client.messages.create(model="claude-opus-4-8", max_tokens=16000, messages=messages)
    # Append the FULL content list (preserves tool_use/thinking/compaction blocks)
    messages.append({"role": "assistant", "content": r.content})
    return next(b.text for b in r.content if b.type == "text")

For conversations that may exceed context: server-side compaction (beta header compact-2026-01-12, context_management: {"edits": [{"type": "compact_20260112"}]} on client.beta.messages.create). Critical: append response.content back verbatim — compaction blocks must be preserved or state is silently lost.

Streaming

with client.messages.stream(model="claude-opus-4-8", max_tokens=64000,
                            messages=[...]) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
    final = stream.get_final_message()
print(final.usage.output_tokens)

const stream = client.messages.stream({ model: "claude-opus-4-8", max_tokens: 64000, messages });
stream.on("text", (delta) => process.stdout.write(delta));
const final = await stream.finalMessage();   // never wrap .on() in new Promise()

SSE event sequence

message_start          → message metadata (id, model, usage so far)
content_block_start    → index + block type (text / thinking / tool_use)
content_block_delta    → text_delta | thinking_delta | input_json_delta
content_block_stop     → block finished
message_delta          → stop_reason + final usage
message_stop           → stream done

Tool inputs stream as input_json_delta (partial JSON strings) — accumulate and parse at content_block_stop, or just use get_final_message() / finalMessage() which assembles parsed blocks for you.

Why stream: non-streaming requests with large max_tokens exceed HTTP timeouts (the Python SDK raises ValueError for non-streaming requests it estimates will run >~10 min). Default to streaming for anything long.

Error Handling

HTTP	`error.type`	Retryable	Typical cause
400	`invalid_request_error`	no	Bad params: removed sampling params, `budget_tokens` on 4.7+, prefill on 4.6+, role ordering
401	`authentication_error`	no	Missing/invalid key; both key + token set
403	`permission_error`	no	Key lacks model/feature access
404	`not_found_error`	no	Bad model ID (date-suffix mistake) or endpoint
413	`request_too_large`	no	Body over size limit — shrink images/history
429	`rate_limit_error`	yes	RPM/ITPM/OTPM exceeded — honor `retry-after`
500	`api_error`	yes	Transient server issue
529	`overloaded_error`	yes	Capacity — backoff; consider another model

Error envelope:

{"type": "error",
 "error": {"type": "rate_limit_error", "message": "..."},
 "request_id": "req_011CSH..."}

Log request_id (also response._request_id on SDK success objects) when reporting issues to Anthropic.

Typed exceptions — never string-match messages

import anthropic
try:
    r = client.messages.create(...)
except anthropic.BadRequestError as e:      # 400
    raise                                    # don't retry client errors
except anthropic.RateLimitError as e:       # 429
    wait = int(e.response.headers.get("retry-after", "60"))
except anthropic.APIStatusError as e:        # catch-all with .status_code / .type
    if e.status_code >= 500: ...             # retryable
except anthropic.APIConnectionError:
    ...                                      # network — retryable

try {
  await client.messages.create({...});
} catch (err) {
  if (err instanceof Anthropic.RateLimitError) { /* backoff */ }
  else if (err instanceof Anthropic.APIError) { console.error(err.status, err.message); }
}

All subclasses expose .type (e.g. "overloaded_error") for finer classification than the status code (e.g. billing_error vs permission_error, both 403).

Retries

The official SDKs auto-retry connection errors, 408/409/429 and >=500 with exponential backoff — default max_retries=2. Configure per client (anthropic.Anthropic(max_retries=5)) or per call (client.with_options(max_retries=5, timeout=20.0).messages.create(...)). Only hand-roll retry logic when you need behavior beyond that (e.g. queue + jitter across many workers):

import random, time

def call_with_retry(client, max_retries=5, base=1.0, cap=60.0, **kwargs):
    last = None
    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)
        except anthropic.RateLimitError as e:
            last = e
        except anthropic.APIStatusError as e:
            if e.status_code < 500:
                raise          # 4xx (except 429) is not retryable
            last = e
        time.sleep(min(base * 2 ** attempt + random.random(), cap))
    raise last

Default request timeout is 10 minutes (timeout= on the client or with_options). On timeout: anthropic.APITimeoutError, retried per max_retries.

Rate Limits

Limits are per-organization, per-model-class, measured three ways:

RPM — requests per minute
ITPM — input tokens per minute (cache reads often discounted/exempt — check headers)
OTPM — output tokens per minute

Tiers scale with cumulative spend (Tier 1-4, then custom/scale). Check live limits in Console or response headers:

Header	Meaning
`retry-after`	Seconds to wait (on 429)
`anthropic-ratelimit-requests-limit` / `-remaining` / `-reset`	RPM state
`anthropic-ratelimit-input-tokens-` / `-output-tokens-`	ITPM / OTPM state

Practical guidance:

Treat 429 as backpressure: honor retry-after, add jitter, cap concurrency.
Long-running agent fleets: budget OTPM, not just RPM — output is usually the binding constraint.
Batches API has separate, much higher throughput and doesn't draw from interactive rate limits — move bulk traffic there.
529 overloaded_error is capacity, not your quota — backoff and/or fail over to a different model tier.

Token Counting

POST /v1/messages/count_tokens — free, model-specific, counts a request without running it:

n = client.messages.count_tokens(
    model="claude-opus-4-8",
    system=system, tools=tools,
    messages=[{"role": "user", "content": text}],
).input_tokens

Never estimate with tiktoken (OpenAI tokenizer; 15-20% undercount on prose, worse on code). Token counts differ between Claude models too — count against the model you'll run.

Vision & Documents

Images: {"type": "image", "source": {...}} blocks — base64, URL, or Files API {"type": "file", "file_id": ...}. Opus 4.7+ supports high-res input (up to 2576px long edge, pixel-accurate coordinates; up to ~3x image tokens).
PDFs: {"type": "document", "source": {...}} — base64, URL, plain text, or file_id. Optional citations: {"enabled": true}.
Files API (beta files-api-2025-04-14): upload once (client.beta.files.upload(...)), reference by file_id across requests. 500 MB/file, 100 GB/org.

messages-api.md 12 KB Permalink History Raw