Practical Guide to Extended Thinking: How to Use Thinking Budgets in the Opus 4.7 Era
Most Chinese-language material about Extended Thinking still teaches
thinking: {type: "enabled", budget_tokens: 10000}. On Opus 4.7, that syntax fails outright. This article explains the correct usage for Opus 4.7, Opus 4.6, and Sonnet 4.6 based on actual API behavior as of 2026-05, with an effort selection table, a complete interleaved thinking loop for tool use, and the two prompt caching pitfalls developers most often hit.
1. The Thinking Interface Changed Once in Six Months
If your codebase still has a call like this:
response = client.messages.create(
model="claude-opus-4-7", # This line
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000}, # This line
messages=[...]
)
response = client.messages.create(
model="claude-opus-4-7", # This line
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000}, # This line
messages=[...]
)
After changing the model to Opus 4.7, the entire request will fail. The reason is that Anthropic changed the thinking interface in the 4.6/4.7 generation from “explicit budget” to “adaptive + effort”. Opus 4.7 no longer accepts the old syntax.
Change summary:
| Model | Old syntax (type: enabled + budget_tokens) |
New syntax (type: adaptive + effort) |
|---|---|---|
| Opus 4.7 | ❌ No longer accepted | ✅ The only supported approach |
| Opus 4.6 | ⚠️ Still works but deprecated | ✅ Recommended |
| Sonnet 4.6 | ⚠️ Still works but deprecated | ✅ Recommended |
| Sonnet 4.6 + interleaved thinking with tools | Requires the interleaved-thinking-2025-05-14 beta header |
Enabled automatically in adaptive mode |
This means that if you plan to integrate Opus 4.7 into an existing agent pipeline, the first step is to change the thinking configuration to adaptive instead of simply swapping the model ID.
2. What Is Adaptive Thinking?
The old design worked like this: you told the model, “Spend up to 10,000 tokens thinking.” The model might use the whole budget, or only part of it. The problem is:
- Simple questions can be forced to waste tokens on unnecessary thinking
- Complex questions may not get enough thinking if the budget is set too low

Adaptive thinking gives that decision to the model, while the effort parameter controls how aggressively it thinks:
| effort | Behavior | Best for |
|---|---|---|
low |
The model often skips thinking entirely for simple questions | Lightweight calls with low cost and low latency |
medium |
Balanced: thinks only when it decides thinking is needed | General production tasks |
high default |
Almost always thinks first | Reasoning tasks in production |
max Opus 4.6/4.7 only |
Maximum thinking intensity | AIME-level math, long-horizon agents, the hardest bugs |
Anthropic’s internal evaluations show that adaptive uses tokens more efficiently than fixed budgets across many tasks, because it does not waste reasoning on easy questions and does not stop too early on hard ones.
3. Complete Python Example
Environment setup:
pip install anthropic
pip install anthropic
3.1 Opus 4.7 Call Adaptive Only
import anthropic
client = anthropic.Anthropic(
api_key="sk-your-ClaudeAPI-key",
base_url="https://gw.claudeapi.com"
)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
messages=[
{"role": "user", "content": "Prove that there are infinitely many primes congruent to 3 modulo 4."}
]
)
for block in response.content:
if block.type == "thinking":
print(f"[Thinking] {block.thinking[:200]}...")
elif block.type == "text":
print(f"[Answer] {block.text}")
import anthropic
client = anthropic.Anthropic(
api_key="sk-your-ClaudeAPI-key",
base_url="https://gw.claudeapi.com"
)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
messages=[
{"role": "user", "content": "Prove that there are infinitely many primes congruent to 3 modulo 4."}
]
)
for block in response.content:
if block.type == "thinking":
print(f"[Thinking] {block.thinking[:200]}...")
elif block.type == "text":
print(f"[Answer] {block.text}")
The returned content is an ordered array of blocks: first a thinking block containing the model’s reasoning summary, then a text block containing the final answer.
3.2 The Same Syntax on Sonnet 4.6
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": "medium"}, # Sonnet does not support max
messages=[{"role": "user", "content": "..."}]
)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": "medium"}, # Sonnet does not support max
messages=[{"role": "user", "content": "..."}]
)
Note: effort: max is not accepted on Sonnet 4.6. max is exclusive to the Opus family. Use medium for everyday production tasks, and only move to high for important reasoning tasks.
3.3 Opus 4.6 Supports Both Syntaxes, but You Should Use Adaptive
# ✅ Recommended
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
messages=[...]
)
# ⚠️ Still works but deprecated. Migrate as soon as possible.
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[...]
)
# ✅ Recommended
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
messages=[...]
)
# ⚠️ Still works but deprecated. Migrate as soon as possible.
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[...]
)
4. Node.js / TypeScript Version
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: "sk-your-ClaudeAPI-key",
baseURL: "https://gw.claudeapi.com",
});
const response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 16000,
thinking: { type: "adaptive" },
output_config: { effort: "high" },
messages: [{ role: "user", content: "Your complex reasoning task" }],
});
for (const block of response.content) {
if (block.type === "thinking") {
console.log("[Thinking]", block.thinking);
} else if (block.type === "text") {
console.log("[Answer]", block.text);
}
}
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: "sk-your-ClaudeAPI-key",
baseURL: "https://gw.claudeapi.com",
});
const response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 16000,
thinking: { type: "adaptive" },
output_config: { effort: "high" },
messages: [{ role: "user", content: "Your complex reasoning task" }],
});
for (const block of response.content) {
if (block.type === "thinking") {
console.log("[Thinking]", block.thinking);
} else if (block.type === "text") {
console.log("[Answer]", block.text);
}
}
5. Interleaved Thinking: The Key Change in Tool Loops
The real power of Extended Thinking in agent scenarios is that it lets the model think again after each tool call. This is interleaved thinking.
With the old syntax on Sonnet 4.6, you had to manually add a beta header:
client = anthropic.Anthropic(
api_key="sk-your-ClaudeAPI-key",
base_url="https://gw.claudeapi.com",
default_headers={"anthropic-beta": "interleaved-thinking-2025-05-14"}
)
client = anthropic.Anthropic(
api_key="sk-your-ClaudeAPI-key",
base_url="https://gw.claudeapi.com",
default_headers={"anthropic-beta": "interleaved-thinking-2025-05-14"}
)
In adaptive mode on Opus 4.7 and 4.6, interleaved thinking is enabled by default, with no beta header required. This is a hidden benefit of migrating to 4.7: agent pipelines no longer need to manage thinking flags.
A complete tool-use loop. Key point: the previous round’s thinking block must be passed back exactly as-is.
weather_tool = {
"name": "get_weather",
"description": "Get the current weather for a specified city",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
# Round 1: the model thinks first, then decides whether to call a tool
response_1 = client.messages.create(
model="claude-opus-4-7",
max_tokens=8000,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
tools=[weather_tool],
messages=[
{"role": "user", "content": "What should I wear in Paris right now? Give specific advice."}
]
)
# Round 2: pass back the thinking block + tool_use block exactly, then add tool_result
thinking_block = next(b for b in response_1.content if b.type == "thinking")
tool_use_block = next(b for b in response_1.content if b.type == "tool_use")
response_2 = client.messages.create(
model="claude-opus-4-7",
max_tokens=8000,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
tools=[weather_tool],
messages=[
{"role": "user", "content": "What should I wear in Paris right now? Give specific advice."},
{
"role": "assistant",
"content": [thinking_block, tool_use_block] # ← Key point
},
{
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": "Paris is 22°C, sunny, with a light breeze."
}]
}
]
)
weather_tool = {
"name": "get_weather",
"description": "Get the current weather for a specified city",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
# Round 1: the model thinks first, then decides whether to call a tool
response_1 = client.messages.create(
model="claude-opus-4-7",
max_tokens=8000,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
tools=[weather_tool],
messages=[
{"role": "user", "content": "What should I wear in Paris right now? Give specific advice."}
]
)
# Round 2: pass back the thinking block + tool_use block exactly, then add tool_result
thinking_block = next(b for b in response_1.content if b.type == "thinking")
tool_use_block = next(b for b in response_1.content if b.type == "tool_use")
response_2 = client.messages.create(
model="claude-opus-4-7",
max_tokens=8000,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
tools=[weather_tool],
messages=[
{"role": "user", "content": "What should I wear in Paris right now? Give specific advice."},
{
"role": "assistant",
"content": [thinking_block, tool_use_block] # ← Key point
},
{
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": "Paris is 22°C, sunny, with a light breeze."
}]
}
]
)
What happens if you do not pass back the thinking block: the model “forgets” and loses the reasoning process it just built, which noticeably reduces second-round decision quality. This is one of the most common mistakes when wiring up agent pipelines.
6. Choosing effort: Which Level Should You Use?
Choose by business task type. Do not default to high:
| Task | Recommended effort | Model | Notes |
|---|---|---|---|
| Simple classification, information extraction | low or no thinking |
Haiku 4.5 / Sonnet 4.6 | Thinking may only add cost and latency |
| General Q&A, document generation | medium |
Sonnet 4.6 | The cost-performance sweet spot |
| Code review, PR analysis | medium or high |
Sonnet 4.6 / Opus 4.6 | Multi-step reasoning, but usually not max |
| Complex algorithms, architecture design | high |
Opus 4.7 | Requires long-chain reasoning |
| AIME-level math, the hardest debugging tasks | max |
Opus 4.7 / 4.6 | Extreme cases, with clearly higher per-call cost |
A useful rule of thumb: run the prompt once without thinking before enabling thinking. If the normal mode answers well, leave thinking off. If the normal mode has obvious issues, such as missing constraints, skipping logic, or answering the wrong question, then turn on thinking. Blindly defaulting to high is one reason teams end up with per-person monthly bills around $2,000.
7. Common Pitfalls
Pitfall 1: The Hidden Coupling Between Thinking and Prompt Caching
The prompt caching cache key includes the thinking configuration. That means:
# First call: creates the cache
client.messages.create(
...,
thinking={"type": "adaptive"},
output_config={"effort": "medium"},
system=[{"type": "text", "text": LONG_SYSTEM, "cache_control": {"type": "ephemeral"}}],
...
)
# Second call: change effort to high → cache invalidated and billed again
client.messages.create(
...,
thinking={"type": "adaptive"},
output_config={"effort": "high"}, # Changed
system=[{"type": "text", "text": LONG_SYSTEM, "cache_control": {"type": "ephemeral"}}],
...
)
# First call: creates the cache
client.messages.create(
...,
thinking={"type": "adaptive"},
output_config={"effort": "medium"},
system=[{"type": "text", "text": LONG_SYSTEM, "cache_control": {"type": "ephemeral"}}],
...
)
# Second call: change effort to high → cache invalidated and billed again
client.messages.create(
...,
thinking={"type": "adaptive"},
output_config={"effort": "high"}, # Changed
system=[{"type": "text", "text": LONG_SYSTEM, "cache_control": {"type": "ephemeral"}}],
...
)
Practical advice: keep effort fixed within the same workflow. Do not change effort during A/B tests and still expect cache hits.
Pitfall 2: Thinking Blocks Are Not Preserved Across Turns Automatically
The model returns a thinking block each time, but in the next request those blocks are not in context by default unless you manually pass them back, as shown in section 5. If you ignore this in multi-turn conversations, the model will repeatedly “think through” things it already reasoned about, wasting a lot of tokens.
Pitfall 3: Thinking Is Billed for the Full Process, Not the Visible Summary
The block.thinking you receive is a thinking summary generated by the model, but billing is based on the full internal thinking process, which may be 3-5x larger than the summary. A thinking block that looks like 200 words on screen may have consumed 5,000 tokens behind the scenes. This is often why the month-end bill does not match your visual estimate.
Pitfall 4: Thinking Conflicts with Several Sampling Parameters
When thinking is enabled, you cannot use:
temperatureother than 1, ortop_ktop_pbelow 0.95- Forced tool use, where
tool_choicespecifies a concrete tool name - Response prefill, where an assistant message is used to prefill the model’s reply
If your existing code uses any of these parameters, remove them before migrating to thinking.
8. Migration Checklist: From Old Code to Adaptive
Code level:
- [ ] Change
thinking: {"type": "enabled", "budget_tokens": N}tothinking: {"type": "adaptive"} - [ ] Add
output_config: {"effort": "..."}and choose the level from the table above - [ ] Remove the
interleaved-thinking-2025-05-14beta header, because adaptive enables it automatically - [ ] Check whether
temperature,top_k,tool_choice, or prefill conflicts with thinking
Engineering level:
- [ ] Add logic to pass thinking blocks back in multi-turn conversations
- [ ] Keep
effortconsistent for prompt caching - [ ] Add billing alerts that monitor thinking tokens separately from visible output tokens
Testing level:
- [ ] Run the same prompt with
effort: low,medium, andhigh, then record quality and token usage - [ ] In the agent pipeline, verify that second-round thinking remains reasonable after passing back
tool_result
9. Summary
Getting one thing right is enough: use adaptive by default, choose effort by task type, and do not start with high.
claudeapi.com provides full access to Claude Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5, and is compatible with the Anthropic SDK format. You only need to replace base_url to integrate it smoothly into existing projects. The console supports viewing thinking token and regular token usage separately by API key, making it easier to find cases where the model is “thinking too hard.”
Try it now: claudeapi.com · Full model list and pricing are available at console.claudeapi.com.



