Practical Guide to Extended Thinking: How to Use Thinking Budgets in the Opus 4.7 Era

Most Chinese-language material about Extended Thinking still teaches thinking: {type: "enabled", budget_tokens: 10000}. On Opus 4.7, that syntax fails outright. This article explains the correct usage for Opus 4.7, Opus 4.6, and Sonnet 4.6 based on actual API behavior as of 2026-05, with an effort selection table, a complete interleaved thinking loop for tool use, and the two prompt caching pitfalls developers most often hit.

1. The Thinking Interface Changed Once in Six Months

If your codebase still has a call like this:

response = client.messages.create(
    model="claude-opus-4-7",  # This line
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},  # This line
    messages=[...]
)

response = client.messages.create(
    model="claude-opus-4-7",  # This line
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},  # This line
    messages=[...]
)

After changing the model to Opus 4.7, the entire request will fail. The reason is that Anthropic changed the thinking interface in the 4.6/4.7 generation from “explicit budget” to “adaptive + effort”. Opus 4.7 no longer accepts the old syntax.

Change summary:

Model	Old syntax (`type: enabled` + `budget_tokens`)	New syntax (`type: adaptive` + `effort`)
Opus 4.7	❌ No longer accepted	✅ The only supported approach
Opus 4.6	⚠️ Still works but deprecated	✅ Recommended
Sonnet 4.6	⚠️ Still works but deprecated	✅ Recommended
Sonnet 4.6 + interleaved thinking with tools	Requires the `interleaved-thinking-2025-05-14` beta header	Enabled automatically in adaptive mode

This means that if you plan to integrate Opus 4.7 into an existing agent pipeline, the first step is to change the thinking configuration to adaptive instead of simply swapping the model ID.

2. What Is Adaptive Thinking?

The old design worked like this: you told the model, “Spend up to 10,000 tokens thinking.” The model might use the whole budget, or only part of it. The problem is:

Simple questions can be forced to waste tokens on unnecessary thinking
Complex questions may not get enough thinking if the budget is set too low

Adaptive thinking gives that decision to the model, while the effort parameter controls how aggressively it thinks:

effort	Behavior	Best for
`low`	The model often skips thinking entirely for simple questions	Lightweight calls with low cost and low latency
`medium`	Balanced: thinks only when it decides thinking is needed	General production tasks
`high` default	Almost always thinks first	Reasoning tasks in production
`max` Opus 4.6/4.7 only	Maximum thinking intensity	AIME-level math, long-horizon agents, the hardest bugs

Anthropic’s internal evaluations show that adaptive uses tokens more efficiently than fixed budgets across many tasks, because it does not waste reasoning on easy questions and does not stop too early on hard ones.

3. Complete Python Example

Environment setup:

pip install anthropic

pip install anthropic

3.1 Opus 4.7 Call Adaptive Only

import anthropic

client = anthropic.Anthropic(
    api_key="sk-your-ClaudeAPI-key",
    base_url="https://gw.claudeapi.com"
)

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    messages=[
        {"role": "user", "content": "Prove that there are infinitely many primes congruent to 3 modulo 4."}
    ]
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking] {block.thinking[:200]}...")
    elif block.type == "text":
        print(f"[Answer] {block.text}")

import anthropic

client = anthropic.Anthropic(
    api_key="sk-your-ClaudeAPI-key",
    base_url="https://gw.claudeapi.com"
)

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    messages=[
        {"role": "user", "content": "Prove that there are infinitely many primes congruent to 3 modulo 4."}
    ]
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking] {block.thinking[:200]}...")
    elif block.type == "text":
        print(f"[Answer] {block.text}")

The returned content is an ordered array of blocks: first a thinking block containing the model’s reasoning summary, then a text block containing the final answer.

3.2 The Same Syntax on Sonnet 4.6

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "medium"},  # Sonnet does not support max
    messages=[{"role": "user", "content": "..."}]
)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "medium"},  # Sonnet does not support max
    messages=[{"role": "user", "content": "..."}]
)

Note: effort: max is not accepted on Sonnet 4.6. max is exclusive to the Opus family. Use medium for everyday production tasks, and only move to high for important reasoning tasks.

3.3 Opus 4.6 Supports Both Syntaxes, but You Should Use Adaptive

# ✅ Recommended
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    messages=[...]
)

# ⚠️ Still works but deprecated. Migrate as soon as possible.
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[...]
)

# ✅ Recommended
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    messages=[...]
)

# ⚠️ Still works but deprecated. Migrate as soon as possible.
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[...]
)

4. Node.js / TypeScript Version

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: "sk-your-ClaudeAPI-key",
  baseURL: "https://gw.claudeapi.com",
});

const response = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 16000,
  thinking: { type: "adaptive" },
  output_config: { effort: "high" },
  messages: [{ role: "user", content: "Your complex reasoning task" }],
});

for (const block of response.content) {
  if (block.type === "thinking") {
    console.log("[Thinking]", block.thinking);
  } else if (block.type === "text") {
    console.log("[Answer]", block.text);
  }
}

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: "sk-your-ClaudeAPI-key",
  baseURL: "https://gw.claudeapi.com",
});

const response = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 16000,
  thinking: { type: "adaptive" },
  output_config: { effort: "high" },
  messages: [{ role: "user", content: "Your complex reasoning task" }],
});

for (const block of response.content) {
  if (block.type === "thinking") {
    console.log("[Thinking]", block.thinking);
  } else if (block.type === "text") {
    console.log("[Answer]", block.text);
  }
}

5. Interleaved Thinking: The Key Change in Tool Loops

The real power of Extended Thinking in agent scenarios is that it lets the model think again after each tool call. This is interleaved thinking.

With the old syntax on Sonnet 4.6, you had to manually add a beta header:

client = anthropic.Anthropic(
    api_key="sk-your-ClaudeAPI-key",
    base_url="https://gw.claudeapi.com",
    default_headers={"anthropic-beta": "interleaved-thinking-2025-05-14"}
)

client = anthropic.Anthropic(
    api_key="sk-your-ClaudeAPI-key",
    base_url="https://gw.claudeapi.com",
    default_headers={"anthropic-beta": "interleaved-thinking-2025-05-14"}
)

In adaptive mode on Opus 4.7 and 4.6, interleaved thinking is enabled by default, with no beta header required. This is a hidden benefit of migrating to 4.7: agent pipelines no longer need to manage thinking flags.

A complete tool-use loop. Key point: the previous round’s thinking block must be passed back exactly as-is.

weather_tool = {
    "name": "get_weather",
    "description": "Get the current weather for a specified city",
    "input_schema": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
    }
}

# Round 1: the model thinks first, then decides whether to call a tool
response_1 = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=8000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    tools=[weather_tool],
    messages=[
        {"role": "user", "content": "What should I wear in Paris right now? Give specific advice."}
    ]
)

# Round 2: pass back the thinking block + tool_use block exactly, then add tool_result
thinking_block = next(b for b in response_1.content if b.type == "thinking")
tool_use_block = next(b for b in response_1.content if b.type == "tool_use")

response_2 = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=8000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    tools=[weather_tool],
    messages=[
        {"role": "user", "content": "What should I wear in Paris right now? Give specific advice."},
        {
            "role": "assistant",
            "content": [thinking_block, tool_use_block]  # ← Key point
        },
        {
            "role": "user",
            "content": [{
                "type": "tool_result",
                "tool_use_id": tool_use_block.id,
                "content": "Paris is 22°C, sunny, with a light breeze."
            }]
        }
    ]
)

weather_tool = {
    "name": "get_weather",
    "description": "Get the current weather for a specified city",
    "input_schema": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
    }
}

# Round 1: the model thinks first, then decides whether to call a tool
response_1 = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=8000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    tools=[weather_tool],
    messages=[
        {"role": "user", "content": "What should I wear in Paris right now? Give specific advice."}
    ]
)

# Round 2: pass back the thinking block + tool_use block exactly, then add tool_result
thinking_block = next(b for b in response_1.content if b.type == "thinking")
tool_use_block = next(b for b in response_1.content if b.type == "tool_use")

response_2 = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=8000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    tools=[weather_tool],
    messages=[
        {"role": "user", "content": "What should I wear in Paris right now? Give specific advice."},
        {
            "role": "assistant",
            "content": [thinking_block, tool_use_block]  # ← Key point
        },
        {
            "role": "user",
            "content": [{
                "type": "tool_result",
                "tool_use_id": tool_use_block.id,
                "content": "Paris is 22°C, sunny, with a light breeze."
            }]
        }
    ]
)

What happens if you do not pass back the thinking block: the model “forgets” and loses the reasoning process it just built, which noticeably reduces second-round decision quality. This is one of the most common mistakes when wiring up agent pipelines.

6. Choosing effort: Which Level Should You Use?

Choose by business task type. Do not default to high:

Task	Recommended effort	Model	Notes
Simple classification, information extraction	`low` or no thinking	Haiku 4.5 / Sonnet 4.6	Thinking may only add cost and latency
General Q&A, document generation	`medium`	Sonnet 4.6	The cost-performance sweet spot
Code review, PR analysis	`medium` or `high`	Sonnet 4.6 / Opus 4.6	Multi-step reasoning, but usually not `max`
Complex algorithms, architecture design	`high`	Opus 4.7	Requires long-chain reasoning
AIME-level math, the hardest debugging tasks	`max`	Opus 4.7 / 4.6	Extreme cases, with clearly higher per-call cost

A useful rule of thumb: run the prompt once without thinking before enabling thinking. If the normal mode answers well, leave thinking off. If the normal mode has obvious issues, such as missing constraints, skipping logic, or answering the wrong question, then turn on thinking. Blindly defaulting to high is one reason teams end up with per-person monthly bills around $2,000.

7. Common Pitfalls

Pitfall 1: The Hidden Coupling Between Thinking and Prompt Caching

The prompt caching cache key includes the thinking configuration. That means:

# First call: creates the cache
client.messages.create(
    ...,
    thinking={"type": "adaptive"},
    output_config={"effort": "medium"},
    system=[{"type": "text", "text": LONG_SYSTEM, "cache_control": {"type": "ephemeral"}}],
    ...
)

# Second call: change effort to high → cache invalidated and billed again
client.messages.create(
    ...,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},  # Changed
    system=[{"type": "text", "text": LONG_SYSTEM, "cache_control": {"type": "ephemeral"}}],
    ...
)

# First call: creates the cache
client.messages.create(
    ...,
    thinking={"type": "adaptive"},
    output_config={"effort": "medium"},
    system=[{"type": "text", "text": LONG_SYSTEM, "cache_control": {"type": "ephemeral"}}],
    ...
)

# Second call: change effort to high → cache invalidated and billed again
client.messages.create(
    ...,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},  # Changed
    system=[{"type": "text", "text": LONG_SYSTEM, "cache_control": {"type": "ephemeral"}}],
    ...
)

Practical advice: keep effort fixed within the same workflow. Do not change effort during A/B tests and still expect cache hits.

Pitfall 2: Thinking Blocks Are Not Preserved Across Turns Automatically

The model returns a thinking block each time, but in the next request those blocks are not in context by default unless you manually pass them back, as shown in section 5. If you ignore this in multi-turn conversations, the model will repeatedly “think through” things it already reasoned about, wasting a lot of tokens.

Pitfall 3: Thinking Is Billed for the Full Process, Not the Visible Summary

The block.thinking you receive is a thinking summary generated by the model, but billing is based on the full internal thinking process, which may be 3-5x larger than the summary. A thinking block that looks like 200 words on screen may have consumed 5,000 tokens behind the scenes. This is often why the month-end bill does not match your visual estimate.

Pitfall 4: Thinking Conflicts with Several Sampling Parameters

When thinking is enabled, you cannot use:

temperature other than 1, or top_k
top_p below 0.95
Forced tool use, where tool_choice specifies a concrete tool name
Response prefill, where an assistant message is used to prefill the model’s reply

If your existing code uses any of these parameters, remove them before migrating to thinking.

8. Migration Checklist: From Old Code to Adaptive

Code level:

[ ] Change thinking: {"type": "enabled", "budget_tokens": N} to thinking: {"type": "adaptive"}
[ ] Add output_config: {"effort": "..."} and choose the level from the table above
[ ] Remove the interleaved-thinking-2025-05-14 beta header, because adaptive enables it automatically
[ ] Check whether temperature, top_k, tool_choice, or prefill conflicts with thinking

Engineering level:

[ ] Add logic to pass thinking blocks back in multi-turn conversations
[ ] Keep effort consistent for prompt caching
[ ] Add billing alerts that monitor thinking tokens separately from visible output tokens

Testing level:

[ ] Run the same prompt with effort: low, medium, and high, then record quality and token usage
[ ] In the agent pipeline, verify that second-round thinking remains reasonable after passing back tool_result

9. Summary

Getting one thing right is enough: use adaptive by default, choose effort by task type, and do not start with high.

claudeapi.com provides full access to Claude Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5, and is compatible with the Anthropic SDK format. You only need to replace base_url to integrate it smoothly into existing projects. The console supports viewing thinking token and regular token usage separately by API key, making it easier to find cases where the model is “thinking too hard.”

Try it now: claudeapi.com · Full model list and pricing are available at console.claudeapi.com.

Practical Guide to Extended Thinking: How to Use Thinking Budgets in the Opus 4.7 Era