Skip to main content

Advisor Tool

Pair a faster executor model with a higher-intelligence advisor model that provides strategic guidance mid-generation.

The advisor tool lets a fast, lower-cost executor model (Sonnet or Haiku) consult a high-intelligence advisor model (Opus 4.6) mid-generation. The advisor reads the full conversation and produces a plan or course correction — typically 400–700 text tokens — and the executor continues with the task.

This pattern is well-suited for long-horizon agentic workloads (coding agents, computer use, multi-step research) where most turns are mechanical but having an excellent plan is crucial. You get close to advisor-solo quality while the bulk of token generation happens at executor-model rates.

Beta

The advisor tool is in beta. Include anthropic-beta: advisor-tool-2026-03-01 in your requests — LiteLLM adds this automatically when it detects the advisor tool in your tools array.

Supported Providers

ProviderChat Completions APIMessages API
Anthropic API
Azure Anthropic❌ (coming soon)❌ (coming soon)
Google Cloud Vertex AI❌ (coming soon)❌ (coming soon)
Amazon Bedrock❌ (coming soon)❌ (coming soon)

Model Compatibility

The executor and advisor models must form a valid pair. Currently the only supported advisor model is claude-opus-4-6.

ExecutorAdvisor
claude-haiku-4-5-20251001claude-opus-4-6
claude-sonnet-4-6claude-opus-4-6
claude-opus-4-6claude-opus-4-6

Chat Completions API

SDK Usage

Basic Example

Advisor Tool — litellm.completion()
import litellm

response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
)

print(response.choices[0].message.content)

With Optional Parameters

Advisor Tool with max_uses and caching
import litellm

response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Build a REST API with authentication in Python."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
"max_uses": 3, # cap advisor calls per request
"caching": {"type": "ephemeral", "ttl": "5m"}, # enable for 3+ calls per conversation
}
],
max_tokens=4096,
)

Streaming

Streaming with Advisor Tool
import litellm

response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Implement a distributed rate limiter."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
stream=True,
)

for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Streaming behavior

The advisor sub-inference does not stream. The executor's stream pauses while the advisor runs, then the full advisor result arrives in a single event. Executor output resumes streaming afterward.

Multi-Turn Conversation

Multi-Turn with Advisor Tool
import litellm

tools = [
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
]

messages = [
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
]

response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=messages,
tools=tools,
max_tokens=4096,
)

# Append the full response (includes server_tool_use + advisor_tool_result blocks)
messages.append({"role": "assistant", "content": response.choices[0].message.content})

# Continue the conversation — keep the same tools array
messages.append({"role": "user", "content": "Now add a max-in-flight limit of 10."})

response2 = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=messages,
tools=tools,
max_tokens=4096,
)
Auto-strip on follow-up turns

LiteLLM automatically strips advisor_tool_result blocks from message history when the advisor tool is not present in the current request. This prevents the Anthropic 400 error that would otherwise occur.

AI Gateway Usage

Proxy Configuration

config.yaml
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY

Client Request via Proxy

Advisor Tool via AI Gateway
from openai import OpenAI

client = OpenAI(
api_key="your-litellm-proxy-key",
base_url="http://0.0.0.0:4000/v1"
)

response = client.chat.completions.create(
model="claude-sonnet",
messages=[
{"role": "user", "content": "Implement a distributed rate limiter in Python."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
)

Messages API

SDK Usage

Basic Example

Advisor Tool — litellm.anthropic.messages
import asyncio
import litellm

async def main():
response = await litellm.anthropic.messages.acreate(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
)
print(response)

asyncio.run(main())

Streaming

Messages API Streaming with Advisor Tool
import asyncio
import json
import litellm

async def main():
response = await litellm.anthropic.messages.acreate(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Implement a distributed rate limiter."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
stream=True,
)

async for chunk in response:
if isinstance(chunk, bytes):
for line in chunk.decode("utf-8").split("\n"):
if line.startswith("data: "):
try:
print(json.loads(line[6:]))
except json.JSONDecodeError:
pass

asyncio.run(main())

AI Gateway Usage

Proxy Configuration

config.yaml
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY

Client Request via Proxy (Anthropic SDK)

Advisor Tool via AI Gateway (Anthropic SDK)
import anthropic

client = anthropic.Anthropic(
api_key="your-litellm-proxy-key",
base_url="http://0.0.0.0:4000"
)

response = client.beta.messages.create(
model="claude-sonnet",
max_tokens=4096,
betas=["advisor-tool-2026-03-01"],
messages=[
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
)
print(response)

Response Structure

A successful advisor call returns server_tool_use and advisor_tool_result blocks in the assistant content:

Response with advisor blocks
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "Let me consult the advisor on this."
},
{
"type": "server_tool_use",
"id": "srvtoolu_abc123",
"name": "advisor",
"input": {}
},
{
"type": "advisor_tool_result",
"tool_use_id": "srvtoolu_abc123",
"content": {
"type": "advisor_result",
"text": "Use a channel-based coordination pattern. The tricky part is draining in-flight work during shutdown: close the input channel first, then wait on a WaitGroup..."
}
},
{
"type": "text",
"text": "Here's the implementation using a channel-based coordination pattern..."
}
]
}

Pass the full assistant content, including advisor blocks, back on subsequent turns. LiteLLM handles this automatically through provider_specific_fields.


Cost Control

Advisor calls run as a separate sub-inference billed at the advisor model's rates. Usage is reported in usage.iterations[]:

Usage with advisor sub-inference
{
"usage": {
"input_tokens": 412,
"output_tokens": 531,
"iterations": [
{
"type": "message",
"input_tokens": 412,
"output_tokens": 89
},
{
"type": "advisor_message",
"model": "claude-opus-4-6",
"input_tokens": 823,
"output_tokens": 1612
},
{
"type": "message",
"input_tokens": 1348,
"output_tokens": 442
}
]
}
}

Top-level usage reflects executor tokens only. Advisor tokens appear in iterations entries with type: "advisor_message" and are billed at Opus rates.

Tips:

  • Enable caching on the tool definition only when you expect 3+ advisor calls per conversation; it costs more than it saves below that threshold.
  • Use max_uses to cap advisor calls per request. Once reached, the executor continues without further advice.
  • For conversation-level caps, count advisor calls client-side. When you reach your limit, remove the advisor tool from tools.

For coding and agent tasks, Anthropic recommends prepending these blocks to your system prompt for consistent advisor timing and optimal cost/quality:

Timing guidance (prepend to system prompt)
You have access to an `advisor` tool backed by a stronger reviewer model. It takes NO parameters — when you call advisor(), your entire conversation history is automatically forwarded. They see the task, every tool call you've made, every result you've seen.

Call advisor BEFORE substantive work — before writing, before committing to an interpretation, before building on an assumption. If the task requires orientation first (finding files, fetching a source, seeing what's there), do that, then call advisor. Orientation is not substantive work. Writing, editing, and declaring an answer are.

Also call advisor:
- When you believe the task is complete. BEFORE this call, make your deliverable durable: write the file, save the result, commit the change.
- When stuck — errors recurring, approach not converging, results that don't fit.
- When considering a change of approach.

On tasks longer than a few steps, call advisor at least once before committing to an approach and once before declaring done. On short reactive tasks where the next action is dictated by tool output you just read, you don't need to keep calling.
Advice weight guidance (add after timing block)
Give the advice serious weight. If you follow a step and it fails empirically, or you have primary-source evidence that contradicts a specific claim, adapt. A passing self-test is not evidence the advice is wrong.

If you've already retrieved data pointing one way and the advisor points another: don't silently switch. Surface the conflict in one more advisor call — "I found X, you suggest Y, which constraint breaks the tie?"

To reduce advisor output length by 35–45% without losing quality, add:

Cost reduction (optional, add before timing block)
The advisor should respond in under 100 words and use enumerated steps, not explanations.

Additional Resources