Tool-output offloading
A single oversized tool output — 500 SQL rows, a 3 MB API response,
a 400-page PDF run through document_conversion — can dwarf the
rest of the conversation and burn a turn's worth of context window
in one shot. Tool-output offloading is the platform's defense
against that.
It's on by default for every agent. When a tool returns more
than the agent can comfortably absorb, the platform either parks
the full output in a session artifact and hands the LLM a compact
reference (artifact mode), or shortens the output in place,
keeping the head and tail and dropping a notice into the middle
(truncate mode). Mode auto-selects: artifact if the agent has
any of artifact_read, artifact_grep, or artifact_jq
configured, truncate otherwise — so an agent without artifact
tools is never left chasing references it can't follow.
Two gates decide what's "too big"
A tool output is offloaded when either of these fires:
- Per-output gate — the output on its own is large enough to
dominate the prompt. Threshold is
context_percentageof the model's context window (default0.25), clamped betweenmin_threshold_bytesandmax_threshold_bytes. - Headroom gate — the output is small in absolute terms, but
adding it would push cumulative input tokens past
headroom_percentageof the context window (default0.70). This catches the slow-burn case: many medium-sized outputs piling up across a session even though no single one is obviously huge.
Anything below min_threshold_bytes always passes through inline.
A small output isn't worth an artifact round-trip and isn't worth
truncating.
headroom_percentage is independent of the
compaction threshold and both can fire
on the same agent. Set it below the compaction threshold so large
tool outputs are offloaded before compaction runs — otherwise a
fresh tool result can be summarized into oblivion the moment it
arrives.
What the agent sees
In artifact mode, the LLM receives a small envelope in place of
the original output:
{
"artifact_id": "art_tool_output_9k2x",
"size_bytes": 2457600,
"line_count": 14020,
"shape": { "rows": "array(500) of object(6 keys)", "next_cursor": "string" },
"how_to_access": {
"artifact_read": "Read full content or a line range (start_line/end_line)",
"artifact_grep": "Search for patterns with grep",
"artifact_jq": "Query with jq expressions, e.g. '.results[0]', '.[] | select(.score > 0.5)'"
}
}
shape is a hint for the LLM to eyeball, not a JSON-schema
document. how_to_access only lists the artifact tools the agent
actually has configured. Artifact tools themselves are never
offloaded — their outputs come back inline so the agent doesn't
end up chasing references to references.
In truncate mode, the head and tail of the original output are
preserved with a notice marking what was dropped:
{ "rows": [
{ "id": 1, ... },
{ "id": 2, ... },
... [truncated 1.8 MB; head and tail preserved] ...
{ "id": 4998, ... },
{ "id": 4999, ... }
]
}
There is no follow-up tool call — the truncated form is final. Zero extra LLM turns, no way to recover detail from the middle.
Configure artifact tools to unlock artifact mode
Adding any of the artifact tools is enough to flip the agent into
artifact mode and give the LLM something to do with the
references it now receives. Configure all three when you can; each
hits a different access pattern.
THREE ARTIFACT TOOLS, ALL ACCESS PATTERNS COVERED
Code example with json syntax.1
Pair offloading with refinable tools
Offloading caps how much the agent ingests in one turn. What it ingests on the next turn is up to your tools. Vectara's built-ins are designed for this — every retrieval and listing tool exposes at least one parameter the agent can use to come back with a narrower question:
corpora_searchtakes aqueryand a full search configuration — metadata filters, result limits, reranking — so the agent can re-query a corpus with a filter or a tighterlimitinstead of re-reading the offloaded result.list_documentstakes acorpus_key, ametadata_filterstring, alimit, and a page key.web_searchtakes aquery, alimit, andinclude_domains/exclude_domains.sql_queryruns whatever SQL the agent writes, so refinement is implicit — but a custom wrapper that pins the database and only exposes aWHERE-clause parameter follows the same pattern.artifact_read,artifact_grep, andartifact_jqare the reference implementation: line ranges, regex patterns, jq expressions. Custom tools that operate on large content should match this shape.
When you build custom tools, follow the same instinct — expose a
filter, a limit, a cursor, or a structured query parameter rather
than always returning a fixed shape. This matters most under
truncate mode, where the dropped middle is gone and the agent's
only recovery path is a narrower re-call. It pays under artifact
mode too: a refinable tool lets the agent skip the artifact
round-trip entirely when it can describe up front what it actually
needs.
The knobs
The tool_output_offloading config exposes six fields. Defaults
are tuned for typical agents — most setups never need to touch
them.
| Field | Default | What it does |
|---|---|---|
enabled | true | Master switch. Set to false only to force pass-through. |
mode | auto | artifact if any artifact tool is configured, otherwise truncate. Set explicitly to override. |
context_percentage | 0.25 | Per-output threshold as a fraction of the model's context window. |
min_threshold_bytes | 4096 | Floor — outputs below this always pass through. |
max_threshold_bytes | 1048576 | Ceiling — outputs above this always offload, even on huge-context models. |
headroom_percentage | 0.70 | Cumulative-context gate. Set to 1.0 to disable. |
The effective per-output threshold scales with the model's context
window, clamped between the floor and ceiling. On large-context
models, max_threshold_bytes is usually the binding constraint.
Trade-offs
artifactmode costs an LLM turn per drill-in. That's the whole point — but it's whymin_threshold_bytesexists. Borderline outputs are often better inline.- Unstructured blobs are awkward to navigate. A 2 MB JSON
document with a clean shape is easy for
artifact_jq. A 2 MB free-form text dump is not. If you control a custom tool, prefer structured output. - Artifacts live for the session. Offloaded outputs share storage with user-uploaded files and clean up when the session ends. Persist anywhere you need them to outlive the session.
- Truncation is lossy and not reversible. Use
artifactmode whenever the agent might need the middle of a large output.
Related
- Context engineering overview
- Artifacts — where offloaded outputs live.
- Built-in tools —
artifact_read,artifact_grep,artifact_jq, and the rest.