Skip to main content
Version: 2.0

Tool-output offloading

A single oversized tool output — 500 SQL rows, a 3 MB API response, a 400-page PDF run through document_conversion — can dwarf the rest of the conversation and burn a turn's worth of context window in one shot. Tool-output offloading is the platform's defense against that.

It's on by default for every agent. When a tool returns more than the agent can comfortably absorb, the platform either parks the full output in a session artifact and hands the LLM a compact reference (artifact mode), or shortens the output in place, keeping the head and tail and dropping a notice into the middle (truncate mode). Mode auto-selects: artifact if the agent has any of artifact_read, artifact_grep, or artifact_jq configured, truncate otherwise — so an agent without artifact tools is never left chasing references it can't follow.

Two gates decide what's "too big"

A tool output is offloaded when either of these fires:

  • Per-output gate — the output on its own is large enough to dominate the prompt. Threshold is context_percentage of the model's context window (default 0.25), clamped between min_threshold_bytes and max_threshold_bytes.
  • Headroom gate — the output is small in absolute terms, but adding it would push cumulative input tokens past headroom_percentage of the context window (default 0.70). This catches the slow-burn case: many medium-sized outputs piling up across a session even though no single one is obviously huge.

Anything below min_threshold_bytes always passes through inline. A small output isn't worth an artifact round-trip and isn't worth truncating.

headroom_percentage is independent of the compaction threshold and both can fire on the same agent. Set it below the compaction threshold so large tool outputs are offloaded before compaction runs — otherwise a fresh tool result can be summarized into oblivion the moment it arrives.

What the agent sees

In artifact mode, the LLM receives a small envelope in place of the original output:

{
"artifact_id": "art_tool_output_9k2x",
"size_bytes": 2457600,
"line_count": 14020,
"shape": { "rows": "array(500) of object(6 keys)", "next_cursor": "string" },
"how_to_access": {
"artifact_read": "Read full content or a line range (start_line/end_line)",
"artifact_grep": "Search for patterns with grep",
"artifact_jq": "Query with jq expressions, e.g. '.results[0]', '.[] | select(.score > 0.5)'"
}
}

shape is a hint for the LLM to eyeball, not a JSON-schema document. how_to_access only lists the artifact tools the agent actually has configured. Artifact tools themselves are never offloaded — their outputs come back inline so the agent doesn't end up chasing references to references.

In truncate mode, the head and tail of the original output are preserved with a notice marking what was dropped:

{ "rows": [
{ "id": 1, ... },
{ "id": 2, ... },
... [truncated 1.8 MB; head and tail preserved] ...
{ "id": 4998, ... },
{ "id": 4999, ... }
]
}

There is no follow-up tool call — the truncated form is final. Zero extra LLM turns, no way to recover detail from the middle.

Configure artifact tools to unlock artifact mode

Adding any of the artifact tools is enough to flip the agent into artifact mode and give the LLM something to do with the references it now receives. Configure all three when you can; each hits a different access pattern.

THREE ARTIFACT TOOLS, ALL ACCESS PATTERNS COVERED

Code example with json syntax.
1

Pair offloading with refinable tools

Offloading caps how much the agent ingests in one turn. What it ingests on the next turn is up to your tools. Vectara's built-ins are designed for this — every retrieval and listing tool exposes at least one parameter the agent can use to come back with a narrower question:

  • corpora_search takes a query and a full search configuration — metadata filters, result limits, reranking — so the agent can re-query a corpus with a filter or a tighter limit instead of re-reading the offloaded result.
  • list_documents takes a corpus_key, a metadata_filter string, a limit, and a page key.
  • web_search takes a query, a limit, and include_domains / exclude_domains.
  • sql_query runs whatever SQL the agent writes, so refinement is implicit — but a custom wrapper that pins the database and only exposes a WHERE-clause parameter follows the same pattern.
  • artifact_read, artifact_grep, and artifact_jq are the reference implementation: line ranges, regex patterns, jq expressions. Custom tools that operate on large content should match this shape.

When you build custom tools, follow the same instinct — expose a filter, a limit, a cursor, or a structured query parameter rather than always returning a fixed shape. This matters most under truncate mode, where the dropped middle is gone and the agent's only recovery path is a narrower re-call. It pays under artifact mode too: a refinable tool lets the agent skip the artifact round-trip entirely when it can describe up front what it actually needs.

The knobs

The tool_output_offloading config exposes six fields. Defaults are tuned for typical agents — most setups never need to touch them.

FieldDefaultWhat it does
enabledtrueMaster switch. Set to false only to force pass-through.
modeautoartifact if any artifact tool is configured, otherwise truncate. Set explicitly to override.
context_percentage0.25Per-output threshold as a fraction of the model's context window.
min_threshold_bytes4096Floor — outputs below this always pass through.
max_threshold_bytes1048576Ceiling — outputs above this always offload, even on huge-context models.
headroom_percentage0.70Cumulative-context gate. Set to 1.0 to disable.

The effective per-output threshold scales with the model's context window, clamped between the floor and ceiling. On large-context models, max_threshold_bytes is usually the binding constraint.

Trade-offs

  • artifact mode costs an LLM turn per drill-in. That's the whole point — but it's why min_threshold_bytes exists. Borderline outputs are often better inline.
  • Unstructured blobs are awkward to navigate. A 2 MB JSON document with a clean shape is easy for artifact_jq. A 2 MB free-form text dump is not. If you control a custom tool, prefer structured output.
  • Artifacts live for the session. Offloaded outputs share storage with user-uploaded files and clean up when the session ends. Persist anywhere you need them to outlive the session.
  • Truncation is lossy and not reversible. Use artifact mode whenever the agent might need the middle of a large output.