Compaction-Friendly Search Output: A Practical Playbook

Q: Is `rg --vimgrep` not more structured? Why avoid it?

It looks structured but adds `file:line:column:` metadata to every line. In measurement it produced 15-22% more tokens than default `rg`, which already returns `file:line:`. The third column (column) is not extra information for the agent; it only eats bytes. Default + `| head -50` is cheaper.

Q: What is the most practical template for a tool result format?

A single-line `file:line | code_snippet` format. The `file:line + 2 lines context` version given in the pillar is richer but 3x more expensive. In the first pass use single line; if the agent needs to, it can `Read` the specific file. This is in line with Anthropic's tool result clearing strategy in Claude Code.

Q: What happens if the tool result wrapper is not this disciplined?

Three typical failures: (1) a single `rg` call eats the search budget in one shot, (2) the agent reads the same file 4-5 times, writing the full content into context each time, (3) compaction fires and which reasoning step depended on which piece is lost. All three are why the `Tier B → Tier C drift` described in the pillar happens.

TL;DR

The single dominant token-saving lever for agent search: cut output at the source. Measurement shows rg | head -50 shrinks output ~40x on high-frequency queries (47k → 1.2k tokens); --vimgrep is not cheap, slightly more expensive; -C N inflates 2-6x by match density. A practical playbook: two-phase strategy (rg -l for discovery → narrowed Read), file:line + 2 lines output format, file-cache + session dedup, budget-aware wrapper, pre-compaction triggers (tool result > 15-20% search budget, same file read 3+ times, fill above 70%). Compaction is not a recovery, it is a failure signal.

Compaction for an agent means “context loss”. Search is the most frequent trigger for that loss. Any wrapper that does not cut the tool result at the source eventually invites compaction.

In code search for AI agents I noted that “the tool result returned to the agent must always be compressed”, and in token budget arithmetic I unpacked which flags this compression can be measured against. In this article I am combining the two sides: a concrete playbook for in what order, in what format, with what cache to compress the output.

The Dominant Lever: Cut Output at the Source

The clearest takeaway from the token budget article’s measurement: shortening output after it has fallen into the agent’s context is a late intervention. The cost is already paid. The real savings happen inside the tool call’s command itself, not after.

The measurement table for a high-frequency query is striking (measured on 298-file ceaksan-v4.0 src/; tiktoken cl100k_base, Spoke 2 notebook):

Variant	Tokens	Ratio to default
`rg "import" src`	~47,000	1.0x (baseline)
`rg --vimgrep ...`	~57,000	1.22x
`rg -C 2 ...`	~285,000	6.06x
`rg ... \| head -50`	~1,200	0.025x (40x smaller)
`rg -l ...`	~7,800	0.17x

40x savings matter even on mid-to-large context windows. A single rg "import" call eats 23% of a 200k budget; with | head -50 it drops to 0.6%.

Three points stand out:

Default rg + | head -50 is mandatory on high-frequency queries, not optional
The --vimgrep flag does not save, it is slightly more expensive. Default already returns file:line:
When using -C N, predict match density in advance. On a 100+ match query, -C 2 can grow output up to 6x and can consume the search budget in a single call

Two-Phase Strategy: Discovery → Targeted

A single rg call should not deliver everything. The pair below is both cheap and reasoning-friendly:

# Phase 1: Discovery (file list)
rg -l "newsletterCluster" src

# Phase 2: Targeted (only the 2-3 files that look relevant)
rg "newsletterCluster" src/content/posts/2026/05/06.*/tr.mdx -C 1

Phase 1’s cost is fixed and low. In the Spoke 2 measurement on the 298-file ceaksan-v4.0 src/ folder, this value came out to ~5.8k tokens; on similarly sized repos it falls in the ~5-8k band. The agent looks at the file list, decides which 2-3 to enter, and phase 2 only does -C 1 or Read on those files. Total cost comes out around a tenth of a single-shot rg -C 2.

This pattern is the same as Anthropic’s “list-then-read” approach in Claude Code, just moved to the search side.

Tool Result Format: file:line + Snippet

The file:line + 2 lines context format from the pillar is rich but expensive. In measurement the variants for the same result set:

Format	Typical tok/match
`file:line` (location only)	~20
`file:line: match_line`	~40-60
`file:line + 1 before/after`	~120-180
`file:line + 2 lines context`	~200-300

For the first pass file:line: match_line (default rg behavior) is enough. If the agent’s reasoning says “I need to look around this match”, that is when it goes to the specific file with Read. Front-loading context is most of it likely wasted.

The practical rule: the tool result should always be in the shortest informative form; leave deepening to the agent’s decision.

File-Cache: Don’t Read the Same File Twice

In Spoke 2’s Read distribution measurement the tail mattered: 6% of files exceed 5k tokens, 2% exceed 10k, max 51k. If those files fall into the agent’s “let me read again, I am not sure” loop, the context can end in one shot.

A simple in-memory cache at session scope solves the problem:

from pathlib import Path
import tiktoken

class FileCache:
    """Session-scope tool-side cache. Does not send the same file twice."""

    def __init__(self, enc=tiktoken.get_encoding('cl100k_base')):
        self.enc = enc
        self._store = {}  # key: (path, mtime, range) -> str

    def read(self, path: str, line_range: tuple[int, int] | None = None) -> str:
        p = Path(path)
        mtime = p.stat().st_mtime_ns
        key = (str(p), mtime, line_range)

        if key in self._store:
            return self._store[key]

        text = p.read_text(errors='ignore')
        if line_range:
            lines = text.splitlines()
            a, b = line_range
            text = '\n'.join(lines[a-1:b])

        self._store[key] = text
        return text

    def tokens_saved(self, hits: int, avg_size: int) -> int:
        return hits * avg_size

Three design decisions are intentional:

The cache key includes mtime. Skip the cache if the file changes. Cross-session persistence is skipped because the code can change within a typical agent session
line_range is optional. If the agent wanted only a part of the file, cache that part; if a full file is later requested, open a separate entry. Not over-the-top optimization, just correctness
There is a tokens_saved metric. After 10 tasks in a day the wrapper has a numerical answer to “how much did I save”; it makes budget discipline visible

Budget-Aware Wrapper: Summarize, Then Truncate

The cache does not shorten the response; it only prevents repetition. The new output itself also has to live under a budget. A single wrapper for search and Read:

class BudgetedToolWrapper:
    """Shortens tool output against budget thresholds."""

    def __init__(self, context_window: int, search_ratio: float = 0.15, enc=None):
        self.enc = enc or tiktoken.get_encoding('cl100k_base')
        self.search_budget = int(context_window * search_ratio)
        self.tool_result_cap = int(self.search_budget * 0.20)  # max 20% per call
        self.used = 0

    def _tokens(self, s: str) -> int:
        return len(self.enc.encode(s))

    def _summarize(self, text: str, target_tokens: int) -> str:
        lines = text.splitlines()
        kept = []
        budget = target_tokens
        for line in lines:
            t = self._tokens(line) + 1
            if budget - t < 0:
                kept.append(f'... [+{len(lines) - len(kept)} lines, summarized]')
                break
            kept.append(line)
            budget -= t
        return '\n'.join(kept)

    def wrap(self, raw_result: str) -> str:
        tokens = self._tokens(raw_result)
        if tokens <= self.tool_result_cap:
            self.used += tokens
            return raw_result

        target = int(self.tool_result_cap * 0.6)  # 60% of capacity, headroom
        summarized = self._summarize(raw_result, target)
        self.used += self._tokens(summarized)
        return summarized

This wrapper regime follows the same principle as the BudgetedSearchAgent search side in spoke 2; here it is applied to Read and rg results. The constructor takes context_window at runtime; no hardcoded model capacity assumption.

A single sensitive point: _summarize does a simple line-truncate in the form above. In production it is worth evolving for rg output into “remaining match count + per-file grouping”. The principle is the same.

Pre-Compaction Triggers: Three Early Warnings

If the wrapper is well built, compaction should not fire. If it fires, one of three things is true. All three are measurable:

A single tool result exceeds 15-20% of the search budget. The wrapper’s tool_result_cap is not set right. Lower the cap or summarize more aggressively
The same file is read 3+ times. File-cache is not in play, or the cache key is wrong (without an mtime check every read may open a new entry). Did you add a cache hit-rate metric?
Context window fill passes 70% and 5+ tool calls are still needed. Task planning is wrong; the agent should summarize early and hand off to a fresh turn (handoff pattern)

The first two signals are easy to catch inside the wrapper; the third requires control in the agent’s own prompt. The recommended control line:

Before any tool call, estimate context fill: if > 70%, summarize work-in-progress
and request a fresh turn. Do not continue accumulating tool results.

This line looks simple but in my own trials I observed it significantly reduced compaction events; I don’t have a specific N yet.

Measurement: What to Look at One Week Later

Three metrics, three thresholds:

Metric	Good	Borderline	Intervene
Avg tokens per search task	< 8k	8-15k	> 15k
File-cache hit rate	> 40%	20-40%	< 20%
Compaction events per task	0	0.1-0.3	> 0.3

Producing these numbers does not require a fancy observability tool. Add a simple log to the wrapper and summarize with 5 lines of SQL at the end of the day. The main thing is to see the number, not measure it perfectly.

Thinking Together with the Pillar Wrapper

In the pillar I proposed three-layer search (lexical → structural → semantic). In Spoke 2 I added budget discipline; in this article I added format and cache discipline. When all three come together the agent code search wrapper looks like this:

class AgentSearchStack:
    def __init__(self, context_window: int):
        self.cache = FileCache()
        self.budget = BudgetedToolWrapper(context_window, search_ratio=0.15)

    def search(self, query: str, lang: str = 'auto') -> str:
        # Phase 1: discovery
        files = run(['rg', '-l', query, 'src']).splitlines()
        if not files:
            return 'no matches'

        if len(files) > 20:
            # Too wide; return the file list to the agent first
            return self.budget.wrap('\n'.join(files[:50]))

        # Phase 2: targeted match
        out = run(['rg', query, *files])
        return self.budget.wrap(out)

    def read(self, path: str, line_range=None) -> str:
        raw = self.cache.read(path, line_range)
        return self.budget.wrap(raw)

Not production-grade, just principle-bearing. The main message: search and read run under the same budget, share the same cache, and the agent never sees the raw version of either.

Next Step

The fourth article in this cluster, LLM-free SpecAgent: AST-based forecasting, will take this budget further by predicting the agent’s next query in advance and caching the result. The role of ast-grep in this wrapper: not cheap, but the only candidate where structural patterns are required for forecasting.

Agent Search Engineering

The practical companion to this article: the long-form version of all 4 cluster articles, academic literature summaries, the measurement notebook, ready-to-use policy snippets, and the decision tree template. Included with ceaksan.com Premium tier.

Join Premium

What's inside

Long-form version of all 4 cluster articles
BudgetedToolWrapper + FileCache production-grade
Jupyter notebook with policy benchmark
Ready CLAUDE.md and .cursorrules snippets
Decision tree template (PDF + Mermaid source)

Key Takeaways

01 Cut output at the source: rg | head -50 is the dominant lever. Without capping high-frequency queries the agent context does not last.
02 --vimgrep does not save; in measurement it is 1.15-1.22x more expensive. -C N inflates 2-6x by match density.
03 Two-phase strategy: first rg -l for the file set, then targeted rg or Read. Dumping every match in one shot is wrong.
04 The tool result wrapper must always be budget-aware. An output exceeding the threshold is summarized first, truncated second; raw calls never go straight to the agent.
05 Pre-compaction triggers: single tool result exceeds 15-20% of search budget, same file read 3+ times, fill above 70%. All three are early-intervention signals.

Frequently Asked Questions (FAQ)

+ If compaction fires automatically, why do I need manual compression?

Automatic compaction is a recovery mechanism, not a fix. Per Anthropic's compaction behavior documentation and my own observation, when compaction happens the agent can lose details it previously saw: which line of which file it looked at, which result it ruled out. Manual compression is about never triggering compaction in the first place, so reasoning continuity is preserved.

+ Is `rg --vimgrep` not more structured? Why avoid it?

It looks structured but adds file:line:column: metadata to every line. In measurement it produced 15-22% more tokens than default rg, which already returns file:line:. The third column (column) is not extra information for the agent; it only eats bytes. Default + | head -50 is cheaper.

+ What is the most practical template for a tool result format?

A single-line file:line | code_snippet format. The file:line + 2 lines context version given in the pillar is richer but 3x more expensive. In the first pass use single line; if the agent needs to, it can Read the specific file. This is in line with Anthropic's tool result clearing strategy in Claude Code.

+ At what level should file-cache live?

Session scope. An in-memory dict shared across agent turns is enough. Cross-session persistence is not needed (code changes, cache ages). The (file_path, mtime, byte_range) triple as a cache key prevents collisions. [SWEzze](https://arxiv.org/abs/2603.28119) reports a 51-71% inference-time compression rate; the bulk of that gain comes from session cache + dedup, not magic.

+ What happens if the tool result wrapper is not this disciplined?

Three typical failures: (1) a single rg call eats the search budget in one shot, (2) the agent reads the same file 4-5 times, writing the full content into context each time, (3) compaction fires and which reasoning step depended on which piece is lost. All three are why the Tier B → Tier C drift described in the pillar happens.

developer-tools ai

The Dominant Lever: Cut Output at the Source

Two-Phase Strategy: Discovery → Targeted

Tool Result Format: file:line + Snippet

File-Cache: Don’t Read the Same File Twice

Budget-Aware Wrapper: Summarize, Then Truncate

Pre-Compaction Triggers: Three Early Warnings

Measurement: What to Look at One Week Later

Thinking Together with the Pillar Wrapper

Next Step

RELATED

Token Budget Arithmetic for Agent Search

Code Search for AI Agents: ripgrep, ast-grep, or Semantic?

Living Architecture Documentation for AI Coding Agents: Research, Approaches, and Tools

LLM failure patternsand how to defend

LLM failure patterns
and how to defend