Token Budget Arithmetic for Agent Search

TL;DR

An AI coding agent context window is a fixed budget; search is a sub-budget of it. A typical allocation: 30% system + project context, 25% conversation history, 15% search results, 20% reasoning headroom, 10% reserve. The 15% search ratio is a good starting heuristic for ~100-200k windows and aligns with the ARCS budget allocation work. At the extremes the framework shifts: a very small window triggers an absolute floor (~4k tokens minimum), a very large window triggers an absolute ceiling (~50k tokens beyond which it is waste). Overflow signals: a single tool result exceeding 15-20% of the current budget, the same file being read more than 3 times, context window fill passing 70%. Numerical walkthroughs for the three tiers, CLAUDE.md snippets, and measurement practices included.

When the agent’s context window fills up, model performance drops. Search is the single biggest trigger for that filling. You cannot understand agent behavior without thinking about the budget numerically.

In code search for AI agents I noted that “search is a sub-budget, default upper bound is 15% of the context window”. In this article I am unpacking where the formula comes from, how it scales for different context windows, and how to catch overflow signals, numerically.

Context Window Is a Budget, Search Is a Sub-budget

The model’s context window is a fixed resource. Capacities in the AI/LLM ecosystem keep expanding (today’s common values may look small tomorrow, million-token windows are on the way), but every tool call, every file read, every reasoning round draws from this resource. Whatever the absolute number, budget discipline does not change; only the base each sub-budget scales on does.

The late-2025 ARCS paper turned this reality into an architectural principle: agentic retrieval runs as a budgeted synthesize-execute-repair loop, and the accuracy/cost trade-off must be optimized explicitly.¹

The practical version: split the context window into five sub-budgets and know each as a ratio.

A typical distribution I’ve derived from my own agentic workflows and the ARCS principle:

Sub-budget	Typical share	What it covers
System + project context	30%	CLAUDE.md, .cursorrules, system prompt, file tree, project memory
Conversation history	25%	Prior user-assistant turns, the agent’s own notes
Search + tool results	15%	rg/ast-grep/repo-map/Read outputs, MCP tool returns
Reasoning headroom	20%	The empty space the model needs to think
Reserve	10%	Unexpected tool call, error recovery, to keep compaction from firing

As an example, a numerical mapping for a mid-to-upper context window (a common reference point at the time of writing):

Total                    200,000 tokens (reference example)
System + project          60,000 tokens
History                   50,000 tokens
Search results            30,000 tokens   <-- the topic of this article
Reasoning headroom        40,000 tokens
Reserve                   20,000 tokens

15% is not a fixed law of nature, but a starting heuristic for typical mid-to-upper context windows (~100-200k). At the extremes the framework does not apply as-is because three reasons break the ratio:

Fixed costs do not scale. A single Read call returns a source file of around ~750 tokens in the typical case, but the distribution has a long tail: in this project’s measurement, median 740, p75 1516, p95 5911, p99 19,213, max ~52k tokens. So a “typical” Read is under 1k, but ~6% of files exceed 5k. Whether the context window is 32k or 1M, this fixed cost does not change: in 32k, 15% is only 4.8k tokens; two mid-sized Reads end the budget. In 1M, 15% = 150k tokens; even half of that is waste for a typical task.
Task need is independent of context size. A bug fix needs N file searches + M file reads; the need does not grow proportionally just because the context window did.
At the extremes strategy changes, not just the ratio. In a small window the move is not “raise the ratio” but “tighten the strategy” (disable semantic, switch to surgical lexical). In a large window the move is not the ratio but the “absolute ceiling”.

The correct three-layer framing:

Practical rule	Typical range	Note
Starting ratio 15%	100-200k window	A healthy default for typical IDE integrations
Absolute floor ~4k tokens	< 50k window	Guarantee at least 1 healthy search + 2 file reads; if the ratio cannot reach this, change strategy
Absolute ceiling ~50k tokens	> 500k window	If the ratio crosses this, lower the number; the surplus is better spent on reasoning or history

In this frame, 15% is a starting point, not a fossilized rule. As the window grows the ratio drops (the absolute ceiling kicks in); as it shrinks the ratio either rises or the strategy changes entirely. The underlying principle: as much search as is needed, more is budget waste, less is task failure.

The allocation is not rigid; it shifts by agent type. In a multi-step refactor task, history’s share can rise to 40%; in a one-shot Q&A task it can drop to 10%. But “15-20% for search” is an empirical starting point that leaves enough room for reasoning + history on typical code-search tasks.

Measurement: Which Flag Actually Saves?

For this spoke, three query types were run on the ceaksan-v4.0 codebase (298 files): a rare symbol (validateTurnstileToken, ~6 hits), a medium-frequency identifier (newsletterCluster, ~306 hits), and a pathological case (import, ~1547 hits). Taking rg default as the baseline, the variants’ token ratios:

Variant	Symbol (~6 hits)	Concept (~306 hits)	High-freq (~1547 hits)
`rg` (baseline)	1.00x (111 tok)	1.00x (9k tok)	1.00x (47k tok)
`rg --vimgrep`	1.22x	1.15x	1.15x
`rg -C 2`	3.81x	5.67x	2.75x
`rg -m 5`	1.00x	1.00x	0.81x
`rg \| head -50`	1.00x	0.17x	0.02x
`rg -l` (files)	0.18x	0.76x	0.16x

Three takeaways:

The dominant lever is not a flag; it is cutting output with \| head -N or -l. On the high-frequency query rg \| head -50 cuts the baseline by 40x (47k → 1.2k tok). No flag is this strong.
--vimgrep is not cheaper, it is more expensive (15-22% higher on every query). The parseable single-line format does not mean fewer tokens; it adds per-line metadata. If you want compact output, prefer rg --no-heading -n or pipe to head.
-C 2 inflation depends on match density: 3.8x on sparse queries, between 2.75x and 5.67x on dense ones. Instead of a fixed “3-5x” ratio, think “match density × context factor”.

ast-grep is a different class of tool, not a token alternative. On the same corpus, for the export const $N pattern, ast-grep default returned 11.9k tokens and --json=compact returned 24.9k tokens; the comparable rg -C 2 was 2.4k. ast-grep is not a savings tool; it is the tool for the structural distinction that regex cannot make (function declaration vs call, typed const vs untyped, JSX component vs HTML element). If there is no other way to make that distinction, you accept the cost; otherwise rg + filter is enough.

Numerical Walkthrough Across Three Tiers

Tier A: Wide context window (reference 200k+ range)

Total context window      200,000 tokens (example)
Search budget (15%)        30,000 tokens
Per-tool-result target      1,000 tokens (roughly 30 result lines)
Max tool calls per turn          30

A typical task: “find all middleware functions and refactor to use the new auth pattern”. This task takes roughly 8-12 turns.

Turn 1: rg "middleware" → 12 files, ~150 tokens
Turn 2: filter to real middlewares with ast-grep → 3 functions, ~400 tokens
Turn 3-5: read the 3 files → 3 × 800 tokens = 2400 tokens
Turn 6-8: refactor, write back, verify → minimal search

Total search spend: ~3000 tokens. About 10% of the budget. Healthy.

Counter-example (bad pattern): for the same task the agent calls semantic search directly → 200 results, 8000 tokens. Then to decide which one is the actual middleware it reads every file → 12 × 1200 tokens = 14,400 tokens. Total search spend: 22,400 tokens. 75% of the budget. No reasoning headroom left, compaction fires, context loss begins.

The gap between the two approaches: 7x. Same task, same model, only the search policy differs.

Tier B: Medium context window (reference 100-200k range; most IDE integrations live here)

Total context window      128,000 tokens (example)
Search budget (15%)        19,200 tokens
Per-tool-result target        750 tokens (roughly 20 result lines)
Max tool calls per turn          25

Smaller window, tighter discipline. Built-in semantic searches typically return top-K vector hits + a snippet per hit; with a common configuration like K=10 and snippet=100 tokens, a single call can easily reach a few thousand tokens. The exact number depends on the IDE version and embedding chunk size, but the order of magnitude is enough signal for the agent to plan: two semantic calls back-to-back in a mid-tier window can eat half the search budget.

Practical advice: disable IDE semantic search in settings, or switch it to Manual mode, and make ripgrep the default. In Cursor the relevant setting in recent versions lives under Settings → Models → Search (location may shift across versions).

Tier C: Small context window (reference 32k and below)

Total context window       32,000 tokens (example)
Search budget (15%)         4,800 tokens
Per-tool-result target        200 tokens (roughly 5-7 result lines)
Max tool calls per turn          15

Aggressive compression is mandatory in a small window. Semantic search barely fits. Stay tightly bound to lexical + structural:

# Typical search invocation for Tier C
rg -C 1 -m 10 --vimgrep "pattern" --type ts  # max 10 matches, 1 line context
ast-grep --json --no-color 'pattern' | head -20  # max 20 matches

The -m 10 and head -20 flags are critical. Without them a single search call drains the budget.

Budget Overflow Signals

Three red flags:

1. A single tool result exceeds 15-20% of the search budget

If a single search call or Read by itself consumes 15-20% of the search budget, the agent is querying wrong. In Tier A this is ~5000 tokens; in Tier C it is ~800 tokens. Intervention:

// Wrapper pseudo-code
function searchWrapper(query, backend, searchBudget) {
  const result = backend.exec(query);
  const tokens = estimateTokens(result);
  const cap = Math.floor(searchBudget * 0.2); // tool result cap = 20% of budget

  if (tokens > cap) {
    return summarize(result, {
      maxTokens: Math.floor(cap * 0.3),
      strategy: "first-N-files",
    });
  }
  return result;
}

2. The same file is read 3+ times

If the agent Reads the same file repeatedly, a caching layer is missing. This is both budget waste and a direct invitation to context rot, which I described in behavioral failure modes of LLMs. The fix:

Hold a session-level file content cache in the skill or MCP wrapper
Return from cache when the same file:line range is requested again
Cache key: sha256(file_path + line_range + mtime)

In my practical observation, on typical refactor tasks this yields around 30-40% search token savings, but the ratio varies with the cache hit rate; because the agent naturally reopens files it already referenced.

3. Context window fill above 70%

As my own heuristic, I treat 70% as the early-warning threshold; it’s not a formal threshold. There is still enough room for reasoning at this point, but it shrinks fast. Intervention strategy:

Clear old tool results (Anthropic’s tool result clearing pattern)
Summarize conversation history (do not leave it to auto-compaction; you take control)
Finish the active task, then start a new session

If you cross 85%, compaction is about to fire. At that point, to avoid losing nuance, write a manual notes-to-self (the structured note-taking pattern I adapted from Anthropic’s context management guide).

CLAUDE.md / .cursorrules Snippets

CLAUDE.md (for Claude Code)

## Search and Tool Budget

- Total search budget per task: ~15% of available context window
- Single tool result cap: ~20% of search budget (scope down if exceeded)
- File read cap: keep proportional to context window; use line range when possible
- Start with rg, escalate to ast-grep only for structural patterns regex cannot express
- For high-frequency queries, always cap output with `| head -50` (dominant cost lever)
- Use `rg -l` for discovery, then targeted `rg` or Read on the shortlist
- Avoid `--vimgrep` (measurably more tokens than default); avoid `-C N` unless context truly needed

## Budget overflow recovery

When a tool result exceeds the cap:

1. Summarize to top-3 files
2. Re-query with narrower scope (--type ts, path filter)
3. Do not feed the full result back to next turn

.cursorrules (for Cursor)

Search budget per task: 15% of available context window.
Disable IDE semantic search default. Use ripgrep first, ast-grep for structure.
Tool result cap: 20% of search budget. If exceeded, summarize before returning.
File read: use line range when known; avoid reading full large files.
Never re-read the same file twice in a session without cache check.

Generic MCP wrapper (token-aware backend)

import tiktoken

class BudgetAwareSearch:
    def __init__(self, context_window: int, search_ratio: float = 0.15):
        self.budget = int(context_window * search_ratio)
        self.tool_result_cap = int(self.budget * 0.2)
        self.used = 0
        self.encoder = tiktoken.get_encoding("cl100k_base")
        self.file_cache = {}

    def search(self, query: str) -> str:
        if self.used >= self.budget * 0.8:
            return "[budget warning: 80% of search budget used, summarize and stop]"

        cache_key = self._cache_key(query)
        if cache_key in self.file_cache:
            return self.file_cache[cache_key]

        raw_result = self._run_backend(query)
        tokens = len(self.encoder.encode(raw_result))

        if tokens > self.tool_result_cap:
            result = self._summarize(raw_result, int(self.tool_result_cap * 0.3))
        else:
            result = raw_result

        self.used += len(self.encoder.encode(result))
        self.file_cache[cache_key] = result
        return result

This class is not production-grade, just principle-bearing. The point: the backend wrapper must always be budget-aware; never expose the raw call directly to the agent. Passing context_window as a constructor parameter is also critical; the number is not hardcoded, it shifts at runtime depending on which model you are running against.

What to Measure One Week Later

Track three metrics:

Average tokens per search task: average by task type. As a practical starting target, I set 80% for myself; not a benchmark
Tool call count per resolved task: a low number is good (effective); too low (below 5) is bad (insufficient exploration). On my own agentic refactor tasks, 8-15 tool calls has been the healthy range; it varies with task complexity
Compaction trigger rate: target 0 compactions per task. If 1+, budget discipline failed; intervene

For these metrics you can use the Anthropic Claude Code admin dashboard, Cursor’s usage tab, or your own MCP wrapper logging. Even without those three, a simple daily note is enough: “out of 10 tasks today, how many triggered compaction?”

Next Step

The third article in this cluster, Compaction-friendly search output: a practical playbook, will cover tool result compression patterns with code examples once it is published. The fourth article, LLM-free SpecAgent: AST-based forecasting, takes this budget further by predicting the agent’s next query in advance and caching the result.

Agent Search Engineering

The practical companion to this article: the long-form version of all 4 cluster articles, academic literature summaries, the measurement notebook, ready-to-use policy snippets, and the decision tree template. Included with ceaksan.com Premium tier.

Join Premium

What's inside

Long-form version of all 4 cluster articles
CoREB, ARCS, SpecAgent literature summaries
Jupyter notebook with policy benchmark
Ready CLAUDE.md and .cursorrules snippets
Decision tree template (PDF + Mermaid source)

Footnotes

ARCS. Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement. arXiv:2504.20434. “Budgeted synthesize-execute-repair loop targeting predictable accuracy-latency trade-offs under fixed iteration and retrieval budgets.” https://arxiv.org/html/2504.20434 ↩

Key Takeaways

01 Context window is a fixed budget. Search is a sub-budget; default upper bound 15%.
02 Typical allocation: 30% system+project, 25% history, 15% search, 20% reasoning headroom, 10% reserve. Shift by agent type.
03 If a single tool result exceeds 15-20% of the search budget, that is a red flag. Truncate or narrow scope.
04 The 15% ratio is a starting heuristic, not a law of nature. Small windows hit an absolute floor; large windows hit an absolute ceiling.
05 Reading the same file 3+ times = missing caching layer. Your skill needs dedup and a session cache.

Frequently Asked Questions (FAQ)

+ Do I really have to count tokens manually?

No. Manual counting per tool call is impossible. Cap tool results against an upper bound instead (for rg the strongest lever is piping with | head -50; add -C N only if needed, since it inflates output 2-6x), compress backend output via post-process, and put a reminder prompt for the agent: 'Each tool call must stay under N tokens'.

+ Where does the 15% search budget rule come from?

It is empirical. The ARCS paper makes the case academically for explicit budget allocation, and Anthropic's context engineering blog suggests the 15-20% range as a compaction default. In practice this ratio leaves enough room for reasoning + history on typical code-search tasks.

+ Does this rule hold for models with a small context window?

It does, but more aggressively. In a small window 15% becomes a tiny absolute number, which means a single semantic search call can eat the whole budget. In that tier stay tightly bound to lexical + structural and treat semantic as a last resort.

+ If context auto-compaction exists, do I still need to track the budget?

Yes. Auto-compaction is a recovery mechanism, not a fix. When compaction fires you lose detail and the agent's chance of reaching the right answer drops. Budget discipline is about never needing compaction in the first place.

+ What should the agent do when the budget is exceeded?

Three steps: first summarize from existing results (top 3 instead of 10 files), then clear old tool results from context (Anthropic's tool result clearing pattern), as a last resort hand off to a fresh agent turn. In every case, stop raw dumping.

developer-tools ai