Code Search for AI Agents: ripgrep, ast-grep, or Semantic?

TL;DR

When an AI coding agent searches a codebase, there are three layers: lexical (ripgrep), structural (ast-grep), and semantic (repo-map, embeddings). The right question is not which is better, but in what order. Start with ripgrep, escalate to ast-grep when the pattern is structural, jump to repo-map only when the query is conceptual. Embeddings are a last resort. Track the token budget at every step and compress results to file:line + 2 lines of context. The CoREB benchmark shows short keyword queries collapse every semantic model. The academic literature points the same way through ARCS and SpecAgent: budget-aware and forecasting-aware orchestration.

ripgrep is fast, ast-grep is structural, semantic search is conceptual. All three are good tools. But for an AI agent the wrong question is “which one is better”. The right question is: for this agent, this task, this context window, which backend in what order?

If an agent fills its context window faster than expected, the problem is not the model. The problem is the tokens it spends finding the right code for you. In a typical scenario, performing a search that should be done with ripgrep via semantic search can turn a clean dozen-file answer into hundreds of noisy results (illustrative numbers, not measurements). On top of that the agent burns another round of reasoning to read and summarize that noise. Double loss.

This article is the natural continuation of grep, ripgrep and text search in the age of AI. There I introduced the three layers (lexical, structural, semantic) separately. Similarly, pre-injection vs MCP tool loop covered how to deliver context to the agent. This article covers which backend to use, in which order, on the inside. Delivery and retrieval are two sides of the same problem.

The Wrong Question: “Which One Is Better”

A developer can say “I’ll skim 10 results and pick the right one”. An agent does not work that way. For an agent every result is a token cost, every extra tool call another reasoning round, every compaction event a context loss. The decision matrix works differently from a human’s (assumption: token-priced reasoning does not resemble human eye-skim).

Quick Recap of the Three Layers

Lexical (rg, ugrep, grep): exact match, fast, context-blind. Parallel scanning with SIMD (Single Instruction, Multiple Data), gitignore awareness, millisecond results.
Structural (ast-grep): pattern matching on top of an AST (Abstract Syntax Tree). Syntax-aware. Language-specific rules. Resolves shape-based queries that regex cannot.
Semantic (mgrep, embeddings, repo-map): conceptual matching. Natural language queries, embedding similarity, file centrality via PageRank. Expensive, with high setup cost.

The Agent Variable Breaks Everything

The CoREB benchmark published in May 2026 shared a striking finding: short keyword queries, the format closest to real developer search, collapse nearly every semantic model tested to near-zero nDCG@10.¹ In other words, short queries like “auth flow”, “user service”, and “handle error” are semantic search’s weakest spot. Yet most queries coming out of agent prompts are in exactly that format.

What does nDCG@10 mean? Normalized Discounted Cumulative Gain at 10. An information retrieval metric. It looks at the top 10 results, scores each by relevance, rewards higher-ranked relevant results more, and scales the outcome between 0 and 1. 1.0 = perfect ranking, 0.0 = no relevant result in the top 10 or very deep in the list. A near-zero value means in practice: the model fills the agent’s context with 10 irrelevant results, either failing to return the right code or ranking it so low it cannot be used.

This finding does not stand alone. In behavioral failure modes of LLMs I noted context rot and instruction attenuation; this is exactly where they kick in. If the agent is fed too many results, it loses track of which one matters and forgets which instruction it was following. So semantic search’s “90% recall” claim can turn into “90% poison” in the agent ecosystem (analytical inference, not a direct measurement).

New Criteria

The comparison needs to be redone from the agent’s perspective:

Backend	Latency	Recall	Precision	Tokens/result	Setup	Determinism	Update
ripgrep	1-5ms	Low*	High	Very low	None	Full	Instant
ast-grep	10-50ms	Medium	High	Low	Low	Full	Instant
repo-map	50-200ms	Medium-High	Medium	Medium	Medium	Full	Index
Embeddings	100ms-1s	High*	Low**	High	High	None	Recompute

*recall when the query is written well. **on short keyword queries per the CoREB finding.

Latency and token values in the table reflect observations on typical repo sizes; they vary with hardware, repo size, and index state.

The Decision Tree

There is one question at the heart of the tree: what type is the query? Classify the query type correctly and the backend choice follows.

Classify the Query Type

Query type              -> Default backend
Exact symbol/string     -> rg
Pattern + scope         -> rg + path filter
Syntactic shape         -> ast-grep
Semantic intent         -> repo-map first, embeddings last resort
Cross-file refactor     -> ast-grep + repo-map
"Where is X used?"      -> rg -> ast-grep escalation

This table looks simple, but classifying the raw query coming out of an agent prompt is a separate job. A practical rule: if there is an exact quoted string in the agent query it is lexical; if there are structural words (function, class, async, return) it is structural; if the form is a question sentence it is semantic.

Backend Escalation Policy

Classification is not enough. If recall is low, or the budget allows, you should escalate one layer up. Escalation works like this:

[query] -> classify
   |-- exact?      -> rg -> done (large majority of cases)
   |-- structural? -> ast-grep -> rg fallback
   `-- semantic?   -> repo-map (PageRank) -> rg expansion -> embeddings (last)

[every step] -> budget check (tokens used vs cap)
[every step] -> recall check (results count, file diversity)

There are three checkpoints.

If the result count is zero, escalate immediately. Do not insist.
If results cluster in a single file, diversity is low; expand.
If token budget passes 50%, stop and continue with current results.

As an architecture decision substrate, the structured documentation approach I covered in living architecture documentation for AI coding agents is a lifesaver here. For repo-map to work, code symbols must be clear, ADRs and module boundaries must be defined.

Two Real Examples

Example 1: The agent says “find all auth middleware”.

Naive approach: semantic search directly. 200 results return. Every file containing auth, middleware, session, login shows up. The agent has to read an 8k-token wall of text. Then a second tool call decides which one is the actual middleware. More tokens. Another round.
Right approach:

rg "middleware" --type ts -l   # 12 files, ~120 tokens
ast-grep --pattern 'export function $NAME($_, $_, next) { $$$ }' --lang ts  # 3 real middlewares

In this example, two steps cost roughly 400 tokens; compared to the naive approach that is one to two orders of magnitude in savings (illustrative arithmetic, varies by repo).

Example 2: “where is user session refreshed?”

Naive approach: semantic directly. Every file containing session and refresh. The important ones and the irrelevant ones together.
Right approach: first, use repo-map to find which files the session symbol is central to (PageRank-ranked list, top 3). Then run rg "refresh" on those three files. Result: 2 hits, the correct function.

The shared idea across both: use the expensive layer to narrow scope, and once scope is narrowed, the cheap layer takes over. Not the other way around.

Three Invisible Dimensions for an Agent

The documentation of existing search tools does not address three dimensions. They are the ones that matter most for an agent.

1. Token Budget Arithmetic

The context window is a budget. Search is a sub-budget of that budget. The late-2025 ARCS paper makes this explicit: agentic retrieval runs as a budgeted synthesize-execute-repair loop, and the accuracy/cost trade-off must be optimized explicitly.²

As a starting heuristic I use search_budget ≈ context_window * 0.15 (not a canonical formula, my own default); calibrate to your workload. In a 200k-token context window allocate ~30k tokens for search. This is an upper bound; monitor it continuously. The context rot I described in behavioral failure modes of LLMs is directly tied to overshooting this budget: the agent starts forgetting things it read in the middle of the context. Some studies report degradation around half-fill in 50k-class windows; the exact threshold varies by model and task.

The practical way to track the budget is simple: measure every tool result by character or token count first; if it exceeds the threshold, summarize or truncate. The next article in this cluster, Token budget arithmetic for agent search, walks through this calculation with numerical examples.

2. Result Compression at the Tool Boundary

The SWEzze study showed code context can be compressed by 51 to 71 percent at inference time, an average 6x ratio.³ The academic claim is bold. A solo dev does not need to go that deep, but the core principle applies: always compress the tool result returned to the agent.

The practical format is this:

file:line + at most 2 lines of context
Do not return the same result from different lines repeatedly (dedup)
gitignore-aware (auto-exclude boilerplate files)
Strip blank lines, excess whitespace, decorative comments

This format aligns naturally with Anthropic’s tool result clearing strategy in Claude Code. If the agent needs more, it can open the specific line again. The fourth article in the cluster, Compaction-friendly search output: a practical playbook, covers this format with code examples.

3. Speculative Pre-fetching

The SpecAgent paper proposes an interesting idea: predict the agent’s likely next query in advance and cache the result.⁴ The academic version uses LLM-based forecasting, which is expensive.

For a solo dev, the LLM-free version is much more practical. Extract the active file’s imports and top-level symbols via AST traversal, pre-grep the referenced symbols in the background. No embeddings, no LLM calls, just Tree-sitter + ripgrep. In a typical Tree-sitter + ripgrep combination, overhead is on the order of 50ms (in my own test; varies by hardware). By the time the agent asks the second query, the result is already ready.

This approach pairs nicely with a local MCP architecture. A local MCP is already a process with file system access. Background prefetch needs no extra permissions. The third article in the cluster, LLM-free SpecAgent: AST-based forecasting, builds this mini pipeline.

Where Existing Tools Fit in This Frame

It is useful to re-evaluate existing tools against this frame. Not a competitor list, but an agent-perspective fit analysis:

Tool	Fit	Gap
`ripgrep` raw	Foundation of the lexical layer, ideally fast	Raw for an agent by itself, no policy
`ast-grep`	Standard of the structural layer	No standalone agent policy, no multi-backend orchestration
Aider repo-map	Embedding-free graph rank, deterministic	Hard to use outside Aider
Claude Code Context Engine	Built-in, exposed via MCP	Black box, you do not see which policy runs
Sourcegraph Amp, DeepContext	Code graph + semantic, enterprise grade	Overkill for a solo dev, cost is high
Smithery, netresearch skills	Wrapper, free, fast setup	Single backend, no policy layer
`mgrep`, embedding-based tools	Real implementation of the semantic layer	Short query problem (CoREB), heavy setup

The clear message: there is no shortage of tools. There is a shortage of policy. Existing tools think one layer at a time. The orchestration layer that combines all three layers with escalation and budget control is the real gap in the ecosystem.

A Practical Start: What to Do This Week

All of the above theory has a 4-line practical version. You can apply it in your agent setup this week.

Four Changes

CLAUDE.md or .cursorrules: “Always start search with ripgrep. Escalate to ast-grep only if the query is structural. Use semantic search only when symbol or exact string is unknown.” This single paragraph fixes a large part of agent behavior (observation-based estimate, no measurement).
Tool result format: always cap high-frequency queries with | head -50 (the dominant savings lever); first find the file set with rg -l, then run a targeted rg or Read. The --vimgrep flag does not save anything; -C N inflates output 2-6x depending on match density, so add it only when needed. For ast-grep, use --json + post-process. The output returned to the agent must always be compressed.
Default budget: do not allocate more than 15% of the context window to search. If exceeded, stop and continue with what you have.
gitignore aware everything: rg respects gitignore by default; for ast-grep set --no-ignore=false. node_modules, dist, and build should never reach the agent.

Measure One Week Later

Track three metrics:

Average token consumption per search task
Tool call count per resolved task
How many times each backend was chosen (rg / ast-grep / semantic distribution)

If the semantic ratio is above 20%, you are likely misclassifying queries (heuristic threshold, not a hard rule). You may be routing queries that ripgrep + ast-grep could solve to semantic search.

Next Step

I am deepening this frame into a cluster. The next three articles:

Token budget arithmetic for agent search: numerical examples, formulas, measurement notebook
LLM-free SpecAgent: AST-based forecasting: an LLM-free prefetch pipeline
Compaction-friendly search output: a practical playbook: result format snippets and dedup strategies

Each will be linked from this pillar as it goes live.

Agent Search Engineering

The practical companion to this article: the long-form version of all 4 cluster articles, academic literature summaries, the measurement notebook, ready-to-use policy snippets, and the decision tree template. Included with ceaksan.com Premium tier.

Join Premium

What's inside

Long-form version of all 4 cluster articles
CoREB, ARCS, SpecAgent literature summaries
Jupyter notebook with policy benchmark
Ready CLAUDE.md and .cursorrules snippets
Decision tree template (PDF + Mermaid source)

Footnotes

CoREB benchmark. Beyond Retrieval: A Multitask Benchmark and Model for Code Search. arXiv:2605.04615. Finding: “Short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10.” https://arxiv.org/abs/2605.04615 ↩
ARCS. Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement. arXiv:2504.20434. “Budgeted synthesize-execute-repair loop targeting predictable accuracy-latency trade-offs under fixed iteration and retrieval budgets.” https://arxiv.org/html/2504.20434 ↩
SWEzze / Oracle-guided Code Distillation. Compressing Code Context for LLM-based Issue Resolution. arXiv:2603.28119. “Maintaining a stable compression rate of about 6 times across models and reducing the total token budget by 51.8 to 71.3 percent.” https://arxiv.org/abs/2603.28119 ↩
SpecAgent. A Speculative Retrieval and Forecasting Agent for Code Completion. arXiv:2510.17925. “Code language models face efficiency constraints from limited context windows and latency budgets, with RAG being used to dynamically fetch relevant snippets.” https://arxiv.org/pdf/2510.17925 ↩

Key Takeaways

01 The right question is not which tool is best. For an agent it is: which backend in which order, under what token budget?
02 ripgrep default, ast-grep escalation, repo-map for conceptual queries, embeddings last resort. In practice this order solves the large majority of cases (observation-based estimate).
03 Short keyword queries (auth flow, user service) collapse semantic models. The CoREB benchmark proved it.
04 Search is a sub-budget. Starting heuristic: search_budget ≈ context_window * 0.15. Check at every step.
05 Result format matters. file:line + ≤2 lines of context + dedup saves notable tokens compared to raw output (in my own measurements ~30-50% range, varies by repo and query type).

Frequently Asked Questions (FAQ)

+ When does it make sense to use semantic search instead of ripgrep?

When the query is natural language and you do not know the exact symbol. But try repo-map first and reach for embeddings last. Embeddings are both expensive and, per the CoREB benchmark, return near-zero nDCG@10 on short queries.

+ When should I pick ast-grep over ripgrep?

Whenever the query is about syntactic shape. Examples: find every async function, catch every rethrow inside try-catch blocks. Anything that can be written in regex but produces much more accurate results via AST.

+ Is it worth learning this when Claude Code has its own context engine?

Yes. The built-in engine is a black box. You do not know which backend runs, with what budget, in what order. You cannot optimize a policy you do not understand. Also, if you are writing a custom skill or MCP server, this decision is yours.

+ Do I really need to count tokens?

Instead of counting manually, cap the search backend output: in ripgrep use -C 2 -m 50, in ast-grep use --json + post-process. Remember the budget at every step, never raw dump.

+ Will embeddings never be used?

They are used, but rarely. Cross-repo search, code lookup from a natural language prompt, documentation search. For everyday in-codebase search, the large majority of cases are solved with ripgrep + ast-grep (observation-based estimate).

developer-tools ai