I was browsing Hacker News last week when this post stopped me:
I built AgentBudget after an AI agent loop cost me $187 in 10 minutes. GPT-4o retrying a failed analysis over and over.
I’ve been there. Not $187, but I’ve watched a LangChain agent burn through tokens on a recursive loop that should have been caught in seconds. By the time I noticed, the damage was done. No alert, no warning, just a surprisingly high bill at the end of the month.
Then I saw another thread, “Ask HN: How are you monitoring AI agents in production?”, and the top comment said something that stuck with me:
Most tools record what happened… but not why the agent deviated from the plan. That’s the gap that actually hurts during post-mortems.
I started digging. And the more I looked, the more I realized: we have a systemic blind spot.
We can trace everything but understand nothing
57% of teams now have AI agents in production (LangChain State of Agent Engineering, 2025). But Gartner predicts 40% of agentic AI projects will be canceled by 2027, not because the AI isn’t good enough, but because teams can’t trust it in production.
The tools exist. LangSmith, LangFuse, Arize, Helicone. They all show you traces. Beautiful, detailed traces. But here’s the problem: a trace tells you what happened, not whether it worked.
Think about it. Your agent processed a customer query in 2.3 seconds, made 4 tool calls, used 3,200 tokens. Great. But:
- Did the customer actually buy something after that response?
- Did the agent hallucinate a product feature that doesn’t exist?
- Is the agent’s accuracy drifting compared to last week?
No trace will tell you that.
Three blind spots I keep seeing
Cost is tracked after the fact, not prevented
The $187 agent loop is not an isolated case. ZenML’s LangSmith alternatives analysis found that busy LangSmith workspaces hit five-figure monthly bills. On the Langfuse launch thread (215 points, 61 comments), one team mentioned managing $60K+ monthly LLM spend and finding existing solutions inadequate.
The issue isn’t tracking cost. LangSmith and LangFuse both show token usage. The issue is that cost calculations are often wrong (19 thumbs up on that issue), especially for cached tokens, vision models, and multi-provider setups. And nobody stops the runaway loop before it drains your budget.
The monitoring tool itself becomes a risk
Here’s one that surprised me: LangSmith’s @traceable decorator crashed production applications during a LangSmith outage (5 thumbs up). The monitoring layer brought down the actual app. AgentOps deadlocks when you record 100+ events. Langfuse’s self-hosted dashboard times out under production load.
The irony: the tool you added for reliability becomes a reliability risk.
Nobody connects agent performance to business outcomes
This is the blind spot nobody talks about. This LangChain issue (9 thumbs up, February 2026) asks for exporting agent metrics to customer-facing dashboards. It’s still open. Langfuse’s alerting feature request has been open since December 2023, over two years.
As one HN commenter put it: “Your audit trail is completely fractured. You can’t confidently tell a compliance officer what your synthetic workforce is doing.”
We’re building agent traces in one system, business metrics in another, and the gap between them is where trust dies.
What would actually help?
I’ve been thinking about what a different approach would look like. Not another trace viewer. Something that answers three questions:
Is my agent reliable? A single score, 0-100, updated in real time. Combining output validation, hallucination checks, latency anomalies, error rates. At a glance: this agent is healthy, this one isn’t.
Is my agent making money? Connecting the trace to the business event. The agent recommended a product, did the customer buy it? The agent wrote a response, did the customer leave or stay? Not just “the agent ran,” but “the agent produced value.”
Will I know when it breaks? A Slack message that says: “Your product-recommender agent’s reliability dropped from 91% to 67% in the last 2 hours. Top error: output validation failure at step 3. [See trace].” Not a PagerDuty-grade enterprise alerting system. Just a webhook that tells you something went wrong.
I’m researching this. Can you help?
I’m exploring whether this is a real problem worth solving or just my personal frustration.
If you run AI agents in production (or plan to), I’d love 2 minutes of your time:
5 questions. No signup. I’ll share the results publicly.
Whether this becomes a tool, an open-source project, or just a blog post with interesting data, the findings will be useful either way.
- 01 Traces tell you what happened, not whether it worked
- 02 Cost is tracked after the fact, not prevented: $187 agent loops, five-figure monthly bills
- 03 The monitoring tool itself can become a reliability risk (LangSmith decorator crashing production)
- 04 Nobody connects agent performance to business outcomes
- 05 Current tools are flight recorders, not collision avoidance systems
+ Why don't existing observability tools catch agent failures?
Tools like LangSmith and LangFuse are designed for debugging after the fact. They show traces, latency, and token counts, but don't provide real-time reliability scoring, cost prevention, or business outcome correlation. They're flight recorders, not collision avoidance systems.
+ What is the $187 problem?
A developer's GPT-4o agent got stuck in a retry loop, retrying a failed analysis over and over, costing $187 in 10 minutes. No monitoring tool alerted them. This pattern (agent loops burning through API credits without warning) is common in production deployments.
+ Can monitoring tools themselves cause production issues?
Yes. LangSmith's @traceable decorator crashed production applications during a LangSmith outage. AgentOps deadlocks when recording 100+ events. Langfuse's self-hosted dashboard times out under production load. The monitoring layer can become a reliability risk.
+ What would better agent monitoring look like?
Three capabilities: (1) Real-time reliability scoring (single 0-100 score combining output validation, hallucination checks, anomalies), (2) Business outcome connection (did the agent's action produce value?), (3) Simple alerting via webhook when metrics cross thresholds.