What Brownfield Engineering Teaches You About Reliable AI Agents

Michael Huang
May 3
5 min read

Updated: May 5

This week at the Stage, we invited Hrishi, CEO at Southbridge.ai, to share his learnings on deploying agents for brownfield engineering work.

Hrishi opened the session with a contrast.

Building most AI demos is like cooking a souffle for friends. If things go wrong, you can swap it for cookie dough halfway through and no one really minds.

But brownfield engineering - the bottom of the stack data work his team at Southbridge does - is different. It is a professional kitchen. The work has to land exactly where it needs to land, or the business stalls.

For those who prefer videos, watch the presentation on youtube here.

The bottleneck is what the model can see

Hrishi points out that while models have gone from acting like junior interns to being intimidatingly smart, the amount of information they can actually see has not really gone up.

"At any point in time, no matter what you're using, the model sitting on the other side is only ever seeing, like, a megabyte worth of data at most," he says.

Past that megabyte, performance degrades. It reminds him of early computer systems with barely enough working memory to hold the active problem.

Someone from the floor pushes back on this. What about the million-token context windows we see advertised. What about the research showing models fail Haystack tests at just a hundred thousand tokens.

Hrishi makes a useful distinction between causal continuity and random retrieval.

"LLMs tend to encode information very sparsely," he explains. If you are working on the same unfolding problem, a frontier model can comfortably hold 400K to 500K tokens. The information follows a causal chain.

But if you stuff that same window with completely unrelated facts and expect intelligence to emerge, the model will struggle much earlier.

Take-away for me: we are no longer constrained by reasoning capacity. We are constrained by architecture. We have to architect around this hard memory limit just like software engineers did decades ago.

How to stay stuck in greenfield forever

The funniest part of the session is when Hrishi structures his advice as an anti-guide. He channels a CIA sabotage manual to explain exactly how to ensure your AI projects never leave the toy phase. Here's a summary for y'all:

1. Never delete anything from context. Throw in every skill, every instruction, and every summary. Compact it often so the team has no shared mental model of what the agent actually knows.

2. Build everything from scratch. Do not reuse dependencies. Build it so complicated that if you leave it for two weeks, you have to rebuild it.

3. Pass organisational knowledge through model after model. Let coworkers vibe a process, have a model summarise it, and feed that to another assistant. Your signal degrades into slop.

4. Mix your control flow, model instructions, deterministic code, and execution environments into one giant blob.

5. Work in silos so every engineer operates like a beautiful butterfly.

The room laughs when he describes how a model will happily create a beautiful two-file thing where no one can tell what anything is. But the underlying truth is sharp. If your prompt, your business logic, and your tool orchestration all blur together, you do not have a system.

Reuse is how reliability compounds

Southbridge's answer to this mess is an abstraction layer they call Hankweave (check out their opensourced repo here.). They started building it at the bottom of the stack because low-level systems need time to accumulate scars and get stable.

Hrishi says they open sourced it to get more diversity and more people pointing out where it fails. You cannot simulate the real world internally.

Their operating pattern is simple. Here are some principles to chew over:

1. Separate control flow from model instructions and code.

2. Break work into individually sequenced units.

3. Reuse these units instead of rebuilding them.

4. Let fixes in a reused unit benefit multiple workflows.

Hrishi admits this is harder than just building whatever you want. But reuse gives you the benefit of the last person who encountered the edge case.

Sentinels are quiet and cheap

The most practical pattern for me was the concept of sentinels.

These are lightweight watchdog agents that sit alongside the main loop. They do not take over the task. They just observe it.

If your main agent keeps hand-waving tests away, we used to try stuffing our prompt with emojis and promises of a hundred-dollar tip.

Or we would bring in a reviewer agent after the fact to fix what got broken.

Sentinels offer a third way. They run on cheaper models like Gemini Flash, and they watch for specific triggers and simply take notes when the main agent veers off track. Those notes surface at decision breakpoints without interrupting the main workflow.

What was super interesting was the philosophy behind the design of the Sentinel, which Hrishi shared. In particular, how do you measure the performance of a smart AI system using a dumber one. How do you break that loop.

Hrishi's answer is grounded in human behaviour. You cannot always measure pure intelligence directly, but you can look for behavioural traces. "With a co-worker, you can always tell when someone's life isn't necessarily going on track," he says. You don't necessarily have to be smarter than him to tell.

With an agent, you look at its behaviour in the same way. Is it reading outside the right boundaries. Are the database calls distributed normally. You do not need a massive model to spot when the shape of the work looks wrong.

Evals have to follow behaviour

This brings up a larger shift in how we evaluate AI. The old evaluation model is broken. When an agentic loop makes a hundred tool calls over six hours, twenty-five of those calls can fail, but the overall run can still achieve the correct outcome.

You cannot judge a complex agent system the same way you judge a single chatbot answer.

Hrishi asks the room how to define and test for successful behaviour rather than step-by-step success or failure. It is a question of measuring the shape of good behaviour instead of demanding perfect execution on every try.

Questions from the floor

Matthew jumps in to observe that everything Hrishi is describing sounds exactly like managing a team of excitable junior developers for the first time. Hrishi immediately agrees.

Shaun asked about the constant shifts in model capabilities. With jumps from 3.5 Sonnet enabling true agentic loops, to Opus 4 shifting from defined problems to open-ended objectives, does everything get swept away every few months.

Hrishi was very open with his answer. It is hard to measure the differences because is impossible to quantify intelligence. It is difficult to quantify how a ten percent increase in a model's intelligence will affect the systems you build today.

But he leaves us with a quiet reassurance. The teams that survive those shifts are the ones who refuse to start from zero every month. They are the ones who build abstraction layers, enforce reuse, and figure out how to keep the kitchen running.

Around the room, the questions remain sharp. People are comparing agent harnesses, debating evaluation methods, and discussing local models. You can feel what Coworking Fridays is good at in moments like this - builders taking each other's work seriously enough to push on the weak points.

Missed out last week? Don't worry, these conversations happen every Friday at SQ Collective.

Usually over laptops. Sometimes over pizza.

Join the next one.