Three Years of Building Agents in Production (Part 1)

Written by Lauren Leidal | Jul 1, 2026 4:55:03 PM

What does it actually take to build autonomous agents for a discipline that depends on exact, repeatable outcomes? mabl spent three years finding out. From early PaLM experiments to the Agent Framework running in production today, mabl engineer and tech lead Lauren Leidal shares what building at the edge of generative AI and software testing taught her team.

The Deterministic Wall

In the summer of 2023, the generative AI hype was deafening. Every SaaS company was rushing to bolt a chat window onto their product and declare themselves "AI-native." At mabl, we knew the underlying technology was a massive paradigm shift, but we also knew our domain. Software testing is inherently deterministic. It requires exact inputs, expected states, and binary pass/fail outcomes. LLMs, by design, are probabilistic guessers.

Still, we had to start learning. So we set an ambitious goal: build an autonomous testing agent using Google's PaLM model.

The reality check hit fast. For all the hype about super-intelligence, we could barely get our first agent to reliably navigate a simple login screen on a basic sandbox app. It would hallucinate DOM elements, lose the thread of what it was doing, and fail at tasks a human could do in three seconds.

It was a hilarious, frustrating disaster, but it gave us our first major structural insight: LLMs weren't ready to drive the whole car yet.

Codifying the Trust Gap

To succeed, we had to stop looking at AI as a magic bullet and start analyzing it as a tool with highly specific constraints. In August 2023, we published a framework outlining exactly where generative AI fit into the software testing lifecycle.

At the time, the industry was ignoring a massive trust gap. Surveys showed high interest in AI testing, but less than 3% of developers actually trusted AI tools. We broke down the testing lifecycle and realized why: generative AI was incredible at high-volume tasks like brainstorming test requirements or generating mock test data, but it fell apart at complex, constraint-based reasoning like test planning and execution.

That framework became our blueprint. If we couldn't build a fully autonomous execution agent yet, we needed to pivot and use the models to solve highly specific, contained problems where their strengths shined.

Working With the Model, Not Against It

This meant learning to treat LLMs like teammates. Just like human engineers, these models have distinct strengths and frustrating blind spots. You don't build a great engineering team by forcing people to work against their natural grain, and the same principle applies to AI. Work with what the model is naturally good at.

Our first real pivot was applying PaLM to advanced auto-healing in the fall of 2023. The initial plan was straightforward: Ask the model to judge word similarity and pick out words similar to a given target in the DOM.

It failed completely. It just couldn't reliably pick out individual similar words.

But as we looked at the outputs, we noticed something. The model couldn't do what we asked—but it could clearly do the underlying task. It just needed to defy us on the approach. It was terrible at picking out individual similar words, but it was surprisingly good at grouping words by semantic meaning. So we stopped arguing with it. We changed our auto-healing architecture to work with that behavior instead of against it, and what started as a model limitation became a reliable, deterministic feature for our users.

LLMs as Teammates: The "Judge" Paradigm

Around the same time we were wrestling with those early PaLM experiments, a seminal research paper dropped in June 2023: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

The broader AI community was converging on LLM-as-a-judge as a way to grade chatbot outputs. A colleague and I saw a different application entirely. If an LLM could reliably evaluate semantic quality, why couldn't it evaluate the complex, non-deterministic state of a web application?

The internal reaction was skeptical, and understandably so. Intentionally introducing non-determinism into a testing tool feels like a category violation. Testing is supposed to remove uncertainty, not embrace it. But that framing was also exactly wrong. The real question was whether we could fundamentally expand the set of things automated testing could cover. We believed we could.

Productizing the Paper: GenAI Assertions

Our initial goal—building a fully autonomous agent that could do everything—was trying to force the model into a role it wasn't ready for. So we leaned into its strengths. What if, instead of asking the model to act, we just asked it to judge?

We took the LLM-as-a-judge concept and engineered a bridge to production, beginning development on GenAI Assertions in the winter of 2024.

Traditional testing assertions are notoriously brittle. With GenAI Assertions, users could simply use natural language to evaluate a visual or contextual state, like asking, "Does the shopping cart accurately reflect the bulk discount applied?" The LLM, using its multimodal capabilities, acts as the judge of that state.

When we released GenAI Assertions in the summer of 2024 (going GA with Gemini 1.5 in the fall), it fundamentally changed the way our users interacted with mabl. It removed the friction of writing complex, nested assertion logic. It became the fastest-growing feature in our company's history.

But there was a catch. While we were successfully building this "judge" concept for our users, the paradigm was still too new for us to fully rely on it internally. GenAI Assertions was our first truly composable AI primitive, and the LLM-as-a-judge pattern would eventually become the foundational mechanism we used to evaluate and course-correct our own internal autonomous agents. But in early 2024, our internal infrastructure hadn't caught up yet.

The Hackathon Spark and the Limits of Traditional Testing

While GenAI Assertions was being developed, we were also pushing forward on other fronts. In March 2024, we released GenAI snippet generation, taking our first steps into chat-based interfaces. A few months later, a July hackathon built on that early framework to give birth to automated failure summaries.

This essentially formed version 1.0 of our agent infrastructure. It proved that bringing AI directly into the user's workflow could massively reduce the cognitive load of debugging failed tests.

But testing these new capabilities exposed a glaring weakness in our process. Because we hadn't yet formalized our internal LLM-as-a-judge pipelines, our testing strategy was a patchwork. For conversational features, we relied on curating small sets of manual test cases, running them through the model, and spot-checking the accuracy. For earlier features like advanced auto-healing, we leaned on traditional labeled datasets.

As usage exploded, the reality set in. You cannot scale an AI product on static datasets and manual spot-checks. Traditional labeled datasets simply don't scale for complex, non-deterministic contexts. Small test suites left us blind to edge cases, silent failures, and model regressions.

We had successfully navigated the initial friction of AI and found real product-market fit. But to cross the chasm from "smart features" to the fully autonomous, multi-platform agents we wanted to build next, our entire underlying architecture, and more importantly how we measured trust, had to fundamentally change. We needed a unified foundation, measurable safety, and a way to give our agents something they fundamentally lacked: memory.

The Dual Bottleneck

We ended 2024 riding high on a suite of successful AI features: advanced auto-healing, GenAI Assertions, and automated failure summaries. But as usage exploded, two glaring realities set in behind the scenes.

First, the individual, disconnected nature of these features wasn't built to last. You cannot build a cohesive agent out of siloed "smart features." Second, our testing methodology couldn't scale. You can't test highly probabilistic workflows with static datasets and manual spot-checks.

To build truly autonomous workflows that users would trust, we needed a unified architecture. Specifically, we needed to give our agents two things they fundamentally lacked: continuous memory and measurable safety.

Giving the Agent a Memory: RAG and State Management

A smart model is useless if it doesn't understand the specific, quirky application it's trying to test. Even worse is an agent that forgets what it's doing halfway through a task.

In late 2024 and early 2025, we rolled out our Retrieval-Augmented Generation (RAG) infrastructure with Gemini 2. Instead of passing blank-slate models into a test, RAG allowed us to inject real-time, user-specific environment data directly into the agent's workflow.

But context in a vacuum isn't enough. As the models stabilized, we faced a massive architectural headache: spanning the gap between our API and our desktop app. Building an agent that operates in a single, controlled environment is hard enough. Building one that can seamlessly transition its state and context between entirely different execution environments is a monumental challenge. We had to build robust state management pipelines alongside our RAG infrastructure to ensure the agent didn't suffer amnesia the moment a test required it to jump from an API validation to a desktop UI interaction.

The Safety Net: From "Vibes" to Data

Giving an agent shared memory and cross-platform capabilities was incredibly risky without being able to accurately measure how it reasoned. True agentic workflows require robust, data-driven pipelines.

Crucially, we didn't halt feature development to build this safety net. Instead, we recognized this was the moment to tap into our seven years of built-up testing expertise and historical data. We moved away from manual spot-checks and transitioned to automated, quantifiable metrics for model performance. We built dedicated evaluator suites leveraging the very same "LLM-as-a-judge" concept we productized in GenAI Assertions to rigorously test our own AI features behind the scenes.

The "Duct Tape" Era and Model Whiplash

We put this new infrastructure to the test in the summer of 2025 when we attempted our first real generative test creation with Gemini 2.

To be radically transparent: it barely worked for our initial demos. Left to its own devices, the model would hallucinate actions, endlessly click the same non-interactive elements, or completely skip crucial steps. To force reliability, we engineered a manual Chain of Thought (CoT). We constrained the model with rigid, step-by-step reasoning prompts. These were, essentially, hard-coding guardrails that forced the model to explicitly state its observations, planned actions, and validations before executing anything. It was the duct tape keeping the demo together.

But AI engineering moves fast, and that duct tape almost backfired.

The transition from Gemini 2 to Gemini 2.5 is where things got genuinely interesting, and nearly went sideways. On the surface, the model upgrade looked like a win. The agent was smarter, and early results were better. But I started noticing subtle logic errors that didn't fit the pattern. The agent would occasionally skip conclusions that should have been obvious, or reason in circles on simple decisions.

I pulled the thinking logs. What I found was the issue: the model was using its built-in reasoning capacity to think about our Chain of Thought structure rather than the actual problem. We had told a model that already knew how to think exactly how to think, and it was dutifully following our instructions instead of just solving the problem.

The fix required a real change in philosophy. We moved from enforcing rigid reasoning steps to providing high-level guidance about thought process, and began trusting the model's native reasoning and shaping it rather than overriding it.

This was the first of what would eventually become five complete rewrites of the authoring agent. In traditional software engineering, rewriting a core feature five times would be a red flag. In AI engineering circa 2025, it meant you were paying attention.

This volatility proved exactly why our shift to data-driven measurement was so critical. The rigorous evaluator suite we built was the only thing keeping features like GenAI Assertions stable while the underlying models fundamentally changed how they reasoned.

The Breaking Point

We had successfully built the deep engine—RAG, cross-platform state management, and rigorous evals—and we were starting to see the massive potential of agentic workflows both in our product and our internal development.

But a new problem was emerging. Building and managing this increasingly complex architecture relied heavily on the ingrained expertise of a few team members, while requiring others to constantly step out of their comfort zones to contribute. This wasn't a sustainable process for building software at scale, especially as the need for deeper context and cohesiveness across these features started to become a glaring issue.

To keep growing, and to match the speed of the broader generative AI market, we couldn't just build isolated features anymore. We had to take everything we had learned and codify it into an agent framework. One that the rest of the engineering organization could use to build the future.

In Part 2, we'll pick up with the demo that forced our hand, the winter we spent building a factory instead of another one-off agent, and what happened when we asked two-thirds of the engineering team to start thinking in prompts instead of pull requests.

View full post