The mabl blog: Testing in DevOps

Rebuilding an AI Agent the Right Way: Measurement, Not Guesswork | mabl

Written by Lauren Clayberg Leidal | Feb 25, 2026 6:22:13 PM

We refactored mabl's test creation agent after nine months in production. But we didn't rebuild on gut feel. We analyzed 2,404 real customer sessions with an AI-powered review agent to measure exactly what was breaking and what to fix. Here's what production data at scale teaches you about building agents that actually work.

The hardest part of building production AI agents isn't shipping them; it's understanding whether they are getting smarter as they get used.

At mabl, we shipped our first agentic test creation system in spring 2025, well before most testing tools had AI on their roadmap. After nine months in production, we rebuilt it from the ground up. It wasn’t broken (in fact, customers were using it with great success), but the original architecture was ready for an evolution.

The decision to do a refactor wasn't based on anecdotal feedback or internal hunches. We built an AI-powered review agent to analyze 2,404 real customer sessions across 439 accounts, measuring behavioral quality that traditional metrics can't capture: looping patterns, error recovery, assertion quality, decision-making consistency. The data told us exactly what to rebuild and, once we shipped the updates, proved the refactor worked.

What Production Usage Taught Us

Our first-generation agent worked. It could interpret screenshots, decide on next steps, interact with page elements, and generate assertions. Customers were creating tests with it daily.

But "working" and "working well at scale" aren’t always the same thing. As usage grew, we started seeing certain patterns:

  • Sessions where the agent looped through the same failed action repeatedly
  • Click interactions that should have succeeded but didn't, with no recovery path
  • Tests that worked in demos but struggled with real enterprise UIs where you see nested iframes, custom components, and legacy authentication flows

Traditional metrics like “test save rate” couldn't tell us why these patterns emerged or how to fix them. The behavioral analysis we were able to glean from the production usage held the key to those questions.

The V1 Architecture: Built for a Different Era

The original agent was built around a single-shot pattern: client sends state, server returns action, client executes, repeat. Every API call was stateless. The model never saw its own history. It had limited memory of what it had tried, what failed, or what it had already considered.

This was the pragmatic decision for early 2025, but as model capabilities improved and production usage revealed edge cases, the constraints started to cause pains in the testing process. The monolithic design meant that adding new context or tools would require changes across multiple files. Meanwhile, stateless execution meant the agent approached every decision in isolation, unable to learn from its own mistakes within a single session.

The Refactored Architecture: Composable by Design

The new system replaced the monolith with a composable framework built around a few core principles.

A Shared Agent Framework

The test creation agent is much smaller and acts as more of a thin layer on top of a shared base framework that handles the complete agent lifecycle: initializing context, registering tools, generating prompts, invoking the model through multi-round function calling loops, and returning results. A new agent only needs to define its system prompt, default capabilities, and any custom logic. Everything else is inherited.

This isn't just cleaner code. It's a bet on the future. We know we'll need agents for test debugging, maintenance, and planning. The framework means each new agent is a configuration layer rather than another monolith to maintain.

Artifacts Replace Monolithic State

The sprawling state object was replaced with focused, schema-validated artifacts. These are self-contained pieces of context like the test outline, execution state, application state (screenshot, URL, tabs), and workspace context. Each artifact knows how to describe itself for the system prompt and how to contribute to the overall context.

This makes prompt construction automatic. When a new artifact is registered, the prompt updates to include it. Schema validation catches poorly-formed context at runtime, avoiding mysterious model behavior downstream.

Provider-Agnostic LLM Integration

We replaced direct SDK calls with a provider abstraction. Each provider implements a small interface – convert the request, generate a response, convert back to a common message format. A factory wraps these into a unified provider with automatic message normalization, built-in function calling loops, streaming support, and retry logic.

Adding a new LLM provider means implementing a handful of functions. Switching between providers is a routing decision, not a rewrite. This has already paid off: we can test the same agent behavior across different models to evaluate which performs best for specific sub-tasks.

Smart Tool Execution

The original system ran every tool on the client. The new architecture separates client-side and server-side tool execution, which unlocked significant capabilities.

Server-side tools (like fetching available flows or analyzing workspace context) resolve without any client round-trip. Client-side tools handle actions that must happen in the browser: clicks, text entry, waits. This separation allows us to build intelligence into agents and share it across different agent types without customers having to update their clients.

The practical impact goes beyond performance. We've built a server-side library of internal tools that make agents more aware of mabl-specific data and patterns. When we improve agent intelligence or add new capabilities, those upgrades happen immediately on the server. The agent gets smarter without any client-side deployment.

Conversation Memory as Architecture

The most consequential change is that the agent now maintains conversation history. Instead of treating each action as an isolated request, the model can see what it's tried, what worked, and what failed.

This architectural shift directly addresses the looping problem we saw in production data. When the agent clicks an element and nothing happens, it can now reason, "I already tried this approach; let me try a different strategy," rather than repeating the same failed action over and over. The conversation becomes the state that enables learning within a session.

Measuring What Matters: The Review Agent Framework

Most vendors can't tell you how their agents behave within sessions: Are they making good initial decisions? Are they recovering from errors effectively? Are the outputs high-quality? Instead, they continue to ship features and, at best, measure usage.

We needed more than that. Traditional metrics hide the behavioral patterns that actually matter for agent quality. We wanted a system that would let us continuously measure and improve agent behavior at scale, so we built a review agent. This AI system is designed to analyze agent sessions with the same rigor a human expert would apply, but across thousands of sessions simultaneously.

How the Review Agent Works

We designed a structured analysis framework that evaluates each test creation session across multiple dimensions:

  • Click and element interaction failures: Did the agent successfully interact with page elements, or did it struggle with targeting or execution?
  • Delete step usage and correctness: When the agent removed steps, was it making good corrections or creating new problems?
  • Assertion quality: Are assertions meaningful verifications, or overly specific checks tied to implementation details?
  • Looping behavior: Is the agent making productive retries with different strategies, or unproductively spiraling through the same failed approach?
  • Missing capabilities: Are there patterns where the agent lacks the tools it needs to complete the task?

We ran this analysis across hundreds of sessions from both the old and refactored agents, excluding internal usage to ensure results reflected real customer patterns.

What the Numbers Revealed

The data told an exciting story: clean sessions (where a test was saved successfully with no looping or interaction issues) increased roughly 4.5x; severely problematic sessions (where issues are compounding across looping, click failures, and problematic deletes) dropped by over 90%.

Even more interesting is what we saw in the behavioral patterns:

Looping became predictive: In the old agent, looping occurred at roughly the same rate regardless of whether the test was ultimately saved. In the refactored agent, saved tests showed meaningfully fewer loops than unsaved tests. The agent was now recovering from issues rather than getting stuck in them.

Click issues became correlated with outcome: The old agent had similar click issue rates whether the test was saved or not. It couldn't distinguish between interaction problems that were recoverable and ones that were fatal. The refactored agent showed a clear gap between saved and unsaved sessions, indicating better error recovery.

Assertions got smarter, not just fewer: Assertions per session dropped significantly, but assertion quality improved by several percentage points. The agent’s verifications got more focused and meaningful, instead of spamming the assertions to find what sticks. Specific categories of bad assertions saw dramatic improvements–errors where the agent confused an element's display text with its underlying value dropped by over 80%.

Delete behavior told a story about decision quality: The old agent used delete steps in over half of sessions, and more than half of those deletes caused more problems. The refactored agent needed deletes far less often, and when it did delete, the vast majority were things that were correct in being deleted. The agent was simply making better initial decisions.

This level of behavioral analysis isn't standard practice in AI development, and it's impossible without structured evaluation at scale. With our baseline behavioral metrics from thousands of sessions, we know that when architectural changes happen, we don't have to guess at whether they worked. We measure agent behavior the same way we'd measure any production system: with data, not demos.

Why Traditional Metrics Can Mislead

One metric initially looked concerning: test save rate dipped after the refactor. The use of the review agent was critical in telling the more complete story.

Sessions with zero assertions increased substantially. These were incomplete or abandoned sessions where users disengaged before reaching the assertion phase; these were not agent failures. Meanwhile, among sessions that progressed to assertions, save rate improved and assertion quality increased significantly.

This is a great example of why behavioral analysis matters–simplified, aggregate outcome data doesn’t take user engagement patterns into consideration. Understanding how the agent performs within completed sessions gives you the signal you actually need to improve the system.

Identifying What to Build Next

The review agent also gave us a prioritized roadmap. In looking at sessions where the agent lacked the tools it needed, we were able to see the gaps and build fixes for them: tab switching capabilities (so agents can handle multi-tab workflows), access to ARIA snapshots for better accessibility tree understanding, and improved element interaction with the experimental model options in mabl Labs all came to us from these sessions.

These aren't guesses at what customers want; they come from a structured analysis of real session failures, and ranked in order of importance by looking at the number of sessions impacted by the lack of tooling. The review agent continues to surface new capability gaps as we analyze more sessions, giving us a data-driven backlog.

What We Learned

Tech debt in AI compounds faster than traditional software. The patterns and capabilities available to agent builders are evolving on a timeline of months. An architecture that was pragmatic in spring 2025 was constraining by early 2026. Planning for refactoring isn't a sign of poor initial design. It's a recognition of how fast the ground is shifting.

Agent observability requires more than metrics dashboards. Session-level success/failure rates hide the behavioral patterns that actually matter: whether the agent is making good initial decisions, recovering from errors effectively, and generating quality outputs. We found that a structured AI-powered review process revealed insights that no amount of aggregate data could surface.

Composability is the hedge against uncertainty. We don't know what the best model for each agent sub-task will be in six months, or what new capabilities we'll want to add. The refactored architecture (with its provider abstraction, artifact system, and shared framework) means we can adapt to whatever comes next without another ground-up rewrite.

Measure behavior, not just outcomes. The refactored agent's test save rate dipped slightly, which would have been alarming in isolation. Understanding the why through nuanced session analysis revealed that the actual agent behavior improved dramatically across every quality dimension. The outcome data made more sense when accounting for the behavioral implications.

Why This Refactor Was an Investment, Not a Reset

In traditional software, a nine-month rebuild signals architectural failure. In AI agent development in 2025, it signals discipline.

The model capabilities, function calling patterns, and production learnings available today didn't exist when we shipped V1. Companies that aren't refactoring their early AI systems are either not learning from production usage, or they're locked into architectures that can't evolve, and both of those are problems.

The composable framework, provider abstraction, and artifact system mean future improvements can be additive: new agents, new tools, new models, all without touching core architecture. This refactor bought us an architectural runway that should carry us through the next phase of AI evolution.

More importantly, we now have the measurement infrastructure to validate those improvements. The review agent framework isn't going away; it will continue to be how we evaluate every architectural change going forward.

What This Means for Testing Strategy

If you're evaluating AI-powered testing tools, the refactor story should actually be reassuring. It means we're learning in production, measuring rigorously, and willing to invest in long-term architecture.

Ask any AI testing vendor these questions:

  • Can they explain how they measure agent quality beyond pass/fail rates?
  • Do they have production behavioral data at scale, or just demo videos?
  • Is their architecture locked to a single model provider?
  • Can they show you structured failure mode analysis?

For mabl, the answers are yes. We built the measurement systems to make them yes.

The review agent framework continues to evaluate new production sessions as they happen, analyzing behavioral patterns to understand how the agent performs. This informs our architectural decisions without training on customer data - we measure agent behavior, not test content.

The composable architecture means we can adapt to new model capabilities without another rebuild. Together, these capabilities - rigorous measurement and architectural flexibility - are the foundation for building AI agents that get measurably better over time, not just different.

The AI agent space is moving fast enough that the systems we build today will need to evolve tomorrow. The goal isn't to build the perfect architecture on the first try. It's to build one that can grow, and to have the observability to know whether the changes you're making are actually working. For mabl's test creation agent, the combination of a composable architecture and AI-powered behavioral analysis gave us both.