Upgrading AI Features: Data-driven strategy for test performance | mabl

Written by Josephine Li | Oct 2, 2025 7:58:43 PM

Generative AI unlocks many new and powerful use cases, but the non-deterministic nature of the technology comes with additional challenges during the development process. Hours of modifying configurations, prompt engineering, and hunting for edge cases can culminate in inconsistent performance and vibe-based quality. Without a clear strategy to discover where the weak points are, migrating to a different model, version, or even simply updating a prompt, can lead to regressions. These challenges become increasingly significant as models are being released and retired faster and faster.

mabl understands that every single customer test must execute reliably, every single time, to ensure the quality of your application. We are committed to maintaining a high bar for quality to ensure our AI-powered features—such as GenAI Assertions, Test Creation, Auto-Healing, and more—support fast and complex testing capabilities.

We recently upgraded the model version that powers GenAI Assertions. We used a thorough, data-driven testing strategy designed specifically for testing performance of generative features. Unlike traditional software, where changes often have predictable outcomes, even a minor tweak to an AI model or a prompt can have unexpected and widespread effects. Testing needs to be customized and tailored to these kinds of features, accounting for their variability and extended influence. Our testing strategy allowed us to ensure reliability with a high degree of confidence, and we feel it would be helpful for other teams working on similarly complex generative AI features.

Defining Success

In a small test suite, hand-labeling ground truth results can be tedious but is doable, and provides highly accurate measurements of performance. However, scale the size of the testing pool up by a few orders of magnitude, and manually looking at every single test case is not a reasonable option.

A better way to effectively understand the risks of a change, without spending valuable engineering hours on excessive labeling, is to look at cases where outcomes differ from the original. For GenAI Assertions, this meant considering test cases where the test result went from pass to fail, or vice versa. While improvements are generally positive, if improvements in one area lead to deterioration of accuracy in another, that may not be a change we want to release.

Goals of a change should be discussed before beginning the change and testing processes. For example, for a metric like this, a hard boundary of percentage of results changed for the better could be implemented. This not only allows for an accurate comparison of a potential change to the previous version, but in testing on larger sets of data, allows for centering attention on the most important cases.

Of course, standard accuracy-related metrics are still important. Accuracy, precision, recall, and F1 (an alternative, balanced measure of accuracy) are still extremely useful in quickly summarizing performance. However, they just aren’t enough to fully validate an entire change.

LLM Self-Evaluation

Unlike traditional software where success might be a simple binary, GenAI features demand a more nuanced testing approach. An important aspect of GenAI assertions is that the prompt returns some information on how it came to its conclusion, allowing for easy debugging. We wanted to make sure the quality of these outputs still remained high with any changes made.

For shorter or more predictive outcomes, metrics like ROUGE or BERTscore, typically designed for evaluating text summarization, can suffice. While these metrics have the benefit of being more standardized and reliable, they generally require human-created references. Also, the length and variation of GenAI Assertion responses made them unideal for our use case. So, we turned to LLM self-evaluation. Turtles all the way down.

Our setup involved providing a model with rubrics defining different levels of quality, along with the original generated response, and asking it to rate the output. While there is always some risk to evaluation by LLM, especially when it’s the same model evaluating on itself (e.g., Gemini on Gemini), this served as a satisfactory first-pass filter, giving a general idea of performance and even directing us towards which cases needed closer observation. And, while the LLM evaluation can be pricey, especially as the test set grows, having an LLM parse through the responses and extract a much smaller subset of poorly performing cases significantly reduces the time and energy required for human analysis.

What Thinking Can Do for Us

We used the technique described above to move our GenAI Assertions feature to a more recent Gemini model. The new model provided “thinking” capabilities.

The above framework enabled us to tackle differences in prompt interpretation, but whe addition of thinking, our prompt was no longer the most efficient. Our original prompt contained detailed instructions for evaluating an assertion.With thinking, what if this was unnecessary?

To investigate this, we revised the prompt by removing some of the detailed instructions. Instead of asking the model to give us its thoughts at each step along the way—which was essentially recreating the thinking it had already done—we let it think through the assertion first, and only asked for a summary to be returned.

With this modification, we were able to improve accuracy even more while also decreasing costs. With the additional cost savings, we were able to increase our per test GenAI assertion limit from 6 to 30. This increase will allow GenAI Assertions to be used more extensively and unlock additional use cases.

Final Thoughts

Building reliable GenAI features is a unique challenge, and maintaining consistency and quality across frequent changes from providers is even tougher. Through upgrading the model behind GenAI assertions, we developed an improved testing strategy applicable to many variations of generative AI-powered features and came out with a few primary takeaways:

Comparison is key: When making changes to a feature that already exists, it’s paramount to ensure that improvements are not hurting other aspects. Comparing results between two versions and focusing on the change is an effective way to do this.

Leverage the power of AI: While it may seem counterintuitive to use an LLM to evaluate its own responses, it can actually be a quite effective, cost-saving approach for a first pass through responses.

New models may bring new efficiencies: With models becoming more efficient, it can be fruitful to investigate whether existing prompts are limiting performance of the new models. Revising your prompts may bring positive improvements.

With this strategy, we were able to increase the accuracy of GenAI assertions while still decreasing the cost, again contributing to better performance and increased value for mabl users.

View full post