Generative AI Testing: Strategies to Ensure Quality and Mitigate Risks

Generative AI is reshaping the landscape of software development, empowering us to create applications with the kind of capabilities we’d never thought possible. From chatbots that understand nuanced conversations to personalized product recommendations that anticipate exactly what you need, these advancements are casting user experiences in a whole new light. As engineering leaders and teams race to integrate this groundbreaking technology, though, a new set of challenges is emerging: generative AI testing is going to be paramount to ensuring the quality and reliability of these AI-powered applications.

The inherent complexities of generative AI models, particularly large language models (LLMs), introduce a host of unique quality concerns. Issues like hallucinations (unexpected or nonsensical outputs), unpredictable behavior, latency, errors, and the challenge of explaining the AI decision-making processes all demand we take a thoughtful and comprehensive approach to building out testing steps and plans.

At mabl, we've recognized these challenges and continue to work on empowering teams to navigate this vast new generative AI testing landscape we’re all taking in. In this post, we want to dive into these quality concerns and offer insights into potential strategies for mitigating risks, focusing on common LLM APIs like Google Gemini, Anthropic Claude, and OpenAI ChatGPT. We also want to recognize the broader applications of these considerations to other gen AI services, giving you a starting point for navigating the complexities of generative AI frameworks, while also providing deeper understanding of the quality-related considerations involved.

When AI Takes Creative License: Addressing the Risks of Hallucination

Hallucination–the phenomenon where AI models generate inaccurate or nonsensical responses–is a critical concern for anyone building applications with Large Language Models. Misleading or nonsensical information erodes user trust, and that’s a risk no company wants to take! In our own benchmarking at mabl, we've found that hallucination rates vary depending on the model and task. While some LLMs exhibit limited hallucination for straightforward tasks, more creative tasks often lead to higher rates of inaccuracy. Unfortunately, due to the way LLMs are designed, avoiding hallucination altogether isn’t really possible.

The good news is that since you know to expect it, you can significantly reduce hallucination in your queries by carefully selecting your models, writing detailed and targeted prompts, and fine-tuning your sampling parameters (Vibudh Singh’s guide to controlling LLM model outputs is a great place to start). These parameters can be adjusted to strike the right balance between creativity and accuracy. In a future post, we'll talk more about how you can leverage mabl's AI-powered tools to proactively detect and manage hallucinations in your genAI-powered apps.

Rethinking "Correctness" in the Age of AI: Generative AI Testing for the Right Outcomes

One of the most intriguing—and challenging—aspects of generative AI is its non-deterministic nature. For example, imagine you’re asking an AI-powered travel chatbot like Priceline's Penny where to ski in July. One day, it might suggest Zermatt, Switzerland, renowned for its summer skiing. The next, it might recommend Portillo, Chile, another great option for winter fun in the Southern Hemisphere. Technically, both of these answers are valid, but the variability challenges the traditional testing mindset where we’re normally relying on predictable, deterministic behavior.

In the realm of generative AI, "correctness" is more subjective; it's not about expecting the same answer every time. Instead, it’s about ensuring that the responses are appropriate and relevant to the context that’s being provided. To do this, we have to move away from strict comparisons and towards evaluating whether the output aligns with the user's intent and the overall goals of the application. We’re working towards a solution to testing unpredictable outputs with the mabl platform, but the universal truth here is that “correct” has many meanings when it comes to AI.

Performance and Reliability Hurdles: Taming Latency and API Instability

At mabl, we love Appium so much that we built our mobile testing automation solution on top of it. Appium's open-source foundation and powerful capabilities make it a natural choice for automating a wide range of mobile applications. However, we recognize that not every team has the resources or expertise to build and maintain their own Appium framework from scratch. That's why we created mabl to empower teams of all sizes and skill levels to leverage the benefits of Appium without the associated complexities.

How much latency is acceptable in this specific use case?
Can the app tolerate periods of time without the LLM provider being available?
Does it make sense to implement redundancy, using multiple providers?

While some issues can be mitigated with client-side retries and frequent monitoring, others require deeper architectural considerations. Careful model selection is also key, as we saw significant speed differences between providers. In our benchmarking at mabl, we noted that Google’s Gemini 1.0 was not only 30% faster than OpenAI’s GPT 4 Turbo (for our specific multi-modal use case), but it also demonstrated impressive consistency with 27% lower latency variability and only a single server-side error across over 1,000 tests. Our initial tests suggest that Google’s Gemini 1.5 will continue this trend, outperforming both Claude 3 Opus and OpenAI’s GPT 4 Turbo. These nuances are what will help you choose the right LLM for your specific needs, at which point you can optimize both the performance and reliability of your AI-powered apps.

"Prompts are Code": Building Trust and Consistency Through Explainability and Prompt Engineering

One of the biggest challenges with generative AI testing lies in the "black box" nature of LLMs. Understanding why an LLM arrives at a particular response can be challenging, so trouble-shooting issues and fine-tuning their behavior isn’t exactly easy. This lack of explainability becomes even more critical when the model produces outputs that are unexpected or flat out incorrect outputs. Explainability in AI is vital for building trust in your AI-powered application and empowering your team. If you can understand why the model responded in a certain way, you can uncover biases, identify areas that need improvement, and ultimately deliver a more reliable and user-friendly experience.

Prompt engineering is a key tool in achieving both explainability and consistency in your app's behavior. Think of prompts as code: small changes can have significant, and sometimes unpredictable, effects on the output (missing comma, anyone?). Even slight variations in wording, like asking an LLM if a translation "means the same thing" versus if it's "accurate," can produce drastically different results. The same goes for the parameters you pass to the LLM–adjusting values like temperature, top_p, or top_k can dramatically alter its behavior. Because of this, it's crucial to treat prompt engineering with the same rigor you treat traditional code, incorporating change control, version control, establishing standards, and even conducting thorough reviews and testing of your prompts.

Beyond simply prompting for desired outputs, you can also use prompts to elicit explanations directly from the LLM and ensure consistent formatting. For example, when building a travel chatbot, you can add the following to any user prompt

"Please provide your response in the following structured format:
Destination: [Name of destination]
Description: [Brief description highlighting the appeal to the user based on their input]
Potential Follow-up Questions:
[Question 1]
[Question 2]
[Question 3]"

This approach accomplishes several things:

Explainability: The prompt explicitly asks the LLM to justify its recommendation and give insight into how it arrived at that output.
Consistent Formatting: The prompt dictates a clear structure for the output, making it easier to parse and display within your application.
Enhanced User Experience: The suggested follow-up questions can help guide the conversation and encourage more interaction with your product.

By incorporating this type of structure, developers can glean more information about the factors influencing the LLM's responses, which promotes transparency and enables more consistent and informative interactions. This isn’t just great for the user experience but also helps to build trust in the AI-powered chatbot.

Tread Carefully: The Risks and Rewards of Upgrading Your LLM Model

The promise of improved performance and new features makes upgrading your LLM model tempting, but it's not without risks. Our own benchmarks have shown that even seemingly minor version changes can lead to significantly different responses to the same prompts. These unintended consequences can easily disrupt your app's functionality if they haven’t been thoroughly tested before making the switch.

It's crucial to approach LLM upgrades with the same caution as any other major software change you would make. Rigorous testing before and after the upgrade is essential in identifying and addressing any unexpected behavior changes. By carefully evaluating the impact on your specific use cases, you can weigh the potential benefits of new models while also assessing the risks and ensuring the ongoing quality of your AI-powered features.

Building a Better Future with AI: The Importance of Robust Testing Steps for Generative AI Applications

As we've explored, the integration of generative AI frameworks into software applications opens up exciting new possibilities for amazing user experiences and pushing innovation in your apps. It also introduces a unique set of challenges that demand a thoughtful and comprehensive approach to testing as you weigh how they can be utilized.

From hallucinations and unpredictable behavior to latency, errors, and the intricacies of explainability and prompt engineering, ensuring the quality and reliability of AI-powered features requires a well-planned strategy. By embracing the insights and techniques discussed here, you can address these challenges before they become problematic, build trust with your users, and confidently deliver AI-powered applications that truly live up to their potential.

At mabl, we're committed to empowering teams with the AI test automation tools and knowledge they need to navigate the ever-evolving landscape of generative AI testing. We recognize that this field is still rapidly evolving, and we're actively researching and developing innovative solutions to help you overcome these challenges. We encourage you to share your own experiences and insights as we collectively build a better future with AI. If you'd like to explore some of mabl's existing AI and genAI testing capabilities, you can take out a free 14-day trial to get started. Together, we can harness the transformative power of generative AI while maintaining the highest standards of quality and reliability.

Generative AI Testing: Strategies to Ensure Quality and Mitigate Risks

When AI Takes Creative License: Addressing the Risks of Hallucination

Rethinking "Correctness" in the Age of AI: Generative AI Testing for the Right Outcomes

Performance and Reliability Hurdles: Taming Latency and API Instability

"Prompts are Code": Building Trust and Consistency Through Explainability and Prompt Engineering

Tread Carefully: The Risks and Rewards of Upgrading Your LLM Model

Building a Better Future with AI: The Importance of Robust Testing Steps for Generative AI Applications

Quality Engineering Resources

mabl Named “AI Quality Management Solution of the Year” in 2025 AI Breakthrough Awards Program

When Free Isn't Free: The Real Economics of Open-Source Automation

The Real Cost of Open Source in Test Automation