Something shifted over Thanksgiving weekend.
On November 24, 2025, Anthropic released Claude Opus 4.5 alongside major updates to Claude Code; for those of us who've spent decades building software, the ground moved. Not in a hype-cycle, keynote-presentation kind of way. In a sit-down-and-build-something-you-couldn't-build-last-week kind of way.
I've been in this industry long enough to be skeptical of "everything has changed" narratives, so I'll be direct: this is the most exciting moment in software history. It's not because AI can write code, though. It's because, for the first time, continuous, autonomous development has become real. It’s not a demo. It’s not a prototype. It’s a way of working. And that changes everything about how we think about quality.
The End of "Vibe Coding" as an Insult
For the past couple of years, "vibe coding" has been shorthand for a real problem: people leaning on AI assistants to generate code that looks right but doesn't hold up under pressure. It was fast, and it was often sloppy. Fair criticism.
Here's what's different now: the code coming out of these agentic workflows isn't slop. Engineers working with tools like Claude Code are producing production-quality software – code that's well-structured, that handles edge cases, that you'd be comfortable shipping. A year ago, the best AI models scored 35% on the most popular software engineering benchmark. Today, the leading models are clearing 70%+ (Opus 4.5 is 80.9%). That's a fundamentally different capability.
But – and this is the part that matters – it doesn't just happen. You don't point AI to a problem and walk away. The teams getting real value from autonomous development are the ones investing in the scaffolding: clear instructions, well-designed tools, structured pipelines, and, most importantly, testing. The emerging best practices for autonomous development look a lot like the best practices for any high-performing engineering team, just at a pace and scale that would have been unimaginable two years ago.
What Agentic Development Means for the World
If software teams can ship high-quality features continuously (not in two-week sprints, rather an actual, constant stream) the potential to accelerate what software can do for all the most important aspects of our daily interactions with the world is enormous. Every business, every public institution, every personal tool you rely on could be improved upon in less time. This comes in the form of healthcare systems that adapt faster, financial platforms that respond to new regulations in days instead of quarters, and small businesses getting access to the kind of software that used to require a dedicated engineering team.
That's the upside. And it's real.
But we have to do it safely. This is particularly true in enterprise software, where the cost of mistakes can be catastrophic. A broken image confirmation in an e-comm check out flow is a poor user experience. A broken workflow in a financial services platform, or a healthcare system, or a compliance pipeline? That's a different magnitude of consequence entirely. You can even look at the recent issues Amazon faced in losing 6.3 million orders because of poorly tested AI-generated code. In asking the question of whether quality practices are keeping up, what we’ve seen so far is that they are not.
The Quality Gap Is Becoming a Quality Chasm
Here's the tension at the center of this moment: development velocity is accelerating dramatically, while testing maturity is lagging behind. Manual QA can't scale to the velocity of AI-generated code; there aren't enough humans to handle that speed. And the problem only compounds when more and more code is created faster and faster. Eventually, the speed of development environments will be so out of sync with the speed of the production and testing environments that there is no coming back.
That was already unsustainable. Now add continuous agentic development to the picture, and the math breaks completely.
Quality Investments Help Us Move Fast Safely
The bulk of investment in agentic work thus far has been dedicated to what we call the inner loop: it’s the place where developers are working directly with Claude Code or something similar. It’s imperative that we make testing investments in the outer loop – where that same code moves out of the developers’ hands and into staging, production, and the hands of real users – if we want to move quickly and safely. The outer loop needs reviews, checks, testing, and monitoring at the same pace as the inner loop is creating the code.
In the agentic inner loop, the feedback mechanisms are familiar: code reviews, unit tests, linting, type checking. The agent writes code, the guardrails catch problems, the developer reviews and iterates. That loop is getting really good, really fast. So fast, in fact, that many developers are moving away from even reviewing their own code extensively, whether that’s because they believe the AI and agentic tools are sufficient, the agents are moving too slowly, or because the cognitive load from reviewing that much code it too high, it’s hard to say.
But the inner loop isn't enough.
In the outer loop, we need something fundamentally different. We need massive test coverage that can keep pace with a constant stream of change. This is the type of coverage that can quickly identify issues, diagnose root causes, and tell you whether the thing your customer actually cares about still works. This is the type of testing that requires system context, not just code context, and coverage that’s persistent across the system, learning and changing as your application does. And because the whole team is involved in the outer loop, the whole team is accountable for its success.
Traditional approaches can't do this. Manual testing was never going to scale for continuous delivery. Brittle, script-based automation breaks the moment the UI shifts, and in a world where the UI is changing constantly, that means broken tests every day. Constantly failing tests train teams to ignore the failures, which could turn out worse than having no tests at all.
The Same Technologies Powering Agentic Development Are Powering Agentic Testing
Here's where the industry is in its most exciting moment: the same capabilities that make continuous agentic development possible are exactly what is coming for continuous agentic testing. Reasoning, memory, context, tools, orchestration – these are the building blocks of a fundamentally new approach to quality.
A developer agent understands code context really well. It can read a codebase, understand the architecture, and reason about how a change will ripple through the system. It operates in the world of functions, dependencies, and data structures.
Now think about what a tester agent needs to do: it needs to understand user and product context. It needs to know what a customer is trying to accomplish, how they navigate the application, what "working correctly" looks like from the outside in, not from the code base on up. It needs to remember what the application looked like yesterday, notice what changed today, and reason about whether that change broke something that matters. It also needs to be good at things that regular LLMs struggle with, like deep integrations with enterprise testing environments and ecosystems, capturing rich evidence to support its decisions, and storing long-term data and metrics to ensure its long-term stability.
These are different orientations, but they draw on the same underlying capabilities. The developer agent is focused on the code. The tester agent is focused on the user and the product. Together, they form the complete picture.
What Continuous Agentic Testing Actually Looks Like
When I talk about continuous agentic testing, I'm not talking about adding AI to your existing test scripts. I'm talking about a fundamentally different model, shifting from managing tests to managing intent.
Instead of maintaining thousands of brittle scripts, you declare what matters: "The guest checkout must always work." "A user searching for a product must see relevant results." "The lead-to-cash workflow must complete without errors." These goals are expressions of intent, all of which depend on the experience the customer should expect.
An agentic testing platform takes those goals and does what a really good human tester would do, continuously: it explores the application, builds the necessary coverage, identifies when something drifts from the desired state, diagnoses why, and tells you what matters. It remembers context across runs, adapts when the UI changes, and reasons about whether a failure is a real problem or just noise.
What Enterprises Should Be Thinking About
If you're an engineering leader evaluating how to bring agentic testing into your organization, there are a few things worth considering.
First is the “fragmented agent” trap, where multiple tools purport to do something great for one aspect of your testing process, whether that’s test creation, execution, triage, or reporting. While it works for experimentation, it’s not something that scales easily. You need a unified system that maintains context across the full lifecycle. If we look at the complex systems used for something like booking a flight, one broken connection can have ripple effects across hundreds of third-party APIs. Your tooling needs a complete understanding of the application to avoid something catastrophic.
Second, you want to look for platforms that operate on intent, not scripts. The shift from "check if this selector exists" to "verify that this user journey works" is the difference between automation and intelligence. Script-based tests are inherently brittle, in that you always have to convert intent into scripts with specific selectors, while agentic systems can work from the original intent directly. Script-based testing asks narrow questions. Agentic testing reasons about outcomes.
Third, demand explainability. Autonomous doesn't mean opaque. Every action an agentic tester takes should be traceable, auditable, and tied back to a human-defined goal. The teams I've seen succeed with this approach are the ones who maintain clear governance, including the ability to exercise a hard veto on high-risk changes, regardless of what the AI recommends.
And finally, think about this as a shift in your team's role, not a replacement of your team. Quality engineers aren't going away, but their work is evolving from executing test cases to defining what "done" means, from triaging failures to auditing agent reasoning, from writing scripts to setting strategy. That's a more interesting job, and a more impactful one.
Looking Forward
We're at the beginning of something genuinely transformative. The combination of continuous agentic development and continuous agentic testing has the potential to close the quality gap that's been widening for years — and to do it in a way that actually makes engineering teams more productive, not just faster.
But it won't happen automatically. It requires intentional investment in the quality layer. It requires new ways of thinking about what testing means when code is being written and shipped continuously. And it requires trusting — and verifying — that the agents we're building alongside are doing work we can stand behind.
The promise of what agentic development can do in the future will be undermined if investment in quality isn’t a priority today. With this speed comes a responsibility to ensure we aren’t just building quickly, but building safely. Because the most exciting thing about this moment isn't how much code we can write. It's that we're finally building the infrastructure to trust it.
Try mabl Free for 14 Days!
Our AI-powered testing platform can transform your software quality, integrating automated end-to-end testing into the entire development lifecycle.
