Key Insight: This is Part 2 of a two-part series on how mabl built a system for AI agents to ship real code across 100+ repositories. Part 1 (click here to read) covered the three infrastructure layers we built to give agents the context, tools, and guardrails they need to work safely at scale. Here, we get into how agents actually use that system, and what we've learned in production.

Letting the agents use the system

Agents have changed the assumptions that previously drove prioritization. Small bug fixes and minor improvements that might have struggled to get prioritization historically can be taken from concept to PR by technical support staff and product owners with the right process and guardrails. In addition to enabling the dev team’s agentic workflows we wanted to leverage the components we built out to fully automate bug fixes and smaller tickets with clear designs.

What We Built: We brought together the above skills and setup to create a ticket-working skill that uses a 3 phase process and Jira integration to complete Jira tickets through PR. Each phase posts the output back to Jira for feedback and uses ticket labels to track progress.

Phase 1: Analysis — An agent reads the ticket, queries relevant repos, checks git history, and outputs a technical summary identifying key repos and classes, a high level summary of the pieces needed to address, open questions, an estimate of complexity, and a confidence assessment. This is all posted back to Jira where users can respond with feedback and answer open questions.

Phase 2: Planning — Agent generates an implementation plan with specific file paths, API endpoints, and test cases. The plan is a detailed overview of the approach the agent should take to achieve the goal in the ticket, including a validation plan for how we ensure the changes meet the target outcome.

Phase 3: Implementation — Agent writes code, runs tests locally, and submits PRs. Each PR links back to the original ticket and planning doc. The agent uses the cross-repo-base skills such as Git Worktrees to maintain clean isolation. A pre-push validation gate runs per-repo checks (prettier, eslint, tsc, spotlessCheck, compileJava) as a hard blocker before any code is pushed.

Phase 4: Review — The code review and auto-fix agents review and clean up the PRs. Engineers review and approve+merge when ready.

Confidence-Based Gating

Each of the first two phases have confidence gates that use an AI assessment to halt or proceed with progression. The skill also supports both interactive and headless modes.
In interactive mode (the default for developer-managed work), the pipeline always pauses between phases to wait for feedback. Confidence determines the recommendation ("proceed" vs "review carefully").

In headless mode (designed for full autonomy), confidence acts as an automated gate:

  • High confidence → auto-proceed: Analysis had 0 open questions, specific file references, contained scope

  • Caution signals → block and stop: Post findings to Jira, add claude-blocked label, wait for human

The confidence criteria are specific and measurable:

  • High confidence requires: 0 open questions, specific file/line references for bugs, small/medium scope for features, low/medium risk and contained blast radius for tech debt

  • Caution signals include: workaround/defensive fixes, breaking changes, missing META fields, high difficulty, large scope, architectural review needed

The Implementation phase is always treated as low confidence regardless of other signals — humans always see the PR before anything merges. There is no scenario where code auto-merges without human approval.

PI1

What This Solved: The Planning phase reduced our implementation failures by ~60%. By forcing agents to declare their approach before coding, we surface misunderstandings early. The Jira label transitions act as hard state triggers. An agent can't skip to Implementation without completing Planning.

The Results

We built measurement and velocity reporting as part of the core system. Here's what changed from October 2025 to March 2026:

PR Velocity

  • October 2025: 291 PRs (12.6 PRs/working day)

  • March 2026: 732 PRs (33.3 PRs/working day)

  • +291% increase

PI2

AI-Assisted Commits

  • October 2025: 17% AI-assisted commits

  • March 2026: 70% AI-assisted commits

Production releases

  • Oct 2025-Dec 2025: ~30-60 releases/month across all products

  • March 2026: 120+ releases

The Inflection Point: January 2026

Our data shows a clear tipping point in January 2026:

  • Application repo PRs jumped from ~200/month to 330 (+65%)

  • AI-assisted commit share tripled from 5% to 15%

  • DevOps AI-assisted commits went from near-zero to 25%

What happened? Two things. The agents repo launched (initialized Dec 5, 2025), centralizing all our agent orchestration code and making it easier for engineers to adopt agentic workflows, and Claude Code adoption reached critical mass. The combination of CLAUDE.md files + MCP servers + the 4-phase pipeline meant agents could finally operate autonomously without constant human hand-holding.

By February, agentic development became the default workflow for many engineers at mabl, not a side experiment.

Stable Headcount, Per-developer Gains

Our unique, active contributors held steady at ~25/month throughout the period. The entire velocity gain is attributable to individual developer throughput combined with enabling non-developers to contribute, not hiring. This is the key signal: agents didn't replace developers at mabl, they made each developer more productive. Several engineers saw significant velocity increases, with multipliers ranging from 4x to over 5x in monthly PR output. 

Lessons Learned: Where Agents Still Break

This system isn't magic. Agents still fail at mabl. Here's where humans stay in the loop:

1. Context Drift Beyond ~3 Repos

Our Repo Coordination Graph works well for features spanning 2-3 repos. Beyond that, agents lose track of dependencies. We've seen cases where an agent correctly identifies repos A, B, and C need updates but forgets that repo D also imports the shared library. Human review catches these. Our cross-repo reviewer agent helps. It specifically checks changes spanning multiple repos for API contract compatibility and version coordination, but it's not a complete solution for complex multi-repo features.

2. Out of date dependencies

Agents occasionally invent API endpoints that don't exist and often pull in old dependency versions of libraries and GitHub actions. We've built mandatory code grounding into the system and use Context7 MCP to verify library APIs against current documentation before citing them. Our Planning phase catches most of these. Despite these safeguards, we still see ~5% of PRs where the agent misunderstands API contracts. Human reviewers spot these during the Review phase.

3. Agents still ignore validation plans

Claude will come up with thorough validation plans that include running unit tests, running a few key mabl tests, and exploratory validation. But if Claude hits an issue launching the app, it will often just quietly choose to skip large parts of the validation plan unless redirected. We’re continuing to iterate on making the app launching skills more reliable while actively investigating ways to leverage hooks and subagents to enforce the validation gates more strictly.

What We're Working On Next

One of the biggest benefits of rolling the system out has been getting it in everyone’s hands with a framework that can be iterated on. Some current ideas/focus areas:

Customized Review Agents — Updating review agents for specific repos and feature types. For example, a custom review agent with instructions on identifying issues with our own agentic development.

Quality Verification subagent — We believe moving quality verification into a dedicated sub-agent will help ensure that Claude sticks to validation plans and has a higher success rate on achieving high-level goals

Operations and Release Management

Building out the infrastructure to chain dependency updates automatically between related repos in GitHub actions. More structured management of production deployment, dev holds, and release ordering. 

Takeaway: From Individual Success to Team-Scale Infrastructure

Individual developers are already succeeding with AI agents, and their productivity gains are real. Today’s challenge isn't proving that agents work; it's building an infrastructure that lets entire engineering teams use agents safely in production.

For us at mabl, the leap from "using AI agents" to "operationalizing AI agents" wasn't about writing better prompts. It was about architecture. Context management, validation gates, shared skill systems, and layered governance turned agents from individual productivity tools into team-scale infrastructure.

We went from 10% AI-assisted commits to 39% in six months. We increased PR velocity by 77%. We're shipping 60 releases/month (up from 47). And, we did it with the same headcount.

The friction points we initially hit (context drift, validation failures, review bottlenecks, cross-repo coordination) were solvable with the right infrastructure. This is just one example of a reference architecture. It’s not the only way, but a way that's working for us.

The system isn't perfect. Agents still break. Humans are still essential, and they always approve the merge. But we've moved agents from individual chat sessions into production pipelines that serve 25 engineers across 100+ repos, and it's changed how we build software.

Try mabl Free for 14 Days!

Our AI-powered testing platform can transform your software quality, integrating automated end-to-end testing into the entire development lifecycle.