The mabl blog: Testing in DevOps

How We Built a System for AI Agents to Ship Real Code Across 75+ Repos [Part 1 of 2]

Written by Geoff Cooney | Apr 8, 2026 6:26:42 PM

Key Insight: Individual developers using AI agents see 2-5x gains, but scaling across teams requires infrastructure. mabl's engineering team built a four-layer system: per-repo operating manuals, validation pipelines, shared agent/human tools, and layered governance where AI blocks bad code and humans always approve merges. This architecture resulted in impressive gains in efficiency and is offered as a reference for teams looking to operationalizing agentic development at production scale.

Coming into 2026, we at mabl could feel a shift happening. Our engineers were seeing efficiency improvements by leveraging agentic coding tools. A handful of our engineers with deep understanding of mabl systems were seeing even more dramatic gains. PMs and other stakeholders were starting to be able to contribute to the mabl codebase. Features that followed existing patterns could be done almost entirely hands-off.

Our experience aligned with industry trends. Anthropic published a report in early December that showed employees who were using Claude on 60% of their work were also reporting getting productivity gains of 50%. At the same time, Claude users reported only being able to fully delegate less than 20% of their work to LLMs.

The Problem We Hit: Scaling Agentic Development Beyond Individual Contributors

mabl has 25 engineers managing 100+ repositories. The team pushes 200+ PRs and 40+ production releases a month. This was before acceleration from agentic development.

As we looked into increasing our own adoption of agentic software development processes, there were some key scaling challenges:

Cross-repo coordination. When a feature spans three repositories (an API change, a frontend update, and a shared library), how does an agent know which repos depend on each other? How does it determine merge order so the API doesn't break when the library ships first? Who maintains the dependency graph, and how do agents access it? This is compounded by the fact that most of the existing out-of-the-box tooling in this area assumes you are starting from a single mono-repo.

Development is a lot more than writing code. mabl developers interact with customer support on Jira tickets, build the app locally to validate their changes, check PRs for feedback, and much more. These activities are even more important for agents as we scale because they ground the code changes in measurable behaviors.

Human code review at AI velocity. If AI increases coding throughput by 3-5x across our team, human review becomes the constraint. We couldn't have engineers reviewing 300 PRs/week when many were low-risk changes (dependency bumps, lint fixes, test updates). We needed a way to let AI catch obvious issues before human reviewers spent time on them, freeing our developers to focus on the PRs that genuinely need human judgment.
Operations at scale. An increase in PRs by an order of magnitude means we hit scaling and coordination challenges in our build and release pipelines much faster than before.

Quality needs to move both up and down the stack. Increased velocity combined with coding and PR reviews moving to higher-level specifications means we need to rely even more heavily on our automation. At the same time, enabling targeted local validation allows the coding agents to better ground themselves.

By mid-2025, we were at a crossroads. In August 2025, roughly 10% of our commits were AI-assisted, entirely from individual contributors using tools ad-hoc. By February 2026, that number hit 39% overall and 60% in infrastructure repos. These numbers don’t even include code that was co-written by code assistants but committed by developers.

The challenge wasn't getting agents to write code. It was building the infrastructure to let 25 engineers use agents safely across 75+ repos.

Here's the architecture we built.

The Architecture

The architecture we built was an evolution, designed layer by layer as we encountered friction points, resulting in distinct components that turn agents into autonomous contributors in our ecosystem:

Cross Repo Base (Rules + Dependency Graphs) — A pseudo mono-repo that sits above all our individual repos. Gives agents the same context any developer at mabl would have: repo conventions, dependencies, architectural patterns, security constraints. No tribal knowledge bottleneck.

Skill System (MCP Servers + Reusable Skills) — Agents and humans get identical capabilities: query Jira tickets, run tests, verify UI changes visually, and manage environments from one unified interface.

Operations & Governance — AI code review blocks problematic PRs but explicitly cannot approve them. Auto-fix agents handle trivial CI failures with circuit breakers to prevent runaway loops. Humans always make the final merge decision.

Then we brought this all together into an automated workflow that follows a 3 stage state machine to design, plan, and implement individual tickets. This leverages the same rules and skills an engineer working with Claude does.

Layer 1: Cross Repo Base

The friction we hit: Agents don't have tribal knowledge. Especially on an established code base they often make locally optimized but architecturally poor design decisions. Tell Claude Code to create a new report in the web app and it will happily hack something together oblivious to all the API endpoints and back-end PDF generation services it can leverage. That’s a relatively simple example. We were spending more time providing context and rejecting bad approaches in PR than we saved from agent-generated code.
What we built: A centralized “cross repo base” repository that sits at the root of all our downstream repos.

The repo base contains a set of common rules and skills:

  • Engineering guides

  • Java and typescript general best practices

  • Design patterns

  • Standard workflows

  • Google Cloud and internal tooling

  • A variety of skills we will detail in the next section

Every repo maintains its own CLAUDE.md file containing repo specific rules and exceptions. The CLAUDE.mds are of varying complexity depending on repo complexity, scope, and variation from standards:

Repo purpose: What this repo does, how it fits into our system
Architecture: What are the core components of this repo
Dependencies: Which repos import this one (or are imported by it). Example: our shared-node-utils is consumed by API, UI, and mabl-cli. Any breaking change requires coordinated PRs.
Development conventions: Commit flags ([skip ci], [bump:patch]), Terraform patterns, test sharding config, build commands
Step-by-step workflows: "How to add a new API endpoint" as a 10-step guide agents can follow

Alongside this, we maintain a Repo Coordination Graph, a central registry mapping dependencies across all repos in our GitHub org. At 850+ lines, it covers 79 repositories with detailed dependency graphs, Pub/Sub topic maps, database table ownership, and prescribed release ordering. When an agent works on a cross-repo feature, it queries this graph to determine:

  • Which repos need updates

  • Merge order (upstream libraries before consumers)

  • Which reviewers to tag based on CODEOWNERS and dependency chains

  • Pub/Sub topic maps and database table ownership

We use Git Worktrees for isolated parallel execution. If an agent works on three repos simultaneously, each gets its own worktree so there's no state contamination risk. We built worktree-related hooks that actively prevent agents from editing files in the wrong workspace. If an agent is operating in a worktree and tries to edit a file in the main workspace (or a different worktree), the hook blocks the edit and suggests the correct path. This solves a real failure mode in multi-repo agent systems: accidentally editing the wrong copy of a file.

We also enforce a mandatory code grounding rule that requires agents to verify all file paths and code references before citing them. No fabricated references allowed. Agents must use Glob or Read to confirm a file exists and quote specific line numbers from code they've actually read. If code is inaccessible, the agent must state that clearly rather than guessing.

What this solved: Without the ecosystem context, every agent invocation required a human to provide context ("this repo uses Terraform module naming pattern X, depends on Y, and PR reviews need Z"). With it, agents operate with the same context a senior engineer at mabl would have. Context drift (where an agent loses track of conventions mid-task) dropped from ~40% of our failures to <5%.

Layer 2: Skill System (MCP Servers as Shared Building Blocks)

The friction we hit: Developers naturally incorporate crazy amounts of context when working on any individual bug or feature. Without access to the same tools and capabilities that developers use, Claude was often working off of guesses based solely on code analysis. Developers will read JIRA ticket details and screenshots, launch the app, work in Slack, and use internal tools to check data. Without those same skills we were handicapping Claude.

What we built: We integrated a handful of MCP servers and close to 40 unique skills that allow Claude to investigate the way a developer would investigate.

A few key servers that give agents and humans equivalent capabilities:

  1. Atlassian MCP Server — Agents and developers can query tickets, update statuses, add comments, search Confluence, and transition labels all from Claude Code. An agent working on a ticket can automatically move it from planning-needed to implementation-needed when it submits a PR.

  2. mabl Testing MCP Server — Agents can trigger mabl test runs, parse results, and determine if a change broke existing tests. When an auto-fix agent encounters a CI failure, it queries this server to see if it's a flaky test (ignore) or a real regression (fix). The server also manages test environments and creates new mabl tests.

  3. Desktop Automation MCP Server — Agents can launch the mabl Desktop Trainer (our Electron app), execute UI interactions via Playwright, and capture screenshots. This solved the "Teaching Agents to See" problem for us. When the validation router determines that desktop changes were made, the implementer agent can visually verify the result using this MCP server, not just run unit tests.

These are configured at the workspace level using .mcp.json, so every developer using Claude Code or Cursor gets the same tool access out of the box.

Beyond the three core MCP servers, we've built 36+ reusable Skills, specialized knowledge modules that agents and developers invoke via /command:

  • Workflow skills: Worktree management, repo setup, build-and-wait-for-dependency, deploy previews, changeset creation

  • Cross repo build skills: Run the api+web ui+desktop app all connected locally

  • Validation skills: Per-repo test runners (Java/Gradle, React/Vitest, TypeScript/Jest, Playwright/Electron), E2E test execution via mabl MCP, a validation router that automatically selects the right test suite based on which files changed

  • Analysis & Planning skills: Type-specific analysis (bug/feature/debt), type-specific planning, difficulty estimation, ticket classification

  • Operations skills: Rerun tests with custom runtime images, fetch execution logs, create mabl tests

Beyond the initial infrastructure, we’ve created a baseline that all developers can continue contributing new skills and rules to making us organizationally smarter every day.

What this solved: MCP servers and skills turned our agents from "code generators" into "full-stack developers." An agent can now read a Jira ticket, generate an implementation plan, write code, run tests, verify UI changes visually (when the validation router determines desktop changes were made), submit a PR, and move the ticket to review-needed, all without human intervention. The human enters the loop at the confidence gate between phases and always for the final PR approval.

Layer 3: Operations and Governance 

The friction we hit: With AI increasing our coding throughput by 3-5x, we needed to scale quality enforcement without making human review the bottleneck. More PRs also means more conflicts and engineers were spending more time than ever fixing build issues, resolving merge conflicts, shepherding builds through pipelines.

What we built: We ruthlessly hunted down sources of conflict while incorporating agents into our PR review and operations processes.

AI Code Review

Every PR gets an AI code review via our claude-pr-review workflow. We pull in the relevant common rules from cross-repo-base as well as repo specific CLAUDE.md files, then provide the review agent with explicit guidance on examples of blocking issues around security, correctness, and code standards. The reviewer can:

  • REQUEST_CHANGES (blocking): The PR cannot merge until the issues are addressed

  • COMMENT (non-blocking): Suggestions that don't block the merge

Auto-Fix Agents

When CI fails on a feature branch with an open PR, our auto-fix workflow diagnoses the failure and pushes a fix commit. We kept scope small to start focusing on simple fixes like linter errors, missing imports, simple type errors, unused variable warnings.

We explicitly instruct the agent to avoid known thornier issues that require more judgement and oversight such as test assertion failures, security vulnerabilities, missing dependencies, build configuration errors, and complex bugs requiring architectural changes.

The workflow also has circuit breakers to prevent fix-fail loops and conflict between developers and the auto fix agents. For example, If the last 2 commits are both tagged [auto-fix], the workflow stops. Or if a PR author previously reverted an auto-fix commit, the workflow won't attempt the same fix again. It recognizes that the author made a deliberate decision that the issue requires human attention.

PR Slash Commands (ChatOps)

A command handler workflow enables ChatOps-style commands via PR comments. Developers can trigger specific actions by commenting on a PR:

  • /claude-review and @claude — Trigger AI code review (available across all repos)

  • Repo-specific commands: /run-integration-tests, /run-all-tests, /deploy-preview, /run-functional-tests, /run-playwright-builds, /build-images, /run-browser-tests

Commands are acknowledged with emoji reactions and status comments. This gives developers fine-grained control over what runs without editing workflow files.

Bringing it together

We integrated this with GitHub's native merge queue and Terraform-managed rulesets:

  • Org-level rules: One required human approval, squash-only merges, no force pushes, no branch deletion

  • Repo-level rules: Required status checks (build, tests, security scanning), configurable concurrency limits

The flow for a typical agent-generated PR:

  1. Agent creates PR after passing pre-push validation gates

  2. AI code review runs automatically, can REQUEST_CHANGES or COMMENT

  3. If CI fails on lint/type issues, auto-fix agent attempts repair (with circuit breaker)

  4. Human reviewer evaluates the PR, informed by AI review comments

  5. Human approves (or requests changes)

  6. PR enters merge queue; full test suite runs

  7. PR merges

What this solved: In February 2026, we're pacing at 370+ PRs/month in application repos alone (up from 230/month in August 2025). The AI code review catches issues before human reviewers spend time on them. The auto-fix agent resolves trivial CI failures (missing semicolons, unused imports, type mismatches) without human intervention. Human reviewers focus on the PRs that genuinely need judgment, like architectural decisions, security implications, and complex logic.

That's the infrastructure: shared context, shared tools, and governance that keeps humans in control without making them the bottleneck. But infrastructure is only useful if agents can actually navigate it.

In Part 2, we'll walk through how we put all three layers to work. A four-phase pipeline that takes a Jira ticket from concept to merged PR, with confidence-based gating that determines when agents proceed autonomously and when they stop to ask for help. We'll also share the results, and where agents still break.