Skip to main content
On this page
Engineering
15 min read

How to Restructure Engineering Teams for Autonomous AI Agents

Izzy A
Izzy A
CTO @PromptMetrics

90% of teams use AI coding tools, but many see lower stability. Learn how to restructure your CI pipelines, specs, and security for autonomous AI agents.

How to Restructure Engineering Teams for Autonomous AI Agents

90% of engineering teams now use AI coding tools, up from 61% in 2024 (Jellyfish, 2025). And yet the 2025 DORA Report dropped an uncomfortable finding: AI adoption currently shows a negative relationship with software delivery stability (DORA, 2025): more tools, more chaos.

The conventional playbook is seductive. Give every developer a Copilot license. Roll out Claude Code. Let people "vibe code" their way to faster shipping. But the teams actually winning aren't the ones with the most AI tooling. They're the ones who restructured their infrastructure, their development process, and their leadership workflow around autonomous agents. The technology doesn't fix a broken system. It amplifies whatever's already there.

Here's what that restructuring looks like across four fronts and what ops and tech leaders need to do about it this quarter.

Key Takeaways

  • 90% of teams use AI coding tools, but AI amplifies existing weaknesses as much as strengths (Jellyfish/DORA, 2025)

  • A 10-minute reduction in CI duration yields an estimated $1.1M/year in productivity gains for a 500-developer team (CircleCI, 2025)

  • Spec-driven development produces 2.1x productivity gains over interactive copilot use alone and turns plain-English specs into cross-departmental assets (Kinkan News, 2025)

  • 88% of organizations had an AI agent security incident in the past year; only 14.4% get full IT approval before go-live (Gravitee, 2026)

Video: Spec-Driven Development with AI Agents by Devoxx. A conference talk covering the full spec-driven pipeline from requirements.md to agent execution.

Is Your CI Pipeline Capping Your AI Velocity?

A 10-minute reduction in CI/CD workflow duration yields an estimated $1.1 million in annual productivity gains for a 500-developer team, and improving the pipeline success rate from 75% to 90% reclaims roughly 78,750 engineering hours per year (CircleCI, 2025). Those numbers are from the pre-agent era. With autonomous coding agents in the mix, slow CI doesn't just cost money. It puts a hard mathematical ceiling on what your agents can ship.

Think about it. If your CI pipeline takes an hour to return results, every autonomous agent you spin up spends a full hour parked, doing nothing, waiting for a signal. Cut that to five minutes, and the same agent can iterate a dozen times in the same window. As Ryan Nystrom, engineering manager at Notion, put it on the How I AI podcast: "If I've got a CI loop that takes an hour to run, your agent's just going to sit there and spin for an hour waiting for results. If it takes three minutes to run holy crap, how much more stuff are you, as a human, and your little swarm of agents, going to be able to get done?"

Notion's response was an internal project nicknamed "afterburner," a full-throttle initiative to cut CI times to a quarter of their previous levels. The trigger wasn't developer complaints. It was the realization that agent throughput is directly proportional a pipeline speed. Steve from Stripe, also interviewed on the podcast, reported 1,300 agent-generated PRs per week. You cannot sustain that volume if every PR waits an hour for CI.

The implication goes deeper than speed. 70% of practitioners say their pipelines are "plagued by flaky tests and deployment failures," and 69% say slow or unreliable CI/CD contributes significantly to developer burnout (Harness, 2026). Agents don't burn out, but they do amplify flakiness. An agent that hits a flaky test will either waste cycles re-running or, worse, learn to route around the failure in ways a human would catch.

There's a second piece here that most teams miss: where the agent runs matters as much as how fast CI completes. Notion built "Boxy," an internal tool that provisions background VMs with Codex and Claude Code pre-installed. Developers describe a task in a Notion page, @mention a coding agent, and the agent spins up on a remote VM, producing a pull request with screenshots of its own UI verification, usually within 10 to 15 minutes. The developer stays in flow. The agent does the grind.

If you don't have a background-VM strategy for your coding agents, you're capping throughput before you've even started. Your engineers' laptops are not the right substrate for agentic workflows. Why would you chain a tireless worker to a machine that goes to sleep when you close the lid?

According to CircleCI's 2025 analysis of 15 million workflows across 22,000 organizations, a 10-minute CI reduction delivers $1.1M in annual productivity for a 500-developer team. With agents in the loop, the same reduction compounds. Agents don't just save human time; they multiply the number of experiments the entire team can run per day.

CI Duration vs Agent Throughput: Agent throughput drops sharply as CI duration increases. At 5 minutes CI, agents produce 24 PRs per day. At 15 minutes, 16 PRs. At 30 minutes, 10 PRs. At 60 minutes, only 6 PRs per day.

Can Spec-Driven Development Turn Your Engineers Into Architects?

Teams using autonomous coding agents for extended or overnight sessions report 2.1x productivity gains over interactive copilot-only use (Kinkan News, 2025). But that number only holds when the agent has a rigorous specification to work from. Unsupervised agents with vague instructions produce an unsupervised mess.

Nystrom's team at Notion adopted a workflow that sounds radical until you realize it's what senior engineers have always done, just accelerated. They created a agent-specs/ folder checked into their repository, filled with detailed markdown documents describing how features should behave. The process: open Whisper and talk through the feature idea out loud. Feed the transcript to Codex. Ask it to read the existing spec library, learn the format, and produce a matching spec. Revise a couple of times. Then point Codex at the finished spec and say, "build it."

"I basically one-shotted this because the entire spec file is so comprehensive with code pointers, with verification steps at the bottom," Nystrom said. "This is now the source of truth for how this part of Notion AI works. And it's just in plain English."

That last part is the sleeper win. When the canonical description of a feature is a well-written markdown document rather than 4,000 lines of TypeScript, something interesting happens. Marketing can read it. The product can read it. GTM teams can understand exactly what shipped without having to schedule a 30-minute walkthrough with the tech lead. The spec becomes the internal API for cross-functional alignment.

The deeper shift here isn't about efficiency, it's about what engineering work becomes. "I view our job as engineers evolving into systems thinkers and architects," Nystrom said. "Most importantly, it's the verification loop. Do you have a tool to let the agent run itself?" Notion built a CLI that spins up its AI, sends it test queries, toggles modes, and captures transcripts. The human's job shifts from writing the implementation to designing the test harness that proves the implementation is correct.

This is the part most "just add AI" strategies miss entirely. You can't hand your backlog to agents and walk away. You have to build the verification infrastructure first. If you can't automatically verify whether the agent's output is correct, you haven't accelerated anything. You've just moved the bottleneck from writing code to reviewing code. And who wants to review 1,300 agent PRs a week by hand each week?

No more waiting for the design review meeting. No more waiting for everyone's calendar to open up. Ship the spec, let the agent build, and debate the merits of something live and working rather than something theoretical sitting in a document.

Teams using autonomous agents for overnight sessions see 2.1x the productivity gains of interactive copilot-only use, according to Kinkan's 2025 survey of 500 development teams. The spec is what makes the multiplier possible. Without it, the agent generates code faster, not better.

What's the Real Cost of Making Your Best Leaders Prep for Meetings?

Standups and sync/check-in meetings combined account for more meeting volume than all other named types combined, 9.2 million meetings analyzed versus roughly 5.2 million for everything else (Supernormal, 2026). The average engineering manager spends chunks of every day pulling status from Slack, GitHub, Jira, and telemetry tools to know what to talk about at standup.

What if none of that required a human?

Nystrom's team built a custom Notion AI agent affectionately named "hot potato" that runs every morning at 9 am. It fans out across Slack channels, queries Honeycomb for the latest CI metrics via MCP, scans the task database for completed items, pulls merged pull requests, reads yesterday's meeting transcript, and compiles everything into a structured pre-read. By the time standup starts, the agenda is already written. The team spends the meeting discussing decisions, risks, and findings, rather than reciting status updates.

"I can basically work up until the minute of our meeting without having done a bunch of prep," Nystrom said. "And then we all get on a video call, and we look at the screen, and we're like, okay, here's what we need to talk about."

The time savings sound small. Twenty minutes a day. But run the numbers: 20 minutes × 5 days × 50 weeks = 83 hours per year, per manager. For a team with six engineering managers, that's roughly 500 hours reclaimed, the equivalent of three months of full-time work. And the real unlock isn't the hours saved. It's what those hours get spent on instead.

"I have done the run-a-big-engineering-group thing where you're spending half your time compiling information, synthesizing it, writing reports," Nystrom said. "I hated it. It's so draining. Now I feel like I'm in a sweet spot where I can support a team of talented individuals without doing paperwork the entire time."

There's another dimension here worth naming. When the pre-read surfaces work, the manager would have missed "somebody fixed our mock server environment, and we're seeing test improvement by up to 13%, I'd miss that, that's super cool, let's talk about it." The meeting stops being a status dump and starts being a discovery mechanism. The quiet engineer who never volunteers updates gets surfaced. The optimization nobody thought to mention becomes the focus of the conversation.

According to Supernormal's analysis of 50.9 million hours of meeting data, standups and sync meetings dominate organizational time more than any other meeting category. Automating the prep for these meetings doesn't just save time, it transforms them from status-reporting exercises into actual decision forums.

And here's the detail that should make every manager exhale: "Your AI agent is never going to complain when you ask it to do this five minutes before the meeting starts."

Video: How Intercom 2X'd Engineering Velocity with Claude Code by How I AI. A deep case study of org restructuring around AI agents with 100% engineer adoption.

The Security Gap That's Eating Enterprise AI Adoption

88% of organizations confirmed or suspected an AI agent security incident in the past year. Only 14.4% report that all AI agents go live with full security or IT approval. And only 21.9% of organizations treat agents as independent identity-bearing entities (Gravitee, 2026).

So here's the bind. AI coding assistants and chatbots only deliver real value when they have deep access to internal codebases, documentation, and systems. Your Copilot needs to see your entire repo. Your support chatbot needs to search across internal docs. But that depth of access immediately triggers intense IT security and compliance scrutiny, and the numbers show those scrutiny processes aren't keeping up.

45.6% of teams still rely on shared API keys for agent-to-agent authentication. A quarter of deployed agents are already capable of creating and tasking other agents (Gravitee, 2026). You don't need to be a security architect to see where this is headed. Agent spawn chains. Credential propagation. Audit trails that stop at the first agent, since no one knows what the third agent did.

The solution isn't to slow down AI adoption until security catches up. That's the path to getting lapped by competitors who figure out both. The solution is treating agent identity and access control as infrastructure problems, not policy problems. When was the last time a policy document stopped a credential leak? Agents need their own credentials, their own scoped permissions, their own audit logs. If a human needs an Okta verify push to access production, an agent needs an equivalent gate.

Modern orchestration layers are filling this gap. Legacy business process automation tools, the BPM suites and low-code platforms of the last decade, weren't built for workflows that mix microservices, human-in-the-loop approvals, and autonomous agents in real time. The emerging answer is orchestration platforms that give every agent a verifiable identity, enforce scoped access, and log every action to a queryable audit trail.

The message from the data is blunt: 88% of organizations already had an agent security incident. The 12% who haven't are either not looking, not deploying, or they have already invested in agent-native identity infrastructure. Guess which group is shipping faster.

AI Agent Security: The Adoption-Control Gap 92.7% of healthcare organizations and 88% of all organizations had AI agent security incidents. Yet only 14.4% of agents go live with full IT approval, and only 21.9% of orgs treat agents as independent identity-bearing entities.

What Engineering Leadership Looks Like in 2027

81% of engineering professionals expect at least a quarter of development work to shift to AI within five years. Only 20% of teams currently use engineering metrics to measure AI's actual impact (Jellyfish, 2025).

There's a gap between feeling faster and being faster, and too many teams are running on vibes. But how do you measure the difference? The DORA report's central finding bears repeating: AI amplifies what's already there. Strong engineering cultures with good practices get stronger. Weak cultures with sloppy practices get sloppier, just faster.

The practical implication for leadership is uncomfortable. Directors, VPs, and CTOs need to get their hands back into code. Not as primary contributors as practitioners who understand what the tools can and can't do. "This is the era of the hard skill," as podcast host Claire Vo puts it. You can't make architectural decisions about agent orchestration if you've never watched an agent one-shot a feature from a spec.

The most impactful decision a tech leader makes this year isn't which AI tool to buy. It's where to invest infrastructure dollars. Fast CI. Background VMs. Verification tooling. Agent identity systems. Get those four things right, and every AI tool you add multiplies its impact. Skip them, and you're just adding noise to a system that can't handle the signal.

Video: Harness Engineering: How to Build Software When Humans Steer, Agents Execute by an AI Engineer. An OpenAI engineer's patterns for making repos agent-legible and verification-first.

Conventional vs. Restructured: Side by Side

Dimension

Conventional Approach

Restructured Approach

AI tools

Give everyone a license, figure it out later

Spec-first rollout with verification tooling in place

CI pipeline

Tolerate 30-60 min runs; "we'll optimize later."

Cut to <10 min before scaling agent usage; treat as throughput multiplier

Dev environment

Agents run on engineer laptops, block local work

Background VMs provisioned on demand; agents work while devs stay in flow

Specs

Design docs live in Notion/Google Docs, go stale

Markdown specs version-controlled in repo; agents read and generate from them.

Meetings

Managers spend 20-60 min/day compiling status updates

Custom agents fan out across Slack, telemetry, and GitHub at 9 am daily

Security

Shared API keys; agents inherit human credentials

Scoped agent identities with own credentials, permissions, and audit logs

The gap isn't about having more tools. It's about whether your infrastructure can handle them.

Frequently Asked Questions

Doesn't this only work for big tech companies with dedicated platform teams?

You don't need a 20-person platform team to get started. Notion's CI initiative was driven by a single engineering manager who "wasn't a CI expert but kind of knew what he wanted." Start with one thing: CI speed, a daily standup agent, or a spec template, and build from there. The DX Q4 2025 report found that even light AI users save 2+ hours per week.

What if my team hasn't adopted AI coding tools yet? Should we start with CI or tools first?

Start with CI. 70% of teams report their pipelines are slowed by flaky tests and failures (Harness, 2026). Fixing that benefits everyone, AI users and non-users alike. Once your pipeline is reliable, introduce AI tools with a spec-first approach rather than letting everyone figure it out individually. Teams that pair AI adoption with structured specs see significantly better outcomes.

How do you prevent agents from generating unmaintainable code at scale?

The verification loop. Notion's approach to building CLI tools that let agents test their own output against real systems creates a baseline quality gate. Pair that with spec-driven development, where every feature has a version-controlled, plain-English specification, and unmaintainable code becomes visible quickly as it diverges from the spec. 71% of teams using AI coding tools report improved test coverage (Kinkan News, 2025), but only when they invest in the verification infrastructure.

Conclusion

The engineering orgs pulling ahead right now share one pattern: they stopped treating AI as a tooling upgrade and started treating it as an infrastructure transformation. They cut CI times before scaling agent usage. They wrote specs before unshackling autonomous coding sessions. They automated leadership toil before burning out their best managers. And they built agent identity systems before their first security incident forced their hand.

DORA got it right. AI amplifies. The question is: what are you giving it to amplify?

This week, measure three things: your median CI duration, how your specs are written and stored, and how many hours your engineering managers spend compiling status updates. Those numbers tell you exactly where to start.

Self-hosted prompt registry + agent telemetry. Zero vendor lock-in. Runs on a $5 VPS.

Up next

Explore more from the blog

Engineering notes, release updates, and honest takes.

Get the best of the prompt engineering blog delivered to your inbox

Join thousands of AI enthusiasts receiving weekly insights, tips, and tutorials.