AgentRoom

TL;DR

Hand a hard backend coding task to a single AI agent and, on the hardest task in our suite, it writes one stub file and quits up to half the time. The obvious fix, running a second agent in parallel and merging the outputs, makes things worse: the two agents silently overwrite each other and the merged result scores below a single agent.

AgentRoom gives concurrent agents a shared

CRDT

workspace plus a small coordination room. Agents claim a file, broadcast a note, and see who holds what, all exposed as

MCP

tools. At matched compute across four frontier coding command-line interfaces (CLIs):

the give-up failure mode collapses, pooled odds ratio 13.7 (95% confidence interval [3.9, 48]),
AgentRoom beats naive parallel-merge by +0.213 in large language model (LLM) judge quality,
stripping out the MCP layer shows the coordination tools carry most of that gain, not the CRDT substrate,
and the sweet spot is N = 2.

Concurrent multi-agent coding is usually framed as parallelism or merge-correctness. The active ingredient is neither. It is explicit coordination.

The problem: a lone agent gives up

Hand a hard backend task to today’s coding agents and three things go wrong:

A lone agent gives up. Facing a fifteen-file financial ledger, it performs what we call a stub-and-exit: one source skeleton, at most one test passing, then it quits within about eighty seconds, having judged the task too large to finish.
Naive parallelism makes it worse. A second uncoordinated agent overwrites the first’s entry file in a shared directory; in separate directories, a post-hoc merge amplifies rather than averages the lone-agent failure, landing below a single agent.
Prior systems sidestep coordination. Large language models are single-stream, and the multi-agent systems built on top inherit that serial limit: they either sequence agents through fixed phases, design then implement then review (Hong et al., 2024; Nguyen et al., 2024; Qian et al., 2024), or pool independent samples with no signalling between them.

The race at the top of the page replays a real run pair. On the left, one agent stubs and exits at about eighty seconds. On the right, two agents in a room claim files, broadcast, merge, and finish all twenty-eight files with every test passing.

How AgentRoom works

AgentRoom is a single integrated primitive with three co-designed parts. A shared workspace where every agent’s writes are immediately visible and concurrent edits merge automatically through a

CRDT

(Automerge Project, 2023; Jahns, 2024). A coordination room where an agent can claim a file, release it, broadcast a note, and query who holds what, all exposed as MCP tools (Anthropic, 2024). And a short advisory protocol: claim a file before you write it, respect files others have claimed, and broadcast your progress.

Architecture

Figure 2. N agents share a CRDT-merged workspace; an MCP room exposes claim / release / broadcast / state / read. Hover a component to see its role.

The claim is an

advisory lock

, a lock agents agree to honor rather than a heavyweight consensus protocol. It does not need to be: the CRDT already guarantees the bytes converge.

CRDT merge

Dev A Dev B

Concurrent insertions converge to one line. No byte-level conflict.

Figure 3. Two agents insert into the same line concurrently; the CRDT interleaves both insertions deterministically with no conflict.

Coordination, not concurrency

Naive concurrency (parallel-merge) lands below a single agent. The CRDT substrate alone helps only a little. The coordination tools carry most of the gain. That asymmetry separates AgentRoom from implicit-coordination systems that share a workspace but never signal intent (Pugachev, 2025).

Results

A lone agent abandons; two agents do not

We define abandonment as a scorer-independent binary: a sub-threshold run that leaves at most two source files or exits early. Pooling twelve model-by-task strata under a Cochran-Mantel-Haenszel (CMH) test, the common odds ratio (OR) for Solo versus AgentRoom abandonment is 13.7, with a 95% confidence interval (CI) of [3.9, 48] and every stratum that records abandonment aligned.

Abandonment forest plot

Figure 4. Per-stratum abandonment (Solo vs AgentRoom (2 agents)), with the pooled CMH odds ratio. Toggle between by-task and by-model views.

Coordination tightens the variance

Beyond eliminating the give-up mode, a room cuts run-to-run variance by 30 to 45 percent across all three CLI-stable models.

Variance contraction

Figure 5. Run-to-run standard deviation of the quality score, Solo vs AgentRoom (2 agents), on the financial-ledger task. Sigma contracts 30 to 45 percent per model.

The ablation ladder

A six-condition ablation at matched compute on the headline task comes out monotonic. The two-agent parallel-merge baseline sits below a single agent: adding agents without coordination is destructive. AgentRoom sits on top, +0.213 over parallel-merge.

Six-condition ablation

Figure 6. Mean quality score for six conditions on the financial-ledger task. The gap from parallel-merge to AgentRoom is +0.213. Toggle the model.

Sequential pipelines amplify the failure

Against a same-model ChatDev-style sequential pipeline at matched compute, AgentRoom wins by +0.336 (on the six genuine ChatDev runs, after excluding seven that were orchestrator crashes). The pipeline stalls at the implementation boundary in four of six runs; AgentRoom abandons in none of its ten.

Paradigm contrast

abandoned at the implementation boundary

Figure 7. Concurrent AgentRoom vs a sequential 3-phase pipeline (same model, matched 1200s budget).

What carries the gain: the bundle probe

Strip away the MCP tools, keeping the CRDT and the collaboration prompt, and the gain over shared-only is only +0.013. Add the tools back and you recover +0.081 more, most of the gain. The small remainder (the prompt-only step over the bare substrate) is within noise at this sample size (n = 7), so we read the split as an ordering, the MCP layer carries most of the gain, not a precise percentage.

Bundle decomposition

Figure 8. Decomposing the AgentRoom bundle: substrate (CRDT) vs coordination (MCP). Drag to see the attribution.

The 86 / 14 split is a point estimate from a small same-budget pool (n = 7), so the precise fraction is noisy; the load-bearing claim is the ordering, the coordination layer carries the larger share.

How many agents? N = 2

Quality peaks at two agents. Total tests passing keep climbing to three, but per-run cost grows linearly and the single broadcast channel saturates past three, so two is the operational sweet spot.

Agent-count scaling

agents N

Figure 9. Agent count on the financial-ledger task (Sonnet). Quality peaks at N=2; max tests peak at N=3; the broadcast channel saturates beyond N=3. Drag the slider.

Agents start talking to each other

Given a room, agents produce coordination language no one prompted for: claiming a module before working it, announcing a planned interface, and occasionally apologizing for a cross-agent fix (“I touched your file for a one-line fix to unblock the tests, sorry”). Shared-only agents, with the same CRDT but no signalling channel, produce none of this. The channel affordance is what activates the behavior (Park et al., 2023; Wu et al., 2024).

The full collaboration taxonomy (and how we labelled it)

Across six financial-ledger Sonnet AgentRoom (2 agents) runs we observe module-claiming in 6/6, plan-adjustment in 5/6, and cross-agent bug-fix in 2/6. These are author labels over the room logs, not an automated metric; the representative apology quote is drawn from a three-agent run where the pattern is more frequent. None of the patterns appear in shared-only.

What we got wrong first

The project began with a much larger claim, an emergence score of 3.82x and talk of grokking. Repeated validity audits killed it: that number was an artifact of a scorer that rewarded raw file and line count, a single bad baseline run, and a CRDT bridge bug under which the agents never actually shared edits. The surviving claim is narrower: AgentRoom buys reliability, lower abandonment and tighter variance, not a large mean-quality lift.

Limitations and conclusion

We score with an LLM-judge composite cross-validated against regex and abstract syntax tree (AST) scorers, not a held-out execution oracle (tasks ship only an agent-authored test suite), so we make no execution-verified correctness claim. The core result covers two-agent TypeScript backend tasks across three models, with small samples per group. The abandonment finding therefore rests on pooling twelve strata rather than any single run group. And the coordination-versus-substrate attribution rests on a small same-budget pool.

Within that scope the picture is consistent. Concurrent multi-agent coding has been treated as a parallelism problem or a merge-correctness problem. It is mostly a coordination problem, and a small set of state-management operations, claim files, announce progress, query the room state, is what turns a crowd of agents tripping over each other into one that ships working code.

Give two agents a room and a way to claim the work, and they stop overwriting each other and start finishing the job.

Resources

Paper and reviews: OpenReview 0aGLZqKJjt (ICML 2026 Workshop on Failure Modes of Agentic Generative AI, FAGEN).
Related systems referenced above: ChatDev (Qian et al., 2024), MetaGPT (Hong et al., 2024), AgileCoder (Nguyen et al., 2024), CodeCRDT (Pugachev, 2025), AutoGen (Wu et al., 2024), OpenHands (Wang et al., 2025), and the MCP specification (Anthropic, 2024).

Anthropic. (2024). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol back: 1, 2
Automerge Project. (2023). Automerge: A library of data structures for building collaborative applications. https://automerge.org
Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S. K. S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., & Schmidhuber, J. (2024). MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VtmBAGCN7o back: 1, 2
Jahns, K. (2024). Yjs: A CRDT framework for shared editing. https://github.com/yjs/yjs
Nguyen, M. H., Chau, T. P., Nguyen, P. X., & Bui, N. D. Q. (2024). AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology. https://arxiv.org/abs/2406.11912 back: 1, 2
Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. https://arxiv.org/abs/2304.03442
Pugachev, S. (2025). CodeCRDT: Observation-Driven Coordination for Multi-Agent LLM Code Generation. https://arxiv.org/abs/2510.18893 back: 1, 2
Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z., & Sun, M. (2024). ChatDev: Communicative Agents for Software Development. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15174–15186). Association for Computational Linguistics. 10.18653/v1/2024.acl-long.810 back: 1, 2
Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R., Zheng, M., Qian, B., Shao, Y., Muennighoff, N., Zhang, Y., Hui, B., … Neubig, G. (2025). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=OJd3ayDDoF
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., & Wang, C. (2024). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations. First Conference on Language Modeling. https://openreview.net/forum?id=BAakY1hNKS back: 1, 2