TL;DR
Hand a hard backend coding task to a single AI agent and, on the hardest task in our suite, it writes one stub file and quits up to half the time. The obvious fix, running a second agent in parallel and merging the outputs, makes things worse: the two agents silently overwrite each other and the merged result scores below a single agent.
AgentRoom gives concurrent agents a shared
- the give-up failure mode collapses, pooled odds ratio 13.7 (95% confidence interval [3.9, 48]),
- AgentRoom beats naive parallel-merge by +0.213 in large language model (LLM) judge quality,
- stripping out the MCP layer attributes 86% of that gain to the coordination tools, not the CRDT substrate,
- and the sweet spot is N = 2.
Concurrent multi-agent coding is usually framed as parallelism or merge-correctness. The active ingredient is neither. It is explicit coordination.
The problem: a lone agent gives up
Hand a hard backend task to today’s coding agents and three things go wrong:
- A lone agent gives up. Facing a fifteen-file financial ledger, it performs what we call a stub-and-exit: one source skeleton, at most one test passing, then it quits within about eighty seconds, having judged the task too large to finish.
- Naive parallelism makes it worse. A second uncoordinated agent overwrites the first’s entry file in a shared directory; in separate directories, a post-hoc merge amplifies rather than averages the lone-agent failure, landing below a single agent.
- Prior systems sidestep coordination. Large language models are single-stream, and the multi-agent systems built on top inherit that serial limit: they either sequence agents through fixed phases, design then implement then review (Hong et al., 2024; Nguyen et al., 2024; Qian et al., 2024), or pool independent samples with no signalling between them.
The race at the top of the page replays a real run pair. On the left, one agent stubs and exits at about eighty seconds. On the right, two agents in a room claim files, broadcast, merge, and finish all twenty-eight files with every test passing.
How AgentRoom works
AgentRoom is a single integrated primitive with three co-designed parts. A shared workspace where every agent’s writes are immediately visible and concurrent edits merge automatically through a
The claim is an
Coordination, not concurrency
Naive concurrency (parallel-merge) lands below a single agent. The CRDT substrate alone helps only a little. The coordination tools carry most of the gain. That asymmetry separates AgentRoom from implicit-coordination systems that share a workspace but never signal intent (Pugachev, 2025).
Results
A lone agent abandons; two agents do not
We define abandonment as a scorer-independent binary: a sub-threshold run that leaves at most two source files or exits early. Pooling twelve model-by-task strata under a Cochran-Mantel-Haenszel (CMH) test, the common odds ratio (OR) for Solo versus AgentRoom abandonment is 13.7, with a 95% confidence interval (CI) of [3.9, 48] and every stratum that records abandonment aligned.
Coordination tightens the variance
Beyond eliminating the give-up mode, a room cuts run-to-run variance by 30 to 45 percent across all three CLI-stable models.
The ablation ladder
A six-condition ablation at matched compute on the headline task comes out monotonic. The two-agent parallel-merge baseline sits below a single agent: adding agents without coordination is destructive. AgentRoom sits on top, +0.213 over parallel-merge.
Sequential pipelines amplify the failure
Against a same-model ChatDev-style sequential pipeline at matched compute, AgentRoom wins by +0.403. The pipeline stalls at the implementation boundary in four of six runs; AgentRoom abandons in none of its ten.
What carries the gain: the bundle probe
Strip away the MCP tools, keeping the CRDT and the collaboration prompt, and the gain over shared-only is only +0.013. Add the tools back and you recover +0.081 more, roughly 86% of the substrate-to-bundle gain. The collaboration prompt on top of the CRDT substrate contributes the remaining 14%.
The 86 / 14 split is a point estimate from a small same-budget pool (n = 7), so the precise fraction is noisy; the load-bearing claim is the ordering, the coordination layer carries the larger share.
How many agents? N = 2
Quality peaks at two agents. Total tests passing keep climbing to three, but per-run cost grows linearly and the single broadcast channel saturates past three, so two is the operational sweet spot.
Agents start talking to each other
Given a room, agents produce coordination language no one prompted for: claiming a module before working it, announcing a planned interface, and occasionally apologizing for a cross-agent fix (“I touched your file for a one-line fix to unblock the tests, sorry”). Shared-only agents, with the same CRDT but no signalling channel, produce none of this. The channel affordance is what activates the behavior (Park et al., 2023; Wu et al., 2024).
The full collaboration taxonomy (and how we labelled it)
Across six financial-ledger Sonnet AgentRoom (2 agents) runs we observe module-claiming in 6/6, plan-adjustment in 5/6, and cross-agent bug-fix in 2/6. These are author labels over the room logs, not an automated metric; the representative apology quote is drawn from a three-agent run where the pattern is more frequent. None of the patterns appear in shared-only.
What we got wrong first
The project began with a much larger claim, an emergence score of 3.82x and talk of grokking. Repeated validity audits killed it: that number was an artifact of a scorer that rewarded raw file and line count, a single bad baseline run, and a CRDT bridge bug under which the agents never actually shared edits. The surviving claim is narrower: AgentRoom buys reliability, lower abandonment and tighter variance, not a large mean-quality lift.
Limitations and conclusion
We score with an LLM-judge composite cross-validated against regex and abstract syntax tree (AST) scorers, not a held-out execution oracle (tasks ship only an agent-authored test suite), so we make no execution-verified correctness claim. The core result covers two-agent TypeScript backend tasks across three models, with small samples per group. The abandonment finding therefore rests on pooling twelve strata rather than any single run group. And the coordination-versus-substrate attribution rests on a small same-budget pool.
Within that scope the picture is consistent. Concurrent multi-agent coding has been treated as a parallelism problem or a merge-correctness problem. It is mostly a coordination problem, and a small set of state-management operations, claim files, announce progress, query the room state, is what turns a crowd of agents tripping over each other into one that ships working code.
Give two agents a room and a way to claim the work, and they stop overwriting each other and start finishing the job.
Resources
- Paper and reviews: OpenReview 0aGLZqKJjt (ICML 2026 Workshop on Failure Modes of Agentic Generative AI, FAGEN).
- Related systems referenced above: ChatDev (Qian et al., 2024), MetaGPT (Hong et al., 2024), AgileCoder (Nguyen et al., 2024), CodeCRDT (Pugachev, 2025), AutoGen (Wu et al., 2024), OpenHands (Wang et al., 2025), and the MCP specification (Anthropic, 2024).
- Anthropic. (2024). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol back: 1, 2
- Automerge Project. (2023). Automerge: A library of data structures for building collaborative applications. https://automerge.org
- Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S. K. S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., & Schmidhuber, J. (2024). MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VtmBAGCN7o back: 1, 2
- Jahns, K. (2024). Yjs: A CRDT framework for shared editing. https://github.com/yjs/yjs
- Nguyen, M. H., Chau, T. P., Nguyen, P. X., & Bui, N. D. Q. (2024). AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology. https://arxiv.org/abs/2406.11912 back: 1, 2
- Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. https://arxiv.org/abs/2304.03442
- Pugachev, S. (2025). CodeCRDT: Observation-Driven Coordination for Multi-Agent LLM Code Generation. https://arxiv.org/abs/2510.18893 back: 1, 2
- Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z., & Sun, M. (2024). ChatDev: Communicative Agents for Software Development. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15174–15186). Association for Computational Linguistics. 10.18653/v1/2024.acl-long.810 back: 1, 2
- Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R., Zheng, M., Qian, B., Shao, Y., Muennighoff, N., Zhang, Y., Hui, B., … Neubig, G. (2025). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=OJd3ayDDoF
- Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., & Wang, C. (2024). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations. First Conference on Language Modeling. https://openreview.net/forum?id=BAakY1hNKS back: 1, 2