AgentRoom

Concurrent Multi-Agent Coding in a CRDT-Backed Shared Workspace

loading task...
Solo 0selapsed 0files 0 / 18tests
AgentRoom 0selapsed 0files 0 / 18tests
Dev A
🔒 Shared Room
room log
Dev B
0s / 605s

Authors

Published

Jun. 2026

Paper

Checking access...

Table of Contents

TL;DR

Hand a hard backend coding task to a single AI agent and, on the hardest task in our suite, it writes one stub file and quits up to half the time. The obvious fix, running a second agent in parallel and merging the outputs, makes things worse: the two agents silently overwrite each other and the merged result scores below a single agent.

AgentRoom gives concurrent agents a shared

CRDT
workspace plus a small coordination room. Agents claim a file, broadcast a note, and see who holds what, all exposed as
MCP
tools. At matched compute across four frontier coding command-line interfaces (CLIs):

The one-line claim

Concurrent multi-agent coding is usually framed as parallelism or merge-correctness. The active ingredient is neither. It is explicit coordination.

The problem: a lone agent gives up

Hand a hard backend task to today’s coding agents and three things go wrong:

The race at the top of the page replays a real run pair. On the left, one agent stubs and exits at about eighty seconds. On the right, two agents in a room claim files, broadcast, merge, and finish all twenty-eight files with every test passing.

How AgentRoom works

AgentRoom is a single integrated primitive with three co-designed parts. A shared workspace where every agent’s writes are immediately visible and concurrent edits merge automatically through a

CRDT
(Automerge Project, 2023; Jahns, 2024). A coordination room where an agent can claim a file, release it, broadcast a note, and query who holds what, all exposed as MCP tools (Anthropic, 2024). And a short advisory protocol: claim a file before you write it, respect files others have claimed, and broadcast your progress.

Architecture
Figure 2. N agents share a CRDT-merged workspace; an MCP room exposes claim / release / broadcast / state / read. Hover a component to see its role.

The claim is an

advisory lock
, a lock agents agree to honor rather than a heavyweight consensus protocol. It does not need to be: the CRDT already guarantees the bytes converge.

CRDT merge
Figure 3. Two agents insert into the same line concurrently; the CRDT interleaves both insertions deterministically with no conflict.

Coordination, not concurrency

Naive concurrency (parallel-merge) lands below a single agent. The CRDT substrate alone helps only a little. The coordination tools carry most of the gain. That asymmetry separates AgentRoom from implicit-coordination systems that share a workspace but never signal intent (Pugachev, 2025).

Results

A lone agent abandons; two agents do not

We define abandonment as a scorer-independent binary: a sub-threshold run that leaves at most two source files or exits early. Pooling twelve model-by-task strata under a Cochran-Mantel-Haenszel (CMH) test, the common odds ratio (OR) for Solo versus AgentRoom abandonment is 13.7, with a 95% confidence interval (CI) of [3.9, 48] and every stratum that records abandonment aligned.

Abandonment forest plot
Figure 4. Per-stratum abandonment (Solo vs AgentRoom (2 agents)), with the pooled CMH odds ratio. Toggle between by-task and by-model views.

Coordination tightens the variance

Beyond eliminating the give-up mode, a room cuts run-to-run variance by 30 to 45 percent across all three CLI-stable models.

Variance contraction
Figure 5. Run-to-run standard deviation of the quality score, Solo vs AgentRoom (2 agents), on the financial-ledger task. Sigma contracts 30 to 45 percent per model.

The ablation ladder

A six-condition ablation at matched compute on the headline task comes out monotonic. The two-agent parallel-merge baseline sits below a single agent: adding agents without coordination is destructive. AgentRoom sits on top, +0.213 over parallel-merge.

Six-condition ablation
Figure 6. Mean quality score for six conditions on the financial-ledger task. The gap from parallel-merge to AgentRoom is +0.213. Toggle the model.

Sequential pipelines amplify the failure

Against a same-model ChatDev-style sequential pipeline at matched compute, AgentRoom wins by +0.403. The pipeline stalls at the implementation boundary in four of six runs; AgentRoom abandons in none of its ten.

Paradigm contrast
Figure 7. Concurrent AgentRoom vs a sequential 3-phase pipeline (same model, matched 1200s budget).

What carries the gain: the bundle probe

Strip away the MCP tools, keeping the CRDT and the collaboration prompt, and the gain over shared-only is only +0.013. Add the tools back and you recover +0.081 more, roughly 86% of the substrate-to-bundle gain. The collaboration prompt on top of the CRDT substrate contributes the remaining 14%.

Bundle decomposition
Figure 8. Decomposing the AgentRoom bundle: substrate (CRDT) vs coordination (MCP). Drag to see the attribution.
A caveat we report honestly

The 86 / 14 split is a point estimate from a small same-budget pool (n = 7), so the precise fraction is noisy; the load-bearing claim is the ordering, the coordination layer carries the larger share.

How many agents? N = 2

Quality peaks at two agents. Total tests passing keep climbing to three, but per-run cost grows linearly and the single broadcast channel saturates past three, so two is the operational sweet spot.

Agent-count scaling
Figure 9. Agent count on the financial-ledger task (Sonnet). Quality peaks at N=2; max tests peak at N=3; the broadcast channel saturates beyond N=3. Drag the slider.

Agents start talking to each other

Given a room, agents produce coordination language no one prompted for: claiming a module before working it, announcing a planned interface, and occasionally apologizing for a cross-agent fix (“I touched your file for a one-line fix to unblock the tests, sorry”). Shared-only agents, with the same CRDT but no signalling channel, produce none of this. The channel affordance is what activates the behavior (Park et al., 2023; Wu et al., 2024).

The full collaboration taxonomy (and how we labelled it)

Across six financial-ledger Sonnet AgentRoom (2 agents) runs we observe module-claiming in 6/6, plan-adjustment in 5/6, and cross-agent bug-fix in 2/6. These are author labels over the room logs, not an automated metric; the representative apology quote is drawn from a three-agent run where the pattern is more frequent. None of the patterns appear in shared-only.

What we got wrong first

The project began with a much larger claim, an emergence score of 3.82x and talk of grokking. Repeated validity audits killed it: that number was an artifact of a scorer that rewarded raw file and line count, a single bad baseline run, and a CRDT bridge bug under which the agents never actually shared edits. The surviving claim is narrower: AgentRoom buys reliability, lower abandonment and tighter variance, not a large mean-quality lift.

Limitations and conclusion

We score with an LLM-judge composite cross-validated against regex and abstract syntax tree (AST) scorers, not a held-out execution oracle (tasks ship only an agent-authored test suite), so we make no execution-verified correctness claim. The core result covers two-agent TypeScript backend tasks across three models, with small samples per group. The abandonment finding therefore rests on pooling twelve strata rather than any single run group. And the coordination-versus-substrate attribution rests on a small same-budget pool.

Within that scope the picture is consistent. Concurrent multi-agent coding has been treated as a parallelism problem or a merge-correctness problem. It is mostly a coordination problem, and a small set of state-management operations, claim files, announce progress, query the room state, is what turns a crowd of agents tripping over each other into one that ships working code.

Give two agents a room and a way to claim the work, and they stop overwriting each other and start finishing the job.


Resources

  1. Anthropic. (2024). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol back: 1, 2
  2. Automerge Project. (2023). Automerge: A library of data structures for building collaborative applications. https://automerge.org
  3. Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S. K. S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., & Schmidhuber, J. (2024). MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VtmBAGCN7o back: 1, 2
  4. Jahns, K. (2024). Yjs: A CRDT framework for shared editing. https://github.com/yjs/yjs
  5. Nguyen, M. H., Chau, T. P., Nguyen, P. X., & Bui, N. D. Q. (2024). AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology. https://arxiv.org/abs/2406.11912 back: 1, 2
  6. Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. https://arxiv.org/abs/2304.03442
  7. Pugachev, S. (2025). CodeCRDT: Observation-Driven Coordination for Multi-Agent LLM Code Generation. https://arxiv.org/abs/2510.18893 back: 1, 2
  8. Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z., & Sun, M. (2024). ChatDev: Communicative Agents for Software Development. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15174–15186). Association for Computational Linguistics. 10.18653/v1/2024.acl-long.810 back: 1, 2
  9. Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R., Zheng, M., Qian, B., Shao, Y., Muennighoff, N., Zhang, Y., Hui, B., … Neubig, G. (2025). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=OJd3ayDDoF
  10. Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., & Wang, C. (2024). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations. First Conference on Language Modeling. https://openreview.net/forum?id=BAakY1hNKS back: 1, 2