A note on what you're reading: The architecture + design patterns described here are real and running in production. But I'm not publishing my actual code, file paths, regex patterns, or config structures. The code samples below were written specifically for this post to illustrate the concepts without exposing attack surface. Think of them as "faithful recreations" rather than copy-paste excerpts. The real implementation lives in the monorepo's ADR trail + a private Notion workspace.
TL;DR
- Running 3-7x concurrent agents (Gemini CLI + Claude Code + Cursor + Codium) on the SAME branch, SAME repo, no worktrees
- Zero file corruption across ~5 months of production use on a 150+ ADR monorepo
- Three implemented layers: OS-level file locking, tiered write protection, read guards
- A fourth layer (orchestrator-level locks with priority queuing) is designed but not yet needed
No, You Don't Need Worktrees
I get it. The instinct when you hear "multiple agents editing the same codebase" is to reach for git worktrees. Separate working directories, separate checkouts, merge when done. Clean. Safe. Boring.
Here's why that doesn't work for my setup:
| Factor | Worktrees | Single Branch |
|---|---|---|
| Context sharing | Each agent sees its own tree; changes invisible to others until merge | All agents see the same files in real-time |
| Merge conflicts | Guaranteed when agents touch overlapping areas | Prevented at write-time by locking |
| Disk space | N copies of the repo | One copy |
| Config files | Duplicated per worktree | One set of configs; agents share coordination state |
| IDE integration | Each worktree needs its own window/session | Agents operate in the same project context |
The whole point of my agentic harness is that agents share state. The task system, the handoff protocol, the event bus, the session-state directory; all of it lives in a shared coordination layer and all agents need to read + write to it. Worktrees would mean N copies of that state, and now you're solving distributed consensus instead of file locking. No thanks.
What I Built Instead
Three layers, each solving a different problem. Each one implemented independently; you don't need all three to get value.
| Layer | What It Does | Protects Against | Status |
|---|---|---|---|
| OS-level locks | Exclusive write access to shared data files | Two agents writing the same coordination file simultaneously | ✅ Implemented |
| Tiered write guard | Block / checkpoint / warn on writes to sensitive files | Agent overwriting credentials, governance configs, etc. | ✅ Implemented |
| Read guard | Block reads of sensitive files | Agent reading credential material or key files | ✅ Implemented |
| Orchestrator locks | Centralized lock management with priority queue + TTL | High-contention scenarios with 10+ agents | ⏳ Designed, not yet needed |
Layer 1: OS-Level File Locking
This is the foundation. Every coordination script uses Python's fcntl.flock() for atomic file access.
Here's an illustrative version of the locking pattern (not my production code, but faithful to the concept):
import fcntl
import signal
import json
from pathlib import Path
class CoordinationLock:
"""
Illustrative file locking for multi-agent coordination.
Real implementation details differ from what's shown here.
"""
def __init__(self, coordination_file: Path, timeout_seconds: int):
self.target = coordination_file
self.timeout = timeout_seconds
def _on_timeout(self, signum, frame):
raise TimeoutError(
f"Another agent is writing to {self.target.name}. "
f"Retry in a few seconds."
)
def write_record(self, record: dict) -> bool:
"""Append a record with exclusive locking."""
signal.signal(signal.SIGALRM, self._on_timeout)
signal.alarm(self.timeout)
try:
with open(self.target, "a") as f:
fcntl.flock(f.fileno(), fcntl.LOCK_EX)
try:
f.write(json.dumps(record) + "\n")
f.flush()
return True
finally:
fcntl.flock(f.fileno(), fcntl.LOCK_UN)
signal.alarm(0)
except TimeoutError:
signal.alarm(0)
return False
The key design choices:
- Fail-fast timeout: if you can't get a lock quickly, something is wrong. Blocking an agent for 30 seconds while another agent finishes a write is worse than failing + retrying. (The exact timeout was tuned after one of the worker agents reviewed the spec and said "that's too generous." Agents reviewing each other's ADRs is genuinely useful.)
- Append mode: agents add records to a coordination file; they don't rewrite the whole thing. This keeps the locked window tiny (milliseconds, not seconds).
- OS-managed cleanup:
fcntl.flock()locks release automatically when the holding process exits. No stale locks from crashed agents. No cleanup scripts. This is the killer feature.
The Lifecycle
Agent wants to write a coordination record
↓
Opens the shared file in append mode
↓
Attempts exclusive lock with timeout
├── Lock available → Write record → Release → Done
└── Lock held → TimeoutError → Retry with backoff
Why This Is Enough (For Now)
The limitation of OS locks is no visibility; you can't see who holds the lock or how long. For 3-7 agents, that hasn't mattered because lock hold times are measured in milliseconds. The orchestrator layer will add visibility when scale demands it.
Layer 2: Tiered Write Guard
OS locks protect shared data files. The write guard protects everything else that agents shouldn't be modifying. Credential files, key material, governance configs, infrastructure state.
Three Tiers of Protection
| Tier | Action | Effect |
|---|---|---|
| Block | Hard stop, no exceptions | Hook returns a blocking exit code. Agent cannot proceed. |
| Checkpoint | Requires human approval | Agent stops + presents the proposed change for review. |
| Warn | Allowed but audited | Warning issued, audit trail created, agent continues. |
What falls into which tier? I'm not publishing that. (Publishing "here are the exact files we hard-block" also tells you "here are the files we DON'T hard-block.")
Indirect Write Detection
Direct file writes via agent tools are easy to intercept. But agents also write via shell commands: redirects, pipe-to-file, inline sed, etc. The guard parses shell commands to detect these indirect writes.
Here's an illustrative example of how you might approach shell write detection (not my production patterns):
def detect_shell_write_target(command: str) -> str | None:
"""
Illustrative shell write detection.
Production uses different patterns + additional layers.
"""
import re
# These are EXAMPLE patterns, not the real detection set.
examples = [
r'>\s*(\S+)', # redirect: echo x > file
r'tee\s+\S*\s*(\S+)', # tee: ... | tee file
]
for pattern in examples:
match = re.search(pattern, command)
if match:
target = match.group(1).strip("'\"")
return target
return None
The important design principle: this detection layer is complemented by other protections. No single layer needs to be perfect.
Self-Protection
The guard config protects itself from modification. If an agent tries to "helpfully" weaken the guard to unblock itself, the edit gets blocked. (Is this paranoid? Maybe. But I've seen agents try.)
Layer 3: Read Guard
The write guard blocks writes. The read guard blocks reads. You don't want agents reading credential material "just to check the format."
Two categories: hard deny (agent can't read the file, period) and soft ask (agent gets a warning + the human gets notified). I'm not publishing which files fall into which category.
The Unified Hook Executor
All three layers plug into a central hook executor. Every agent action fires through this before executing. It validates multiple categories in sequence:
| Category | Examples |
|---|---|
| Compliance | Attribution checks, changelog governance, ADR enforcement |
| Security | Write guards, read guards, CLI command scanning, injection detection |
| Quality | Context drift monitoring, notification routing |
| Orchestration | Handoff validation, anti-spiral detection |
| Budget | Per-session token consumption tracking |
If any validator in the chain returns a blocking result, the action stops. The executor runs across ALL agents (Cursor, Claude Code, Gemini CLI, Windsurf) with per-platform event mapping.
Staged Rollout
Not every hook starts in "block mode." A rollout config controls graduated enforcement:
monitor (log only) → warn (surface to agent) → enforce (block on violation)
Some hooks start at monitor and earn their way to enforce after stabilization. Others start at enforce and stay there permanently.
The Planned Orchestrator Layer
For completeness: the orchestrator-level locking system has been fully designed (with its own ADR) but not implemented. It adds things the OS-level locks don't provide:
| Feature | OS Locks | Orchestrator |
|---|---|---|
| Lock types | Exclusive only | Read / Write / Intent-Write |
| Visibility | Opaque | State file shows all active locks + queue |
| Priority | FIFO only | Critical / High / Normal / Low |
| TTL | Process lifetime | Configurable with renewal protocol |
| Starvation prevention | None | Auto-promotion after wait threshold |
Lock Compatibility Matrix
| Held ↓ / Request → | read | intent_write | write |
|---|---|---|---|
| read | ✅ Grant | ✅ Grant | ❌ Queue |
| intent_write | ✅ Grant | ✅ Grant | ❌ Queue |
| write | ❌ Queue | ❌ Queue | ❌ Queue |
Multiple concurrent reads are fine. Any write is exclusive. intent_write lets an agent signal "I'm about to modify this" without blocking readers yet.
Why It's Not Implemented
The three current layers handle everything I've thrown at them for 3-7 concurrent agents. The complexity of orchestrator locks isn't justified at current scale. The design is ready for when it is.
How Agents Learn the Protocol
A lock protocol skill teaches agents the procedures: check lock state before writing, follow the decision matrix, always release on error, handle contention by working on something else. The skill is federated into every IDE config directory so all agents discover it automatically.
Real-World Concurrency Pattern
Terminal 1: Cursor (primary) — editing a governance handler
Terminal 2: Claude Code — writing a task coordination record
Terminal 3: Gemini CLI — reviewing an ADR + writing feedback
Terminal 4: Cursor (parallel) — updating the changelog
What happens:
- Claude Code acquires lock on the coordination file, writes, releases. ~50ms.
- Cursor edits the handler directly (single writer, no lock needed)
- Gemini CLI reads the ADR (read-only, no lock needed)
- Cursor (parallel) writes the changelog (attribution check fires, no file lock needed)
Lock contention: zero. Each agent touches different files.
In practice, coordination writes take ~50ms and the fail-fast timeout has NEVER been hit in production. The lock is insurance, not a bottleneck.
The ADR Trail
Every design decision in this system is documented in its own ADR. The file locking protocol, the task architecture, the hook enforcement framework, the tiered write guard, the cross-agent hook wiring, and the orchestrator design each have full context + decision rationale + consequence analysis.
The ADRs aren't just documentation; they're executable context. Each hook references its governing ADR, and when an agent gets blocked, the error message includes the ADR reference so the agent (or the human) can look up why the rule exists.
What's Next
- Orchestrator implementation: when scale demands it
- Contention metrics: automated monitoring to surface patterns I can't see today
- Cross-repo locking: when agents operate across multiple repos
The system is intentionally simple. fcntl.flock() is a 1970s Unix primitive. signal.alarm() is barely more sophisticated. But simple primitives composed into layered defense have kept 7 concurrent agents from destroying my monorepo for months. Sometimes the unsexy solution is the right one.
