MPMC Queues: The Final Boss

When Everyone Fights Everyone: Dual CAS Coordination

We've conquered SPSC (one-to-one) and MPSC (many-to-one). Now comes the final boss: MPMC—Multi-Producer Multi-Consumer. Everyone writes. Everyone reads. Everyone competes on both sides of the queue.

If SPSC maps neatly to dedicated pipeline stages and MPSC maps to “many sources feeding a single aggregator,” then MPMC is what you find in task execution frameworks, thread pools, web server request queues, and any system where multiple threads both produce work and consume it. This is where most naive implementations crumble. In this chapter we’ll look at why, and then build a version that doesn’t.

The double contention nightmare

MPMC is uniquely challenging because you have contention on both ends of the queue:

Loading diagram...

The naive solution is to protect both ends with a single global lock:

View source

public class LockedMPMCRingBuffer<T> {
    private final ReentrantLock lock = new ReentrantLock();
    
    public boolean offer(T element) {
        lock.lock();  // EVERYONE WAITS HERE
        try {
            if (full()) return false;
            buffer[head++] = element;
            return true;
        } finally {
            lock.unlock();
        }
    }
    
    public T poll() {
        lock.lock();  // PRODUCERS AND CONSUMERS FIGHT HERE
        try {
            if (empty()) return null;
            return buffer[tail++];
        } finally {
            lock.unlock();
        }
    }
}

The result is complete serialization:

Loading diagram...

With 8 producers + 8 consumers, throughput collapses to 800K ops/sec. Adding threads makes it slower because you’ve turned a concurrent problem into a strictly serialized one guarded by a hot lock that all participants share.

The lock-free solution: dual CAS

The core insight for MPMC is that producers only need to contend with other producers, and consumers only need to contend with other consumers. There is no reason for a consumer to grab the same mutex as a producer, as long as we can safely coordinate head and tail indices. That leads us to a dual CAS design: one CAS for claimed write slots, and a separate CAS for claimed read slots.

View source

public class LockFreeMPMCRingBuffer<T> {
    private volatile long head;  // Producers CAS here
    private volatile long tail;  // Consumers CAS here
    
    @Override
    public boolean offer(T element) {
        // Producers compete for slots via CAS
        do {
            long currentHead = head;
            long currentTail = tail;
            
            if (currentHead - currentTail >= capacity) {
                return false; // Full
            }
            
        } while (!HEAD.compareAndSet(this, currentHead, currentHead + 1));
        
        // Won! Write to claimed slot
        buffer[(int) (currentHead & mask)] = element;
        return true;
    }
    
    @Override
    public T poll() {
        // Consumers compete for slots via CAS
        do {
            long currentTail = tail;
            long currentHead = head;
            
            if (currentTail >= currentHead) {
                return null; // Empty
            }
            
        } while (!TAIL.compareAndSet(this, currentTail, currentTail + 1));
        
        // Won! Read from claimed slot
        return (T) buffer[(int) (currentTail & mask)];
    }
}

Now producers and consumers operate independently:

Loading diagram...

No global lock. No serialization between the “send” side and the “receive” side. Just CAS-based coordination among peers on each end.

The performance revolution

Using the same benchmark harness as for MPSC, we test 4 producers + 4 consumers and 8 producers + 8 consumers, each performing 10 million operations.

Throughput

                 Locked       Lock-Free    Improvement
4P + 4C:         1.2M/sec     7.8M/sec     6.5×
8P + 8C:         800K/sec     5.2M/sec     6.5×

The lock-free design is 6.5× faster, but more importantly, lock-free scales while the locked version collapses as you add threads. Instead of turning additional cores into lock-waiters, the dual CAS scheme lets both producer and consumer sides exploit more parallelism.

Latency

Percentile    Locked       Lock-Free    Improvement
p50           800 ns       120 ns       6.7×
p99           3,500 ns     450 ns       7.8×
p99.9         12,000 ns    1,200 ns     10×

That p99.9 is where MPMC systems traditionally die. In the locked version, you see 12 microseconds due to cascading lock waits: producers and consumers both pile up behind the same mutex, amplifying each other’s delays. In the lock-free version, those long outliers shrink to 1.2 microseconds, driven primarily by brief CAS retry bursts rather than serialized lock handoffs.

CPU efficiency

Loading diagram...

Result: 3.25× better CPU utilization (35% wasted vs 80% wasted)

Lock-free still has coordination cost—nothing is free—but it wastes 35% of CPU time instead of 80%. On a 16-thread system, that’s the difference between burning ~13 cores on overhead versus ~5.6 cores, which often decides whether you can scale on a single machine or need to shard across multiple.

The sequence number coordination

The dual CAS scheme above ensures that producers and consumers don’t step on each other’s indices, but we still need to guard against out-of-order writes and reads on individual slots.

Consider this:

Loading diagram...

The fix is the same per-slot sequence number idea we used in MPSC, applied symmetrically:

// Producer writing to slot 'index':
while (sequences[index].get() != currentHead) {
    Thread.onSpinWait();  // Wait for previous lap
}
buffer[index] = element;
sequences[index].set(currentHead + capacity);
 
// Consumer reading from slot 'index':
while (sequences[index].get() != currentTail + capacity) {
    Thread.onSpinWait();  // Wait for producer
}
element = buffer[index];
sequences[index].set(currentTail + 2 * capacity);

This ensures that producers never overwrite active slots, as the sequence number check prevents a producer from writing to a slot until the consumer has finished reading from it in the previous lap. Consumers never read incomplete data, as the sequence number check ensures that a consumer only reads from a slot after the producer has finished writing to it. The ring can be reused lap after lap without confusion, since the sequence encodes which "round" each slot belongs to, allowing the buffer to cycle through multiple complete passes without losing track of slot ownership.

When to use MPMC

MPMC queues are perfect for task execution frameworks following ThreadPoolExecutor-style patterns, where multiple threads submit tasks and multiple worker threads consume them. Web server request queues represent another ideal use case, where multiple acceptors handle incoming connections and multiple handlers process requests concurrently. Work-stealing schedulers benefit from MPMC queues, allowing multiple submitters to enqueue work while multiple workers dequeue and execute tasks. Event processing systems with multiple sources feeding multiple processors also represent a natural fit for MPMC coordination.

MPMC is not suitable when you have a single producer and single consumer, as SPSC is simpler and faster for that scenario. When you have multiple producers but only a single consumer, MPSC avoids consumer-side contention and is the better choice. For low thread counts with fewer than four total threads, locks are fine and simpler to implement and maintain. If dynamic sizing is needed, ring buffers are fixed capacity and you pay for that simplicity, so MPMC may not be appropriate.

The rule of thumb: If you have 4+ producers AND 4+ consumers with high throughput needs, and you're willing to pay the complexity cost, use a lock-free MPMC.

Real-world impact: task execution

Here’s what happened when we rebuilt a task execution framework to use a lock-free MPMC queue instead of a global lock:

Before (Locked):

Producers:             8 API threads
Consumers:             8 worker threads
Throughput:            800K tasks/sec
P99 latency:           3.5ms
Task timeouts:         ~5% of tasks
CPU efficiency:        20% doing real work

After (Lock-Free MPMC):

Producers:             8 API threads
Consumers:             8 worker threads
Throughput:            5.2M tasks/sec
P99 latency:           450μs
Task timeouts:         <0.01% of tasks
CPU efficiency:        65% doing real work

Result: 6.5× more throughput, 7.8× better latency, and 99% fewer timeouts. Same hardware, same workload mix, just a fundamentally better coordination mechanism at the queue.

Closing thoughts

MPMC is the hardest coordination pattern in this series because everyone competes on both sides of the queue. One global lock is simple to reason about and easy to implement, but it serializes all operations and quickly becomes catastrophically slow under load.

Dual CAS lets producers and consumers operate independently, with contention only among their own kind. The cost is complexity: CAS loops, per-slot sequence numbers, and careful attention to memory ordering are non-trivial to get right and non-trivial for teams to maintain. The reward, however, is substantial: 6.5× throughput and 10× better tail latency with 16 threads in our benchmarks, plus significantly better CPU utilization.

In our series so far, we've covered off-heap ring buffers that eliminated GC pauses by moving data structures outside the heap, wait-free SPSC queues that are optimal for one-to-one coordination and deliver 6.4× throughput improvements, lock-free MPSC queues that use CAS for many-to-one coordination and deliver 4.9× throughput improvements, and lock-free MPMC queues that use dual CAS for many-to-many coordination and deliver 6.5× throughput improvements.

Next: Event Pipelines—chaining multiple SPSC queues (or ring segments) for staged processing. Think LMAX Disruptor: one of the most influential patterns in high-throughput event-driven systems.

May your CAS operations succeed and your queues never block.

MPMC Queues in Java: The Final Boss

MPMC Queues: The Final Boss

The double contention nightmare

The lock-free solution: dual CAS

The performance revolution

Throughput

Latency

CPU efficiency

The sequence number coordination

When to use MPMC

Real-world impact: task execution

Before (Locked):

After (Lock-Free MPMC):

Closing thoughts

Further Reading

Lock-Free MPSC Queues in Java

Wait-Free SPSC Queues in Java

Event Pipelines in Java: The LMAX Disruptor Pattern

Arthur Costa