Wait-Free SPSC Queues in Java

December 23, 202518 min readNew

How to replace synchronized queue handshakes with a wait-free Single-Producer Single-Consumer ring buffer that uses precise memory ordering instead of locks.

Wait-Free SPSC Queues in Java

Wait-Free SPSC Queues in Java

When One Producer Meets One Consumer: The Art of Perfect Synchronization

Scenario 02: synchronized SPSC vs wait-free SPSC comparison.

You've built your off-heap ring buffer. GC pauses are gone. Memory is tight. But there's one problem: you're still using locks. Every synchronized block is a tiny handshake between threadsβ€”50 nanoseconds here, 100 thereβ€”and in high-frequency trading, those nanoseconds are money. Over millions of operations per second, that handshake tax quietly turns into entire cores burned just on coordination.

Today we're going to eliminate locks entirely. Not with fancy CAS operations. Not with spin loops that burn CPU while you wait. We're going simpler and more targeted: wait-free SPSC queues that rely on well-defined memory ordering rather than mutual exclusion. Along the way we’ll connect the code to the underlying hardware, so you can see exactly where the savings come from instead of treating β€œwait-free” as a magic label.

SPSC stands for Single-Producer Single-Consumer. One thread writes. One thread reads. That's it. When you can guarantee this constraint, something beautiful happens: you can build the fastest queue possible on the JVM with nothing more than clear ownership of indices and carefully placed acquire/release barriers.


The setup: one gateway, one strategy

Picture a high-frequency trading system. On one side, you have an exchange gateway decoding market data from the network. On the other side, a trading strategy analyzing ticks and placing orders. The architecture might look roughly like this:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Exchange   │─── Market Data ───>β”‚   Trading    β”‚
β”‚   Gateway    β”‚    (10M/sec)       β”‚   Strategy   β”‚
β”‚ (Thread 1)   β”‚                    β”‚ (Thread 2)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The gateway is a single producer: it decodes packets from the NIC and emits normalized ticks. The strategy is a single consumer: it reads those ticks, updates internal state, and fires orders. No other thread writes into this particular path, and no other thread consumes directly from it. This is perfect SPSC.

The naive solution? Slap a synchronized on everything and call it a day:

public class SynchronizedSPSCRingBuffer<T> {
    private final Object[] buffer;
    private int head;  // Producer writes here
    private int tail;  // Consumer reads here
    
    public synchronized boolean offer(T element) {
        if (size == capacity) return false;
        buffer[head] = element;
        head = (head + 1) & mask;
        size++;
        return true;
    }
    
    public synchronized T poll() {
        if (size == 0) return null;
        T element = (T) buffer[tail];
        buffer[tail] = null;
        tail = (tail + 1) & mask;
        size--;
        return element;
    }
}

Clean. Thread-safe. Easy to reason about. Unfortunately, as soon as you care about tens of millions of messages per second, this approach is slow as molasses.


The hidden cost of synchronized

To understand why synchronized hurts here, we need to look at what it really does under the hood. It's not "just a keyword"β€”it's a full-blown coordination protocol involving monitor acquisition, which costs approximately 50 cycles as the thread must acquire ownership of the monitor object. Memory barriers follow, creating a full fence around the critical section that costs approximately 100 cycles, ensuring all memory operations complete before the critical section and become visible after it. Monitor release then costs another 50 cycles as the thread releases ownership. Under contention, there is potential OS involvement through mechanisms like futex on Linux, which can cost 2,000 or more cycles as the thread enters kernel space to wait for the lock.

Every single offer() and poll() pays this tax, even though in SPSC there is no logical contention: only one producer thread ever calls offer, and only one consumer thread ever calls poll. For 10 million operations per second:

10,000,000 ops/sec Γ— 200 cycles per op = 2,000,000,000 cycles/sec

On a 3.5 GHz CPU, that's ~57% of one core spent just acquiring and releasing locks for a queue that could, in principle, run with zero fights. And here's the kicker: there's no contention in the high-level design. The contention is self-inflicted by the locking strategy.

The cache line ping-pong

Worse, the monitor lock itself lives in memory. When the producer acquires it, that cache line moves to CPU 0. When the consumer acquires it, it moves to CPU 1. Back and forth, thousands or millions of times per second:

CPU 0 (Producer):  [Lock] ─────────────> Cache Miss! (200 cycles)
CPU 1 (Consumer):  ──────> [Lock] ─────> Cache Miss! (200 cycles)

This is called cache line bouncing, and it turns an otherwise trivial operation into death by a thousand cuts. Even if the critical section itself is tiny, the act of coordinating ownership of the lock pollutes caches, burns cycles on coherence traffic, and increases latency scatter for both producer and consumer.


The wait-free revolution

Here's the key insight: in SPSC, we don't need locks at all. We just need to ensure two simple invariants. The producer must be able to reliably see when the buffer is full by reading the consumer's tail index and writing its own head index, allowing it to determine if there is space available. The consumer must be able to reliably see when the buffer is empty by reading the producer's head index and writing its own tail index, allowing it to determine if there is data available to consume.

No CAS. No retries. Just well-defined memory ordering between the two indices and the contents of the corresponding slots. If we respect those invariants, each side can operate independently and make progress without ever taking a shared lock.

The implementation

public class WaitFreeSPSCRingBuffer<T> {
    private static final VarHandle HEAD;
    private static final VarHandle TAIL;
    
    static {
        // VarHandle provides precise memory ordering
        MethodHandles.Lookup lookup = MethodHandles.lookup();
        HEAD = lookup.findVarHandle(WaitFreeSPSCRingBuffer.class, "head", long.class);
        TAIL = lookup.findVarHandle(WaitFreeSPSCRingBuffer.class, "tail", long.class);
    }
    
    private final Object[] buffer;
    
    // Cache-line padded to avoid false sharing
    private volatile long p0, p1, p2, p3, p4, p5, p6, p7;
    private volatile long head;  // Producer's index
    private volatile long h0, h1, h2, h3, h4, h5, h6, h7;
    
    private volatile long c0, c1, c2, c3, c4, c5, c6, c7;
    private volatile long tail;  // Consumer's index
    private volatile long t0, t1, t2, t3, t4, t5, t6, t7;
    
    @Override
    public boolean offer(T element) {
        long currentHead = (long) HEAD.getOpaque(this);
        long currentTail = (long) TAIL.getAcquire(this);  // ACQUIRE
        
        if (currentHead - currentTail >= capacity) {
            return false; // Full
        }
        
        buffer[(int) (currentHead & mask)] = element;
        HEAD.setRelease(this, currentHead + 1);  // RELEASE
        return true;
    }
    
    @Override
    public T poll() {
        long currentTail = (long) TAIL.getOpaque(this);
        long currentHead = (long) HEAD.getAcquire(this);  // ACQUIRE
        
        if (currentTail >= currentHead) {
            return null; // Empty
        }
        
        T element = (T) buffer[(int) (currentTail & mask)];
        buffer[(int) (currentTail & mask)] = null;
        TAIL.setRelease(this, currentTail + 1);  // RELEASE
        return element;
    }
}

What just happened?

Three optimizations make all the difference:

1. VarHandle memory ordering instead of locks

Instead of full memory fences from synchronized, we use precise ordering operations on the index fields. The getAcquire operation establishes an acquire barrier, meaning "I want to see all writes that happened before the last setRelease," ensuring that any writes made by the producer before releasing the head index become visible to the consumer. The setRelease operation establishes a release barrier, meaning "Make all my previous writes visible before this write," ensuring that all writes made by the producer, including the element write, become visible before the head index is updated.

// Producer:
buffer[5] = element;              // Plain write (4 cycles)
HEAD.setRelease(this, 6);         // Release barrier (10 cycles)
 
// Consumer:
long head = HEAD.getAcquire(this); // Acquire barrier (10 cycles)
element = buffer[5];               // Plain read (4 cycles)

The release on the producer’s head update ensures that the element write becomes visible before the new head value. The acquire on the consumer’s head load ensures that once it sees the updated head, it will also see the element. Total cost: ~30 cycles versus 200+ cycles for synchronized, and no interaction with the OS.

2. Cache-line padding eliminates false sharing

Those weird p0, p1, p2... fields are not a mistakeβ€”they’re padding to ensure head and tail live on different cache lines:

Without padding:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Cache Line 0 ────────────┐
β”‚ head (8) β”‚ tail (8) β”‚ other (48)  β”‚  ← Shared!
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Producer    Consumer
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    CONFLICT!

With padding:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Cache Line 0 ────────────┐
β”‚ head (8) β”‚ padding (56 bytes)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Producer only

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Cache Line 1 ────────────┐
β”‚ tail (8) β”‚ padding (56 bytes)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Consumer only

NO CONFLICTS!

Producer updates head β†’ only CPU 0's cache line is invalidated. Consumer updates tail β†’ only CPU 1's cache line is invalidated. They never interfere with each other at the cache-line level, which is exactly what you want in a high-throughput, low-latency pipeline.

3. No locks means no OS involvement

Synchronized blocks can invoke the operating system if there's contention (for example via futex on Linux). Even without contention, the monitor protocol introduces extra work. The wait-free operations here are pure CPU instructions: loads, stores, and lightweight barriers. No system calls, no kernel transitions, no context switchesβ€”just straight-line code that the CPU can predict and optimize aggressively.


The performance revolution

Let's measure this properly. Using the same test harness as in the ring buffer articleβ€”10 million messages (5M writes, 5M reads), a capacity of 4096β€”we compare the synchronized and wait-free SPSC implementations.

Throughput

                    Synchronized    Wait-Free      Improvement
Throughput:         6.6M ops/sec    42.3M ops/sec   6.4Γ—

6.4Γ— faster. That’s not a rounding error or an incremental gain from micro-tuningβ€”it’s an entirely different performance profile for the exact same logical behavior.

Latency distribution

Percentile    Synchronized    Wait-Free    Improvement
p50           120 ns          18 ns        6.7Γ—
p90           250 ns          35 ns        7.1Γ—
p99           500 ns          45 ns        11Γ—
p99.9         2,000 ns        80 ns        25Γ—

The p99.9 improvement is 25Γ—. Those 2-microsecond outliers in the synchronized version are exactly the kind of rare-but-deadly stalls that ruin trading P&Ls or real-time experiences. They come from lock ownership transfers, occasional OS involvement, and cache line ping-pong. The wait-free version is rock solid: extremely tight distribution with no long tail and no mysterious β€œghost spikes.”

CPU efficiency

Here's where the profile really shifts:

Synchronized:
β”œβ”€ Lock acquisition:  40% of CPU time
β”œβ”€ Lock release:      20% of CPU time
β”œβ”€ Actual work:       40% of CPU time
└─ Wasted effort:     60%

Wait-Free:
β”œβ”€ Memory ops:        10% of CPU time
β”œβ”€ Actual work:       90% of CPU time
└─ Wasted effort:     10%

Result: 9Γ— better CPU utilization

With synchronized, 60% of your CPU is effectively burned on bookkeeping. With wait-free SPSC, that drops to 10%, and the remaining 90% goes to actual trading logic, feature computation, or whatever real work your system is supposed to be doing. On a multi-core box, that difference directly translates into either more capacity on the same hardware or the same capacity with fewer machines.


The memory model magic

How do getAcquire and setRelease guarantee correctness without any locks or CAS loops? The answer lives in the Java Memory Model’s happens-before relationship.

The happens-before relationship

When the producer writes:

buffer[5] = trade;              // Write 1
HEAD.setRelease(this, 6);       // Write 2: RELEASE barrier

The release barrier ensures Write 1 happens-before Write 2. All writes that occur before setRelease are committed to memory and become visible to other threads that subsequently perform an acquire on the same variable.

When the consumer reads:

long head = HEAD.getAcquire(this);  // Read 1: ACQUIRE barrier
trade = buffer[5];                  // Read 2

The acquire barrier ensures Read 1 happens-before Read 2. The consumer is guaranteed to see all writes that happened before the producer's setRelease of the same head value.

Together, these create a happens-before edge:

Producer:  buffer[5] = trade  β†’  HEAD.setRelease(6)
                                       ↓
Consumer:                        HEAD.getAcquire(6)  β†’  trade = buffer[5]

This is enough to guarantee that the consumer never reads a partially initialized element and never observes stale data, all without grabbing a lock or spinning on a CAS loop. We are using the memory model exactly as intended, not fighting it.

The assembly code

It’s instructive to peek at what this compiles to conceptually.

Synchronized version:

; Monitor enter
lock cmpxchg [monitor], rdi    ; 50 cycles
; Actual work
mov [buffer+rax*8], rcx        ; 4 cycles
; Monitor exit
lock cmpxchg [monitor], 0      ; 50 cycles
mfence                          ; 100 cycles
 
Total: ~200 cycles

Wait-free version:

; Producer
mov [buffer+rax*8], rcx        ; 4 cycles (store element)
mov [head], rax                ; 4 cycles (store index)
; Implicit store-release (~10 cycles)
 
; Consumer  
mov rax, [head]                ; 4 cycles (load index)
; Implicit load-acquire (~10 cycles)
mov rcx, [buffer+rbx*8]        ; 4 cycles (load element)
 
Total: ~40 cycles

5Γ— fewer CPU cycles per operation. That gap only grows when you multiply it by tens or hundreds of millions of messages per second.


When to use wait-free SPSC

Wait-free SPSC queues are perfect for scenarios where you have exactly one producer thread and one consumer thread, and you need maximum performance with predictable latency. Exchange gateways feeding strategy engines represent a classic use case, where market data ingestion requires a single decoding thread producing normalized ticks that a single strategy thread consumes. Network I/O to processing threads follows the same pattern, with packet handling and protocol decoding happening on dedicated threads that feed processing pipelines. Audio and video pipelines benefit from SPSC queues between decoder and renderer stages, where each stage runs on a dedicated thread. Logging systems where application threads write to a single log writer thread, and telemetry collectors where instrumentation threads feed a single aggregator thread, both represent ideal SPSC scenarios.

Wait-free SPSC is not suitable when you have multiple producers, which requires the MPSC pattern instead, or multiple consumers, which requires the MPMC pattern. It cannot handle dynamic thread counts because SPSC is strict: exactly one producer and one consumer must be designated at design time. For low-throughput systems processing less than 100,000 operations per second, the synchronized approach is fine and simpler, as the performance benefits of wait-free coordination don't justify the added complexity.

The rule of thumb: If you have exactly one producer and one consumer, and you need maximum performance with predictable latency, use wait-free SPSC. It's the fastest queue you can reasonably build on the JVM without dropping into native code.


Real-world impact: trading strategy latency

Let’s get concrete. In a real HFT system, every microsecond of latency costs real money, either through missed fills or worse execution prices. Here’s what happened when we switched from a synchronized SPSC queue to a wait-free SPSC ring buffer in a strategy path:

Before (Synchronized):

Average latency:       150ns per message
P99 latency:           500ns
At 10M messages/sec:   1.5 seconds wasted in locks/minute
Profitable trades:     47 per day

After (Wait-Free):

Average latency:       24ns per message
P99 latency:           45ns
At 10M messages/sec:   0.24 seconds in queue ops/minute
Profitable trades:     310 per day

310 vs 47 trades per day. That's 6.6Γ— more opportunities purely from reducing queue latency. The strategy code didn’t change. The market didn’t change. The only change was replacing lock-based coordination with wait-free memory ordering in a single queue between the gateway and the strategy.


Closing thoughts

SPSC is a special caseβ€”one producer, one consumerβ€”but it's more common than it looks on architecture diagrams. Any time you have a pipeline with dedicated threads, you likely have SPSC relationships hiding under generic β€œqueue” boxes.

The synchronized approach is simple and correct, but it wastes the majority of your CPU on coordination overhead and introduces dangerous latency outliers. The wait-free approach requires you to understand memory ordering, VarHandles, and cache behavior, but it delivers 6.4Γ— better throughput, 11Γ— better p99 latency, and 9Γ— better CPU efficiency in the exact same scenario.

In our series so far, we've covered off-heap ring buffers that eliminated GC pauses and delivered 772Γ— better p99.9 latency, and wait-free SPSC queues that eliminated lock overhead for one-to-one coordination and delivered 6.4Γ— throughput improvements.

Next up, we'll tackle the harder problem: multiple producers, one consumer (MPSC). When you can't avoid contention on the write side, how do you coordinate without falling back to global locks?

Until then, may your queues be fast and your latencies predictable.


Further Reading

Repository: techishthoughts-org/off_heap_algorithms

Off-Heap Algorithms in JavaPart 2 of 2
Series Progress2 / 2
Arthur CostaA

Arthur Costa

Senior Full-Stack Engineer & Tech Lead

Senior Full-Stack Engineer with 8+ years in React, TypeScript, and Node.js. Expert in performance optimization and leading engineering teams.

View all articles β†’