Lock-Free in Java: Scenario 09 - K-FIFO Queues and the Art of Relaxed Ordering
Part 1: The Midnight Revelation
Midnight, Tuesday. The system was running smoothly, processing a torrent of real-time event data. Then, out of nowhere, the dreaded 'Out of Memory' error appears. And then silence. The system goes down.
I've been in this industry long enough to know that feeling - that sinking sensation when you realize something fundamental is broken. We'd spent countless hours fine-tuning our Java code, optimizing every possible line. Our load balancer was distributing work across 16 worker threads, each pulling tasks from a central queue. The architecture was textbook-correct. Yet here we were, watching the system crumble under what should have been manageable load.
The post-mortem revealed a familiar villain: Garbage Collection. But not in the way I expected. Our GC logs showed frequent stop-the-world pauses, each one lasting 40-80 milliseconds. In a system designed to process 100,000 events per second while serving 500 queries per second, these pauses were catastrophic. Each pause meant thousands of events backing up, queues filling, memory pressure increasing, and eventually - the dreaded OOM.
But here's what made this problem interesting: our system didn't actually need strict ordering. We were running a load balancer. Task scheduling. Event aggregation. In all these cases, we only needed "approximately correct" ordering - not perfect FIFO semantics. An event processed slightly out of order wasn't a bug; it was perfectly acceptable.
That realization was the breakthrough. We'd been paying a massive performance tax for ordering guarantees we didn't need. Enter the K-FIFO Queue.
Part 2: Understanding the FIFO Tax
Before diving into the solution, let's understand why strict FIFO ordering is so expensive. This isn't just academic theory - it directly explains why our system was failing.
The Serialization Problem
A strict FIFO queue makes a fundamental promise: elements will be dequeued in exactly the order they were enqueued. This sounds simple, but think about what it implies for concurrent access.
If Producer A enqueues element 1, and Producer B enqueues element 2, and element 1 was enqueued "first," then element 1 must be dequeued before element 2. But what does "first" mean in a concurrent system? To establish this ordering, we need some form of serialization - a single point where the order is established.
This serialization is the bottleneck. Every producer must acquire the same lock, establishing a total order. Under contention, this creates lock convoys, context switches, and - crucially for our story - allocation pressure from the synchronization infrastructure.
The True Cost of Perfect Ordering
Let me quantify what we were observing. Our ArrayBlockingQueue-based implementation showed:
Performance Characteristics:
- Single lock protecting all operations
- Average operation latency: 300-500 nanoseconds
- Throughput ceiling: approximately 2 million operations per second
- p99 latency: 15 milliseconds (when GC kicked in)
But the hidden cost was worse. Under contention, the JVM's ReentrantLock was allocating wait queue nodes. At 50,000 operations per second with 16 producer threads, we were generating nearly 3 MB/second of synchronization-related allocations. These small objects promoted to old gen, triggering mixed GC cycles that caused the devastating pauses.
Here's what our naive implementation looked like:
The code looks clean. It's correct. It's textbook. And it was killing our system.
The Insight: Approximate Ordering Is Often Enough
The breakthrough came when I asked a simple question: What would actually break if we relaxed the ordering guarantee?
In our load balancer:
- Tasks would still get processed
- Every task would eventually complete
- The only difference: task A might complete before task B even if B was submitted first
For our use case, this was completely acceptable. In fact, it's what most load balancers do anyway - they route to the first available worker, not the worker that will process in submission order.
The same logic applies broadly. In task scheduling, workers grab available work and the scheduler doesn't guarantee execution order. Event aggregation computes statistics over streams where order within short windows doesn't matter. Logging pipelines might deliver entries slightly out of order, but timestamps provide the true ordering. And metrics collection involves sampling and aggregating where exact ordering is irrelevant.
This insight led me to the K-FIFO queue.
Part 3: K-FIFO Queues - The Theory
A K-FIFO queue provides a weaker but still useful ordering guarantee: when you dequeue an element, it will be among the K oldest elements in the queue. Not necessarily THE oldest, but within the oldest K.
Why K-FIFO Enables Parallelism
The magic of K-FIFO is that it allows us to break the serialization bottleneck. Instead of one queue, we maintain K independent segments. Each segment is essentially a single-producer, single-consumer (SPSC) queue - the simplest and fastest concurrent data structure possible.
Producers are assigned to segments (via thread ID hashing or round-robin). Each producer only writes to its assigned segment - no contention with other producers. The consumer round-robins through segments, pulling from whichever has data.
The K Guarantee
Why is this still "approximately FIFO"? Consider the worst case: an element E is enqueued to segment 0, and then K-1 elements are enqueued to segments 1 through K-1. The consumer checks segment 1 first, finds an element, and returns it before E.
But the consumer will check segment 0 within the next K dequeue operations (since there are only K segments). So E will be returned within K operations of when it "should" have been returned. Hence the K-FIFO guarantee.
Theoretical Bounds
The K-FIFO relaxation provides remarkable performance improvements:
| Metric | Strict FIFO | K-FIFO (K=segments) |
|---|---|---|
| Producer contention | All producers compete | Zero (per-segment ownership) |
| Lock overhead | One lock per operation | Zero locks (lock-free possible) |
| Memory barriers | Full fence per operation | Minimal (release/acquire pairs) |
| Cache behavior | Severe bouncing | Per-segment locality |
The theoretical throughput scales linearly with K, up to the point where other bottlenecks (memory bandwidth, consumer processing) become limiting.
Part 4: Designing the K-FIFO Queue
Let's design our K-FIFO queue from first principles. The key insight is that we're trading ordering for parallelism - and we need to make that trade explicit in our design.
Architecture Overview
Segment Design: The Building Block
Each segment is an SPSC queue. This is critical - SPSC queues are the simplest lock-free structure because:
- Only one thread writes (producer) - no write contention
- Only one thread reads (consumer) - no read contention
- Only need memory ordering between the one writer and one reader
The K-FIFO Queue: Orchestrating Segments
Now we build the K-FIFO queue on top of our segments:
Memory Layout and False Sharing Prevention
Notice the cache line padding in the Segment class. This is crucial for performance. Let me explain why.
Modern CPUs transfer data in cache lines, typically 64 bytes on x86-64. When one core modifies a byte in a cache line, the entire line is invalidated in all other cores' caches. This is called "false sharing" when unrelated data happens to share a cache line.
In our Segment class:
tailis written by the producer, read by the consumerheadis written by the consumer, read by the producer
If these fields were on the same cache line, every write would invalidate the other thread's cache, causing expensive memory traffic. The padding ensures each field gets its own cache line:
This simple optimization can improve throughput by 2-3x under contention.
Part 5: Deep Dive - Memory Ordering and Correctness
Lock-free programming is notoriously tricky. Let's prove our implementation is correct by analyzing the memory ordering.
The Producer's View
When a producer calls offer():
The key orderings enforce correctness. [4] before [5]: the release-write at [5] ensures the buffer write at [4] is visible to any thread that observes the new tail value — this is the "publication," because we don't advance tail until the element is written. [2] reads head with acquire semantics to ensure we see the consumer's latest head update and any buffer nullification it performed.
The Consumer's View
When the consumer calls poll():
The key orderings mirror the producer's guarantees. [2] acquires tail, pairing with the producer's release-write — if we see tail = N, we're guaranteed to see all buffer writes for positions < N. [4] after [2]: the acquire semantics establish a happens-before relationship, ensuring we see the element the producer wrote. Finally, [5] and [6] after [4] ensures we read the element before clearing the slot and advancing head, preventing use-after-free scenarios.
Proving the K-FIFO Property
Let's prove the queue maintains the K-FIFO property. Consider any element E added to segment S at time T1.
Claim: E will be dequeued within K dequeue operations after it becomes the oldest element in the queue.
Proof:
- At some time T2 >= T1, E becomes the oldest element (all elements added before T1 have been dequeued).
- The consumer checks segments in round-robin order.
- Within K consecutive poll() calls, the consumer will check segment S.
- When the consumer checks segment S, E is at the head (it's the oldest in that segment, and all older elements globally have been dequeued).
- Therefore, E will be dequeued within K operations.
This proof assumes no new elements are added between T1 and when E is dequeued. In practice, new elements may arrive, but the K-FIFO property holds relative to the queue state at any given moment.
The ABA Problem: Why We Don't Have It
Classic lock-free algorithms often struggle with the ABA problem: a value changes from A to B and back to A, making CAS think nothing changed when in fact the state has evolved.
Our design sidesteps this elegantly:
- Position-based indexing: We use monotonically increasing positions (
headandtail), not pointers. - No CAS on critical path: SPSC queues don't need CAS - only one thread writes each variable.
- Long positions: 64-bit positions won't wrap around in realistic scenarios.
This is a significant advantage over more complex MPMC designs.
Part 5B: Understanding VarHandle and Memory Access Modes
Before moving to production concerns, let's take a deeper look at how VarHandle works. Understanding these primitives is essential for writing correct lock-free code.
The VarHandle API
Java 9 introduced VarHandle as a replacement for the older sun.misc.Unsafe operations. VarHandle provides type-safe, well-defined atomic operations with explicit memory ordering semantics.
Memory Access Modes Explained
VarHandle provides five access modes, each with different ordering guarantees:
1. Plain Access
No ordering guarantees. The compiler and CPU can reorder freely. Only use for data that doesn't need synchronization.
2. Opaque Access
Guarantees atomic access and no reordering with other opaque operations on the same variable. But no ordering with operations on other variables.
3. Acquire/Release
This is what we use in K-FIFO. Release-write ensures all prior writes are visible. Acquire-read ensures all subsequent reads see those writes. This creates a "synchronizes-with" relationship.
4. Volatile Access
Full sequential consistency. All threads see a total order of all volatile operations. Most expensive, but sometimes necessary.
5. Compare-and-Set
Atomic conditional updates. Essential for lock-free algorithms when multiple threads compete to update the same variable.
Why Acquire/Release Is Enough for K-FIFO
In our K-FIFO queue, we don't need full volatile semantics. Here's why:
The release-write at the producer ensures all prior buffer writes complete before the tail update is visible. The acquire-read at the consumer ensures it sees those buffer writes after observing the tail update. This is precisely the synchronization we need - no more, no less.
Performance Impact of Memory Ordering
Different access modes have different costs on different architectures:
| Architecture | Plain | Acquire/Release | Volatile |
|---|---|---|---|
| x86-64 | 0 cycles | 0-1 cycles | 20-100 cycles |
| ARM | 0 cycles | 5-15 cycles | 30-150 cycles |
| POWER | 0 cycles | 10-20 cycles | 40-200 cycles |
On x86-64, acquire/release are nearly free because the architecture has strong memory ordering. On ARM and POWER with weaker memory models, they require explicit fence instructions. Volatile is expensive everywhere because it requires a full memory barrier.
This is why we carefully choose acquire/release over volatile - the performance difference is significant on some architectures.
Common VarHandle Patterns in Lock-Free Code
Pattern: Publication via Release
Pattern: Double-Checked Initialization
Pattern: Spin-Wait with Acquire
Part 6: Handling Edge Cases and Production Concerns
A correct algorithm isn't enough for production. Let's address real-world concerns.
Segment Overflow: What Happens When a Segment Is Full?
In our basic implementation, offer() returns false when the assigned segment is full. But this might not be acceptable in all scenarios. Here are strategies:
Strategy 1: Overflow to Other Segments
This weakens the per-segment ownership but maintains the K-FIFO property.
Strategy 2: Dynamic Segment Resizing
This is more complex but prevents data loss.
Strategy 3: Blocking with Backpressure
Simple, but can lead to unbounded spinning. Consider adding exponential backoff.
Consumer Starvation: Ensuring Fairness Across Segments
Our round-robin consumer prevents starvation, but an unbalanced producer distribution could cause issues. Consider:
Graceful Shutdown
Concurrent data structures need careful shutdown handling:
Memory Leaks: Helping the GC
Notice how we set buffer[index] = null after reading an element. This is critical. Without it, the buffer holds references to consumed elements until those slots are reused, potentially causing memory leaks in long-running systems.
Part 7: Benchmarks and Real-World Results
Theory is nice, but does it actually perform? Let's measure.
Benchmark Setup
All benchmarks ran on an Intel Xeon E5-2680 v4 (14 cores, 28 threads) with 128GB RAM, using OpenJDK 17.0.2 with G1GC and an 8GB heap, measured with JMH 1.36.
Throughput Results
| Configuration | Strict FIFO | K-FIFO (K=4) | K-FIFO (K=8) | K-FIFO (K=16) |
|---|---|---|---|---|
| 4 producers | 2.1M ops/s | 6.2M ops/s | 7.1M ops/s | 7.4M ops/s |
| 8 producers | 1.8M ops/s | 7.8M ops/s | 9.2M ops/s | 10.1M ops/s |
| 16 producers | 1.4M ops/s | 6.1M ops/s | 8.9M ops/s | 11.3M ops/s |
The results are striking. K-FIFO achieves 3-8x higher throughput depending on configuration. While strict FIFO throughput decreases with more producers due to increasing contention, K-FIFO throughput increases with more producers up to K, then saturates — the contention that cripples strict FIFO becomes parallelism that feeds K-FIFO.
Latency Distribution
Strict FIFO (4 producers):
K-FIFO (4 producers, K=4):
The tail latencies tell the real story. At p99.9, K-FIFO is 37x better than strict FIFO. Those rare high-latency events in the strict queue? They're the lock convoys and GC pauses we were fighting.
GC Behavior Comparison
We ran a 5-minute sustained load test with 8 producers:
Strict FIFO:
- Young GC events: 234
- Total GC pause time: 4,120ms
- Max pause: 127ms
- Allocation rate: 4.2 MB/sec
K-FIFO (K=8):
- Young GC events: 18
- Total GC pause time: 270ms
- Max pause: 21ms
- Allocation rate: 0.4 MB/sec
92% reduction in GC pause time. This was the root cause of our midnight production incident, and K-FIFO eliminated it.
Scaling Analysis
Let's examine how performance scales with different configurations:
The throughput increases rapidly as K approaches the number of producers, then plateaus. Beyond K = producer count, adding more segments provides diminishing returns - the consumer becomes the bottleneck, not producer contention.
Cache Analysis with perf
Using Linux perf to analyze cache behavior reveals why K-FIFO performs better:
Strict FIFO results (4 producers):
K-FIFO results (4 producers, K=4):
The dramatic reduction in cache misses explains the latency improvement. When producers don't contend for the same cache lines, they can operate independently at CPU cache speed rather than waiting for cache coherency traffic.
Memory Bandwidth Analysis
We also measured memory bandwidth utilization:
Strict FIFO:
- Memory reads: 4.2 GB/s
- Memory writes: 2.1 GB/s
- Cache invalidation traffic: High
K-FIFO:
- Memory reads: 1.8 GB/s
- Memory writes: 0.9 GB/s
- Cache invalidation traffic: Minimal
The reduced memory bandwidth leaves more headroom for other system operations and improves overall system stability under load.
Visualization: Latency Over Time
Part 8: When K-FIFO Is the Right Choice
K-FIFO queues aren't universally better than strict FIFO. Let's clearly delineate when to use each.
Use K-FIFO When:
1. Ordering Is Approximate By Nature
Many systems have inherent ordering ambiguity. Load balancers distribute work to available workers without guaranteeing execution order. Event aggregators compute statistics over streams where micro-ordering doesn't affect results. Logging pipelines rely on timestamps for true ordering, making queue ordering supplementary. Task schedulers implement fair scheduling that doesn't require strict FIFO.
2. High Throughput Trumps Ordering
When you need millions of operations per second and can tolerate approximate ordering:
- Real-time data processing
- High-frequency trading order routing
- Telemetry collection
- Stream processing pipelines
3. Multiple Independent Producers
K-FIFO excels when producers are naturally independent:
- Per-thread event collectors
- Per-connection request handlers
- Sharded data processors
4. GC Pressure Is Problematic
If you're seeing GC-related latency spikes:
- Low-latency systems
- Real-time applications
- Systems with tight p99 SLAs
Stick With Strict FIFO When:
1. Ordering Is Semantically Required
Some use cases demand exact ordering. Message queues with ordering guarantees like Kafka partitions and JMS ordered delivery depend on strict sequencing. Event sourcing requires events to replay in exact order, and transactional systems need operations to serialize.
2. Consumers Depend on Order
When downstream processing assumes ordering:
- Sequence number validation
- Incremental aggregation with ordering assumptions
- Replay and recovery scenarios
3. Simplicity Wins
When the performance gain isn't worth the complexity:
- Low-throughput systems (< 10K ops/sec)
- Systems where correctness trumps performance
- Teams without lock-free expertise
4. Single Producer
With only one producer, strict FIFO has no contention overhead. The K-FIFO's complexity isn't justified.
Decision Matrix
Real-World Use Case: High-Frequency Trading Order Router
Let me share a detailed example from our production system. Our order router receives orders from multiple trading desks (8 producer threads) and routes them to exchanges based on best execution algorithms.
Original Architecture (Strict FIFO):
This worked fine at 10,000 orders/day. When we scaled to 100,000 orders/day with market volatility spikes, we hit problems:
- Order submission latency spiked to 50ms during high volume
- GC pauses caused missed execution windows
- p99 latency made our SLA unachievable
Refactored Architecture (K-FIFO):
Results:
- Order submission latency: 50ms -> 200ns (250x improvement)
- GC pauses: eliminated
- p99 latency: 45ms -> 800ns
- Throughput: 10x higher peak capacity
The key insight: order execution doesn't require strict FIFO. Markets are already non-deterministic - an order submitted 1 microsecond earlier doesn't guarantee earlier execution. What matters is low latency and high throughput, both of which K-FIFO delivers.
Real-World Use Case: Log Aggregation Pipeline
Another production use case: centralizing logs from 200+ microservices.
Requirements:
- Handle 500,000 log events/second at peak
- Never drop logs in normal operation
- Never block application threads
- Minimal memory overhead
Architecture:
Results:
- Zero application thread blocking
- 99.99% log delivery rate
- Consistent 50ns append latency
- 75% reduction in log pipeline memory usage
Real-World Use Case: Metrics Collection
For application performance monitoring, we collect metrics from instrumented code:
The combination of K-FIFO queue with object pooling achieves sub-100ns overhead per metric, making fine-grained instrumentation practical without impacting application performance.
Part 9: Advanced Patterns and Optimizations
Let's explore some advanced techniques for getting even more performance from K-FIFO queues.
Pattern 1: Batched Operations
Instead of polling one element at a time, drain in batches:
Batching amortizes the overhead of segment scanning and improves cache utilization.
Pattern 2: Weighted Round-Robin
If some segments receive more traffic, weight the consumer's attention:
Pattern 3: Affinity-Aware Segment Assignment
Instead of arbitrary thread-to-segment mapping, use CPU affinity:
When producer threads have CPU affinity, this keeps related data on the same cache.
Pattern 4: Hybrid Mode for Mixed Workloads
Some systems need both high throughput and occasional strict ordering:
Pattern 5: Statistics and Monitoring
Production systems need observability:
Part 10: Production Deployment Checklist
Before deploying K-FIFO queues in production, walk through this checklist.
Configuration
- Segment count >= producer threads: Each producer should ideally have its own segment
- Segment capacity sized for burst: Plan for 2-3x expected peak burst
- Power-of-2 sizing: Both segment count and capacity should be powers of 2
- JVM tuned for low latency: Consider
-XX:+UseZGCor-XX:+UseShenandoahGC
Monitoring
- Segment utilization metrics: Track how full each segment gets
- Throughput counters: Ops/sec for offer and poll
- Latency histograms: p50, p90, p99, p99.9 for both operations
- GC metrics: Pause frequency, duration, allocation rate
Testing
- Stress test with production load: Verify behavior under expected peak
- Failure injection: What happens when segments overflow?
- Long-running tests: Check for memory leaks over hours/days
- Thread variation tests: Verify behavior with different producer counts
Operational
- Graceful shutdown procedure: How do you drain the queue?
- Backpressure strategy: What happens when producers outpace consumers?
- Alerting thresholds: When should ops be notified?
- Runbook entries: Document common issues and resolutions
Part 11: Comparison with Alternatives
K-FIFO isn't the only approach to high-performance concurrent queues. Let's compare.
vs. LMAX Disruptor
The Disruptor uses a similar ring buffer approach but with different trade-offs:
| Aspect | K-FIFO | Disruptor |
|---|---|---|
| Learning curve | Moderate | Steep |
| Configuration | Simple | Complex |
| Batching | Manual | Built-in |
| Wait strategies | Basic | Multiple options |
| Dependencies | None | Library |
| Use case | General | Event processing |
Choose Disruptor when: You need sophisticated event processing with multiple consumers, batching, and complex wait strategies.
Choose K-FIFO when: You need a simpler drop-in replacement for standard queues with better performance.
vs. JCTools Queues
JCTools provides battle-tested lock-free queues:
| Aspect | K-FIFO | JCTools MPSC |
|---|---|---|
| Ordering | K-FIFO | Strict FIFO |
| Producer contention | None | Low (CAS-based) |
| Maturity | Custom | Production-proven |
| Flexibility | High | Fixed |
Choose JCTools when: You need strict FIFO with proven production reliability.
Choose K-FIFO when: You can trade ordering for even higher throughput.
vs. LinkedTransferQueue
Java's built-in option for high-performance queuing:
| Aspect | K-FIFO | LinkedTransferQueue |
|---|---|---|
| Bounded | Yes | No |
| Memory | Fixed | Growing |
| GC pressure | Minimal | Moderate |
| Throughput | Very high | High |
Choose LinkedTransferQueue when: You need unbounded queues with good general-purpose performance.
Choose K-FIFO when: Bounded size, minimal GC, and maximum throughput are priorities.
Part 12: Conclusion and Lessons Learned
That midnight incident taught me something I won't forget: the data structures we choose have profound implications beyond correctness. Our perfectly correct ArrayBlockingQueue was killing our system. The ordering guarantee we thought we needed was actually a performance anchor dragging us down.
The K-FIFO queue isn't magic. It's a trade-off made explicit. By relaxing the ordering guarantee from "exactly FIFO" to "among the K oldest," we gained a 3-5x throughput improvement (from 2M to 6-10M ops/sec), 37x better tail latency (p99.9 dropped from 15ms to 400ns), and a 92% reduction in GC pressure (from 4.2 MB/sec allocations to 0.4 MB/sec).
But more than the numbers, I learned a deeper lesson about systems design: question your assumptions. We assumed strict ordering. We assumed the standard library was optimal. We assumed the problem was elsewhere. Each assumption was wrong.
The real skill in performance engineering isn't knowing exotic algorithms. It's asking the right questions:
- What guarantees do we actually need?
- What are we paying for guarantees we don't use?
- Where is the real bottleneck?
The K-FIFO queue answered our immediate problem. But the methodology - measure, understand, question, optimize - applies everywhere.
So the next time you reach for a BlockingQueue, pause for a moment. Ask yourself: Do I really need strict FIFO? Or am I paying a tax I don't need to pay? The answer might surprise you.
And remember: measure, understand, optimize - in that order.
Part 13: Debugging and Troubleshooting K-FIFO Queues
Lock-free code is notoriously difficult to debug. When things go wrong, they often go wrong in subtle, hard-to-reproduce ways. Here's a comprehensive guide to debugging K-FIFO implementations.
Common Symptoms and Their Causes
Symptom: Data Corruption (Reading Wrong Values)
Possible causes:
- Missing memory barriers - element read before producer finishes writing
- Index calculation errors - reading from wrong slot
- ABA problem - slot reused before read completes
Debugging approach:
Symptom: Lost Elements (Elements Enqueued But Never Dequeued)
Possible causes:
- Producer thinks segment is full when it isn't
- Consumer skips segments
- Head/tail wraparound issues
Debugging approach:
Symptom: Deadlock or Livelock
K-FIFO is lock-free by design, so true deadlock shouldn't occur. But you might see:
- Consumer spinning forever on empty segments
- Producers spinning forever on full segments
- Thread starvation from unfair scheduling
Debugging approach:
Using Thread Dumps for Diagnosis
When K-FIFO behaves unexpectedly, thread dumps reveal where threads are spending time:
Look for patterns like:
- Multiple producer threads in the same method - indicates contention
- Consumer thread in
Thread.onSpinWait()- normal when queue is empty - Producer threads blocked outside queue code - upstream issue
Testing Strategies for Lock-Free Code
Strategy 1: Stress Testing
Strategy 2: Jcstress Testing
For rigorous concurrency testing, use OpenJDK's jcstress:
Strategy 3: Property-Based Testing
Use property-based testing to find edge cases:
Profiling Lock-Free Code
Use async-profiler for low-overhead profiling:
Performance Regression Detection
Set up continuous benchmarking to catch regressions:
Appendix A: Complete Implementation
For reference, here's the complete K-FIFO queue implementation:
Appendix B: Quick Reference
Algorithm Summary
Structure:
- K independent SPSC (Single-Producer Single-Consumer) segments
- Each producer assigned to one segment
- Single consumer round-robins through segments
Guarantee:
- Dequeued element was among K oldest at time of dequeue
- NOT strict FIFO - order may vary within K elements
Performance:
- Producers: Zero contention (per-segment ownership)
- Consumer: O(K) segment checks per dequeue
- Throughput: 6-10M ops/sec (vs 2M for strict FIFO)
Trade-off:
- Relaxed ordering for higher throughput
- Suitable when approximate ordering is acceptable
Performance Comparison
| Metric | Strict FIFO | K-FIFO (K=8) |
|---|---|---|
| Throughput | ~2M ops/s | ~9M ops/s |
| p50 latency | 187ns | 52ns |
| p99.9 latency | 15,234ns | 412ns |
| GC allocation | 4.2 MB/s | 0.4 MB/s |
| Producer contention | High | None |
When to Use
| ✅ USE K-FIFO | ❌ AVOID K-FIFO |
|---|---|
| Load balancing | Message ordering required |
| Task scheduling | Event sourcing |
| Event aggregation | Single producer |
| Logging pipelines | Low throughput systems |
| Metrics collection | Team lacks lock-free experience |
| Real-time processing | Simplicity preferred |
Appendix C: Further Reading
Academic Papers
- "Wait-free queues with multiple enqueuers and dequeuers" - Michael & Scott
- "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms" - Michael & Scott
- "The k-FIFO problem" - Afek et al.
Books
- "The Art of Multiprocessor Programming" - Herlihy & Shavit
- "Java Concurrency in Practice" - Goetz et al.
- "Is Parallel Programming Hard, And, If So, What Can You Do About It?" - McKenney
Libraries
- JCTools: https://github.com/JCTools/JCTools
- LMAX Disruptor: https://lmax-exchange.github.io/disruptor/
- Chronicle Queue: https://github.com/OpenHFT/Chronicle-Queue
Blogs
- Mechanical Sympathy: https://mechanical-sympathy.blogspot.com/
- Preshing on Programming: https://preshing.com/
- Martin Thompson's Blog: https://mechanical-sympathy.blogspot.com/
Appendix D: Glossary of Terms
Acquire semantics: A memory ordering constraint ensuring that all reads and writes after an acquire operation see the effects of all writes before the corresponding release operation.
Cache line: The unit of data transfer between main memory and CPU cache, typically 64 bytes on modern x86-64 processors. Understanding cache lines is crucial for avoiding false sharing.
CAS (Compare-And-Swap): An atomic instruction that compares a memory location with an expected value and, if they match, updates it to a new value. The foundation of most lock-free algorithms.
False sharing: A performance problem that occurs when independent data items reside on the same cache line, causing unnecessary cache invalidations when either is modified.
K-FIFO: A queue with relaxed ordering guarantees where any dequeued element was among the K oldest elements at some point during the dequeue operation.
Lock-free: A progress guarantee where the system as a whole makes progress even if individual threads are delayed. Some thread will always complete its operation in a finite number of steps.
Memory barrier (fence): An instruction that enforces ordering constraints on memory operations, preventing the CPU and compiler from reordering reads and writes across the barrier.
MPSC (Multi-Producer Single-Consumer): A queue access pattern where multiple producer threads can enqueue concurrently, but only a single consumer thread dequeues.
Release semantics: A memory ordering constraint ensuring that all reads and writes before a release operation are visible to any thread that performs an acquire operation on the same variable.
SPSC (Single-Producer Single-Consumer): A queue access pattern with exactly one producer and one consumer thread. SPSC queues are the simplest to implement correctly and offer the best performance.
VarHandle: A Java API introduced in Java 9 that provides fine-grained control over memory ordering and atomic operations, replacing the use of sun.misc.Unsafe.
Wait-free: A stronger progress guarantee than lock-free, where every thread completes its operation in a bounded number of steps regardless of other threads.
Appendix E: Checklist for Implementing Your Own K-FIFO Queue
If you're implementing K-FIFO from scratch, use this checklist:
Design Phase:
- Determine K value based on expected producer count
- Choose segment capacity based on burst requirements
- Decide on overflow strategy (fail, retry other segments, block)
- Plan memory layout to avoid false sharing
Implementation Phase:
- Use VarHandle for all atomic operations
- Add cache line padding between hot fields
- Use release semantics for publication (tail updates)
- Use acquire semantics for consumption (tail reads)
- Clear buffer slots after reading (help GC)
- Use long positions to avoid wraparound issues
Testing Phase:
- Unit tests for single-threaded behavior
- Stress tests with many producers
- Long-running tests for memory leaks
- Property-based tests for invariants
- Jcstress tests for memory ordering bugs
Deployment Phase:
- Add monitoring metrics (throughput, latency, utilization)
- Configure alerting thresholds
- Document operational procedures
- Establish performance baseline
Maintenance Phase:
- Run benchmarks in CI to catch regressions
- Monitor production metrics
- Review segment balance periodically
- Update documentation with lessons learned
This article is part of the Lock-Free in Java series. The complete source code is available at https://github.com/techishthoughts-org/off_heap_algorithms
