# Tech.ish Thoughts — Full Content Index > Complete article index with descriptions, key takeaways, and metadata. > For a concise version, see /llms.txt --- ## Polyglot GraphQL Federation: Part 5 - Observability Across the Stack - **URL**: https://techishthoughts.com/posts/2026/05/graphql-federation-part-5-observability/ - **Published**: 2026-05-06 - **Authors**: arthur-costa - **Reading Time**: 45 minutes - **Series**: Polyglot GraphQL Federation (Part 5) - **Tags**: GraphQL, Federation, OpenTelemetry, Distributed Tracing, Grafana, Tempo, Prometheus, Loki, Pyroscope, Observability, SLO - **Categories**: Engineering Implement end-to-end observability for polyglot GraphQL federation with OpenTelemetry, Tempo, Prometheus, Loki, Alloy, and Pyroscope, including tail sampling, SLOs, and cross-signal correlation. ### Key Takeaways - The OTel Collector's spanmetrics connector eliminates per-language metric instrumentation entirely — it watches every span and auto-generates RED metrics (rate, errors, duration) broken down by service, HTTP route, and GraphQL operation name, so three languages get consistent metrics from zero application code. - Tail sampling in the Collector keeps 100% of error and slow (>500ms) traces while probabilistically sampling 25% of healthy traffic, cutting storage by ~73% without losing visibility into the requests that matter most for debugging. - Cross-signal correlation is the real payoff of the LGTM+ stack: a latency spike in Prometheus shows an exemplar dot linking to the exact trace in Tempo, which links to logs in Loki filtered by trace ID, which links to a CPU flame graph in Pyroscope — four signals, one click chain. - Structured JSON logging to stdout with embedded traceId is the non-negotiable prerequisite that makes the entire log correlation pipeline work — Alloy parses the JSON, extracts traceId as a Loki label, and Grafana's derived fields turn it into a clickable "View Trace" link. - SLO tracking derives entirely from spanmetrics: Prometheus recording rules compute availability SLI (1 - 5xx rate) and latency SLI (% of requests under 500ms) every 30 seconds, with error budget burn alerting that fires when 50% of the 30-day budget is consumed. --- ## Polyglot GraphQL Federation: Part 4 - Kong, Apollo Router, and Query Planning - **URL**: https://techishthoughts.com/posts/2026/04/graphql-federation-part-4-gateway-layer/ - **Published**: 2026-04-21 - **Authors**: arthur-costa - **Reading Time**: 22 minutes - **Series**: Polyglot GraphQL Federation (Part 4) - **Tags**: GraphQL, Federation, Kong, Apollo Router, API Gateway, Authentication, Rate Limiting, Query Planning - **Categories**: Engineering How two gateways compose into a secure, observable API layer with intelligent query planning across subgraphs. ### Key Takeaways - The two-gateway pattern separates concerns cleanly: Kong handles protocol-agnostic API policies (auth, rate limiting, CORS, request IDs) while Apollo Router handles GraphQL-specific work (supergraph composition, query planning, parallel execution) — either can be replaced independently. - Kong validates credentials once and propagates extracted claims as x-user-id/x-user-role headers, so subgraphs make authorization decisions without re-validating tokens — authentication happens at the edge, authorization happens at the leaf. - The Router's query planner is where federation's intelligence lives: it sequences dependent fetches (get product ID first) and parallelizes independent entity resolution (inventory and reviews simultaneously), so total latency equals Step 1 + max(Step 2 fetches) rather than the sum of all. - Auth endpoints route directly to the User Service bypassing Apollo Router entirely, because login/registration are simple REST calls that don't need federation — routing them through the GraphQL layer would add unnecessary latency and query planning overhead. --- ## Object Detection from Scratch: Part 2 - Dataset, Labels, and the Reality of Training Data - **URL**: https://techishthoughts.com/posts/2026/04/object-detection-part-2-dataset-labels/ - **Published**: 2026-04-15 - **Authors**: arthur-costa - **Reading Time**: 20 minutes - **Series**: Object Detection from Scratch (Part 2) - **Tags**: Object Detection, Datasets, YOLO, Roboflow, Computer Vision, Data Quality, Bounding Boxes - **Categories**: Engineering Part 2 examines the dataset behind the MTG detector: splits, YOLO labels, class design, annotation noise, and why data quality sets the real performance ceiling. ### Key Takeaways - The dataset is not just a folder of images. Its splits, class design, and annotation consistency define what the detector can plausibly learn. - YOLO labels are simple on disk but encode strong assumptions: normalized coordinates, one text file per image, and axis-aligned boxes. - The gap between mAP50 and mAP50-95 in this project is a strong clue that annotation quality, especially for small regions, is part of the limiting factor. - For this detector, data quality is a higher-leverage improvement path than endlessly scaling model size. --- ## Batch Processing: Amortized I/O for Audit Persistence - **URL**: https://techishthoughts.com/posts/2026/01/off-heap-batch-processing-production/ - **Published**: 2026-04-09 - **Authors**: arthur - **Reading Time**: 51 minutes - **Series**: Off-Heap Algorithms in Java (Part 10) - **Tags**: Java, Off-Heap, Batch Processing, Double Buffer, Disk I/O, Audit Log, Persistence, Performance - **Categories**: Technology Amortize disk I/O costs with double-buffered batch processing, continuous flushing, and regulatory-compliant audit logging for trading systems. ### Key Takeaways - Batching amortizes memory barrier cost across N operations instead of paying one barrier per element — a batch size of 32 yields 2-4x throughput improvement for SPSC queues because the dominant cost in lock-free queues is fences, not data movement. - Double-buffered batch processing decouples producers from I/O entirely: the producer fills one buffer while the flusher persists the other, making disk latency invisible to the hot path. - Optimal batch size is workload-dependent and adaptive tuning outperforms static configuration — HFT systems want 8-16 (latency-critical), log aggregation wants 256-512 (throughput-critical), and real-time analytics sits at 32-64. - Regulatory audit logging in trading systems must guarantee persistence ordering and durability, making batch commit boundaries the natural unit of transactional consistency for compliance. --- ## Polyglot GraphQL Federation: Part 3 - When GraphQL Meets gRPC and REST - **URL**: https://techishthoughts.com/posts/2026/04/graphql-federation-part-3-hybrid-protocols/ - **Published**: 2026-04-06 - **Authors**: arthur-costa - **Reading Time**: 24 minutes - **Series**: Polyglot GraphQL Federation (Part 3) - **Tags**: GraphQL, gRPC, REST, Protocol Buffers, Stripe, Microservices, Java, Go, API Design - **Categories**: Engineering GraphQL is not the only protocol in a federated platform. This article explores how gRPC handles internal Java-to-Java communication while REST powers Stripe payment integration in Go. ### Key Takeaways - Protocol selection follows use-case boundaries, not architectural dogma: GraphQL faces the client for flexible queries, gRPC connects internal Java services for high-frequency latency-sensitive lookups, and REST integrates Stripe because it's the only option and wrapping it would add a translation layer between error models. - The Inventory service's dual-protocol pattern — exposing GraphQL on port 4004 for the federation router and gRPC on port 50051 for direct Product Catalog calls — proves that transport protocol is an adapter while business logic stays shared in one InventoryService class. - Batch gRPC calls (GetInventoryBatch) are critical for the hot path: when 20 products need stock counts, one gRPC request replaces 20 individual calls, and the same batch database query serves both gRPC and federation entity resolution code paths. - Choosing gRPC over REST between the two JVM services compounds advantages on an internal Docker network: binary protobuf is 3-10x smaller than JSON, HTTP/2 multiplexing shares a single TCP connection, and code generation from the shared .proto file eliminates serialization bugs entirely. --- ## K-FIFO Queues: Relaxed Ordering for Maximum Throughput - **URL**: https://techishthoughts.com/posts/2026/01/off-heap-kfifo-queue-production/ - **Published**: 2026-03-26 - **Authors**: arthur - **Reading Time**: 50 minutes - **Series**: Off-Heap Algorithms in Java (Part 9) - **Tags**: Java, Off-Heap, K-FIFO, Relaxed Ordering, Throughput, Metrics, Probabilistic, Performance - **Categories**: Technology Trade strict FIFO ordering for dramatically higher throughput with K-FIFO queues, segmented buffers, and probabilistic fairness for metrics collection. ### Key Takeaways - K-FIFO queues trade strict ordering for throughput by allowing any element within a window of K positions to be dequeued next — this eliminates the single-point CAS contention that makes strict FIFO queues collapse under high producer counts. - Segmented buffer design distributes contention across K independent segments so that CAS failures on one segment don't block progress on others, achieving near-linear scaling with producer count where strict FIFO hits a wall. - K-FIFO is the right choice specifically for metrics collection, telemetry, and event processing where approximate ordering is acceptable — but ordering-dependent workloads like event sourcing or transactional replay must use strict FIFO regardless of performance cost. - The central queue bottleneck that crashed the system under load was not a capacity problem but a contention problem — 16 threads hammering one CAS variable generated enough cache-line bouncing to trigger cascading backpressure and OOM. --- ## Polyglot GraphQL Federation: Part 2 - Three Languages, One Schema - **URL**: https://techishthoughts.com/posts/2026/03/graphql-federation-part-2-polyglot-subgraphs/ - **Published**: 2026-03-22 - **Authors**: arthur-costa - **Reading Time**: 28 minutes - **Series**: Polyglot GraphQL Federation (Part 2) - **Tags**: GraphQL, Federation, Java, Go, TypeScript, Micronaut, gqlgen, Apollo Server, Microservices - **Categories**: Engineering How Java, Go, and TypeScript each implement GraphQL Federation 2 subgraphs with entity resolution, schema directives, and database-per-service isolation in a real e-commerce platform. ### Key Takeaways - The federation spec intentionally decouples the contract (entity keys, __resolveReference, _entities query) from implementation language — all three ecosystems produce identical GraphQL responses, making polyglot federation a matter of library ergonomics rather than protocol compatibility. - TypeScript/Apollo wins on iteration speed with ~10 lines of federation overhead and sub-second hot reload, Java/Micronaut wins on explicit control with manual __typename dispatch, and Go/gqlgen wins on type-safe code generation where the developer writes just a one-line entity lookup function. - Database-per-service isolation is enforced at the data level with separate PostgreSQL instances — cross-service data access happens exclusively through federation entity resolution, never through shared database access, which makes domain boundary violations structurally impossible. - Entity resolution ergonomics diverge sharply: TypeScript uses resolver functions on the type, Java requires a manual switch statement inside Federation.transform(), and Go's gqlgen plugin generates the entire _entities dispatch so you only implement FindEntityByID. --- ## Object Detection from Scratch: Part 1 - Why This Project Is Worth Building - **URL**: https://techishthoughts.com/posts/2026/03/object-detection-part-1-why-this-project-matters/ - **Published**: 2026-03-15 - **Authors**: arthur-costa - **Reading Time**: 18 minutes - **Series**: Object Detection from Scratch (Part 1) - **Tags**: Object Detection, Computer Vision, YOLO, Machine Learning, Python, FastAPI, OCR, OpenAI - **Categories**: Engineering Why a Magic: The Gathering card detector is a serious engineering project, not a toy demo. Part 1 frames the product problem, the architecture, and the technical journey ahead. ### Key Takeaways - This project is not just a model-training exercise. It is a product pipeline that turns raw images into usable card information through detection, OCR, lookup, and art matching. - Object detection is the right abstraction because the system must find multiple semantic regions on the same card, not just assign one label to the whole image. - The architecture is split into two tracks: an offline training track that produces weights and an online inference track that answers user requests in real time. - The repo already captures the entire engineering lifecycle: dataset setup, training, validation, live detection, identification, a web application, and experiment logs. --- ## Polyglot GraphQL Federation: Part 1 - The Monolith's Breaking Point - **URL**: https://techishthoughts.com/posts/2026/03/graphql-federation-part-1-why-federation/ - **Published**: 2026-03-07 - **Authors**: arthur-costa - **Reading Time**: 20 minutes - **Series**: Polyglot GraphQL Federation (Part 1) - **Tags**: GraphQL, Federation, Microservices, API Design, Apollo Router, Distributed Systems, E-Commerce - **Categories**: Engineering Why monolithic GraphQL APIs collapse under the weight of growing teams and domains. An introduction to GraphQL Federation 2 and the architecture behind a polyglot e-commerce platform. ### Key Takeaways - Monolithic GraphQL servers create ownership ambiguity, deployment coupling, and language lock-in as organizations scale — federation solves all three by giving each team an independent subgraph. - Federation 2 entities use @key directives to span service boundaries, letting multiple subgraphs contribute fields to a single type without touching each other's code. - A polyglot approach (Java, Go, TypeScript) is practical because the federation router hides implementation languages from clients — the browser sends one query regardless of which runtime resolves it. - A two-layer gateway stack separates concerns: Kong handles protocol-agnostic API security (JWT, rate limiting, CORS) while Apollo Router handles GraphQL-specific orchestration (query planning, entity resolution, header propagation). - Query plan parallelism is the key performance insight — the router fans out independent entity resolution calls concurrently, so latency equals the slowest subgraph rather than the sum of all. --- ## The Disruptor Pattern: Multi-Stage Event Processing Pipelines - **URL**: https://techishthoughts.com/posts/2026/01/off-heap-disruptor-pattern-production/ - **Published**: 2026-02-12 - **Authors**: arthur - **Reading Time**: 51 minutes - **Series**: Off-Heap Algorithms in Java (Part 5) - **Tags**: Java, Off-Heap, Disruptor, LMAX, Event Processing, Pipeline, Sequence Barrier, Performance - **Categories**: Technology Implement LMAX Disruptor-style event processing with sequence barriers, multi-stage pipelines, and batch processing for ultra-low latency systems. ### Key Takeaways - The Disruptor's core trick is replacing multiple inter-stage blocking queues with a single pre-allocated ring buffer where stages track progress via padded atomic sequences — eliminating lock contention, per-event allocation, and cache-hostile memory access patterns simultaneously. - False sharing between sequence counters is the silent performance killer: without padding each sequence to its own 64-byte cache line, updating one stage's progress invalidates neighboring sequences in other cores' caches, adding 40-100ns per iteration. - Wait strategy selection is the primary latency-vs-CPU tradeoff knob — BusySpinWaitStrategy delivers ~10-50ns response at 100% CPU, while BlockingWaitStrategy trades 10-100us+ latency for minimal CPU, and the choice depends on whether you have dedicated cores. - The multi-producer available-buffer pattern uses release/acquire VarHandle semantics to let consumers safely read events even when producers complete out of order — a consumer seeing a flag guarantees all prior event data writes are visible. - Benchmark results show the Disruptor pipeline achieving 72x improvement at p99.9 latency (623ns vs 45,123ns) and 98x fewer GC events compared to LinkedBlockingQueue, because the hot path allocates zero objects and inter-stage transfer costs drop from microseconds to tens of nanoseconds. --- ## Lock-Free MPMC Queues: Dual Contention Mastery - **URL**: https://techishthoughts.com/posts/2026/01/off-heap-mpmc-queue-production/ - **Published**: 2026-01-29 - **Authors**: arthur - **Reading Time**: 50 minutes - **Series**: Off-Heap Algorithms in Java (Part 4) - **Tags**: Java, Off-Heap, Concurrency, MPMC, Lock-Free, Work-Stealing, Thread Pool, Performance - **Categories**: Technology Master the complexity of Multi-Producer Multi-Consumer lock-free queues with per-slot sequence numbers, dual CAS coordination, and work-stealing thread pool integration. ### Key Takeaways - MPMC is fundamentally harder than MPSC because contention becomes bidimensional — producers race for the head and consumers race for the tail, and both sides must coordinate through per-slot sequence numbers to prevent overwrites of unread data and duplicate reads. - Per-slot sequence numbers encode a three-state protocol (empty, written, being-read) that lets producers and consumers coordinate without locks: a slot is writable when sequence equals the producer's position, readable when sequence equals position+1, and recyclable when the consumer sets sequence to position+capacity. - Locked MPMC throughput actually decreases when moving from 4P/4C to 8P/8C due to contention overhead, while lock-free throughput stays roughly consistent — demonstrating that lock-based designs have a thread-count ceiling beyond which adding resources makes things worse. - Lock-free MPMC delivers 27x better p99.9 latency (789ns vs 21,345ns at 8P/8C) and 95% less GC allocation, with the tail-latency improvement mattering far more than the 3x mean improvement for trading workloads where worst-case determines profitability. - The dual-CAS protocol avoids ABA through 64-bit monotonic positions (2^63 operations before wraparound equals ~29,000 years at 10M ops/sec) combined with per-slot sequence validation as a secondary consistency check. --- ## Lock-Free MPSC Queues: Production-Grade Implementation - **URL**: https://techishthoughts.com/posts/2026/01/off-heap-mpsc-queue-production/ - **Published**: 2026-01-15 - **Authors**: arthur - **Reading Time**: 51 minutes - **Series**: Off-Heap Algorithms in Java (Part 3) - **Tags**: Java, Off-Heap, Concurrency, MPSC, Lock-Free, VarHandle, High-Frequency Trading, Performance - **Categories**: Technology A deep-dive into building production-grade Multi-Producer Single-Consumer lock-free queues in Java, with VarHandle, CAS operations, and real-world benchmarks. ### Key Takeaways - The MPSC asymmetry is the key design lever: because only one consumer exists, the tail path needs no synchronization at all — producers compete via CAS on the head while the consumer reads with plain loads, cutting the coordination surface in half. - Lock contention's real cost is not mutual exclusion but the cascade of context switches (3,000-10,000ns each), cache-line bouncing, and AQS node allocations (3.6 MB/sec under load) that dominate a 50ns critical section by two orders of magnitude. - The locked MPSC queue showed 45x variance between p50 (187ns) and p99.9 (8,934ns) — a direct signature of lock convoy effects where threads serialize through parking and waking rather than making concurrent progress. - VarHandle with explicit memory ordering modes (getAcquire/setRelease) replaces the blunt instrument of volatile or synchronized with targeted barriers, keeping correctness while avoiding the full-fence cost on weakly-ordered architectures. --- ## Sharded Processing: Per-Core Isolation for Zero Contention - **URL**: https://techishthoughts.com/posts/2026/01/off-heap-sharded-processing-production/ - **Published**: 2026-01-04 - **Authors**: arthur-costa - **Reading Time**: 50 minutes - **Series**: Off-Heap Algorithms in Java (Part 7) - **Tags**: Java, Off-Heap, Sharding, Thread Affinity, CPU Cores, Parallelism, Zero Contention, Performance - **Categories**: Technology Eliminate contention entirely with per-CPU-core sharded buffers, thread affinity, and isolated processing lanes for maximum parallelism. ### Key Takeaways - Per-core sharding eliminates contention by partitioning buffers so threads never coordinate — achieving 54x throughput improvement (2M to 108M ops/s) and dropping p99.9 latency from 18us to 134ns on a 64-thread system. - Cache-line false sharing between adjacent shards silently destroys performance; padding each shard to 64-byte boundaries and pinning threads to cores cut the L1 cache miss rate from 38% to 4%. - The canonical signal that you need sharding is throughput decreasing as you add threads — if going from 4 to 8 threads makes things slower, a CAS retry ratio above 1.0 confirms a serialization bottleneck. - Release-acquire VarHandle semantics are sufficient and optimal for producer-consumer publication — volatile (full sequential consistency) is unnecessarily expensive, while plain access is dangerously incorrect on ARM's weak memory model. --- ## Wait-Free Telemetry: Never-Blocking Observability - **URL**: https://techishthoughts.com/posts/2026/01/off-heap-wait-free-telemetry-production/ - **Published**: 2026-01-04 - **Authors**: arthur-costa - **Reading Time**: 50 minutes - **Series**: Off-Heap Algorithms in Java (Part 6) - **Tags**: Java, Off-Heap, Wait-Free, Telemetry, Observability, Metrics, Trading, Performance - **Categories**: Technology Build wait-free telemetry buffers that never block producers, with overwrite semantics for high-frequency trading observability that doesn't impact system performance. ### Key Takeaways - Telemetry data has fundamentally different semantics than business data — recent statistical accuracy matters more than individual event delivery, so accepting overwrite-on-full as a feature rather than a bug enables a wait-free design that never blocks producers. - The observability paradox mirrors Heisenberg: under peak load, synchronized telemetry buffers consume more CPU time on coordination than exists in a second, causing the monitoring layer to become the primary source of the outage it should be diagnosing. - Replacing compareAndSet (which can loop indefinitely under contention) with getAndIncrement (which always completes in one step) is the precise boundary between lock-free and wait-free — and it matters most when contention is highest. - Wait-free telemetry achieves 140x better p99.9 latency (89ns vs 12,456ns) and scales nearly linearly to 16 producers because each additional contender adds only a constant-time atomic increment, not another context switch in a lock convoy. - Cache-line padding between head and tail via seven unused longs (56 bytes) prevents false sharing that would otherwise turn every producer write into a cross-core cache invalidation storm for the consumer. --- ## Wait-Free SPSC Queues in Java - **URL**: https://techishthoughts.com/posts/2025/11/off-heap-spsc-queue/ - **Published**: 2025-12-23 - **Authors**: arthur-costa - **Reading Time**: 18 minutes - **Series**: Off-Heap Algorithms in Java (Part 2) - **Tags**: Java, Off-Heap, Concurrency, Performance, High-Frequency Trading, VarHandle, SPSC Queue - **Categories**: Engineering How to replace synchronized queue handshakes with a wait-free Single-Producer Single-Consumer ring buffer that uses precise memory ordering instead of locks. ### Key Takeaways - A wait-free SPSC queue replaces synchronized blocks with VarHandle acquire/release semantics, cutting per-operation cost from ~200 CPU cycles (monitor enter/exit + memory fence) to ~30 cycles (two lightweight barriers), yielding 6.4x throughput and 25x better p99.9 latency. - Cache-line padding between head and tail indices eliminates false sharing — without it, producer writes to head invalidate the consumer's cache line for tail even though no data is logically shared, causing hundreds of cycles of coherence traffic per operation. - The synchronized approach wastes 60% of CPU on lock bookkeeping versus 10% for wait-free, meaning on the same hardware 90% of cycles go to actual trading logic instead of coordination overhead. - Wait-free correctness relies entirely on the JMM happens-before edge: setRelease on the head index guarantees all prior element writes are visible to any thread that subsequently does getAcquire on that same index — no CAS, no retries, no OS involvement. - SPSC relationships hide in plain sight on architecture diagrams wherever dedicated threads form a pipeline — exchange gateway to strategy engine, decoder to renderer, application thread to log writer — and each one is a candidate for lock elimination. --- ## A gentle introduction to observability - **URL**: https://techishthoughts.com/posts/2025/12/gentle-introduction-to-observability/ - **Published**: 2025-12-11 - **Authors**: gabriel-jeronimo - **Reading Time**: 6 minutes - **Tags**: SRE, Observability, OpenTelemetry, Monitoring - **Categories**: Engineering Learn the core concepts of observability, why logs, metrics and traces matter and how tools like OpenTelemetry, Prometheus and Grafana improve your system reliability ### Key Takeaways - Observability is not a set of tools but the ability to infer internal system state from external signals — logs, metrics, and traces each answer different questions (what happened, how much, and where in the call chain). - OpenTelemetry has become the dominant collection standard (adopted by 71% of companies alongside Prometheus for metrics), making vendor-neutral instrumentation the de facto starting point rather than a lock-in bet. - Without observability, the questions "where is the problem" and "how many users are affected" become slow manual investigations that directly inflate MTTR and degrade user experience during incidents. - Traces are uniquely valuable in distributed systems because they reconstruct the full request path across microservices, exposing latency contributors that per-service logs and metrics cannot reveal on their own. --- ## Event Pipelines in Java: The LMAX Disruptor Pattern - **URL**: https://techishthoughts.com/posts/2025/11/off-heap-event-pipeline/ - **Published**: 2025-11-19 - **Authors**: arthur-costa - **Reading Time**: 18 minutes - **Series**: Off-Heap Algorithms in Java (Part 5) - **Tags**: Java, Off-Heap, Concurrency, Performance, High-Frequency Trading, Event Pipelines, LMAX Disruptor - **Categories**: Engineering How to chain SPSC queues into a high-throughput event pipeline, following the LMAX Disruptor pattern for multi-stage processing with sub-microsecond latency. ### Key Takeaways - Between any two pipeline stages there is exactly one producer and one consumer, so chaining SPSC ring buffers replaces every LinkedBlockingQueue with a wait-free primitive — turning an architectural observation into a 23x throughput gain. - The 23x total improvement far exceeds the 5.5x per-stage lock removal because lock-free coordination, zero per-event allocation, GC elimination, and cache locality compound multiplicatively rather than additively. - At 1M events/sec through 4 stages, blocking queues allocate 56 MB/sec of throwaway node objects while SPSC ring buffers allocate exactly zero bytes, making GC pressure alone a sufficient reason to switch even if lock overhead were free. - End-to-end p99 latency dropped from 4.8 microseconds to 420 nanoseconds — crossing the threshold where the queueing layer stops being the dominant latency contributor and business logic can operate in its native sub-microsecond regime. --- ## MPMC Queues in Java: The Final Boss - **URL**: https://techishthoughts.com/posts/2025/11/off-heap-mpmc-queue/ - **Published**: 2025-11-18 - **Authors**: arthur-costa - **Reading Time**: 18 minutes - **Series**: Off-Heap Algorithms in Java (Part 4) - **Tags**: Java, Off-Heap, Concurrency, Performance, High-Frequency Trading, MPMC Queue, CAS - **Categories**: Engineering How to build a dual-CAS Multi-Producer Multi-Consumer ring buffer in Java that scales on both ends without collapsing under lock contention. ### Key Takeaways - MPMC's core insight is that producers only need to contend with other producers and consumers only with other consumers — dual CAS separates these two contention domains so neither side blocks the other, delivering 6.5x throughput over a single global lock. - A single-lock MPMC with 8 producers and 8 consumers collapses to 800K ops/sec because all 16 threads serialize through one mutex; dual CAS sustains 5.2M ops/sec by allowing producer CAS and consumer CAS to proceed concurrently on different cache lines. - Per-slot sequence numbers serve double duty in MPMC: they prevent producers from overwriting slots the consumer hasn't read yet, and prevent consumers from reading slots the producer hasn't finished writing, all without any lock coordination. - The locked MPMC wastes 80% of CPU on lock waiting and context switches versus 35% for lock-free, which on a 16-thread system is the difference between burning ~13 cores on overhead versus ~5.6 — often the deciding factor for fitting on a single machine. - MPMC lock-free queues only justify their complexity at 4+ producers AND 4+ consumers with high throughput needs; for simpler topologies, SPSC or MPSC patterns deliver better performance with less coordination machinery. --- ## Lock-Free MPSC Queues in Java - **URL**: https://techishthoughts.com/posts/2025/11/off-heap-mpsc-queue/ - **Published**: 2025-11-17 - **Authors**: arthur-costa - **Reading Time**: 18 minutes - **Series**: Off-Heap Algorithms in Java (Part 3) - **Tags**: Java, Off-Heap, Concurrency, Performance, High-Frequency Trading, VarHandle, MPSC Queue, CAS - **Categories**: Engineering How to replace locked many-producer queues with a lock-free Multi-Producer Single-Consumer ring buffer coordinated entirely by CAS and sequence numbers. ### Key Takeaways - Lock-based MPSC queues exhibit negative scaling — throughput drops from 4.2M to 1.7M ops/sec as producers increase from 1 to 8 — while lock-free CAS coordination scales positively from 4.5M to 8.3M ops/sec because threads compete in parallel rather than serializing behind a mutex. - CAS atomically claims a write slot so each producer can write independently after winning its index; per-slot sequence numbers then solve the visibility gap where a slow producer might not finish writing before the consumer tries to read that slot. - Progressive backoff (Thread.onSpinWait under light contention, Thread.yield under heavy) prevents CAS retry storms from devolving into livelock while keeping the common-case cost near zero since most operations succeed on the first or second attempt. - Replacing ReentrantLock with CAS-based MPSC in a production order pipeline cut missed orders from 50,000/day to 200/day — same strategies, same hardware, 99.6% fewer failures purely from eliminating lock contention spikes at the queue level. - The practical threshold for lock-free MPSC is 4+ producers at 5M+ ops/sec; below that, ReentrantLock is simpler to maintain and the coordination overhead doesn't dominate the latency budget. --- ## Off-Heap Algorithms in Java: The Ring Buffer Foundation - **URL**: https://techishthoughts.com/posts/2025/11/off-heap-ring-buffer/ - **Published**: 2025-11-15 - **Authors**: arthur-costa - **Reading Time**: 20 minutes - **Series**: Off-Heap Algorithms in Java (Part 1) - **Tags**: Java, Off-Heap, Concurrency, Performance, High-Frequency Trading, FFM API, Ring Buffer - **Categories**: Engineering From a naive heap-based queue to an off-heap ring buffer with dramatically better throughput, tail latency, and GC behavior for high-frequency trading workloads. ### Key Takeaways - Moving a ConcurrentLinkedQueue-based aggregator to an off-heap ring buffer using Java 21's FFM API delivers 58x throughput at 8 threads and 772x better p99.9 latency — the gains come from eliminating GC pauses and pointer-chasing cache misses simultaneously. - Heap-based queues degrade with more threads (4M to 1.15M ops/sec from 1 to 8 threads) while the off-heap ring buffer scales linearly (10.78M to 67.41M), because contiguous native memory eliminates the per-object allocation churn that triggers GC storms under concurrency. - The naive approach burns 15.4% of CPU time on garbage collection (925ms/min) versus 0.013% for off-heap (0.8ms/min) — for latency-sensitive systems, no amount of business-logic optimization can recover that budget. - Sequential off-heap layout achieves ~4 cycles per access versus ~600 cycles for heap pointer-chasing, because the CPU prefetcher detects the stride pattern and loads cache lines in advance rather than stalling on three levels of indirection per entry. - Off-heap is not universally better — it trades JVM safety features (heap dumps, bounds checking, automatic GC) for performance, so the rule of thumb is to adopt it only when GC pauses appear in your p99/p99.9 latency profiles. --- ## The Actor Model on the JVM: Part 3 - The Final Chapter - **URL**: https://techishthoughts.com/posts/2025/07/actor-model-jvm-part-3-final-chapter/ - **Published**: 2025-07-19 - **Authors**: arthur-costa - **Reading Time**: 11 minutes - **Series**: Actor Model on the JVM (Part 3) - **Tags**: Actor Model, Akka, Apache Pekko, Java, Scala, Concurrency, WebSocket, Testing - **Categories**: Engineering Complete implementation guide to the Actor Model with advanced patterns, testing strategies, and real-world lessons learned from building scalable concurrent systems. ### Key Takeaways - Apache Pekko preserves full API compatibility with Akka 2.6.x under Apache 2.0 licensing, giving teams a zero-rewrite migration path away from Lightbend's commercial model. - Actors manage state by returning new behaviors with updated data on each message, making state transitions explicit and race-condition-free without any synchronization primitives. - Supervision hierarchies are the actor model's killer feature for production resilience — parent actors encode failure policies (restart, resume, stop, escalate) as composable strategies rather than scattering try/catch throughout business logic. - The ask pattern introduces implicit synchronization and should be used sparingly; prefer tell with message correlation IDs to preserve the asynchronous, non-blocking nature of actor communication. - Growing mailbox sizes are the canary in the coal mine for actor systems — they signal backpressure failures that require circuit breakers or load shedding, not bigger queues. --- ## Understanding Communication Protocols: A Comprehensive Guide - **URL**: https://techishthoughts.com/posts/2025/06/understanding-communication-protocols/ - **Published**: 2025-06-13 - **Authors**: arthur-costa - **Reading Time**: 20 minutes - **Tags**: Protocols, Networking, OSI Model, HTTP, TCP, UDP, Software Architecture - **Categories**: Technology Comprehensive analysis of communication protocols from historical perspectives to modern implementation considerations across all network layers. ### Key Takeaways - The 1978 split of the original TCP specification into separate TCP and IP layers established the separation of concerns (routing vs. reliable delivery) that still underpins every modern network stack. - HTTP/2's head-of-line blocking exists because all multiplexed streams share a single TCP connection — HTTP/3 fixes this by running each stream over an independent QUIC stream on UDP, so one lost packet stalls only its own stream. - Protocol selection is a trade-off triangle between reliability, latency, and overhead: TCP guarantees order and delivery at the cost of handshake round-trips, while UDP sacrifices guarantees for speed — QUIC bridges the gap with built-in TLS 1.3 and 0-RTT resumption. - gRPC with Protocol Buffers is the dominant inter-service protocol for microservices because binary serialization yields smaller payloads and faster parsing than JSON, while generated stubs enforce schema contracts across language boundaries. - MQTT's publish-subscribe model with minimal packet overhead makes it the de facto standard for IoT communication where thousands of resource-constrained devices must push telemetry over unreliable, low-bandwidth links. --- ## Building an Escrow Marketplace Smart Contract: A Beginner's Guide to Solidity - **URL**: https://techishthoughts.com/posts/2025/04/building-escrow-marketplace-smart-contract/ - **Published**: 2025-04-05 - **Authors**: gabriel-jeronimo - **Reading Time**: 10 minutes - **Series**: Blockchain Foundations (Part 2) - **Tags**: Blockchain, Ethereum, Solidity, Smart Contracts, DeFi, Foundry, Web3 - **Categories**: Technology Learn how to build a secure escrow marketplace smart contract using Solidity and Foundry, enabling trustless transactions between buyers and sellers. ### Key Takeaways - The checks-effects-interactions pattern is the primary reentrancy defense in Solidity — update all contract state before making external calls so a reentrant invocation sees already-modified balances. - Smart contract escrow replaces a trusted third-party custodian with deterministic code: funds are locked on purchase, released atomically on delivery confirmation, and refundable by deadline expiry — no human intermediary needed. - Foundry's vm.prank and vm.warp enable testing caller identity and time-dependent logic (like delivery deadlines) in a single deterministic test run, which is critical since deployed contracts are immutable. - Struct packing matters for gas cost — reordering fields so uint128, uint64, uint32, and uint8 share storage slots can significantly reduce deployment and per-transaction costs on Ethereum. --- ## The Actor Model on the JVM: Part 2 - The Pitfalls of Shared State - **URL**: https://techishthoughts.com/posts/2025/03/actor-model-jvm-part-2-pitfalls-shared-state/ - **Published**: 2025-03-02 - **Authors**: arthur-costa - **Reading Time**: 20 minutes - **Series**: Actor Model on the JVM (Part 2) - **Tags**: Java, Concurrency, Multithreading, Actor Model, Performance, Synchronization - **Categories**: Engineering Deep dive into the specific problems that arise when dealing with shared mutable state in multithreaded environments and why traditional synchronization approaches fall short. ### Key Takeaways - Race conditions, deadlocks, livelocks, and starvation are the four distinct failure modes of shared mutable state — each requires a different mitigation strategy under traditional locking, but the Actor Model eliminates all four by design. - The mental model problem is the real scalability killer: with 3 threads and 4 operations each there are 369,600 possible interleavings, making exhaustive reasoning about correctness practically impossible. - Java's AtomicInteger and volatile solve single-variable visibility but fall short for compound invariants — any operation spanning multiple fields still needs explicit synchronization or a fundamentally different concurrency model. - Adding a distributed cache to a multithreaded service compounds the consistency problem by introducing cross-instance staleness on top of intra-process race conditions, doubling the coordination surface area. - An actor-based ticket service processes purchase messages sequentially within the actor, eliminating the check-then-act race condition without any synchronization keywords or lock ordering concerns. --- ## The Actor Model on the JVM: Part 1 - OOP and the Rise of Concurrency Challenges - **URL**: https://techishthoughts.com/posts/2025/02/actor-model-jvm-part-1-oop-concurrency-challenges/ - **Published**: 2025-02-17 - **Authors**: arthur-costa - **Reading Time**: 23 minutes - **Series**: Actor Model on the JVM (Part 1) - **Tags**: Actor Model, Java, Concurrency, OOP, Distributed Systems, Akka, Apache Pekko, Scala - **Categories**: Engineering Explore the evolution of Object-Oriented Programming and its challenges in concurrent programming, setting the stage for understanding the Actor Model as a solution. ### Key Takeaways - OOP's encapsulation promise breaks under concurrency — private fields protect against misuse by other classes but offer zero protection against simultaneous access by multiple threads. - Asynchronous does not mean parallel: async/await on a single thread merely yields during I/O waits, while true parallelism requires multiple threads or cores executing simultaneously. - Synchronized blocks trade correctness for throughput — every lock-protected method becomes a serialization point that limits scalability proportionally to contention. - The Actor Model sidesteps shared-state hazards entirely by giving each actor exclusive ownership of its state and restricting all inter-actor communication to asynchronous message passing. --- ## What is Blockchain, How Does It Work and Why Does It Matter? - **URL**: https://techishthoughts.com/posts/2024/12/what-is-blockchain-how-does-it-work/ - **Published**: 2024-12-11 - **Authors**: gabriel-jeronimo - **Reading Time**: 3 minutes - **Series**: Blockchain Foundations (Part 1) - **Tags**: Blockchain, Cryptocurrency, Decentralization, Bitcoin, Ethereum, Smart Contracts, DeFi, Web3 - **Categories**: Technology Understanding Blockchain: The Technology Behind a Decentralised Future. Comprehensive guide to blockchain technology, its mechanisms, and real-world impact. ### Key Takeaways - Blockchain replaces centralized trust with cryptographic proof — each block's hash chains to the next, making historical tampering computationally infeasible because altering one block invalidates every subsequent hash. - Proof-of-Stake slashes energy consumption by ~99% versus Proof-of-Work while introducing economic security through slashing — validators risk losing staked funds for malicious behavior rather than burning electricity. - The blockchain trilemma (decentralization, security, scalability) forces explicit architectural trade-offs, which Layer 2 solutions like rollups and payment channels address by moving execution off-chain while settling on-chain. - Smart contracts turn blockchains from passive ledgers into programmable platforms — DeFi protocols recreate lending, trading, and insurance without intermediaries by encoding financial logic directly in contract code. - Asset tokenization via smart contracts enables fractional ownership of traditionally illiquid assets like real estate, lowering investment barriers and increasing liquidity through on-chain transferability.