Use cases for Disruptor
- Note that the key is BALANCED FLOW - if your flow is unbalanced, you need to weigh the cost of losing local L3 cache with the reuse of cores
RING BUFFER
The Ring Buffer is a bounded, pre-allocated data structure, and the allocated data elements will exist for the life of the Disruptor instance
Per Daniel Spiewak, was considered for the core data structure in Clojure before the bit-mapped vector trie was selected by Rich Hickey
On most processors, there is a high cost for a remainder calculation on a sequence number which determins the slot in the ring, but it can be greatly reduced by making the ring size a power of 2; use a bit mask of ring size minus one to perform the remainder operation efficiently
The data elements are merely storage for the data to be handled, not the data itself
Since the data is allocated all at once on startup, it is highly likely that the memory will be contiguous in main memory and will support effective striding for the caches
When an entry in the ring buffer is claimed by a producer, it copies data into one of the pre-allocated elements
Sequence number % number of slots = slot in use, no pointer to end, ok to wrap managed by producer/consumer instance
In most Disruptor usages, there is only one producer (network IO, file system reads, etc), which means no contention on sequence entry/allocation; if more than one producer, they can race each other for slots and use CAS on the sequence number for next available slot to use
Producers copy data into the claimed element and make it public to consumers by "committing" the sequence
Consumers wait for a sequence to become available in the ring buffer before they read the entry using a WaitStrategy defined in the ConsumerBarrier; note that various strategies exist for waiting, and the choice depends on the priority of CPU resource versus latency and throughput
If CPU resource is more important, the consumer can wait on a condition variable protected by a lock that is signalled by a producer, which as mentioned before comes with a contention performance penalty
Consumers loop, checking the cursor representing the current available sequence in the ring buffer, which can be done with or without thread yield by trading CPU resource against latency - no lock or CAS to slow it down
Consumers that represent the same dependency share a ConsumerBarrier instance, but only one consumer per CB can have write access to any field in the entry
SEQUENCING
Basic counter for single producer, atomic int/long for multiple producers (using CAS to protect the counter)
When a producer finishes copying data to a ring buffer element, it "commits" the transaction by updating a separate counter used by consumers to find out the next available data to use
Consumers merely provide a BatchHandler implementation that receives callbacks when data is available for consumption
Consumers can be constructed into a graph of dependencies representing multiple stages in a processing pipeline
Read/Writes are minimized due to the performance cost of the volatile memory barrier
BATCHING EFFECT
When a consumer falls behind due to latency, it has the ability to process all ring buffer elements up to the last committed by the producer, a capability not found in queues
Lagging consumers can therefore "catch up", increasing throughput and reducing/smoothing latency; near constant time for latency regardless of load, until memory subsystem is saturated, at which point the profile is linear following Little's Law
Producers also batch, and can write to the point in the ring buffer where the slowest consumer is currently working
Producers also have to manage a wait strategy when there are multiples of them; no "commits" to the ring buffer occur until the current sequence number is the one before the claimed slot
Compared to "J" curve effect on latency observed with queues as load increases
DEPENDENCY GRAPHS
With a graph like model of producers and consumers (such as actors), queues are required to manage interactions between each of the elements
The single ring buffer replaces this in a single data structure for all of the elements, resulting in greatly reduced fixed costs of execution, increasing throughput and reducing latency
Care must be taken to ensure that state written by independent consumers doesn't result in the false sharing of cache lines
TO EXECUTE PERFORMANCE TESTS, YOU NEED A MACHINE THAT CAN EXECUTE 4 THREADS SIMULTANEOUSLY (I need a box)
EVENT SOURCING
Daily snapshot and restart to clear all memory
Replay events from a snapshot to see what happened when something goes awry
Links:
Blog: Processing 1M TPS with Axon Framework and the Disruptor: http://blog.jteam.nl/2011/07/20/processing-1m-tps-with-axon-framework-and-the-disruptor/
QCon presentation: http://www.infoq.com/presentations/LMAX
Google Group: http://groups.google.com/group/lmax-disruptor
Martin Fowler's Bliki post: http://martinfowler.com/articles/lmax.html
Martin Thompson's Mechanical Sympathy blog: http://mechanical-sympathy.blogspot.com/
Trisha Gee's Mechanitis Blog: http://mechanitis.blogspot.com/
Disruptor Wizard (simplifying dependency wiring): http://github.com/ajsutton/disruptorWizard
My Scala port: http://github.com/jamie-allen/sdisruptor
Presenting at JavaOne 2011