Reconstructing the NASDAQ book: ITCH 5.0 decode and a lock-free SPSC pipeline

hft-orderbook is a C++17 engine that replays or receives a raw NASDAQ TotalView-ITCH 5.0 feed and rebuilds the limit order book for every symbol in the stream, live. This note covers the four design decisions I would defend in a code review: how the bytes are decoded, how the two transports share one decoder, how the price-level store was chosen by measurement instead of taste, and how the thread hand-off stays wait-free and provably race-free.

1. A reconstructor, not a matching engine

ITCH is a post-match feed: the exchange has already crossed the orders, and the stream tells you what happened - an add, an execute, a cancel, a delete, a replace, each carrying an order reference number. So the hot path is not price-time matching logic; it is an O(1) lookup from order reference to the order's price level, followed by a size adjustment on that level. Getting that mental model right early simplifies everything downstream: the book is a deterministic function of the message sequence, which also makes it perfectly replayable for tests and for the synthetic session that powers the browser viewer.

2. Decode by byte assembly, never by struct cast

The tempting shortcut with a binary feed is to reinterpret_cast the buffer to a packed struct. ITCH punishes that twice: the fields are big-endian, and the layouts have misaligned multi-byte fields (a 6-byte timestamp at offset 5, prices at odd offsets). A struct cast is undefined behaviour on alignment and silently wrong on a little-endian host. The whole decoder instead goes through three tiny readers:

src/itch/decoder.cpp

// Big-endian (network order) field readers. Portable: no reliance on host
// endianness, alignment, or struct packing.
inline std::uint16_t rd16(const std::uint8_t* p) {
    return static_cast<std::uint16_t>((std::uint16_t(p[0]) << 8) | p[1]);
}
inline std::uint64_t rd_be(const std::uint8_t* p, int n) {
    std::uint64_t v = 0;
    for (int i = 0; i < n; ++i) v = (v << 8) | p[i];
    return v;
}
inline std::uint64_t rd48(const std::uint8_t* p) { return rd_be(p, 6); }

Compilers see straight through this - the shifts fold into single byte-swap instructions - so the safety costs nothing. Every message type also validates its exact on-wire length before any field is read, which is what lets the libFuzzer harness hammer the decoder with garbage without ever finding an out-of-bounds read.

3. Two transports, one decoder

The same decoder serves two framings. BinaryFILE is the offline shape: captured session files where each message carries a 2-byte length prefix - ideal for deterministic replay and benchmarking. MoldUDP64 is the live shape: UDP datagrams carrying a session id, a sequence number and a block count. UDP can drop or reorder, so the Mold layer tracks the expected sequence number and surfaces gap detection the moment a packet goes missing - the difference between an honest book and a quietly wrong one. Keeping the transport concern (framing, sequencing) separate from the decode concern means the fuzzer, the replay tests and the live UDP path all exercise the identical message-parsing code.

4. The price-level store: measured, not assumed

Every book needs a structure mapping price to aggregate size, sorted, with fast best-bid and best-ask. The default answer is std::map. The repo ships three implementations behind one interface, parity-tested against each other so they cannot drift:

  • std::map - the baseline; pointer-chasing red-black tree.
  • a sorted flat vector - contiguous, cache-friendly, but O(n) inserts away from the touch.
  • a windowed array indexed by price tick around the inside - the canonical L2 structure.

On the recorded benchmark run, the windowed array came out about 24% faster than the map baseline - because order flow clusters tightly around the best prices, so almost every update lands in a small, hot, contiguous window. The honest caveat from the README applies: those are relative, same-runner figures from virtualized CI hardware - a fair A/B, not an absolute latency claim. The point of the exercise is the discipline: all three stores stay in the tree, parity-checked, so the trade-off is measured rather than folklore.

5. The hand-off: a wait-free SPSC ring

Decode and book-building run on separate pinned threads, connected by a bounded single-producer / single-consumer ring. No locks, no CAS loops - just two monotonic counters with acquire/release ordering:

include/core/spsc_ring.hpp

bool push(const T& item) {
    const std::size_t head = head_.load(std::memory_order_relaxed);
    const std::size_t next = head + 1;
    if (next - tail_cache_ > cap_) {                 // maybe full -> refresh
        tail_cache_ = tail_.load(std::memory_order_acquire);
        if (next - tail_cache_ > cap_) return false;  // genuinely full
    }
    slots_[head & mask_] = item;
    head_.store(next, std::memory_order_release);     // publish the slot
    return true;
}

The details that matter: capacity is rounded to a power of two so the wrap is a single AND; head_ and tail_ live on separate alignas(64) cache lines so the two cores never false-share; and each side keeps a cached copy of the other's counter, only paying the cross-core acquire when the ring looks full or empty. The consumer spins with a PAUSE-instruction busy-wait rather than sleeping, which is the right trade when the next message is microseconds away.

Correctness is not vibes-based either: the CI matrix runs the full pipeline under ThreadSanitizer (data races), ASan/UBSan (memory and UB), a libFuzzer target on the decoder, and a clang -Werror build. A lock-free structure without a sanitizer gate is a future incident report.

6. Watching it work

The L2 viewer renders the result: the depth ladder, the cumulative depth chart, micro-price and imbalance signals, and the trade tape, streamed as JSON snapshots over a dependency-free WebSocket. The in-browser demo replays a synthetic session (generated by gencap, run through the real engine); pass ?ws= to watch a local wsbook live. It is the same reconstruction code the benchmarks measure, replaying a synthetic session.

What I would do next

Three candidates, in order: kernel-bypass receive (AF_XDP or a userspace stack) to take the syscall out of the UDP path; a binary snapshot protocol for the viewer instead of JSON; and a nanosecond-resolution histogram of decode-to-book latency per message type, published with each CI run so regressions show up as numbers instead of feelings.