Making CRDTs Byzantine Fault Tolerant - Martin Kleppmann

matheus23 · March 4, 2022, 3:51pm

Abstract:

It is often claimed that Conflict-free Replicated Data Types
(CRDTs) ensure consistency of replicated data in peer-to-
peer systems. However, peer-to-peer systems usually con-
sist of untrusted nodes that may deviate from the specified
protocol (i.e. exhibit Byzantine faults), and most existing
CRDT algorithms cannot guarantee consistency in the pres-
ence of such faults. This paper shows how to adapt existing
non-Byzantine CRDT algorithms and make them Byzantine
fault-tolerant. The proposed scheme can tolerate any num-
ber of Byzantine nodes (making it immune to Sybil attacks),
guarantees Strong Eventual Consistency, and requires only
modest changes to existing CRDT algorithms.

PDF:

Quinn · March 5, 2022, 12:09am

Thanks for sharing this; it’s super relevant to what we’ve been thinking about!

I read through the paper earlier today and put my notes in Roam, but since not everyone has access to that, I wanted to copy them here:

Versions vectors are unsuitable for BFT because a malicious node can produce multiple distinct updates with the same sequence number

Calls out three types of problematic operation based CRDT updates:

For CRDTs that require unique identifiers on updates (logoot, Treedoc), if multiple updates have the same identifier it can lead to divergence
- Solved by using the CID of an update as its ID
- For CRDTs that require tracking causal dependencies (RGA) as metadata on updates, a malicious node can lie about this metadata and cause divergence
- Solved by referencing the last update by a node + all new events received since that update in any new updates
- Then an update is only valid if all of those dependencies have been received
- For CRDTs with semantic constraints on updates, like that a predecessor must precede a successor (as in YATA), a malicious node can violate these semantics
- Solved by the same approach as above: since updates reference their causal dependencies, you can walk backward from a successor until the predecessor is found

Basically, his solution to all of these problems is content-addressing + DAGs, so it’s all very relevant, but also very close to what we’ve already been discussing

The main difference I can see is that he’s targeting the traditional case, of needing to ensure that every replica receives the same set of updates and converges. In our case we won’t converge, since different agents will have access to different updates (encrypted tuple stores)