A collection of papers with approaches for solving “Set Reconciliation”. Feel free to comment and we’ll add your own!
The problem statement is: Two machines without shared prior context both store a set of items. After set reconciliation they should know their set difference. Commonly optimized are:
- protocol bandwidth: Most algorithms’ total bytes sent is dependent on the size of the difference between the sets, not the size of the sets themselves.
- protocol latency: Most algorithms try to find the difference after a single round-trip of communication. I.e. sending a single message should suffice for the receiver to know their difference.
What’s the difference?: efficient set reconciliation without prior context
We describe a synopsis structure, the Difference Digest, that allows
two nodes to compute the elements belonging to the set difference
in a single round with communication overhead proportional to the
size of the difference times the logarithm of the keyspace.
Set reconciliation using Invertible Bloom Filters (IBFs). Their size must be approximated using strata estimators beforehand.
Bandwidth-Efficient Transaction Relay in Bitcoin
We propose a new transaction dissemination protocol, Erlay, that not only reduces the bandwidth consumption by 40% assuming current connectivity, but also keeps the bandwidth use almost constant as the connectivity increases. In contrast, the existing protocol increases the bandwidth consumption linearly with the number of connections. By allowing more connections at a small cost, Erlay improves the security of the Bitcoin network. And, as we demonstrate, Erlay also hardens the network against attacks that attempt to learn the origin node of a transaction.
Origins in error-correction codes apparently. Similar to the paper above, which has origins in tornado codes.
Set Reconciliation with Nearly Optimal Communication Complexity
The communication complexity of these set reconciliation protocols is independent of the sizes of the hosts’ sets, and instead depends only on the size of the difference between the two sets. Moreover, under certain conditions, set reconciliation can be achieved non-interactively, with just a single message. Thus, a host A could broadcast a kb-bit message, and every host B_i, whose set differs from A’s set by at most k bit-strings (each of length b) could recover the bit-strings it is missing. This works even if each host B_i is missing a different set of bit-strings, so that the total number of distinct bit-strings recovered is much larger than k.
This paper is great. It also has a section on the information-theoretic limits of set-reconciliation efficiency, as well as a great “related works” section that goes into why the problem of set reconciliation and error correction are both closely related.
Fast Approximate Reconciliation of Set Differences
We motivate the performance tradeoffs between message size, accuracy and computation time for this problem with a straightforward approach using Bloom filters. We then introduce approximation reconciliation trees, a more computationally efficient solution that combines techniques from Patricia tries, Merkle trees, and Bloom filters. We present an analysis of approximation reconciliation trees and provide experimental results comparing the various methods proposed for approximate reconciliation
Data synchronization–long a staple in file systems–is emerging as a significant communications primitive. In a distributed system, data synchronization resolves differences among distributed sets of information. In named data networking (NDN), an information-centric communications architecture, data synchronization between multiple nodes is widely used to support basic services, such as public key distribution, file sharing, and route distribution. While existing NDN synchronization schemes are functional, their implementations rely on log-based representations of information, which creates a limitation on their performance and scalability. This paper presents iSync, a high performance synchro- nization protocol for NDN. iSync supports efficient data reconciliation by representing the synchronized datasets using a two-level invertible Bloom filter (IBF) structure. A set- differences can be found by subtracting a remote IBF from a local IBF. The protocol can obtain multiple differences from a single round of data exchange, and does not require prior context in most application scenarios. We evaluated iSync’s performance by comparing it to the CCNx synchronization protocol. Experiments show that iSync is about eight times faster across a range of network topologies and sizes, and that it reduces the number of packets sent by about 90%
We introduce the hash history mechanism for capturing dependencies among distributed replicas. Hash histories, consisting of a directed graph of version hashes, are independent of the number of active nodes but dependent on the rate and number of modifications. We present the basic hash history scheme and discuss mechanisms for trimming the history over time. We simulate the efficacy of hash histories on several large CVS traces. Our results highlight a useful property of the hash history: the ability to recognize when two different non-commutative operations produce the same output, thereby reducing false conflicts and increasing the rate of convergence. We call these events coincidental equalities and demonstrate that their recognition can greatly reduce the time to global convergence.
The Distributed Bloom Filter is a space-efficient, probabilistic data structure designed to perform more efficient set reconciliations in distributed systems. It guarantees eventual consistency of states between nodes in a system, while still keeping bloom filter sizes as compact as possible. The eventuality can be tweaked as desired, by tweaking the distributed bloom filter’s parameters. The scalability, as well as accuracy of the data structure is made possible by combining two novel ideas: The first idea introduces a new, computationally inexpensive way for populating bloom filters, making it possible to quickly compute new bloom filters when interacting with peers. The second idea introduces the concept of unique bloom filter mappings between peers. By applying these two simple ideas, one can achieve incredibly bandwidth- efficient set reconciliation in networks. Instead of trying to minimize the false positive rate of a single bloom filter, we use the unique bloom filter mappings to increase the probability for an element to propagate through a network. We compare the standard bloom filter with the distributed bloom filter and show that even with a false positive rate of 50%, i.e. even with a very small bloom filter size, the distributed bloom filter still manages to reach complete set reconciliation across the network in a highly space-efficient, as well as time-efficient way.
Byzantine Eventual Consistency and the Fundamental Limits of Peer-to-Peer Databases
Sybil attacks, in which a large number of adversary-controlled nodes join a network, are a concern for many peer-to-peer database systems, necessitating expensive countermeasures such as proof-of-work. However, there is a category of database applications that are, by design, immune to Sybil attacks because they can tolerate arbitrary numbers of Byzantine-faulty nodes. In this paper, we characterize this category of applications using a consistency model we call Byzantine Eventual Consistency (BEC). We introduce an algorithm that guarantees BEC based on Byzantine causal broadcast, prove its correctness, and demonstrate near-optimal performance in a prototype implementation.
Specifically Section 5.3 “Reducing the number of round trips”:
A downside of Algorithm 1 is that the number of round trips can
be up to the length of the longest path in the predecessor graph,
making it slow when performing reconciliation over a high-latency
network. We now show how to reduce the number of round-trips
using Bloom filters  and a small amount of additional state.
However, that protocol is only efficient in case the reconciling machines have prior context. I.e. this is the “remember what you already told someone” idea, optimized using bloom filters. Interestingly this is based not on hashes of all blocks, but only the hashes of “heads”, i.e. root CIDs. I’d need to dive deeper to really understand.