Incremental Verification and Streams

jonathan.essex · August 8, 2022, 4:14pm

IPLD Streams

Proposal:

Introduce a new DAG element kind Stream. A Stream is a list containing stream elements. A stream element is a union of:

A NonceValue
A Link

A NonceValue is a list of two elements [nonce, data]. The nonce and data values may have any type; however, for a stream to be valid all nonce elements in the stream must have the same type, and that type must be ordered according to the nonce ordering rules (to be defined). In many cases a nonce will be simple integer, and might represent a timestamp, a block height (in a blockchain application), or a byte offset, depending on the application.

All elements in the Stream must be a NonceValue, with the exception of the last element, which must be either a Link, or null. We refer to this as the terminating link.

The Link element of a stream must reference a block in which the root element is a Stream. The nonce values in the referenced precursor must have the same type as the referencing stream (aka the subsequent block).

So far, there is no particular reason why a Stream could not be represented as a struct. However, we now specify the following hard constraints:

The CID of a block which has a Stream as its root element should always be constructed with the same hash algorithm, codec, and base as the terminating link.
The Hash algorithm chosen will always have a Merkle-Damgard construction.
The hash value of the terminating link will be used as the initialization value for calculating the hash of the block. The terminating link does not otherwise contribute to the hash of the block.
The hash of a Stream block is calculated by hashing the stream elements in nonce order
Nonce values in a subsequent block must be greater than any nonce value its precursor block.

Validating these constraints will be significantly easier if the Stream has a distinct node type.

The interface of an IPFS node will be enhanced with the following methods:

Rewind (cid, nonce) -> [ Stream, cid ]
Listen (cid, timeout) -> [ [ Stream, … ], cid ]

The Rewind method implements an optimized search for the first stream block in the sequence of precursors which has a nonce value less than the specified nonce. If no such block exists, the last block in the sequence (the block containing a null precursor) will be returned. Both the matched block and its CID are returned.

The Listen method returns all known blocks subsequent to the specified block in nonce order (the block with the greatest nonce will be returned last). If no subsequent blocks are available, the call will wait for the specified timeout period. If the node discovers any new subsequent blocks before the timeout expires, those blocks will be returned. The CID of the most recent block is returned at the end of the list.

The construction has some useful properties:

The CID of the most recent block in the chain is independent of the way that data is distributed amongst is precursor blocks; a single block containing all the stream elements will have the same hash as the head of a large chain of blocks containing the same stream elements.
The sequence of blocks returned by the Listen method is incrementally verifiable; the terminating link in each subsequent Stream block can be used to verify the data in the precursor block.
Consequent to these first two properties, a call to Rewind can either walk an existing static chain of stream blocks, or dynamically create a sequence of stream blocks optimized for the nature of the request; for example, thin clients might set a low limit on the maximum block size.
Nonce information encoded in the stream enables IPFS nodes to create caches of stream data that are known to be complete for a given nonce range. Thus some searches within the stream which are also bound by a given nonce range may be conclusively answered from cached data.

expede · August 9, 2022, 6:39am

Yeah, totally makes sense

There’s also some related work has happened with Streaming CAR files

I wonder if a skip list or finger tree would also be useful, since a linked-list has relatively terrible performance unless you specifically need linear access

Is it a nonce in the truest sense, or more of a payload in the data field? Could you also have more than the one link, where the first is the pointer to the next element in the chain, but the other(s) are payload (if it won’t all fit in the data field)?

jonathan.essex · August 9, 2022, 1:10pm

I’m not sure that nonce is necessarily the right word; It’s an ‘ordered property which can (but need not) be mapped to some real-world value’. In my initial draft I used ‘timestamp’ but felt that was too specific as ‘block height’ was also very much on my mind. I deliberately left the possibility of non-primitive nonces open for the blockchain tracking use case where forked streams are a possibility.

Some kind of index for the nonce value definitely needed to avoid having to walk the linked list (indeed, that is part of the point of introducing the nonce). I was thinking a node might maintain something with a very low overhead like a BRIN index for each stream it is tracking.