Proof of Space - a Programmer's Guide

jonathan.essex · September 28, 2022, 6:26pm

Proof-of-Space - a Programmer’s Guide

Why is this Hard?

Using disk space is far too easy, as my AWS bill attests every month. Intuitively, it does not seem that proving that I have used a certain amount of disk space should be hard. Yet, I personally have found this one of the hardest things to understand about Filecoin. This article attempts to recapitulate my own journey towards understanding - hopefully without many of the dead ends and iterations. The first question I found myself asking was: why is this so hard?

Everybody’s naïve conception of a proof-of-space is the same:

Create some kind of lookup table for the results of applying an agreed hash function to a range of n values. A challenger can pick a value in that range at random and send the corresponding hash to the prover. The prover can find the hash in their lookup table, and must respond to the challenger with the original value, within some time limit.

This scheme makes the assumption that there are only two ways this is possible: either the prover has enough compute power to perform a brute-force search of the given range for a matching hash, or they have a lookup table which consumes at least as much space as the n keys stored in it.

If this assumption holds, choosing an appropriately hard hash function and an appropriately short time limit would make the brute-force strategy unviable - making a successful challenge/response into a proof that the lookup table exists.

Unfortunately, the assumption doesn’t hold.

For brevity’s sake, let’s define a function hashmod(n,v) = hash(v) mod n. The function hash might need to be instantiated with some bound parameters (such as a specific initialization vector). We’ll just wave our hands and say that’s already happened.

Firstly, we know the that there are only n values in the range. Thus, we know that any value returned from hashmod(n,v) is likely to be unique. It’s not hard to adjust the lookup table to handle the occasional conflict, and the computing power needed to resolve such conflicts (by calculating one of several possible hashes) would be minimal. Thus, the best we can say is that the lookup table is simply big enough to hold n values that are themselves less than n.

Our range size n could be quite large, however. Does this mean that we can use the naïve proof-of-space to prove we are using at least that much space?

Well, no.

We can get away with a lookup table with just m key/value pairs, where m is much less than n. Each value in the table is an index in our range, chosen at random. Each key is the result of iteratively executing v = hashmod(n,v), t times.

When the challenger sends a hash (the challenge), we can calculate challenge mod n and look up the resulting key in the table. If it is… well, does that do us any good? The value stored against this key isn’t one we we are looking for - it is some original random value that, when hashed t times, resulted in the key. However, to recover the index all we need to do is hash that value t-1 times.

If the challenger hash isn’t in the table then all we need to do is take hashmod(n, challenge mod n)… then look that key up in the table. If we find it, hash the related value t-2 times to recover the index.

This process can clearly be repeated up to t times.

If appropriate values for m and t are chosen, this process is highly likely to retrieve a good index for any given hash. (see Hellman’s attack on block cyphers). It’s a classic time/memory trade-off, where the memory requirement is steeply reduced by the additional time required to iteratively invoke the hash function. Since the prover is required to respond to a challenge inside a known time limit, an attacker can simply figure out how many times they can reliably apply the hash function within that time, and create the smallest lookup table which will allow them to fake the responses.

Verifiable Delay Functions

The naïve approach above can be summarized as ‘find a function that is hard enough to calculate that the only way to respond within a given time limit is to pre-calculate the results’. The ‘hard function’ we chose was the inverse of a hash function. By design, this inverse is so hard to calculate that the only way it can be done is to run the hash function over a known domain in order to create the lookup table. The hash function is also (by design) relatively trivial to compute. This is what makes the Hellman attack feasible.

What if, instead, we actually design a function that is hard to compute, but for which it is possible to verify that a given output is correct with relative ease?

This brings us to the world of Verifiable Delay Functions (VDFs), which are exactly that. We define two functions, eval(v) and verifiy(v,r) which need to be instantiated with some public parameters (the formal specification includes a function setup which does exactly that). We say verify(v, eval(v)) should always return true and that it should be very difficult to find any value r for which verify(v,r) returns true otherwise. We also specify that eval should be hard to compute, and verify should be easy, and that we can tune exactly how hard when we call setup.

Given the above, we can easily propose a proof-of-space where for some range sized n of values v, the prover creates a lookup table for r = eval(v) and responds to a challenge v by looking up r in the table. The verifier can then trivially check that the returned r is correct by invoking verify(v,r).

The (considerable) downside of this approach is that we have to construct the lookup table by executing our deliberately slow eval function n times. This may become unworkable if we are reserving large amounts of space - or require that other parameters are adjusted (for example, requiring that a large reservation be challenged more frequently).

Vector Commitments

What if, instead of using a VDF, the table contained genuinely randomized values (perhaps using some external source of entropy)? In principle the prover could publish, ahead of time, a vector commitment to all the random numbers in the table. Then, when challenged with an index, the prover could return the random number in that slot to the challenger, together with some additional information which the challenger can combine with the vector commitment to prove to himself that the random number returned is indeed the correct one.

I’m going to assume that most people reading this already know what a vector commitment is. By way of an inadequate explanation to those who don’t, I’m going to say that one example of such would be a Merkle tree, which is simply a tree structure where nodes are indexed by their hashes, and which has the useful property that for a given set of key/value pairs stored in the tree, the hash of the root node is deterministic and unique. Thus the hash of the root node (the commitment) and list of hashes gathered on the path through the tree to an given key/value pair (a witness) can be used to show that a given key/value pair was included in the tree when the commitment was published. The witness, while sometimes quite large, is much smaller than the entire tree. There are vector commitment schemes that have much smaller witnesses, but the Merkle tree is the easiest understand.

The problem with this simple idea is that, again, it doesn’t work. This time the issue is that we are relying on the prover to fill their table with honestly generated random numbers. Unfortunately the science of Cryptography has provided us with may ways to create tables of numbers which look random but aren’t. If the prover can chose random-looking numbers which can be re-generated very quickly, then they can defeat the algorithm, perhaps just storing a single ‘seed’ in place of the entire lookup table.

Graph Labeling

The fundamental insight behind most real proof-of-space algorithms is that, while we can’t prove that a set of numbers is random, maybe we can prove that the set of numbers was generated in a way that is sufficiently complex to defeat any attempt to regenerate that set on demand. The VDF is one provably hard way to generate numbers that look random, but there are others.

One alternative would be to require that there is some kind of mathematical relationship between the numbers (we will start to call these numbers ‘labels’) and provide an algorithm that allows us to check both that a sampling of labels is in the set, and validate that the relationships between those labels are correct.

The way this is typically realized is to convert our table of nearly-random values into a directed acyclic graph of labels. The label of any given node on this graph is the hash of the labels of its parent nodes. The effort required to generate all the labels is required is clearly related to both the size of the graph and the degree to which nodes are interlinked. Unfortunately, it is also seems possible in principle that when generating the label for an individual node, we could attempt to derive the labels of the parents of that node in parallel. If the soundness of our Proof of Space requires that an attacker cannot regenerate any substantial portion of the label table within the response time window, it seems we should properly assume that if any steps can be conducted in parallel, they will be.

Happily, the hardness of this problem is related to the hardness of a classic problem in graph theory (pebbling games). For now let’s just assert that because the label of a node cannot be calculated until all the labels of its parents have been calculated, the number of purely sequential steps required to calculate a label is related to the depth of the graph, and assume that with a graph that is sufficiently deep and interconnected, the problem is hard enough.

Algorithms of this ‘graph labeling’ type require an additional validation step. Once the prover has generated a commitment to the graph, they must provide a sampling of random node labels and the labels of their parent nodes. The challenger should check in advance that the parent/child relationships are correct (by verifying that the label of any child is indeed the hash of the labels of its parents) and that all the sampled labels are ‘in’ the commitment. We can then use the sampling as a statistical validation of the properties of the graph, hence showing that the labels were generated in a manner sufficiently complex to make recreating them infeasible in the required time window.

Once this setup verification is complete, simple challenges (requests for the label at a given index) can be responded to with a node label and a witness for that label in the commitment.

One of the key insights here is that the full labeling need only be stored for the initial setup verification, after which the challenges and responses utilize only an agreed subset of the labels. The unused labels are deleted. If the graph is sufficiently connected, regenerating any of the labels in the ‘used’ portion of the graph should require regenerating a substantial portion of the deleted labels, which makes it much harder to do.

Stacked Graphs and Bipartite Expanders

The above section skates merrily over the hard part of this, which is how to generate graphs for which the labeling problem is provably hard enough to prevent the regeneration of the graph (or large parts of it) within the response time window.

In the limit, a complete DAG (in which all n nodes are connected to all other nodes with either a parent or child relationship) would require on the order of n^2 hash operations to generate the labeling. Deleting any single label would typically require on the order of n hash operations to regenerate it. This might work better as a proof of space than the naïve implementation. Unfortunately, there’s one node in such a graph that has all the other nodes as a parent, meaning that in the worst case, the size of a proof in the validation step is essentially the complete graph.

At the other limit, we could make our graph a simple linked list (so node 1 is a parent of node 2, node 2 of node 3, and so on) which would obviously require n-1 hash operations to fully label it. The problem with this, obviously, is that any one label can be regenerated with only one hash operation and (for example) if we keep only every 10th node then we can regenerate any of the deleted labels with just 9 hash operations. So while are proofs are pretty dinky, this is no better than the naïve implementation we started with.

One of the earlier graph constructions to be investigated in the context of Proof Of Space is a stacked construction of several layers, where each layer is a ‘Bipartite Expander’ (see [6]). We can define a bipartite expander graph here as one that that connects a set of sources to a distinct set of sinks, where any subset of the sources is connected to more sinks than there are sources in the subset.

[Formally: An (n, α, β) bipartite expander (0 < α < β < 1) is a directed bipartite graph with n sources and n sinks such that any subset of αn sinks are connected to at least βn sources]

We can ‘stack’ bipartite expanders by connecting the sinks of one layer to the sources of the next layer. When applied to a Graph Labeling POS, the intuition is clearly that (by the properties of the bipartite expander) dependencies will be amplified across layers, such that a child in the final layer will be dependent on many (if not all) the labels calculated in the first layer. Thus, all those values would need to be recalculated by a dishonest prover in response to any challenge.

However, while it is easy to create a bipartite expander which has a high degree (by connecting every source to every sink) this has similar problems to the complete DAG in terms of the size of the required proofs. The question, then, is whether we can generate bipartite expanders with some constant degree while still maintaining relatively high values for α, β and thus ensuring the dependency amplification makes life hard for any dishonest prover.

Happily - There’s an algorithm for that - referred to as Chung’s construction.

function BipartiteExpander(width : integer, degree : integer) {
	sinks : [[integer]] // each sink is just a list of the source indices
	edgeCount = width * degree
	permutation = permute([1..edgeCount-1]) // create a random permutation
	for (edge = 0; edge < edgeCount; edge++) {
		sources = sinks[edge mod width] || [] // empty if no sources yet
		sources.push[permutation[edge] mod width] 
	}
	return sinks
}

The explicit creation of the permutation of edge indices in the pseudocode above is used because it makes the algorithm’s intent clear. For a large graph this would be very wasteful; the implied pseudo-random mapping of indices is more or less equivalent to an encryption operation and indeed Filecoin uses a Feistel cypher to this effect.

We can use Chung’s construction iteratively to create a graph of m layers of width n and degree d. There’s quite a bit of math we can apply to figure out the lower bounds for these properties given a desired minimum difficulty, but, from a programmer’s perspective, these are just constants which we are happy to know have some basis in theory as opposed to just being empirically generated.

Unfortunately, it can also be seen that with d much smaller than n a certain amount of parallelism can readily be applied; the parents of a node in one layer may themselves have sets of parents which are disjoint and could reasonably be computed in parallel. This complicates our requirement for a hard time-bound in which a dishonest prover cannot reasonably respond to a challenge.

Depth-Robust Graphs

Depth-Robust Graphs (DRGs), which we will intuitively define as graphs in which most subgraphs exceed some desired depth relative to the number of nodes in the subgraph, prove useful in this respect.

[ Formally: A directed acyclic graph (DAG) on n nodes with d-indegree is (n, α, β, d) depth robust graph if every subgraph of αn nodes contains a path of length at least βn. ]

Clearly it is impossible to create a label until the labels of its parents are known, so the depth of a graph can provide us with a measure of ‘hardness’ which cannot be overcome with brute parallelism. As for the bipartite expander graphs, it’s easier to create a DRG which is highly connected. But again, to constrain the maximum size of a proof, we need graphs that are both depth-robust and have a reasonable upper limit to the number of connections at each node.

There’s an algorithm for that - Filecoin uses the DRSample algorithm from [5], which is proven to produce a depth robust graph (with certain characteristics). It’s pretty simple:

function DRSample(nodeCount : integer) {
	// each node is just a list of parent indices. We start with 2 nodes with no parents.
	tree : [[integer]] = [[],[]]
	for (node = 2; node < nodeCount; node++) {
		parents = []
		for (parent : integer = 0; parent < 2; parent++) 
			parents.push(getParent(node, parent))
		tree.push(parents)
	}
	return tree
}
	
function GetParent(node: integer, parent: integer) {
	if parent == 0 
		return node - 1
	else {
		bucket = random(1, floor(log2(node) + 1))
		bucketTop = min(node-1, 2 ** bucket)
		bucketBottom = max(2, bucketTop/2)
		return node - random(bottom, top)
	}
}

The ‘bucket’ construction biases the algorithm to make random connections from closer nodes (since low-order buckets are selected as often as high-order buckets, but the high-order buckets are much larger).

In Filecoin, this algorithm is generalized to produce more connected graphs. In theory, this is done by first producing a graph using the DRSample algorithm, and then reducing the number of nodes by some factor d by defining that for nodes with indices a and b in the original graph, if a is a parent of b in the original graph, then a/d is a parent of b/d in the reduced graph. This produces a graph where most nodes have d parents. It is possible to show that the reduced graph is ‘more’ depth-robust in the sense that relatively smaller subgraphs meet the same minimum depth criterion that was met by the original.

In practice, we can simply create an inner loop over d in DRSample in order to create the final graph directly. We can also see that the algorithm is well adapted to creating the labels we need for a proof-of-space in a single pass.

We can tune the hardness of the problem by changing the total number of nodes generated, the in-degree of the graph (d above) and proportion of nodes we delete. We can crunch these parameters through math derived from pebbling games and try decide if a dishonest prover who did not store the remaining labels could generate any one of them sufficiently quickly to fool the challenger, even given massive parallel compute resources. If we can stymie dishonest prover while avoiding increasing the in-degree to the extent that the proofs in the setup verification step become unmanageably large, we have a viable proof of space.

My own opinion is that given a bilateral economic relationship (e.g. storage provider and consumer) between prover and challenger a ‘simple’ POS based on depth-robust graphs would be sufficient. However, the use of Proof of Space in the Filecoin consensus protocol creates a greater incentive to break the algorithm.

Stacked Depth-Robust Graphs

Filecoin, then, goes the extra mile. The final construction used is actually a layer cake of DRGs with Bipartite Expander filling. This can intuitively be seen as ensuring that the labels in the final layer are dependent on a tree of ancestors that is both wide and deep, and which cannot be compressed or memoized in any meaningful way. It uses degree-6 DRGs with 2^30 nodes stacked in 11 layers with degree-8 expanders in between. The hash algorithm used for generating labels is SHA256 (although it zeros the two most significant bits, for reasons which have nothing to do with the Proof of Space). The commitment is an 8-way Merkle tree.