WNFS Private filesystem structure

matheus23 · August 4, 2021, 6:05pm

Context

Brooke and I talked about reconciliation in our 1on1 again, and it came up that because of the structural differences between the public and the private filesystem, reconciliation will work differently. E.g. on the private side, usually the file on with more changes on a replica will win against the same file with fewer changes on another replica.

Thus, I’ve been thinking about making the private filesystem more “DAG-like” again.

The idea

The private filesystem is split into two sections:

an immutable block store at /ipfs/<userRootCID>/privateBlocks. It stores encrypted filesystem nodes keyed by CID as an MMPT for efficiency.
a somewhat mutable “pointer” store at /ipfs/<userRootCID>/privateRoots

The pointer store is similar to a design brooke and daniel evaluated a while ago, which brooke calls “an index” into the private filesystem. It’s (conceptually, possibly via MMPT) a map from namefilters to (possibly encrypted?) CIDs.

The pointer should be thought of similar to the root CID of the public filesystem: They’re the entry point for your filesystem. If you’ve got the root CID (i.e. know which block is supposed to be the root) and you know the key for decrypting it, you can unlock the rest of the DAG.

Now, exactly which pointers get stored isn’t obvious:

we could store a pointer to every block. (it’s not clear to me what the source of truth should be. Do we follow the links inside each block, or do we just use the index?)
we could store only the root of the file system, but then apps which don’t have root access wouldn’t be able to update parts of the private filesystem, because they can’t change the root pointer
instead, I propose apps should update the pointer for the outermost part of the filesystem they changed. If they have root access, then this will be the root of the filesystem.

If you’ve got a private filesystem and you want to walk its tree, you do this primarily via the DAG structure pointed to from the root pointer, but once you reach parts of the filesystem where there’s a newer pointer that covers the subtree, you use follow the pointer’s CID instead of the CID from the block you just looked at.

Apps should the merge in these “temporary overlays” into the root DAG and update the root pointer accordingly.

Conflict resolution

Conflict resolution should basically work exactly the same as in the public filesystem, as we’ve got multiple DAG structure representing WNFSes just like in the public case.

Once conflicts have been resolved, we’ll have a new root which has the other roots as nodes somewhere in the DAG.

Preventing arbitrary pointer changes

If we allow arbitrarily changing the pointer to the root CID, we’d allow all apps to write to rewind the state of the filesystem back destructively, even if they’ve only got APPEND access, and there would be no way to check this on a server.

To fix this, for each path (e.g. the root path), we store a list of pointers. If an app produces a new update and only has APPEND access, it’ll push this update into the list. A third party can then check that an APPEND update is valid by making sure that … the update only appends. Duh.

Semantically, the actual filesystem root is then the ‘merge’ of all pointers. We have this merge operation defined through conflict resolution.

Problems

This leads us with two things:

We generate a lot of entries for each namefilter. E.g. every private filesystem change in an app with only APPEND access is another list entry. Thus, we leak some metadata.
If we want to prune redundant pointers in the list (some of them are superseded by others because of fast-fowards), we need to log into an app that has OVERWRITE access. However, we don’t want to have apps requesting overwrite access just for this purpose. But for a third party, it’s impossible to distinguish whether someone is trying to e.g. time-travel the filesystem destructively or is just trying to prune the amount of pointers.

Original discord message for completeness:

I have an idea for a more DAG-like private filesystem (on the data layer) that I just want to run by you: We store encrypted blocks by their CID, at `/privateBlocks`. These almost look just like our current blocks, but without the namefilters. They reference each other by CID. Because we store the encrypted blocks in an MMPT by CID, not by namefilter, it's really efficient to test whether a CID is in under the root of a filesystem or not. (We kind of replicate an ipfs blockstore). When we store everything by CID, the immediate next question is: How do we know what the root block in the filesystem should be? For that, I'd propose an MMPT at `/privateRoots` with namefilters as keys, just like we have them right now, pointing at the 'root' block. Now, the `/privateBlocks` MMPT is meant to be append-only, while `/privateRoots` is mutable. So if you want to update the private filesystem, you'll add the new root block to `/privateBlocks` and then change the root CID pointer at `/privateRoots`. If you're an app that doesn't have access to the root of the private filesystem, you'll put another entry into `/privateRoots` to update the highest nodes you have access to. An app that has root access can later come and resolve these pointers by updating the actual root. When traversing the private filesystem DAG, you need to account these non-root 'overrides'.

[15:42]

Here’s an example: Let’s say you’re trying to traverse the filesystem from the root, but another app has an entry for an update at private/Apps/flatmate. Now there will be two entries in <root>/privateRoots. One for private/ and one for private/Apps/flatmate. You can start traversing the filesystem by looking at the block with the cid pointed to from the namefilter for private/ in <root>/privateRoots. You do the normal thing of decrypting the block, looking at the keys and cids inside it and decrypting the next blocks in the DAG. No need for looking revisions ahead. The CIDs inside blocks will always refer to the most recent entry. Except for (in this case) the path at private/Apps/flatmate. While traversing, you’ll need to check whether any of the namefilters in <root>/privateRoots match your next paths. If so, you’ll need to look at the CID pointed to from the there, instead of what the block tells you. If you have two replicas with two different “root CIDs” for the private filesystem and you’re trying to merge these two filesystems on an actor that doesn’t have any permissions (e.g. the fission server), you simply put all conflicting root CIDs under the root namefilter in <root>/privateRoots. At some point a client that can read the conflicting roots will be able to reconcile the changes with exactly the same algorithm as for the public side: Compare DAGs, see whether one DAG is a subDAG of the other, if so fast-forward, otherwise merge and finally replace the root CID with only one entry, the merged filesystem.

[15:42]

— The biggest problem I see is that the fission server will have no way to distinguish an update that only has permissions for APPEND from an update that has permission to OVERWRITE. One “simple” solution to this would be: Anything that’s append will only be allowed to append in <root>/privateRoots, but anything that can OVERWRITE will be allowed to actually apply the reconciliation. However, that means a (probably not insignificant) overhead if the user mostly uses apps that only have APPEND permissions and rarely uses apps which have OVERWRITE permissions: They’ll accumulate lots of entries under the namefilter, which all subsume each other (should be detected as fast-forwarded).

expede · August 4, 2021, 7:56pm

I’m super excited that other people are thinking about these things I’ve also been coming back to this space recently to help with concurrent updates, much like the public side. It suggests that there’s something in here.

During my lunch I fiddled with the ideas a bit. In good news, we can definitely make the UCAN permission stuff work in an append-only setting Also, this method with a handful of index pointers could make elliptic curve accumulator witnesses more viable, which have some nice properties over Nyberg (i.e. Bloom) accumulators.

I’m starting to remember why we didn’t go this direction previously. Not that these are impossible to tackle, but they at minimum need more cycles:

We need multiple top-level pointers, which is a lot of pointer juggling and exposes how many entry points exist. These quickly need their own versioning or merge semantics, which means persistent state which leaks more data (see #2)
The top-level index can’t be mutable, since we want to preserve the shape of historical branching for correcting the algorithm. This reveals a lot of information about what was updated together, and which nodes are the same file, much like the public side.
Delegating read access to a subnode means writing a new index pointer, which the delegator may not have (they may only have read-only access), breaking read-level attenuation
We can put branch history backpointers on the reconciled node, but now that data is either in multiple places, or mutable, which are both fragile
How do you manage multiple updates to a single
It’s unclear how this will interact with backwards secrecy

Several of these tradeoffs are perhaps equally-well handled with skeleton indexes, which we’ve dropped from the latest spec because they’re fragile.

Things that have come up from this that we should think on more:

Switch to publicly verifiable signatures for updates, like the original cryptree paper
- Only use UCANs for side effects i.e. overwriting history, account admin, buying domain names
- Downside is that the security model breaks, but we can bootstrap with a cosigner like FileCoin

matheus23 · August 5, 2021, 8:39am

Reading up on that again.

Yep. The information leaking is the biggest issue as far as I can see. But at least in what I imagine, I don’t think we’re leaking information about which nodes are the same file. However, this is the part that I think needs fixing most. We can’t just accumulate the pointers endlessly/until an app with overwrite permissions comes along.
But at the same time it seems like if we were to fix this issue, i.e. we wouldn’t have to accumulate pointers, that’d mean we’d also not need multiple pointers at which point I don’t think this can be realized with bloom filters anymore. Thus likely we’d need something else for the root pointer entirely (and that feels like going back to step 0).

I needed to wrap my head around this a little. At first I didn’t get why you wouldn’t be able to just send the CID + key to someone to give them write permissions, but of course that’d mean a third party wouldn’t be able to distinguish someone creating any new namefilter from someone with actual delegated permissions. Leaving this here as a note for myself.

In the OP I was thinking of putting branch history into the reconciled node (& having that be immutable). The index would then point at multiple filesystem nodes. Someone with the key to these nodes however would be able to figure out which one either

which is the most “up to date” node, i.e. the one which contains all other nodes somewhere in its DAG or
how to merge all of these nodes into a new DAG as per reconciliation, and put that one back, so we get to the situation in (1) again.

Then again, having a list of root pointers is definitely not ideal, because it means a lot of work for clients (e.g. new apps would need to potentially do a lot of work to merge all the filesystems). We’d rather have a well-defined, simple source of truth for the root, just like in the public filesystem.

Noting for myself again: My mental model is: If we handle updates via signing updates via the file/directory’s public key, we have to give them a persistent, public name (the public key), thus leaking lots of info. (Is there perhaps some way to do this zero-knowledge-y?).

Not that I’ve made any progress in this post, just documenting my thoughts.

expede · August 12, 2021, 4:59pm

Here is a smörgåsbord of concepts and tradeoffs. Some of these plug long-standing holes in the system, especially around concurrency, efficiency, and federation.

Our A DID of Our Own (`did:wn`)

I propose that we create a DID format of our own, based heavily on did:key. For example:

did:wn:z7YcKfiZmBynAwZ1fAuQbhF5iR5C5kZg9hBb8AiJ9jR1yA

Making a DID WNFS-aware gives us a couple advantages:

Still `did:key` Compatible

No need to drop support for DID key. This is an extension, and “uncompressed” keys are fine.

Hash-Compressed RSA

RSA public keys are enormous. Their signatures are as well, but there’s very little that we can do about those. What if we took the SHA256 of the RSA PK, and stored the full key in a well-known location in WNFS? We can even gossip these across multiple WNFS users so that there’s many replicas holding copies of the public key (so it doesn’t go missing).

This cuts down on the size of a 2048-bit RSA key by a factor of 8, but requires a lookup if you don’t have that hash in cache. A 4096-bt RSA key gets cut down by 16. It’s worth noting that the signatures stay the same size as the key.

Chained DID Rotation

Rotating keys on-platform is possible(!)

UCAN Tweaks

Posting Directly on WNFS

Rather than attaching UCANs in HTTP headers, we should move to placing them directly in the data structure. We will still want UCANs for auth for side-effectful operations, like administering the account or buying domain names.

Attaching the relevant UCAN to WNFS has a few advantages:

Doesn’t depend on a single server to validate UCANs, open up P2P use cases
Anyone can validate historical changes
HTTP server header size limit doesn’t matter
More efficient: each UCAN only uploaded once, don’t need to point it at the server DID
CID-based proofs have better data availability than hoping that this one was persisted to WNFS when they all are

The downsides are few but still need to be considered:

It does leak some more information that can be used to correlate updates (time stamps, DIDs, UCAN chains, &c)
- At some point down the road we may find a zero-knowledge construction to help, but we can rely on good credential management and pseudonymity in the meantime
Time-based expiry is meaningless (see “Ditching the Wall Clock”)

Public vs Private Differences

Public

We can attach the UCAN to the updated root. Pretty straightforward, since the root is visible.

Private

Here we want to attach UCANs to each updated node, in the clear. Due to deduping, doing this by hash-linked CID is very lightweight.

Depending on what we do about logical clock expiration, we may need to also sign in the root CHAMP CID itself to show that you had access at the time.

Efficiency

We dedup storage, and don’t need to send these massive RSA-filled headers anymore (in the pure WNFS case).

Data Leaks

Attaching UCANs to be read for all eternity does expose some correlated data. It’s… not great, but It’s that or depend on a server for all updates. We may find more zero-knowledge methods later, but this is already state of the art.

We can obfuscate some of this information by issuing multiple UCANs to multiple DIDs for the same agent (since they’re cheap to create). This makes it difficult to revoke bad actors since they now have N uncorrelated UCANs.

Ditching the Wall Clock

Wall clock time is meaningless if we attach UCANs to the nodes directly. Today we assume an updating server, but in this more P2P mode, that may not be the case. So, who has the ability to update things?

I’ve considered “time-extensions” by having a collaborating node counter sign your changes and putting it under their expiry time. This has so many edge cases (e.g. what if no one with the extra rights is around in time?) that it’s too brittle IMO. Ideally we find some kind of logical expiry based on a proof of knowledge. Perhaps some mix of external revocation (in the revocation Merkle tree) & especially Lamport OTP? Lamport OTPs in particular are quite promising, since it very directly limits the number of actions that you can take. When we have transactions, this is less useful since a writer can merge multiple writes under a single clock tick, and this save their remaining hashes for (much) later. Essentially it says “you may write X times” instead of “you can write for T minutes”.

…but maybe that’s enough?

Private QuickSync™ (AKA Skipping Head Of Line Blocking)

This is probably the bit that’s most interesting one. The core problem is how to merge multiple private branches without taking linear time in the number of updates across all branches.

Seeing the Forest from the Trees

Place the diverged trees into a forest. The tree with most nodes will be getting merged onto (because it’s more efficient than copying these nodes into a smaller tree). In case of a tie, pick the one with the lowest alphanumeric root CID.

Start by fast forwarding as far as you can in all trees. Verify (see below) and merge these nodes to the main tree while performing that search. When you get to the end of each tree, you now have the elements that you need to perform an n-way merge on, using the same algorithm as the public side.

Separating out the trees does show more directly which changes are related to an observer, but seems like a fair trade-off.

Differential Rooting… Now in the Time-Dimension

Wait, what about the rest of the nodes? They’re not merged; they may even be broken!

We already have a similar situation & solution: differential rooting. As a refresher: this is when someone doesn’t have write access to the root, so they write what they can. This leads to a temporarily fragmented materialized tree.

We now have the same situation in the time dimension. We’ve synced the highest node in each branch and reconciled them. The tree being merged onto may be completely unreadable by the merging agent (i.e. the largest tree wasn’t theirs and they only got access to a later ratchet). Further, it’s possible for a number to have been skipped, and we need to remain fault tolerant in this possibility, so working on trees with holes is fully allowed.

As this forest is passed between agents, some of them will have more read & write access, and can continue the rooting progress. This can be done in the background, and provides an overall performance improvement to all peers (since they no longer need to track all of these branches).

Forest Pruning

When a tree is fully absorbed by another in the forest, it can be removed from the forest.

I leave it as an open question about if we need to keep the entire tree around, or just the largest tree and diffs from the rest. Only keeping diffs speeds up search (after the initial diffing process is complete) and makes it easy to know what’s left to reconcile. The main question is if the diffing is worth the up-front cost, and if dropping the context will matter for some use cases. The comparison algorithm shouldn’t care what it’s scanning between, so this is purely a separate data-management question. Bad pruning can also lead to dropped leaves.

We do want to keep the trees separate (at least the unmerged nodes), so that we can find the heads of various nodes while we’re waiting to reconcile them.

There’s no additional storage overhead in either case, since IPFS will store the same blocks either way. This may be a “why not both?”, but that does have some more (small but possibly unneeded) complexity.

Mean Size Lower Bound

An interesting idea from @matheus23 to help avoid reconciliation time bombs is to require the mean average node size to be at minimum some limit. This prevents bugs from causing a reconciliation time explosion, since they can’t persist (e.g.) single-character changes. An attacker can still write garbage to a series of revisions, but that’s more expensive for them, will hit storage limits much faster, and it’s possible to write broken nodes in all cases.

If we don’t like the raw mean, we can get all statistical and pick some standard deviations and whatnot. This kind of analysis can be done without cleartext access, so it’s pretty elegant!

Artificial Revision Bottleneck

I like the size requirement above, but for completeness (or possibly for related problems)…

This is a big part of the reason why proof of work (PoW) algorithms are used in blockchains (intentionally lower block production to prevent DDoS). We could also make this computational with PoW, or the more lightweight proof of history.

Validators & Lightweight Reputation Staking

Many systems with trustless distributed consensus have a validation model, where others can validate that a change was made correctly. Here we want to validate that the contents that we (the Fission server) cannot see are in fact correct, not just written to a valid path for a particular UCAN.

Here, agents would sign the content that they’re validating. If they validate bad data, a more trusted user in their UCAN chain and revoke their access (and potentially correct the bad data if they have overwrite permissions).

This gives us an option of last resort when things go sideways.

Revision Repairs

Missing revision number or repairing bad data is possible. Insertions of missing versions are trivial: it’s an append. Correcting broken data is harder, since other data may depend on the broken node. My suggestion is as follows:

Write a new repaired node (as a multivalues — it will have the same namefilter). Mark the flawed node as being tombstoned, and point at the repaired version.

Tracking Branch IDs

When we’ve merged the forest into a single tree, we still want to be able to track which node came from which branch. This can be achieved in (at least) two ways:

Structural — Use encrypted back-pointers
Label — In the multivalue, use a hash as the key for the multivalue branch

matheus23 · August 13, 2021, 1:14pm

Would that mean that other clients need to keep a history of CHAMP CIDs?

I think generally WNFS access revocation can happen using not the expiry time, but key rotation instead.
On the public side, it would be possible to link to a grow-only revocation set from the root node.
The big issue with this is that revocation then isn’t asynchronous.

I think the rest is great. Regarding broken nodes I think it just boils down to “we don’t reconcile branches with broken nodes” (and eventually delete them), and regarding reconciliation times, as a first simple approach I’d suggest this:
Let’s say there’s two actors: A and B. A creates a huge update which would take long to reconcile and sends it to B. B will try to reconcile the change with a time limit of X seconds. The time limit is reached and it’ll just continue working on its own branch. Later it sends its own update to actor A, who then notices that their changes weren’t reconciled in, yet. Because it created the huge update, it can also reconcile these changes more quickly.
When the reconciled update is shared with B again, it should be able to quickly figure out that it can fast-forward.

expede · August 13, 2021, 5:12pm

Yeah, I had looked at this (in the thing above). As you say, how do you do this async. So the challenge is twofold:

How do we revoke them while they’re offline? Do we reject the entire update?
Who’s doing the rotation? Do they have to be online?

We can’t know that without doing a linear sync (looking at every single node in the branch before merging). The performance of that is pretty bad, and may not even fit into browser memory. It’s… super tricky.

We should dig into this more, because as I understand it they’re opposite strategies. Making sure that the entire branch is good before merging vs hopping to the head and working backwards, basically.

RE A & B — That assumes that A will be back online in a reasonable amount of time / ever. On a human side, rather than pure technology being possible, the round trip time for that could be pretty rough, no?

matheus23 · August 13, 2021, 5:34pm

Yes and yes.
And that probably ties into the “Well, how do you communicate that an update is actually replicated to the user?”. Especially in the P2P case. What should that even mean in such a case?

expede · August 13, 2021, 5:51pm

Can you expand on this? From the way I’m reading it, you can check that you either have both trees in the forest and/or all of the CIDs from Tree A are in Tree B. But I could be misunderstanding!

matheus23 · August 13, 2021, 5:57pm

In the currently implemented model, we have a guarantee saying “if you’ve got your current state on the DNSLink, fission will make sure your data is safe”.
But in a more P2P model, what even are the guarantees that we want to tell the user about? Is it “yes, we’ve synced your phone state with your laptop”, i.e. “if you open your laptop, you’ll see your changes”. Or it could be something else. So that’s the figuring out part