Compression/rabin on roadmap? Scaling?

laudiacay · February 7, 2023, 12:03am

I’m interested in possibly using this for very large scale processing of archival data before sending it off to Filecoin.

I’m definitely interested in using it with IPFS at small scales to store/share age encryption identities and file metadata between people/entities, and then I’ll just post the giant encrypted/compressed blobs to Filecoin.

Looking at compression and rabin chunking and adding that- wondering if you have thoughts?

Also- is there any benchmarking data, anything I should be aware of, any places where a lot of things might be all loaded into memory together in creation of a private filesystem, any places that look like a chokepoint for massive scale and parallelization?

Opening this thread as I read the spec and repository so that we can discuss

matheus23 · February 7, 2023, 11:20am

That’d be amazing

We’ve thought about doing compression before encryption. I think it’s super reasonable, but there’s also a lot of different ways to ‘slice in’ the compression (before chunking? after chunking?). We’d probably just have to throw some thinking at it and we’d figure something out fairly quickly.

Regarding chunking: For encrypted data, deduplication doesn’t apply due to randomness used in encryption. What’s more interesting is efficient inserts/removes that we get if we have locally stable chunk boundaries with rabin (which would also interfere with compression).
All in all, it’s entirely possible. Brendan had some thoughts on doing this: wnfs-go/doc/cipherchunk.md at master · wnfs-wg/wnfs-go · GitHub
Another choice is whether to have different keys for each chunk. That’d mean needing to attach keys to all links, but would also allow rotating the key on a write without having to re-encrypt the rest of all files.
Today, we’re doing the dumb thing in both regards: Fixed chunk size and same key for all chunks. I really want to improve this, and I think we should consider advanced chunking (rabin or fixed chunks being a runtime choice) and different keys-per-chunk. The difficulty here lies in writing all the algorithms that do the chunking, then seeking and modifying subsequences efficiently.

Ooooh yes

For now, we haven’t optimized for scale - there exist chokepoints as a result of us not putting all our resources in things we don’t expect to need in the immediate future, since our use-case (browser-based personal data stores) aren’t huge-scale.

Here’s some chokepoints I know of:

When writing private files, we don’t stream yet (there exist read streams though). This’ll probably be the first thing we tackle, as that has already become a problem for Functionland when they tried to add >200Mb files on phones. Also not super hard, just some implementation work.
Our directories aren’t chunked yet, so they’re effectively subject to IPFS/IPLD/bitswap block size constraints, limiting the amount of directory entries to somewhere in the thousand(s?). This one requires some thinking & some spec writing (issue here: Directory Sharding · Issue #8 · wnfs-wg/spec · GitHub). That’s likely a more long-term thing.

On the parallelization side, I believe that’s absolutely possible for us, but we haven’t focused on this yet. So far, our async rust isn’t doing any join!s yet. I believe we can have a really good parallelization story in the future due to all internal data structures being immutable by default and being designed to support idempotent, associative and commutative merge algorithms.
I should also mention that we’re Rc-based right now (because JS is single-threaded), but that’s easy enough to exchange for Arc.

laudiacay · February 7, 2023, 5:27pm

yeah- there are impls of fixed/rabin in iroh already. i’m going to ignore this for now and think about it later
keys per chunk smart- also helps you not hit AES-GCM limits …

benchmarking data we’ll have something soon. i have our other devs wiring our repo up to criterion and our critical section will include the part that will eventually include wnfs.

oof re parallelization. we’ll help us get there.