Dynamic type information in IPLD?

makoConstruct · December 2, 2023, 12:26am

One of the applications I see for IPLD is allowing a storage provider who knows nothing about the app to spot CIDs in the data and batch the data it links to (usually so that it can send them all in one response, reducing round trips).

The current approach standardizes on cbor, and a type tagging scheme that would allow a host to identify CIDs.
I’m considering another approach, which would lead to slimmer written standards and greater flexibility: Every piece of data starts with a type ref, which links to a type definition, which lists wasm implementations of various methods. For the use case mentioned above, there would be a method trace_refs(self)-> stream<Ref>, which would produce an iterater over the references contained in the data.
A motivating case where IPLD standards seem inadequate for this is programming languages or hand-written type description languages that contain textual content-hashes to their dependencies. To locate the content hashes perfectly, the server needs to, in a sense, know how to parse the syntax of the language. Yet, If we have a prototype/dynamic type object, then the server, in another sense, doesn’t need to know the syntax, the data’s type knows, it can use the type to extract the nested refs. It can also be a lot more sophisticated about distinguishing refs that the recipient will need from the ones they probably wont need.

IPVM (the integration of executable code into IPFS) is kind of the moment this becomes realizable.
So, I’m curious, is this being done? This being, I guess, a distributed type system?

I’m not sure how this relates to IPLD’s schemas. Schemas seem to be trying to do something like this without allowing schemas to specify code? Sounds perilous.

(my background: Doing preliminary research for building… um… foundations and libraries for a social, participatory computing UX. I’ll need to define an interface language, and it will have to use content-addressing from the start)

matheus23 · December 4, 2023, 9:07am

Yeah you’re totally right that IPVM would enable this. There’s some hand-waving necessary about trust relationships & how actual integration into something like kubo would look like, but in principle you can put the pieces together more easily now that homestar (an IPVM implementation) exists.

I’ve been thinking of something similar to the trace_refs function, essentially, when you run an IPVM function that has access to a blockstore, record the blockstore access patterns.
E.g. imagine an IPVM function that looks up a bunch of records on a HAMT. Whoever is running the function first may optimistically download the whole HAMT to have it available as soon as possible, but once you generate the receipt, you have the blockstore access pattern traces, so you can fetch just the blocks that you need to access the keys in the HAMT. So you win some bandwidth

Following the links is one of the main functions of IPLD, it allows kubo and others to “pin” arbitrary data, or garbage-collect. However, IPLD in principle is also supposed to power stuff like selection queries on data, where paths through blocks are translated into following links and leafs that are expected to be huge byte arrays are translated into traversing chunked data structures (like UnixFS files or similar).
From these use cases, following links is just the most pervasively implemented use case.
So I’d technically say that something like trace_refs still lives on a slightly different abstraction level.

By the way - there’s some talks from IPFS Thing last year that talk about Wasm + IPLD. E.g. autocodec goes into this direction (although it’s mostly about translating between e.g. dag-pb and dag-cbor or dag-json): https://www.youtube.com/watch?v=nCYj0LghbpI
There’s also a talk from Stebalien that talks about what Wasm could bring to IPLD: https://www.youtube.com/watch?v=mN2iYiEyjUM

Oh yeah, and welcome @makoConstruct Nice first post.

zeeshan.lakhani · December 4, 2023, 2:09pm

Yeah, I thought a lot about autocodec while reading this, so good call out.

makoConstruct · December 5, 2023, 2:54am

(I had to take a moment to wrap my head around HAMTs, if we’re hash-addressing, why not just use a hashmap? I’m going to guess the answer is that in a distributed storage context, no one wants to have to be the one who has to expand the hashmap in the process of adding one entry, or that at large scales exponential up-sizing becomes a bit ridiculous, or because block chunking makes a tree structure relatively faster, or something like that?)

Autocodecs… I see. So in rustland we’d define a conversion like so: impl Into<Json> for &Cbor { fn into(self)-> Json { ... } }. But the impl has to be defined basically either in the Cbor type definition, or otherwise in the Json type definition, it can’t be defined anywhere else. This makes it unambiguous which impl is being used, but I’ve always found that a bit limiting, because it means a third party cannot come along and define a conversion for two types that don’t know about each other.
I’d probably prefer an approach where there’s a fairly open set of converter packages that anyone can contribute to, where collisions (conversions between the same pair of types) are ranked by some community code auditing process (maybe peer review) in terms of security, efficiency and ergonomics (potentially not ranked with numeric ratings but using full rank recovery from comparisons between pairs (paraphrasing: finding the ‘top’ of the graph formed by submitted comparisons between the nodes in the graph. A tournament is an efficient example of this being done. Something like pagerank is an approximate and good enough algorithm for taking a graph and finding its top.) as this is more suitable for situations where there’s no obvious max or min quality to callibrate an absolute scale, and where doing comparisons is expensive, and where there’s a fair amount of respect between judges.).

So, EG, I guess when you tried to convert cbor to json, your code editor would query that registry and select the top (audited) conversion, and put its content-hash-import into your code. (like, import convert_cbor_json from ipfs:QmcniBv7UQ4gGPQQW2BwbD4ZZHzN3o3tPuNLZCbBchd1zh#convert_cbor_json (the name after the hash is just a check to make sure it’s the right hash and to make it clear what the hash link is supposed to refer to. Eventually the editor would cache the package and show its description instead of a hash.))

I guess that’d be a specific case in a more general system: The community defines categories for libraries. Categories are assigned or removed by the votes of programmers in your preferred “is a good programmer” endorsement graph. Libraries that share a category are sometimes compared and ranked. “Conversions of cbor to json” would be one such library category. If a library has higher ranked alternatives within its primary category, the library browser shows a notice/warning/alert on its page, suggesting that you check out the other alternatives.

I should ask whether Fission would like to be involved in the development of a distributed cross-language typesystems and development tooling, or whether yall are just going to leave it to me. If it’s left to me, 30% chance it wont get done! I haven’t floated this by NLNet and they might consider it too ambitious at this stage. (But also, there are a lot of people who share the dream of a cross-language wasm ecosystem, and bytecodealliance are doing a lot of good work in that direction and probably intend to just keep going, so it probably wont be left to me.)
Capn proto is also a distributed type system and actor interface language, but it favors obtuse read-in-place APIs that people don’t seem to like, and the type description format doesn’t embrace content-addressing, and it mandates field numbers (for forward-compatability) even in initial definitions when they’re just boilerplate, and that could also be dealt with much more elegantly using content-addressing to link to previous versions of the schema. So I wont use the capnp format.
And I don’t think there have been any other attempts at this?

matheus23 · December 5, 2023, 3:06pm

It’s nice to have a single CID to share your current state of your hash map.
Otherwise, to synchronize you’d need to share your whole set of key-value pairs before you can figure out that you and another peer are on the same state.

Additionally, by hierarchically signing the set of key-value pairs, if you and your peer’s states differ, you can figure out the difference much faster since some sub-trees will end up being unchanged. E.g. when the only difference is a single key-value pair, you only need to transfer a O(log N) diff of the nodes from the root to the leaf that changed.