@walkah & I have been talking about things that we’d like to see from an IPFS Operator’s distro. Here’s my early list (James and others, please feel free to add to this)
Shared storage (multiple ingestion nodes, single blockstore e.g. S3)
Levelled caching (e.g. Redis on each node, plus a shared Redis)
Temp prioritized connections
One simple variant would be public / private bridge nodes
Run a private network (access with a shared secret) where some nodes can talk to the private AND public networks
Either a shared secret, or better: credentialed access like timed UCAN
Centralized cluster supervisor
Make requests to the supervisor, it ensures that any of the nodes have ingested the DAG
Watch for stale connections, fail faster if no bytes change in n minutes
Reconnect to client on stall
P2P version is nice, but has had issues due to increased complexity
In the meantime a central controller has worked well for us in the operator use case
I don’t think I have anything to add, per se, but can maybe add some colour behind our goals / thinking. From a high-level we would like to be able to horizontally scale our IPFS infrastructure quickly and easily. We are also using AWS so are constantly trying to balance price / performance.
Rather than maintaining nodes with enough storage for all of our users’ data - we’d like to point multiple nodes at a centralized “infinite storage” pool. Bonus: be able to quickly autoscale up additional nodes if necessary that come with the full data set.
This comes directly out of our experience with the first one: using the S3 datastore at any kind of scale is currently very expensive. While we want the long term, redundant storage of blocks … let’s not hit S3 anymore than we have to, right?
Yes we know (and have tried) ipfs-cluster… I think this is a slightly different model.
Sometimes a node will just seem to stall while pinning. Maybe there’s a good way to detect why this is happening, but we haven’t figured it out - pointers welcome!
Yes this actually exists - but good / best practices dashboards (grafana, etc) would be amazing.
Yes, this too. We detect this today by following the Progress field across the cluster using streaming HTTP, and fail if that number does change in 2 minutes.
What I meant is that sometimes the node just becomes totally unresponsive for a period of time, and we need to dial into the machine and restart the systemd process (sometimes it unsticks). This is with a massive EC2 instance with low RAM and CPU usage. (Not sure the underlying problem yet)
What do you mean by “totally unresponsive”? Some examples might include:
You can’t do a libp2p connection to the node
You query the data over Bitswap in a reasonable time
The HTTP API (e.g. ipfs id) does not respond
The gateway takes a long time to display content it has cached
The gateway takes a long time to display content you know is available to the gateway (e.g. a node that’s peered to it, or content that is advertised in the IPFS Public DHT by a node that is publicly reachable)
…
Depending on what the problem is it’s worth noting that sometimes a node can “run out of resources” without actually running out of resources on the machine. For example, in order to keep regular desktop users from having their regular computer usage experience overwhelmed by Bitswap requests there are a certain number of goroutines dedicated to processing requests. However, your massive server might want to handle way more parallel requests. In this case there will be some exposed some configurations in go-ipfs v0.10.0 (kubo/docs/config.md at 92854db7aed4424fad117ceb4e13f64a80ff348b · ipfs/kubo · GitHub) that can help here, but over time it’s likely that more configuration (or a separate binary) designed for use by IPFS operators will be needed.
Also, a “bad hash list” at the block store. We’ve had DMCA takedowns, and being able to unilaterally or collectively a collection of known-bad hashes would be helpful.
The node does not respond to any interaction, via CLI or otherwise. The gateway times out, connection requests time out, the HTTP API is nonresponsive, and so on. Sometimes it fixes itself, but the quickest fix is to restart the systemd process