IPFS Operator Node Wishlist

@walkah & I have been talking about things that we’d like to see from an IPFS Operator’s distro. Here’s my early list (James and others, please feel free to add to this)

  • Shared storage (multiple ingestion nodes, single blockstore e.g. S3)
  • Levelled caching (e.g. Redis on each node, plus a shared Redis)
  • Temp prioritized connections
    • One simple variant would be public / private bridge nodes
      • Run a private network (access with a shared secret) where some nodes can talk to the private AND public networks
      • Either a shared secret, or better: credentialed access like timed UCAN
  • Centralized cluster supervisor
    • Make requests to the supervisor, it ensures that any of the nodes have ingested the DAG
    • Watch for stale connections, fail faster if no bytes change in n minutes
      • Reconnect to client on stall
    • P2P version is nice, but has had issues due to increased complexity
      • In the meantime a central controller has worked well for us in the operator use case
  • Better hang detection (fewer manual restarts)
  • Nice to haves out of the box
    • HTTP caching for the HTTP gateway(s)
    • Preconfigured Prometheus stats
1 Like

I don’t think I have anything to add, per se, but can maybe add some colour behind our goals / thinking. From a high-level we would like to be able to horizontally scale our IPFS infrastructure quickly and easily. We are also using AWS so are constantly trying to balance price / performance.

Rather than maintaining nodes with enough storage for all of our users’ data - we’d like to point multiple nodes at a centralized “infinite storage” pool. Bonus: be able to quickly autoscale up additional nodes if necessary that come with the full data set.

This comes directly out of our experience with the first one: using the S3 datastore at any kind of scale is currently very expensive. While we want the long term, redundant storage of blocks … let’s not hit S3 anymore than we have to, right?

Yes we know (and have tried) ipfs-cluster… I think this is a slightly different model.

Sometimes a node will just seem to stall while pinning. Maybe there’s a good way to detect why this is happening, but we haven’t figured it out - pointers welcome!

Yes this actually exists - but good / best practices dashboards (grafana, etc) would be amazing.

1 Like

Yes, this too. We detect this today by following the Progress field across the cluster using streaming HTTP, and fail if that number does change in 2 minutes.

What I meant is that sometimes the node just becomes totally unresponsive for a period of time, and we need to dial into the machine and restart the systemd process (sometimes it unsticks). This is with a massive EC2 instance with low RAM and CPU usage. (Not sure the underlying problem yet)

1 Like

What do you mean by “totally unresponsive”? Some examples might include:

  1. You can’t do a libp2p connection to the node
  2. You query the data over Bitswap in a reasonable time
  3. The HTTP API (e.g. ipfs id) does not respond
  4. The gateway takes a long time to display content it has cached
  5. The gateway takes a long time to display content you know is available to the gateway (e.g. a node that’s peered to it, or content that is advertised in the IPFS Public DHT by a node that is publicly reachable)

Depending on what the problem is it’s worth noting that sometimes a node can “run out of resources” without actually running out of resources on the machine. For example, in order to keep regular desktop users from having their regular computer usage experience overwhelmed by Bitswap requests there are a certain number of goroutines dedicated to processing requests. However, your massive server might want to handle way more parallel requests. In this case there will be some exposed some configurations in go-ipfs v0.10.0 (kubo/docs/config.md at 92854db7aed4424fad117ceb4e13f64a80ff348b · ipfs/kubo · GitHub) that can help here, but over time it’s likely that more configuration (or a separate binary) designed for use by IPFS operators will be needed.

2 Likes

Also, a “bad hash list” at the block store. We’ve had DMCA takedowns, and being able to unilaterally or collectively a collection of known-bad hashes would be helpful.

The node does not respond to any interaction, via CLI or otherwise. The gateway times out, connection requests time out, the HTTP API is nonresponsive, and so on. Sometimes it fixes itself, but the quickest fix is to restart the systemd process

That’s great to know! Thanks :raised_hands:

Yeah, agreed :100: