DistSys February 2022: Metastable Failures in Distributed Systems (2021)

Please sign up on the Luma page to get access to the Zoom

hotos21-s11-bronson.pdf (410.7 KB) (canonical paper link)


We describe metastable failures—a failure pattern in distributed systems. Currently, metastable failures manifest themselves as black swan events; they are outliers because
nothing in the past points to their possibility, have a severe impact, and are much easier to explain in hindsight than to predict. Although instances of metastable failures can look
different at the surface, deeper analysis shows that they can be understood within the same framework.

We introduce a framework for thinking about metastable failures, apply it to examples observed during years of operating distributed systems at scale, and survey ad-hoc techniques developed post-factum for making systems resilient to known metastable failures. A systematic approach for building systems that are robust against unknown metastable failures remains an open problem.

ACM Reference Format:

Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, and Timothy Zhu. 2021. Metastable Failures in Distributed Systems. In Workshop on Hot Topics in Operating Systems (HotOS ’21), May 31-June 2, 2021, Ann Arbor, MI, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3458336.346528


Chat Log

00:14:23 quinn: read through vs. look aside caches
00:36:40 Na: quinn in section 3 “Change of Policy during Overload”
00:36:51 quinn: Yes! Thank you
00:37:36 Philipp: Jepsen - https://github.com/jepsen-io/jepsen
00:39:42 Philipp: Roblox post mortem -Roblox Return to Service 10/28-10/31 2021 - Roblox Blog
00:39:45 Philipp: Roblox Return to Service 10/28-10/31 2021 - Roblox Blog
00:41:10 quinn: :100:
00:43:48 Na: https://github.com/testground/testground ?
00:44:08 Philipp: Yep I think so
00:55:50 Na: <3 “broken network day”
00:59:53 quinn: Where caches are a given, shipping delays are an Evergiven
01:00:03 Brooklyn Zelenka (expede): :stuck_out_tongue:
01:00:48 Na: The Simple Solution to Traffic - YouTube ← spontaneous traffic
01:00:59 Na: 2003 blackout: What Really Happened During the 2003 Blackout? - YouTube
01:04:31 quinn: I’m definitely in for more frequent meetings
01:04:39 Na: same
01:04:43 Philipp: same
01:05:02 quinn: Spoiler alert: that means lots of datalog papers
01:05:07 Clifford Fajardo: same ^
01:05:16 Na: :eyes: datalog