DSys RG July 2022: In Search of an Understandable Consensus Algorithm (Raft)

Paper: In Search of an Understandable Consensus Algorithm (Extended Version)

00:14:01	Philipp Krüger:	Hey @Swaram, just wanted to let you know that you're unmuted and we can hear some noise coming through your mic sometimes
00:18:40	Joshua Fontany:	😎
00:19:07	Sodium :	> Initially
we planned to use a ranking system: each candidate was
assigned a unique rank, which was used to select between
competing candidates. If a candidate discovered another
candidate with higher rank, it would return to follower
state so that the higher ranking candidate could more eas-
ily win the next election. We found that this approach
created subtle issues around availability (a lower-ranked
server might need to time out and become a candidate
again if a higher-ranked server fails, but if it does so too
soon, it can reset progress towards electing a leader). We
made adjustments to the algorithm several times, but after
each adjustment new corner cases appeared. Eventually
we concluded that the randomized retry approach is more
obvious and understandable
00:19:27	Sodium :	page 6, just before S5.3 Log Replication
00:20:15	Philipp Krüger:	Beat me to it Na :P
00:21:16	Joshua Fontany:	I sketched out a 2 order of magnitude randomness timer for automatic reconnecting the yjs crdt websocket connections 8 was using. That was a good explanation, thank you.
00:21:49	Quinn Wilton:	Thanks Josh!
00:22:28	João Guilherme Araújo:	Does anyone know Liskov and Oki's consensus algorithm that they cite in the first page?
00:22:47	Quinn Wilton:	https://brooker.co.za/blog/2014/05/19/vr.html
00:22:53	Quinn Wilton:	^ that's it
00:23:05	João Guilherme Araújo:	Paxos and Raft are pretty well-known, but I've only seem people talk about Viewstamped Replication like twice
00:24:42	Sodium :	*and* the leader needs to constantly talk to everyone
00:26:08	Sodium :	electing a whole cabinet of Ministers Of Data
00:26:23	Philipp Krüger:	lol
00:28:28	Philipp Krüger:	"garbage" collection is the oppositie of monotonicity
00:29:21	Quinn Wilton:	Always happy to answer but I don't want to take over the discussion 😅
00:29:25	Brent Shambaugh:	https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type ?
00:29:28	Quinn Wilton:	Yup!
00:36:40	Sodium :	they talk about the election modification Quinn was just talking about on p 11, just before S7 Log Compaction, at the end of the cluster membership changes section
00:37:18	Sodium :	<3 tiddlywiki
00:37:35	Brooklyn Zelenka (@expede):	Welcome John 👋
00:39:57	Philipp Krüger:	I think Paul Borrill wanted to say something, too, sorry if we cut you off!
00:40:33	João Guilherme Araújo:	Hi John 😄
00:41:20	John Ousterhout:	Hi everyone
00:45:33	Kaleb Alves:	I think there's a Fred Schneider's paper that try to solve some kind of async config change problem, not sure if it's the same thing.
00:45:46	Eleanor:	My impression is that Spanner requires a high level of network resiliency. I took it to mean that other clouds couldn't do it because the network failures would render it moot
00:46:19	Kaleb Alves:	the article: https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf
00:47:08	Juho Myllylahti:	One thing popping to mind is also the problem of adversial nodes, that's a difficult issue if you want to trust the timestamps. Of course time is relative, so there's going to be issues (with nodes on the orbit clock drifting if nothing else :-)).
00:47:56	Juho Myllylahti:	*adversarial
00:48:01	Sodium :	it seems like the system as a whole is monotonic-ish? but individual nodes aren't necessarily
00:55:16	Quinn Wilton:	Ha this is what I was about to ask
00:57:46	Edward Sargisson:	Parthenon or Analytica? Clarke, Emerson and Sifakis won the Turing award in 2007 for model checking.
00:59:29	Brooklyn Zelenka (@expede):	LVars are in that broad general direction, for example
01:04:25	Quinn Wilton:	Brooke - can you by any chance go to page 3? There's a great quote halfway down the left page that ties to what John is saying
01:04:49	Joshua Fontany:	Korzybski's Abstraction models use reduction
01:05:49	Brooklyn Zelenka (@expede):	#whynotboth
01:05:56	Joshua Fontany:	They were physically Wiregrass hung above each other to demonstrate the concept, with the suspending lines indicating lines connecting corresponding data points in the model.
01:06:05	Joshua Fontany:	*wireframes
01:06:41	Quinn Wilton:	That's one of the reasons I'm a big fan of lightweight formal methods, in tools like Alloy. They've historically filled a niche for me between implementation and full-blown verification, where you're able to explore the problem from the perspective of building intuition
01:07:54	Kaleb Alves:	Not about understandability, so I'll put my question here. Raft has a "disadvantage" over paxos on elections. If you havea bunch equally up-to-date servers running, they might all get minority of votes, so they must wait for random time then retry. In the real world, this doesnt looks like a problem, but I never saw "how" this problem is solved behind the curtains
01:11:05	João Guilherme Araújo:	My question may be more historical, but why did you invest so much time on understanding Paxos and then creating a more understandable alternative as opposed to doing the same for Viewstamped Replication ?
01:12:26	Quinn Wilton:	^ My guess is that they didn't set out to replace Paxos, since the paper seems to suggest that Raft grew out of difficulties in fully understanding Paxos. Obviously John would be able to give a better answer though
01:13:39	John Ousterhout:	We started with Paxos, because it was considered the "gold standard". But, once we started a new design, we weren't thinking of a "better Paxos"; we were just starting from scratch to design the simplest possible consensus algorithm. We also looked at Viewstamped Replication, but it also seemed pretty complicated.
01:14:25	João Guilherme Araújo:	got it, thanks!
01:16:57	Joshua Fontany:	😄
01:19:14	Quinn Wilton:	A certain tendency of the database community
01:19:24	Brooklyn Zelenka (@expede):	That's the one!
01:19:25	Quinn Wilton:	https://arxiv.org/abs/1510.08473
01:22:19	Philipp Krüger:	Is this the point where we talk about BFT? :)
01:22:38	Brian:	This looks like the Wolfram + Metcalfe discussion: https://www.youtube.com/watch?v=Vti0sCgfXsQ
01:23:47	Quinn Wilton:	She's done some great work with Martin on CRDTs too. I believe she was on the JSON CRDT paper, which is a great read
01:25:30	Brent Shambaugh:	This is a quality meeting group. I might need to think more about notes and not rushing during reading so I can contribute more. I learned a lot about consensus anyways, so thanks .
01:25:46	Philipp Krüger:	<3
01:27:40	Nithin Thomas:	about commit stage. is there a possibility for a leader to commit a log entry, but crash before it is sent to followers? The paper explains how a new leader will not commit any pending entries of previous leader. But that may give a false response to client from previous leader?
01:28:27	Philipp Krüger:	I don't think so. The leader sends the log entry to all followers and waits for answers before committing the value.
01:29:09	Philipp Krüger:	If the leader crashes before sending log entries to followers, then the client will notice a time-out on their operation, or something like that I think
01:29:17	John Ousterhout:	If a leader crashes before sending a log entry to followers, then it will not respond to the client; the client will have to retry (from scratch) with a new leader, and that leader will respond to the *retry*, not the original.
01:29:18	Philipp Krüger:	(i.e. the value isn't committed)
01:29:39	Nithin Thomas:	got it, thank you
01:30:34	Jagan:	http://www.sigcomm.org/sites/default/files/ccr/papers/2015/August/2829988-2790010.pdf
01:32:13	Kaleb Alves:	yep, I've seen this kind of problem in MongoDb replicas/servers. Good thing is that mongo is raft-based
01:32:54	Jagan:	http://www.sigcomm.org/sites/default/files/ccr/papers/2015/August/2829988-2790010.pdf   this is the link to issue with Raft
01:32:57	Quinn Wilton:	@paul https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/
01:33:57	Philipp Krüger:	https://community.wolfram.com/groups/-/m/t/2575423
01:35:15	John Ousterhout:	I'm going to have to sign off now... thanks for the interesting discussion!
01:35:23	Brooklyn Zelenka (@expede):	Thanks for joining John!
01:35:28	Sodium :	\o thanks for coming!
01:35:29	João Guilherme Araújo:	Thanks for joining us!
01:35:34	Philipp Krüger:	Thank you so much for joining & answering questions! )
01:35:39	Eleanor:	thank you!
01:35:59	João Guilherme Araújo:	I'll also have to drop off now, awesome discussion, bye folks 🙂
01:36:03	Brent Shambaugh:	Can the leader lose data while all of the followers die or have amnesia? This will lead to a new log entry at the same index? 
01:36:27	Brent Shambaugh:	Adios John ad Joao.
01:37:28	Philipp Krüger:	Yes. Raft doesn't protect against that, it can only survive a fault in the minority of nodes (e.g. 2 faulty nodes of 5, or 1 in 3)
01:38:00	Brooklyn Zelenka (@expede):	https://discord.gg/j8P3DJ4z
1 Like

Heidi Howard’s related work was mentioned a couple times in the session. @matheus23 posted this video afterwards

1 Like