Sunday, March 1, 2015

My trick for coordinating Dropbox collaborations

There are high-tech solutions to managing the collaborative paper writing process, such as using version control systems like CVS or git. The downside to these systems is that when you are collaborating on a paper with a low-tech author, it is cumbersome to get them on board with these tools. Some of my collaborators have been from a non-CS background, and, shocking I know, but some CS people also don't want to deal with cryptic version control commands and error messages.

Dropbox provides a simple easy-to-use solution to sharing files which helps  teams to collaborate on projects, including collaborating on a paper. However, Dropbox does not have access control and cannot coordinate concurrent accesses to the same file by multiple writers. This causes overwritten/lost updates and frustrated collaborators.

I have a low-tech solution to this problem. I have used this solution successfully with multiple collaborators over the last 2-3 years. Chances are that, you have also come up with the same idea. If not, feel free to adopt this solution for your collaborations. It helps.

Here is the solution. When I share a folder to collaborate with coauthors, I create a token.txt file in that folder. This file is there to coordinate who edits which file at any given time.

Below is a sample token.txt file contents. The first part explains the guidelines for coordination, the second part lists the files to collaborate on. We generally use one file to correspond to one section of the paper.

Check with token.txt file to see who is editing which file in order to avoid conflicts.
If you want to write:
+ pick a section/file,
+ reopen token.txt to avoid seeing old-state,
+ write your name next to that section/file at token.txt to get the lock/token,
+ quickly hit save on token.txt, and
+ start editing that section/file on your own pace.

When done writing:
+ delete your name for that section/file at token.txt. (I.e., Unlock that file.)

Don't keep unlocked files open in your editor, lest you inadvertently save them over a newer version your collaborator wrote. An editor that auto-saves may also cause that problem.
eval: ...........Ailidani

ALL FILES: This file is always with Murat! Treat this as read-only, don't edit.

The token.txt file acts as a message board to show who is working on which file. Of course there can be concurrent updates to token.txt, but this is less likely because you make a tiny edit and then quickly save this file. Moreover, many editors, including Emacs, notifies you immediately if the token.txt file has been changed during the time you keep it open.

(This is in essence similar to the use of RTS-CTS messages in wireless communications. Instead of having a more costly collision for a long-size data transfer, it is an acceptable tradeoff to have occasional collided/lost RTS messages, because those are very short transfers and are low-cost.)

Yes, this is a lazy low-tech solution and requires some discipline from collaborators, but the effort needed is minimal, and I haven't got any bad feedback from any collaborator on this. During the time I used this method, we never had a problem with concurrent editing on the same file. I used this with teams of size 2-to-6, and only for collaborating on papers. One problem, however, we occasionally had was with someone forgetting to remove a lock after he is done working on that file. In this case, we resort to email: "Are you still holding the lock on this file? When do you plan to release it?"

Notice the "ALL FILES" entry at the end. That is when one author may want to lock all the files for a short duration easily. This happens infrequently, but comes handy for restructuring the paper, making fast substitution edits across all files, or spell-checking the entire paper before submission.

You may also notice a entry at the bottom. That is an org-mode file where I keep my TODO items and notes about the paper. It is my lab-notebook for this writing project. I can't function in any project without my file.

PS: Geoffrey Challen suggests: "You could also replace Murat's token.txt file with a Google Drive document." 

Friday, February 27, 2015

Salt: Combining ACID and BASE in a Distributed Database

This paper appeared in OSDI'14. The authors are Chao Xie, Chunzhi Su, Manos Kapritsos, Yang Wang, Navid Yaghmazadeh, Lorenzo Alvisi, and Prince Mahajan, all from The University of Texas at Austin. Here you can watch a video of the OSDI presentation of the paper, and also find the presentation slides and the paper available as open access.  USENIX knows how to do conferences right.

Dropping ACID

ACID (Atomicity, Consistency, Isolation, Durability) approach provides ease-of-programming through its simple transaction abstraction, but loses on the performance. BASE (Basically-Available, Soft state, Eventually consistent) approach, popularized by the NoSQL systems, provides good performance (low-latency, high-throughput, and scalability), but loses on the ease-of-programming due to increased complexity of concurrent execution in distributed systems.

This paper introduces Salt, which aims to find a best-of-both-worlds middle ground: to provide high performance like BASE with modest programming effort like ACID.

Less is more

Salt approach is motivated by the following observation (which is an instance of the Pareto principle). I am quoting from the paper:
"When an application outgrows the performance of an ACID implementation, it is often because of the needs of only a handful of transactions: most transactions never test the limits of what ACID can offer. Numerous applications [2, 4, 5, 10, 11] demonstrate this familiar lopsided pattern: few transactions are performance-critical, while many others are either lightweight or infrequent." 
Of course, a better quantification than saying "numerous" or "many" applications would make the argument for the Salt approach stronger. However, such an extensive study may take several months, and is unfair to ask for a conference paper submission.

It is tempting to increase the concurrency of those critical ACID transactions by splitting them into smaller mini ACID transactions. However, doing so may not be safe as you would lose Atomicity (which frees you from worrying about intermediate states during failures) and Isolation (which regulates which states can be accessed when transactions execute concurrently). The paper calls this a stark choice and cites the YouTube clip from Butch Cassidy to further drive this point home (I am not kidding).

The paper's proposal is this. Instead of worrying about all possible Atomicity & Isolation complications with all other ACID transactions, Salt categorizes transactions into ACID transactions and a small number of BASE transactions, and lets you worry only about checking Atomicity & Isolation complications between the BASE transactions.

Of course there is an unmentioned complication here. You also have to ensure that these BASE transactions leave the database in consistent state. This is not an easy task as well, and opens up Salt's ease-of-programming to debate. On the other hand, you would have to worry about this for all operations if you implement a full-BASE solution.

The Salt approach

Actually Salt's BASE transactions approach gets its inspiration from the ACID nested transactions. The paper says:
"Nested transactions is an abstraction originally introduced to offer, for long-running transactions, atomicity at a finer granularity than isolation. Our purpose in introducing BASE transactions is similar in spirit to that of traditional nested transactions: both abstractions aim at gently loosening the coupling between atomicity and isolation. The issue that BASE transactions address, however, is the flip side of the one tackled by nested transactions: this time, the challenge is to provide isolation at a finer granularity, without either drastically escalating the complexity of reasoning about the application, or shattering atomicity."

By providing isolation at a finer granularity, Salt allows BASE transactions to achieve high concurrency by observing each other's internal states, without affecting the isolation guarantees of ACID transactions. A BASE transaction is made up of multiple mini-transactions, called alkaline subtransactions. (Yes, the paper makes several Chemistry puns.) The committed state of an alkaline subtransaction is observable by other BASE or alkaline subtransactions. On the other hand, the committed state of an alkaline subtransaction is not observable by other ACID transactions until the parent BASE transaction commits.  As such, Salt guarantees that, when BASE and ACID transactions execute concurrently, ACID transactions retain, with respect to all other transactions (whether BASE, alkaline, or ACID), the same isolation guarantees they used to enjoy in a purely ACID environment.

The proverbial Bank example

When you talk about ACID versus BASE the classical example is the bank money transfer example. The paper gives a pure ACID, a pure BASE, and a Salt implementation of this example.

I wonder, which implementation is closer to the way the money transfers are actually handled in real banks. If someone familiar with how banks handle the money transfer can weigh in on this, that may shed a light on this debate.  My money is on the pure BASE approach.

Something about this bank example is odd though. The total-balance operation is an unrealistic big/monolithic operation. Why did that have to be a transaction? The total-balance operation could be done with a consistent/timed snapshot, no? Then why did it have to be a transactions spanning/freezing the entire system?

Now let's check the three implementations more closely. The implementation in Figure 1.a (ACID) uses only one type of locks, ACID locks. Figure 1.b (BASE) uses only one type of locks, local locks. BASE programming looks like classical distributed system programming, it tries to do everything with local asynchronous actions as much as possible. Figure 2 (Salt) uses Alkaline, Saline, and ACID locks.

Why does Salt need 3 different locks? Because the first alkaline transaction is special, as it determines whether the entire BASE transaction it is included in will succeed or not. If that first alkaline subtransaction succeeds, then comes the saline-lock to make the intermediate states in the rest of the BASE transaction available to other BASE and alkaline transactions. But then, this is something I am still not clear on. Salt attributes significance to the first alkaline subtransaction, so its design should also be special. But I am unclear about what should go in that subtransaction? Are there guidelines to design that first subtransaction so it captures the successful completion of the rest of the BASE transaction?


Like any OSDI/SOSP paper worth its salt (see, I can also make "Salt" puns :-), the paper includes a strong evaluation section. The authors implemented a Salt prototype by modifying MySQL Cluster. They evaluated the prototype using the TPC-C benchmark, which consists of 5 types of transactions: new-order (43.5%), payment (43.5%), stock-level (4.35%), order-status (4.35%), and delivery (4.35%). The results show that by BASE-ifying 2 of these transactions, Salt achieves 80% of maximum throughput of a full BASE implementation.

Notice the left side of Figure 7, which shows that ACID version faster than the Salt version under low load. The paper doesn't discuss this but, if we had both ACID and BASE version of transactions, it may be possible to hot-swap between these two versions to optimize the throughput of the system depending on the current system load.

Related work

The paper fails to cite several papers that proposed orthogonal approaches to deal with the same problem: improving the performance of ACID transactions, while keeping the ease-of-programmability.

The Red-Blue Actions paper (OSDI'12) considered the problem of reducing the granularity of ACID transactions by using a generator operation and shadow operation and categorizing the operations  as red (consistency critical) and blue (fast and eventually consistent). That paper also uses a Bank example.

Then there are work from the Berkeley group on this problem. I had summarized the Invariant-based coordination avoidance paper earlier. There is also the Highly Available Transactions paper  (VLDB'13) where they  look at ACID isolation level relaxing for improving availability.


The paper proposes the Salt approach assuming the ACID implementation of the application is already present. The application for this is if you already have an ACID system, you can Salt it to fine-tune the performance. Another approach would be to start with a BASE implementation and to increase atomicity/isolation for a couple operations. Maybe for some applications, it will be easier for going from BASE to Salt: Starting with BASE and then adding Alkaline subtransactions and ACID transactions (if necessary).

The Salt approach do require a learning curve and can be tricky. But how do we argue the complexity of the programming effort needed for one approach (say Salt) versus the other (say pure BASE) in an objective manner? One can objectively show performance improvements with evaluations, but it is harder to argue for "ease of programming" in an objective/quantitative manner.

There can also be some complications with the early-committing (after the first alkaline subtransaction) in BASE transactions. This question came up in the OSDI presentation. If deadlock occurs the MySQL Cluster approach is to use timouts and roll the transactions back. In this case, Salt throws exceptions which the developer need to address to re-achive consistency and atomicity.

Tuesday, February 17, 2015

Book report: "Good Prose" and "How to write a lot"

Notes  from the "Good prose"

This book is written by a journalist turned nonfiction author and his editor.
This is a nice book, and a good read.

"Quiet beginnings: You cannot make the reader love/trust you in the first sentence, but you can lose the reader with a grand proposition in the first sentence."

"To write is to talk to strangers. Prepare the reader, tell everything reader needs to know to read on, but no more."

"For a story to live, it is essential only that there be something important at stake, a problem that confronts the characters or reader. The unfolding of the problem and its resolutions are the real payoff."

"Revelation transforms events into a story."

"Don't mess with chronology unless you have a good reason."

"On the topic of essays:  Essayists tend to argue with themselves. Who am I to write this? Who cares to read this? If I knew my own mind, I would not make essays, I would make decisions. --Montaigne."

A main take-away from the book is the respect these writers show for their craft. They spend countless hours drafting/revising/editing with patience. They spend days discussing about improvements on a draft. The important thing is to get it right, do the right thing. At some point, over a dispute on one word with the editor in chief of Atlantic, they consider quitting their dream jobs there on principle.

Notes from "How to write a lot"

This book aims to help academicians to write a lot. It is a short book and it has a single, simple message: "schedule time for writing 2 hours daily in your calendar, and sit down to write at those times".

"Writing is a skill, not an innate gift or special talent. Like any advanced skill, writing must be developed through systematic instruction and practice. Learn the rules and do deliberate practice."

"Schedule time for writing and stick to it."

"Any action that is instrumental in completing a writing project counts as writing."

"No internet. The best kind of self control is to avoid situations that require self control."

Some recommended reading by the book are: "Writers book of hope" and "Professors as writers". I hope to check these later.

Tuesday, February 10, 2015

Paper summary: Perspectives on the CAP theorem

This is a 2012 short paper by Seth Gilbert and Nancy Lynch that appeared in a special issue commemorating the 12th anniversary of the CAP theorem. Gilbert and Lynch get to write in this special issue because they were the ones to first publish a proof the CAP conjecture put forward by Eric Brewer in PODC 2000 keynote.

In this paper, Gilbert and Lynch aim to situate the CAP theorem in the broader context of a family of results in distributed computing theory that shows impossibility of guaranteeing both safety and liveness in an unreliable distributed system. A quick refresher on specifications is in order. Safety says "nothing bad happens". (You can satisfy safety by doing nothing.) Liveness says "eventually something good happens". Unreliability (aka failure model) says "such and such faults happen".

The impossibility results surveyed in relation to CAP concern slightly different problems and slightly different fault models. While it is easy to confuse CAP with those results on a superficial look, on a closer inspection we see that the results are all distinct and none subsume CAP.

The CAP problem 

The CAP theorem does NOT consider the consensus problem, but considers an easier problem: the atomic read/write register (aka atomic storage) problem. Atomic means that the system provides linearizability, a strong type of single-copy consistency that guarantees that a read returns the most recent version of data. The specifications of this problem are as follows. (The first is the safety property, the second one liveness.)
Consistency: The system provides its clients with a single register (emulated by
multiple storage nodes), and each client can read or write from that register.
Availability: Each request for read or write eventually receives a response.

The FLP (Fisher-Lynch-Patterson) and the attacking generals impossibility results consider the consensus problem. The specifications for consensus are as follows. (The first two are safety properties, the last one a liveness property.)
Agreement: No two process can commit different decisions.
Validity (Non-triviality): If all initial values are same, nodes must commit
that value.
Termination: Nodes commit eventually.

So here is the difference between consensus and atomic storage. Consensus is supposed to dutifully remember a value that is anchored (stored by a majority number of nodes). Consensus is loyal to making that value persist as the committed decision. Atomic storage doesn't have that responsibility. The nodes don't need to commit to a decision value, so the system doesn't need to keep track of and remember whether a value is anchored. The atomic storage system as whole accepts new writes as long as the reads don't return results that betray the single register (i.e., single-copy) abstraction.

And what is the implication of this difference? FLP result declares that even under reliable channels assumption, consensus is impossible to solve in an asynchronous system with node crash failures. For example, Paxos loses liveness because it can not converge to a single leader in an asynchronous model. Did the current leader crash? The failure detector cannot be accurate. If the failure detector incorrectly says that the leader (who is supposed to ensure and remember that a value is anchored) is not crashed, liveness is violated since nodes keep waiting on a failed leader. If failure detector incorrectly says that the leader is crashed, then you have multiple leaders, and liveness is violated because of multiple leaders dueling with forever escalating ballot numbers to get the majority to accept their proposal.

On the other hand, since the atomic storage problem doesn't care about remembering whether a value is anchored, it is oblivious to the dueling leaders clients, and as such it is solvable for crashes of up to half of the nodes with the FLP model (i.e., with reliable channels in an asynchronous system). I had blogged about the Attiya, Bar-Noy, Dolev (ABD) algorithm that achieves this feat.

Now that we know atomic storage problem is solvable with reliable channels with up to minority crashes, what can we say about the atomic storage in the presence of unreliable channels? That is covered by the CAP theorem's fault model, which we discuss next.

The CAP fault model 

We discussed the specifications of the problems considered by CAP, FLP, and attacking generals, but we omitted to talk about another important part of the system specification, the unreliability/fault model.

Above I had introduced the FLP fault model when discussing solvability of consensus versus atomic storage in the FLP model. FLP fault model assumes reliable channels, asynchronous system, crash failure. Of course, by assuming reliable channels, you don't get reliable channels in your deployment. That is just wishful thinking. But since the attacking generals impossibility result proved that consensus is not achivable in the presence of unreliability channels, FLP had to consider reliable channels. Even then, we have disappointment; consensus is also impossible in the FLP model.

CAP does something courageous and considers unreliable channels again (as in the attacking generals fault model) in its fault model. Since CAP is concerned with the atomic storage problem, which is a slightly easier problem than consensus, the attacking generals impossibility result does not subsume the CAP result.

CAP result says that atomic storage problem is also impossible to solve under unreliable channels.

Recall that ABD solved the atomic storage problem in the FLP model. If we move to the CAP fault model and allow partitions, we observe from the ABD algorithm that it blocks (loses availability) for a read or write request that arrives to a node in a minority partition. Just as the CAP says, either consistency or availability has to give.

Here is the proof sketch verbatim from Gilbert-Lynch paper.

Similar to the attacking generals result, the CAP result is oblivious to whether the system is synchronous or asynchronous, and holds in both cases.

What is remaining?

Observe from the CAP proof sketch that the CAP fault model is very rough. When it says unreliable channels, it allows you to assume the worst case (i.e., no message makes it through at all), and prove the impossibility result for that worst case.

What if we quantify and limit the unreliability of the channels to more realistic scenarios. Can we prove more refined versions of CAP? What would be the consistency level a system can provide if the system model allows eventual message arrival? A recent technical report from University of Texas Austin, "Consistency availability convergence" paper, looks at that problem. We will discuss that paper next in our distributed systems seminar.

More about CAP tradeoffs

The Gilbert-Lynch paper discusses some of the practical implications of the CAP theorem and says that Consistency versus Availability should not be seen as an absolute and binary tradeoff. Instead you can consider shades of Consistency versus Availability. Also you can make different Consistency versus Availability tradeoffs at the data level, operation level, and subsystem level. These observations are very similar to the suggestions made in Eric Brewer's article in the same special issue: "CAP 12 years later, how the rules have changed".

The Gilbert-Lynch paper also mentions the scalability problems caused due to trying to enforce consistency, but leaves that discussion as future work. PACELC model by Daniel Abadi provides a more detailed explanation for Low-latency versus Consistency tradeoffs in the absence of partitions.

Saturday, February 7, 2015

How to present your work

Presentation is a very important skill. Determining how to present and communicate your results in an effective manner is  as important as doing the research and getting those results. From the same material you can get a killer job talk or a bummed dissertation talk. Presentation skills can make the difference.

Presentation is not a soft skill despite the common misconception among many technical people. It takes a lot of brains, analyzing, and synthesizing to produce a good presentation. You have to understand and internalize your content very well in order to present it clearly in the most engaging and compelling way. So if you fail to present your work clearly, that reflects poorly on you and your work. The audience will think "the work doesn't look significant or promising", or even worse "the presenter doesn't truly understand/internalize his work".

The most essential requirement for a successful presentation is practice. I observe that our graduate students are not good at presenting because they don't get enough practice and rehearsals. As with any skill, you will learn about presenting best by doing it. So don't waste any opportunity to talk about your work. Talk about your work to your relatives, who won't understand. (This is very useful, because you get to think of how to explain your contributions in the context of the real world out there.) Tell it to your friends not in the same field. Tell it to your lab mates. Give talks in the department seminars. You need to learn how to get feedback and gauge your presentation. Eventually, you will get to present your work in conferences.

The "Talk like TED" book 

I recently picked up this book from a library. (Yes, a physical library.) The book was a very easy, lightweight reading. The book's outline consists of the 9 tips it offers for giving successful TED-like talks. It argued that good talks need to be emotional, novel, and memorable, and categorized the 9 tips under those 3 headings.
+ unleash your passion
+ tell a story
+ have a conversation
+ teach something new
+ deliver WOW moments
+ use humor
+ keep it short
+ paint a mental picture
+ be authentic

This may be my own failing, but I didn't learn/benefit much from the book. (Shall I say, the book was not very emotional, novel, or memorable?) If you had done some reading about public speaking before, this book does not offer much new insights or content. This book could have been condensed to a long blog post.  For example, read this article by Chris Anderson instead.

Maybe the most interesting take away from the book is how much practice the presenters put into the talk. I knew TED speakers rehearsed a lot, but I was still surprised to learn how much. Six months before the talk, the presentation draft is there. Then it is all practice rehealsals and tweaking to refine and fine-tune. One month before the talk, the final form of the presentation is ready.

Keep your eyes on the message 
I think the best presentation advice I received was from another TED, Ted Herman. After hearing me speak as a graduate student on one of my work, he told me: "Focus on the message. Give your presentation with the sole goal to teach, and not to impress."

If you worry about impressing the audience in a presentation, then your ego gets in the way, you get self-conscious. You start questioning "Did I say that right? Was I too slow? Am I perceived as confident enough? etc." This will get you nervous and self-doubting. If you focus on the message, you will get your point across, even if it takes flapping your arms wildly. And if your content is good, it will impress people after all. As Anderson says "Presentations rise or fall on the quality of the idea, the narrative, and the passion of the speaker. It’s about substance, not speaking style or multimedia pyrotechnics."

Focusing on the message is hard to do using Powerpoint/Keynote. Powerpoint makes it easy to lose the focus on the message. By writing slide after slide and getting things stylistically right (which Powerpoint facilitates immensely), you get the false-sense of security that you are communicating the content. To avoid this and to force yourself to focus on the message/content, you should prepare the story and outline of the talk first in your mind, before sitting down to prepare the slides. How would you give this talk if you were forced not to use any slides. Thinking this way and purposefully omitting the slides as crutches will help you learn, discover, and hone your message. (More on this below, where I talk about framing the talk.)

To reiterate what I said in the beginning of the post, presentation is not a soft skill and requires a lot of brains. In order to produce a clear and condensed message, you need to learn how to abstract the most important lessons from all the work you did. This requires you to first process and internalize your work really well. You should omit a lot of incidental details, and provide a condensed message in a conference presentation or job talk. These talks are there to whet people's appetites and get them to read your papers. This does not mean that these talks should omit technical content. On the contrary they should include technical content that communicates the essence of your technical contribution in a clear and accessible manner.

Frame your talk well

And from my PhD advisor, Anish Arora, I learned how to frame a talk well. Framing means finding the right place to begin and developing the right story and context to pitch the ideas.

When we meet to discuss how to present our work, Anish would declare "let's try to understand what we did in this work". Wow! He was totally comfortable accepting that we could understand our work better, and the context at which we started and performed the work is not necessarily the best context with which we present the work. From Anish, I learned that it is OK to search for a new perspective and a better context to frame and present the work.

In fact, your best presentations are which you also learn and get a new appreciation of your work. If you get that new understanding of your work, that is a good indication that you did enough to frame your work, and you achieved a good understanding and focusing of your message.

Related links

Presentation skills is somewhat related to writing skills. So most of the advice here also applies to writing. Similarly some of the advise on writing also applies to preparing a presentation.
How to write your research paper
How I write

What is the biggest rock?
You can get the style right, and give a content-free and superficially interesting talk. But as Lincoln said: "You can fool all the people some of the time and some of the people all of the time, but you cannot fool all the people all the time."

Saturday, January 31, 2015

Dijkstra's stabilizing token ring algorithm

One of the classical algorithms I teach in my distributed systems class is Dijkstra's stabilizing token ring algorithm. This algorithm has started the self-stabilization field as a subfield of fault-tolerance. And, it still receives interest even after 40 years. There has been probably hundreds of self-stabilization papers that revisits Dijkstra's stabilizing token ring algorithm as part of a solution or as part of a case study. This algorithm never gets old for me as well. I still enjoy talking about this algorithm in class and thinking about it once in a while. I guess this is because it is a very elegant algorithm.

Linked is Dijkstra's original paper introducing the algorithm. This paper also includes two variants of the stabilizing token ring algorithm, 3-state and 4-state token ring algorithms. Dijkstra would later do a followup writeup, titled, "A belated proof of self-stabilization", where he gave a proof of stabilization for 3-state token ring program.

As a side note, when I asked my distributed class whether they heard Dijkstra's name before, almost all hands went up. And almost all of them had heard his name due to the "all pairs shortest path algorithm". Interesting. Probably, in Dijkstra's mind, that algorithm was a minor contribution. He came up with that algorithm at a cafe in 20 minutes without using pen and paper. None of the students heard the "Go-to considered harmful" paper, which may be closer to what Dijkstra was aiming with his career. Dijkstra's later work focused on predicate calculus for enabling  structured/principled verifiably-correct programming.

Stabilizing fault-tolerance

Stabilization is at the heart of the Dijkstra's token ring algorithm. Stabilization is a type of fault tolerance that advocates dealing with faults in a principled unified approach instead of on a case by case basis. Instead of trying to figure out how much faults can disrupt the system stabilization assumes arbitrary state corruption, which covers all possible worst-case collusions of faults and program actions. Stabilization advocates designing recovery actions that takes the program back to invariant states starting from any arbitrary. Anish Arora's paper "Practical Self-Stabilization for Tolerating Unanticipated Faults in Networked Systems" provides a great overview of stabilization concepts.

Token Ring stabilization

Let's start with the simple token ring concept. The processes are arranged in a ring fashion and there is a unique token circulating in this ring. (You can think of the token may be providing mutual exclusion to the processes; whichever process has the token can access the critical-section/shared-resource).

In the invariant states there is exactly one token. So in the bad states, there are two possible cases: 1) we end up with a ring with no tokens,  2) we end up with a ring with multiple tokens.

Let's consider how Dijkstra's token ring algorithm deals with case 1. Dijkstra's token ring algorithm maps the concept of a physical token to a relation between two neighboring processes. This mapping is done skillfully to limit the state space so the case 1 faults cannot possibly happen in that state space. In the mapped state space, any arbitrary state  is guaranteed to contain 1 or more tokens.

How does this work? Dijstra's algorithm encodes a token with the relation: "the x variable of a process is DIFFERENT than the x variable of its predecessor in the ring". In this case, if all x variables are the same, there won't be any token in the system. To fix this Dijkstra's algorithm encodes the token differently for process 0: "There is a token at process 0, if its x variable is the SAME as that of its predecessor, process N". Encoded this way, we cannot not have 0 tokens in any arbitrary state in the state space.

A good analogy for bounding the corruption space is the use of the carousal door. For a normal door, you have two possible states: door is open or door is closed. For a carousal door, there is only one possible space: the door is closed. (Yet you still make it through that door by revolving through it.)

In order to deal with case 2, Dijkstra's algorithm uses some token overwriting mechanisms. First let's observe that no process action introduces an extra token to the system. (Of course, a fault action can corrupt state arbitrarily, and can throw the system to a state with many extra tokens.) In the normal/fault-free case, by taking an action a process consumes a token that exists between its predecessor and itself and passes this token to its successor. In the faulty case, this is also the worst damage a process can do, because it will always consume the token at itself, and since its state change affects only its successor, it may or may not pass the token to its successor. If the process execution does not pass the token to its successor, this is actually good: an extra token is removed from the system. And if enough of these happen, the system will stabilize to exactly one token case.

But what if this leads to a system with no tokens? Recall that this cannot happen due to the discussion regarding Case 1.

Stabilizing token ring algorithm in TLA+ 

After explaining the program actions, in order to help students internalize what is going on, I call 5 students to stand up in front of the class. The students form a ring. And they use hands up or down to represent x=1 or x=0, and enact a simulation of the program starting from some arbitrarily corrupted states. The students first get to observe that there is indeed no configuration possible with 0 tokens. Then as they enact several scenarios, they observe how no new tokens are introduced by a process action, but in some cases an extra token is consumed.

Then, I challenge them to come up with a scenario where this binary token ring formed does not get to converge.  The corruption continues to live on in the system.  Here is such a run in a ring of 4:
1010 --> tokens at nodes 1,2,3
1011 --> tokens at nodes 1,2, and 0
1001 --> tokens at nodes 0,1,3
1101 --> tokens at nodes 0,2,3
0101 --> tokens at nodes 1,2,3
0100 --> tokens at nodes 1,2,0
0110 --> tokens at nodes 0,1,3
0010 --> tokens at nodes 0,2,3
1010 --> tokens at nodes 1,2,3
and this gets us to the first state. The system can repeat this loop forever without stabilizing.

Of course, when you run the model checker for this setup (where x is limited to 0 and 1) the TLA+ model checker provides you this counterexample.

The M>=N requirement 

So, what went wrong? Wasn't Dijkstra's token ring supposed to be stabilizing? The first time around we discussed the algorithm since x was an unbounded integer, the algorithm was stabilizing. When we considered the binary token ring example, x was bounded by 2, and that failed to stabilize in a ring of size 3 or more. There is a relation on M, the bound on x, and the algorithm's ability to stabilize for a ring of size N. We had seen in the counter example that for M<N there is a way to form a loop in the execution trace and keep the corruption going in the system.

For M>=N, the node 0 acts as a gate. Consider the case where node 0 never fires. In this case the other nodes fire, and eventually they all copy the x value at node 0, and in that configuration the system reduces to one token (at node 0) and stabilization is achieved. For M>=N, we can use the pigeon-hole principle to show that the system  eventually reaches the case where node 0 does not fire until all other nodes  copy its value. This is because, for M>=N, at any configuration there is a value "v" of x that is not currently present in the ring. Eventually node 0 will hit "v", and then node 0 will not be enabled until v is copied by all nodes. This is how node 0 acts as a gateway.

Concluding remarks  

I may talk about Dijkstra's 3-state and 4-state token ring algorithms in another post. Those are also interesting and elegant algorithms.  In a paper I wrote in 2002, I used Dijkstra's classic, 3-state, and 4-state token ring algorithms as case studies. If you like to learn more about this, you can read that paper here.

Related links

Crash only software 
The role of distributed state

Thursday, January 29, 2015

My Distributed Systems Seminar reading list for Spring 2015

Below is the list of papers I plan to discuss in my distributed systems seminar this semester. If you have some suggestions on other good/recent papers to cover, please let me know in the comments.
  1. Perspectives on the CAP Theorem
  2. Consistency, availability, and convergence.
  3. Salt: combining ACID and BASE in a distributed database
  4. Extracting More Concurrency from Distributed Transactions
  5. Eidetic Systems
  6. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
  7. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
  8. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
  9. GraphX: Graph Processing in a Distributed Dataflow Framework
  10. Architecture of a database system
  11. Arrakis: The Operating System is the Control Plane
  12. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism
  13. Project Adam: Building an Efficient and Scalable Deep Learning Training System

My experience with using TLA+ in distributed systems class

I used TLA+ in my distributed system class in Fall 2014. (To learn the backstory on this, read my pre-semester TLA+ post.)

In short, I loved the experience and I am hooked. Integrating TLA+ to the class gave students a way to get a hands-on experience in algorithms design and correctness verification. Even for a sophisticated algorithm, such as Paxos, I can refer the students to TLA+ to practice and play with the algorithm, so they can internalize what is going on. Throughout the semester as I had more chance to work with TLA+, I started growing a library of TLA+ modeling of distributed algorithms. Next time I teach the class, I hope to provide the students with a TLA+ model of each algorithm I cover in the class.

The student feedback was also positive. The students liked TLA+ since it gave them a way to experiment and supported them in reasoning about the algorithms. Several students complained of the learning curve. They wanted me to cover TLA+ in more detail in the class. (The TLA+ manual can be very dense for the students.) Next semester I plan to allocate more time getting the students acquainted with TLA+. (Next time around, I will also work on devising automated ways to grade the TLA+ projects. My TA had do manual testing of the projects, not a pleasant task for a class of 60 students.)

The TLA+ mini-projects I used

Throughout the semester, I assigned 3 TLA+ mini-projects with increasing complexity.

The aim of the first mini-project was to get the students started with TLA+. I asked the students to model a bean can problem (white and black beans with rules to remove beans form the can). This was simple and there was only one process. But the students got to learn how they can check safety and liveness conditions on their program.
The second mini-project required them to deal with multiple processes. I asked them to model the logical physical clocks we had introduced in our research.  Using TLA+ model checker the students found the counterexample in the naive algorithm, where the logical clock gradually but unboundedly gets ahead of the physical clock. This counterexample is hard to find and involves 30 steps in the algorithm. So this was an example were TLA+ proved its value.  (Try to figure out this counterexample with unit tests, I dare you.)
The final mini-project gave them more practice on modeling distributed algorithms. I asked them to model Dijkstra's classical stabilizing token ring algorithm as well as Dijkstra's lesser known 4-state stabilizing token ring algorithm. Students learned about self-stabilization and how it deals with arbitrary memory corruption.

Lessons learned (the good, the bad, and the ugly)

I learned that PlusCal language is the right abstraction for modeling distributed algorithms. TLA+ is too low level for writing (and reading) distributed algorithms. PlusCal allows you to include TLA+ expressions and definitions, so you benefit from all the expressive power of TLA+ without going full TLA+. PlusCal is not a separate entity, but just a more convenient way to write TLA+ code. In the TLA+ toolkit you can write the algorithm in PlusCal and hit the "translate to TLA+" to produce the corresponding TLA+ code. When I use the term "modeling an algorithm in TLA+", I really mean writing the model in PlusCal, auto-translating it to TLA+ and model checking it.

Learning to model algorithms using TLA+ takes time. The language is powerful and very general; it is Math after all. You need to study many examples to learn about the RIGHT ways of doing things. I experienced this as a novice first hand. I had written a PlusCal code for Paxos within a couple days. My code worked, but it used procedures, and it was very procedural/operational rather than declarative, the way Math should be. And so my code was very slow to model check. Model checking with 2 acceptors worked, but I gave up on model checking with 3 acceptors since it took more than an hour. Then I found Lamport's PlusCal code about Byzantine agreement. I rewrote my Paxos implementation, imitating his style. I used macros, instead of procedures. My code become more declarative than operational. And, for message passing, I used Lamport's trick of using record typed messages which are written just once to a shared message board; it was the acceptors responsibility to react to them (by just reading, not consuming them). My new refactored code translated to ~150 LOC TLA+, whereas my first version had translated to ~800 LOC TLA. And the new version model checked in a minute for 3 acceptors, leading to a couple of magnitudes of order improvement in efficiency.

While PlusCal is nice, the crutch of using labels in PlusCal algorithms (for enabling TLA+ translation) is a bit ugly. The labels in PlusCal are not just aesthetic. They determine the granularity of atomic execution of blocks of code, so it is critical to place the labels right. If you fail to do this right, you will run into deadlocks due to co-dependent await conditions in different blocks of code. Inserting a label reduces the granularity of block of code from atomic to break at the label. Labels were a little unintuitive for me. But after some practice, I was able to get a hang of them.

Another ugly part is the lack of data structure support for passing messages and parsing them. Using tuples without field names is flexible, but it becomes intractable. Using records help somewhat, but manipulating field names in records also gets complicated quickly. TLA+ doesn't have static/strict typing, and I got stuck with runtime errors several times. In one case, I realized I had to find a way to flatten out a set inside of another set; after a lot of head banging, I found about the UNION operator, which solved the problem.

Good engineers use good tools. And TLA+ is a great tool for modeling distributed algorithms. Modeling an algorithm with TLA+ is really rewarding as it makes you "grok" the algorithm.

Sunday, November 23, 2014

Google Cloud Messaging (GCM): An evaluation

I had written about our work on building a crowdsourced superplayer for the "Who wants to be a millionaire (WWTBAM)" quiz show earlier. In that work we developed an Android app that enabled the app users in Turkey to participate in WWTBAM in real time as the show was airing on TV. When a question was read by the show host, my PhD students typed the question and the multiple-choice options, which were transmitted via Google Cloud Messaging (GCM) to the app users. App users played the game, and enjoyed competing with other app users, and we got a chance to collect precious data about MCQA dynamics in crowdsourcing. Our app was downloaded 300K+ times, and at the peak of its popularity 20K participants played the game simultaneously.

We used GCM to send the questions to the participants because we wanted to keep the app simple. GCM is the default push messaging solution for the Android platform and is maintained by Google as a free service with no quotas. GCM allows app developers to send push messages to Android devices from the server. As an alternative to GCM, we could have developed our app to maintain open TCP connections with the backend servers, but this would have made our backend design complicated and we would need many machines to be able to serve the need. Using GCM, we were able to send quiz questions as push messages easily, and can get away with using only one backend server to collect the answers from the app users. Another alternative to GCM was to deploy our messaging service, say using MQTT, but since we were developing this app as an experiment/prototype, we wanted to keep things as simple as possible.

After we deployed our app, we noticed that there were no studies or analysis of GCM performance, and we wondered whether GCM was enough to satisfy the real-time requirements in our application. Since we had a lot of users, we instrumented our app to timestamp the reception time of the GCM messages and reply back to the backend server with this information.

Our evaluation was done for both offline and online mode of GCM messaging. In the online mode, we send the message to the online participants that are playing the game while WWTBAM is broadcasting. That is, in the online mode, we know that the client devices are powered on, actively in use and have network connection. In the offline mode, we send the GCM message at a random time, so the server has no knowledge of whether the client devices are powered on, are in use, or have network connection. Table2 shows the number of devices involved in each of our experiments.

We found a giant tail in the GCM delivery latencies. Figure 4 above shows the message arrival time cumulative distribution in the online and offline modes. Note that the x-axis is in logarithmic scale. Table 6 shows message arrival times broken into quartiles. The median and the average message arrival times in the table indicate that the latency in the offline experiment is significantly high compared to the online experiment.

In Figure 5, comparing the GCM delivery times under WiFi versus cellular data connection in the offline mode shows that GCM delivery latencies are lower using the cellular data connection.
To investigate the GCM arrival times in the offline scenario further, we devised a double message experiment, where we send 2 GCM messages initiated at the same time on our server. This time, instead of measuring the message arrival time, we measure the difference between the arrival times of the first message and the arrival time of the second message. We were expecting to see very low time difference between these twin messages, but Figure 9 shows that the giant tail is preserved, and that some of the devices receive the second message hours after receiving the first one. Since we know that, at the time the first of the twin messages has arrived to the device, the device is reachable by Google’s GCM servers, the latency for the second message should be due to scheduling it to be sent independently than the first message.

These results show that the GCM message delivery is unpredictable, namely having a reliable connection to Google's GCM servers on the client device does not by itself guarantee a timely message arrival. It would be worth replicating this experiment where all participants are in the US, to see how that changes the results.

Our experiments raised more questions than it answered. Delving deeper at the documentation, we found some information that may explain some of the problems with GCM in the offline mode, but we need more examples to test these and understand the behavior better. It turns out that both cellular and WiFi modes keep a long lived connection to GCM servers. In the case of deep sleep mode (where user does not turn on the screen for long), the CPU wakes up and sends heartbeat to the servers every 28 minutes on cellular, every 15 minutes on the WiFi. This causes the connection to drop in many cases, because routers and providers kill the inactive TCP connections without sending any info to the client and/or server. Not all users are affected by this problem, it depends on carrier and/or router they use.

Related links

Our GlobeCom14 paper includes more information.

Saturday, November 15, 2014

Paper Summary: Granola, Low overhead distributed transaction coordination

This paper is by Cowling and Liskov, and it appeared at Usenix ATC'12.  Most closely related papers to this paper are the Sinfonia and Calvin papers. So, it may be helpful to also read the summaries of those papers from the links above to familiarize yourself with them.

This paper looks at coordination of 1-round transactions, which are different from general transactions that involve multi-round interaction with the client. 1-round transactions execute at the participant nodes, with no communication with other nodes except for at most a single commit/abort vote.

Figure 1 shows the system architecture. The clients are the transaction managers for these 1-round transactions. They initiate transactions and evaluate the commit/conflict/abort decisions returned for the transactions. The repositories communicate with one another (for at most a single commit/abort vote) to coordinate transactions.

There are 2 types of transactions in Granola: coordinated ones, and uncoordinated ones. Actually, better/less-confusing names for these 2 types would be lock-coordinated transactions and timestamp-coordinated transactions, since the latter type of transactions still involve coordination among the repositories.

To reflect this duality in coordination mode, each repository runs in one of two modes. When there are no coordinated transactions running at a repository, it runs in timestamp mode. When a repository receives a request for a coordinated transaction, it switches to the locking mode.

Uncoordinated transactions

There are two kinds of uncoordinated transactions: single repository transactions, or independent distributed transactions. Granola uses a timestamped-based coordination to provide serializability to both kinds of uncoordinated transactions and avoids locking for them. Examples of independent distributed transactions include read-only transactions, and local-read transactions, such as give every employee a 2% raise.

(In our previous work, the slow-fast paper (2010), we had also made an analysis similar to the independent transactions idea. Our slow-fast analysis inspects the program actions, which are precondition guarded assignment statements, and determined for which actions the atomicity can be relaxed so that they can execute in an uncoordinated manner. Our finding was that if the precondition of a program action is "locally-stable" (i.e., this precondition predicate cannot get falsified by execution of other program actions), then it is safe to execute this program action in an uncoordinated/nonatomic manner.)

Repositories are assumed to have access to loosely synchronized clocks (say NTP synchronized clocks). The use of timestamps in Granola are for ordering/serializability of transactions, and logical clocks would also suffice. Granola needs time synchronization for performance/throughput not for correctness. This separation is always a welcome one. (I can't resist but refer to our hybrid logical clock, HLC, work here. I think HLC would improve exposition of Granola, and would make it tolerant to malformed clients which may increase highTS and deny service to normal clients' requests.)

Figure 5 shows the straightforward single repository execution, and Figure 6 shows the more interesting independent transaction execution. Since these are 1-round transactions, the commit/abort decisions are local and one-shot at repositories. Voting is used to notify other repositories about the local decision (a conflict vote cause a repository to abort the transaction), and also for nominating a proposed timestamp for the transaction. The transaction is assigned the highest timestamp from among the votes.

In the "pure" timestamp mode, all single repository and independent distributed transactions commit successfully, and serializability is guaranteed as they are executed in timestamp order. This provides a substantial reduction in overhead from locking and aborts, which improve the throughput.

While in the timestamp mode, coordinated transactions may also arrive which may lead to conflict decisions to be returned for the single or distributed independent transactions executing on those repositories. Those repositories will switch to coordinated mode to serve the coordinated transactions. Once all coordinated transactions have completed those repositories can again transition to the timestamp mode.

Dually, in locking mode, repositories can still serve single repository and distributed independent transactions provided they do not conflict, i.e., they do not need to update a locked item. (In effect, the timestamp mode is just a shortcut to denote the lack of any coordinated transactions in the system.)

Coordinated transactions

Figure 7 shows coordinated transaction execution. Coordination is needed for a transaction that requires a remote read for abort/commit decision. For example, the transaction to transfer $50 from Alice's account to Bob's account first needs to remote-read Alice's account to certify that it indeed has more than $50. (Recall that in contrast for an independent distributed transaction the read, or the guard of the transaction, was local: give every employe 2% raise.) Different from independent distributed transaction execution, the coordinated transaction execution involves a prepare phase to acquire required locks. Locks ensure serializability for coordinated transaction execution. So timestamp order may not be satisfied in commit of transactions, and thus, external consistency may not be provided for coordinated transaction execution.


This paper is written with a focus on describing the system in reasonable detail to potential users of the system. This diverges a bit from the academic style, which would put the focus on the novelties and "intellectual merit" (a la NSF). Maybe this is because of the style/focus of the USENIX ATC conference. The evaluation section is also really nice and detailed. (Granola performs like Sinfonia for coordinated transacations, and improve throughput for uncoordinated transactions.)

It seems that Granola does not offer much advantage over Calvin. The paper compares in the related work with Calvin and states the following. "The Calvin transaction coordination protocol was developed in parallel with Granola, and provides similar functionality. Rather than using a distributed timestamp voting scheme to determine execution order, Calvin delays read/write transactions and runs a global agreement protocol to produce a deterministic locking order."

I think Granola may have an edge over Calvin for low-latency, especially for transactions that involve a couple repositories. The best way to see why is to consider this analogy.
Granola:Calvin :: CSMA:TDMA.

Friday, November 14, 2014

Paper Summary: Calvin, Distributed transactions for database systems

Calvin is a transaction scheduling and replication management layer for distributed storage systems. By first writing transaction requests to a durable, replicated log, and then using a concurrency control mechanism that emulates a deterministic serial execution of the log's transaction requests, Calvin supports strongly consistent replication and fully ACID distributed transactions, and manages to incur lower inter-partition transaction coordination costs than traditional distributed database systems.

Calvin emphasizes modularity. The holy trinity in Calvin is: log, scheduler, executor. When a client submits a transaction request to Calvin, this is immediately appended to a durable log, before any actual execution begins. Calvin's scheduling mechanism then processes this request log, deciding when each transaction should be executed in a way that maintains an invariant slightly stronger than serializable isolation: Transaction execution may be parallelized but must be equivalent to a deterministic serial execution in log-specified order. As a result, the log of transaction requests itself serve as an ultimate "source of truth" about the state of a database, which makes the recovery very easy.

I thought Calvin resembles the Tango approach a lot. (I had discussed Tango here recently.) It is almost as if Calvin is Tango's cousin in databases domain. As such, Calvin has similar strengths and disadvantages like Tango. For the advantages: Calvin provides good throughput, but will not get stars for low-latency. Calvin provides scalable replication, and strongly-consistent replication. (After you have one authoritative log, this is not hard to provide anyways.)

The centralized log is the source of all disadvantages in Calvin as well.  The transactions need to always go through the centralized log; so there are no truly local transactions. Thus Calvin will perform worse for workloads that have local/non-coordinating workload. So, the TPC-C workload Calvin uses for evaluation is actually best workload to show Calvin's relative performance to other systems.

The Log component

Calvin uses Paxos to achieve availability of the log by replicating it consistently. A group of front-end servers collect client requests into batches. Each batch is assigned a globally unique ID and written to an independent, asynchronously replicated block storage service such as Voldemort or Cassandra. Once a batch of transaction requests is durable on enough replicas, its GUID is appended to a Paxos “MetaLog”. Readers can then reassemble the global transaction request log by concatenating batches in the order that their GUIDs appear in the Paxos MetaLog.

Batching trades off throughput with low-latency: you cannot have transaction latency lower than the batching duration (epoch). So an epoch is a guaranteed overhead on latency for every transaction.

The scheduler component

The Scheduler component (which is a centralized component) examines a transaction before it begins executing and decides when it is safe to execute the whole transaction, then hands the transaction request off to the storage backend for execution with no additional oversight. The storage backend therefore does not need to have any knowledge of the concurrency control mechanism or implementation. Once a transaction begins running, the storage backend can be sure that it can process it to completion without worrying about concurrent transactions accessing the same data. However, each storage backend must provide enough information prior to transactional execution in order for the scheduler to make a well-informed decision about when each transaction can safely (from a concurrency control perspective) execute.

For transaction execution, the scheduler still uses locks. Deterministic locking ensures concurrent execution equivalent to the serial transaction order in the log, and also makes deadlock impossible (and the associated nondeterministic transaction aborts).


I don't know why Calvin doesn't adopt Tango style log maintenance: Using chain replication to improve throughput of the centralized log. This might actually help Calvin.

Similarly Calvin should also adopt selective/custom stream replication per replica feature in Tango. That would implement the flexibility/generality of Calvin.

Related links

Saturday, November 1, 2014

Paper Summary: Coordination Avoidance in Database Systems

Serializing transactions is sufficient for correctness, but it is not necessary for all operations of all applications. The downside of serialization is that it kills scalability and is overkill in many cases.

This paper (which will appear in VLDB'15) has the following insight: Given knowledge of application transactions and correctness criteria (i.e., invariants), it is possible to avoid this over-coordination of serializability and execute some transactions without coordination while still preserving those correctness criteria (invariants).

In particular the authors propose the concept of "invariant confluence" to relax the use of serialization for some operations of a coordination-requiring application. By operating on application-level invariants over database states (e.g., integrity constraints), the invariant confluence analysis provides a necessary and sufficient condition for safe, coordination-free execution. When programmers specify application invariants, this analysis allows databases to coordinate only when concurrency may violate those application invariants.

So how do they get the application invariants? "Many production databases today already support invariants in the form of primary key, uniqueness, foreign key, and row-level check constraints. We analyze this and show that many are invariant-confluent, including forms of foreign key constraints unique value generation, and check constraints, while others, like primary key constraints are, in general, not."

They claim that many common integrity constraints found in SQL and standardized benchmarks are invariant confluent, allowing order-of-magnitude performance gains over coordinated execution. To substantiate this claim, they apply invariant confluence analysis to a database prototype and show 25-fold improvement over prior TPC-C New-Order performance on a 200 server cluster. They find that 10 out of 12 of TPC-C's invariants are invariant-confluent, under the workload transaction.

The invariant-confluence model

Invariant-confluence captures a simple (informal) rule: coordination can be avoided if and only if all local commit decisions are globally valid. (In other words, the commit decisions are composable.)

They model transactions to operate over independent logical snapshots of database state. Transaction writes are applied at one or more snapshots initially when the transaction commits and are then integrated into other snapshots asynchronously via a merge operator that incorporates those changes into the snapshot's state. "Merge" is simply the set union of versions, and is used to capture the process of reconciling divergent states.

In effect, this model states that each transaction can modify its replica state without modifying any other concurrently executing transactions' replica state. Replicas therefore provide transactions with partial snapshot views of global state. They define local validity/consistency as a safety property, but global replica consistency is not defined as a safety property. Instead it is defined as a liveness property under the name "convergence".

(Formal definition of invariant-confluent:)
A set of transactions T is I-confluent with respect to invariant I if, for all I-T-reachable states Di, Dj with a common ancestor state, Di union Dj is I-valid.

Applying the invariant-confluence concept

As the definition implies, I-confluence holds for specific combinations of invariants and transactions. Removing a user from the database is I-confluent with respect to the invariant that the user IDs are unique. However, two transactions that remove two different users from the database are not I-confluent with respect to the invariant that there exists at least one user in the database at all times. As another example, uniqueness is not I-confluent for inserts of unique values. However, reads and deletions are both I-confluent under uniqueness invariants: reading and removing items cannot introduce duplicates.

Table 3 summarizes the 12 invariants found in TPC-C benchmark as well as their I-confluence analysis results as determined by Table 2. They classify the invariants into 3 broad categories: materialized view maintenance, foreign key constraint maintenance, and unique ID assignment.

Figure 5 shows the concurrency/throughput improvement made possible by applying the invariant-confluence analysis to the TPC-C workload.

Related work

This paper is an extension of the "CALM" approach that uses monotonicity and convergence concepts to relax the coordination needs of applications.

The "Scalable commutatitivity rule" paper, which was one of the best papers in SOSP'13, is a closely related work. In order to relax serializability and boost concurrency, that work prescribes exploiting the commutativity of operations. Another related work that exploits commutativity to relax serializability is the "Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary" paper. The invariant confluence analysis concept is more general (but probably harder to apply) than the commutativity rule approach, because while commutativity is sufficient for correctness it is not always necessary.

In our previous work, the slow-fast paper (2010), we had also used the concept of "invariant-relaxed serializability" in distributed systems domain, particularly in application to wireless sensor/actor network concurrency control. (Maybe somewhat of a misnomer we called a slow action as one which can be executed in a concurrent/uncoordinated/nonatomic manner, and a fast action as one which needs to be executed in an atomic/coordinated/no-conflicts manner.)

I suspect our slow-fast approach used a less aggressive optimization than invariant-confluence: Slow-fast did not require/inspect program invariants explicitly, it only required access to the program actions (i.e., transactions). The slow-fast approach inferred that the invariant holds when program actions (transactions) execute atomically. (Invariant-confluence may potentially use even a weaker invariant than what slow-fast used.)

Our slow-fast analysis inspects the program actions, which are precondition guarded assignment statements, and determined for which actions the atomicity can be relaxed so that they can execute in an uncoordinated manner. Our finding was that if the precondition of a program action is "locally-stable" (i.e., this precondition predicate cannot get falsified by execution of other program actions), then it is safe to execute this program action in an uncoordinated/nonatomic manner. (This check probably implies "mergeability" of the state.) Our analysis also prescribed ways to break a coordination requiring action into two smaller actions to make it coordination free.

The "coordination avoidance in databases" paper applies the "invariant-relaxed serializability" idea in a more restricted and more useful domain, database transactions, and demonstrates the idea in a very practical way.

Monday, October 27, 2014

Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks

This paper appeared in SRDS 2013, and is concerned with the snapshot isolation problem for distributed databases/data stores.

What is snapshot isolation (SI)?

(I took these definitions almost verbatim from the paper.)
SI is a multiversion concurrency control scheme with 3 properties:
1) Each transaction reads from a consistent snapshot, taken at the start of the transaction and identified by a snapshot timestamp. A snapshot is consistent if it includes all writes of transactions committed before the snapshot timestamp, and if it does not include any writes of aborted transactions or transactions committed after the snapshot timestamp.
2) Update transactions commit in a total order. Every commit produces a new database snapshot, identified by the commit timestamp.
3) An update transaction aborts if it introduces a write-write conflict with a concurrent committed transaction. Transaction T1 is concurrent with committed update transaction T2, if T1 took its snapshot before T2 committed and T1 tries to commit after T2 committed.

When a transaction starts, its snapshot timestamp is set to the current value of the database version. All its reads are satisfied from the corresponding snapshot. To support snapshots, multiple versions of each data item are kept, each tagged with a version number equal to the commit timestamp of the transaction that creates the version. The transaction reads the version with the largest version number smaller than its snapshot timestamp. If the transaction is read-only, it always commits without further checks. If the transaction has updates, its writes are buffered in a workspace. When the update transaction requests to commit, a certification check verifies that the transaction writeset does not intersect with the writesets of concurrent committed transactions. If the certification succeeds, the database version is incremented, and the transaction commit timestamp is set to this value.

What is the innovation in the Clock-SI paper?

The conventional SI implementations use a centralized timestamp authority for consistent versioning. This is because local clocks on different nodes may differ a lot (NTP synchronization may have 10s of ms of inaccuracies), and is not suitable for consistent versioning.

Clock-SI, instead, proposes a way to use loosely synchronized clocks to assign snapshot and commit timestamps to transactions. Compared to conventional SI, Clock-SI does not have a single point of failure and a potential performance bottleneck. It saves one round-trip message for a ready-only transaction (to obtain the snapshot timestamp), and two round-trip messages for an update transaction (to obtain the snapshot timestamp and the commit timestamp). A transaction's snapshot timestamp is the value of the local clock at the partition where it starts. Similarly, the commit timestamp of a local update transaction is obtained by reading the local clock.

If you read Google's Spanner paper, you know that Google Spanner solves this problem by introducing TrueTime, which uses atomic clocks.

How does Clock-SI work?

Clock-SI essentially response-delays a read in a transaction
1) to account for clock synchronization differences (epsilon) as in Fig1, and
2) to account for the pending commit of an update transaction.

In Fig1, the read arrives at time t′ on P2's clock, before P2’s clock has reached the value t, and thus t′ < t. The snapshot with timestamp t at P2 is therefore not yet available. Another transaction on P2 could commit at time t′′, between t′ and t, and change the value of x. This new value should be included in T1's snapshot.

T2's snapshot is unavailable due to the commit in progress of transaction T1, which is assigned the value of the local clock, say t, as its commit timestamp. T1 updates item x and commits. The commit operation involves a write to stable storage and completes at time t′. Transaction T2 starts between t and t′, and gets assigned a snapshot timestamp t′′, t < t′′ < t′. If T2 issues a read for item x, we cannot return the value written by T1, because we do not yet know if the commit will succeed, but we can also not return the earlier value, because, if T1's commit succeeds, this older value will not be part of a consistent snapshot at t′′.


The paper does not include a performance comparison to Spanner. The NTP synchronized clocks in the evaluation experiments have an NTP offset/epsilon less than 0.1 msec, which is actually more precise than Spanner's atomic clock! I guess this is thanks to the Gigabit Ethernet they use in their LAN deployment.

Discussion: Use of Hybrid Logical Clocks (HLC) for the Clock-SI problem

HLC is a hybrid version of logical clocks and physical clocks, introduced by us recently, to combine the advantages of both clocks, while avoiding their disadvantages. Since HLC captures happened-before relationship and uses this extra information in ordering, it does not need to wait out uncertainty regions of physical clock synchronization. Dually, since HLC is related to physical clocks it allows querying with respect to physical time. We had shown HLC's advantages for the consistent snapshot problem in our work.

Here we find that HLC indeed improves the clock-SI problem of snapshot isolation if it is used instead of physical clocks. HLC avoids the delay in Figure 1. HLC would not incur the delay because it also uses happened-before information as encoded in HLC clocks.

Saturday, October 18, 2014

Facebook's software architecture

I had summarized/discussed a couple papers (Haystack, Memcache caching) about Facebook's architecture before.

Facebook uses simple architecture that gets things done. Papers from Facebook are refreshingly simple, and I like reading these papers.

Two more Facebook papers appeared recently, and I briefly summarize them below.

TAO: Facebook's distributed data store for the social graph (ATC'13)

A single Facebook page may aggregate and filter 100s of items from the social graph. Since Facebook presents each user with customized content (which needs to be filtered with privacy checks) an efficient, highly available, and scalable graph data store is needed to serve this dynamic read-heavy workload.

Before Tao, Facebook's web servers directly accessed MySql to read or write the social graph, aggressively using memcache as a look aside cache (as it was explained in this paper).

The Tao data store implements a graph abstraction directly. This allows Tao to avoid some of the fundamental shortcomings of a look-aside cache architecture. Tao implements an objects and associations model and continues to use MySql for persistent storage, but mediates access to the database and uses its own graph-aware cache.
To handle multi-region scalability, Tao uses replication using the per-record master idea. (This multi-region scalability idea was again presented earlier in the Facebook memcache scaling paper.)

F4: Facebook's warm BLOB storage system (OSDI'14)

Facebook uses Haystack to store all media data, which we discussed earlier here.

Facebook's new architecture splits the media into two categories:
1) hot/recently-added media, which is still stored in Haystack, and
2) warm media (still not cold), which is now stored in F4 storage and not in Haystack.

This paper discusses the motivation for this split and how this works.

Facebook has big data! (This is one of those rare cases where you can say big data and mean it.) Facebook stores over 400 billion photos.

Facebook found that there is a strong correlation between the age of a BLOB (Binary Large OBject) and its temperature. Newly created BLOBs are requested at a far higher rate than older BLOBs; they are hot! For instance, the request rate for week-old BLOBs is an order of magnitude lower than for less-than-a-day old content for eight of nine examined types. Content less than one day old receives more than 100 times the request rate of one-year old content. The request rate drops by an order of magnitude in less then a week, and for most content types, the request rate drops by 100x in less than 60 days. Similarly, there is a strong correlation between age and the deletion rate: older BLOBs see an order of magnitude less deletion rate than the new BLOBs. These older content is called warm, not seeing frequent access like hot content, but they are not completely frozen either.

They also find that warm content is a large percentage of all objects. They separate the last 9 months Facebook data under 3 intervals: 9-6 mo, 6-3 mo, 3-0 months. In the oldest interval, they find that for the data generated in that interval more than 80% of objects are warm for all types. For objects created in the most recent interval more than 89% of objects are warm for all types. That is the warm content is large and it is growing increasingly.

In light of these analysis, Facebook goes with a split design for BLOB storage. They introduce F4 as a warm BLOB storage system because the request rate for its content is lower than that for content in Haystack and thus is not as hot. Warm is also in contrast with cold storage systems that reliably store data but may take days or hours to retrieve it, which is unacceptably long for user-facing requests. The lower request rate of warm BLOBs enables them to provision a lower maximum throughput for F4 than Haystack, and the low delete rate for warm BLOBs enables them to simplify F4 by not needing to physically reclaim space quickly after deletes.

F4 provides a simple, efficient, and fault tolerant warm storage solution that reduces the effective-replication-factor from 3.6 to 2.8 and then to 2.1. F4 uses erasure coding with parity blocks and striping. Instead of maintaining 2 other replicas, it uses erasure coding to reduce this significantly.

The data and index files are the same as Haystack, the journal file is new. The journal file is a write-ahead journal with tombstones appended for tracking BLOBs that have been deleted. F4 keeps dedicated spare backoff nodes to help with BLOB online reconstruction. This is similar to the use of dedicated gutter nodes for tolerating memcached node failures in the Facebook memcache paper.
F4 has been running in production at Facebook for over 19 months. F4 currently stores over 65PB of logical data and saves over 53PB of storage.


1) Why go with a design that has a big binary divide between hot and warm storage? Would it be possible to use a system that handles hot and warm as gradual degrees in the spectrum? I guess the reason for this design is its simplicity. Maybe it is possible to optimize things by treating BLOBs differentially, but this design is simple and gets things done.

2) What are the major differences in F4 from the Haystack architecture? F4 uses erasure coding for replication: Instead of maintaining 2 other replicas, erasure coding reduces replication overhead significantly.  F4 uses write-ahead logging and is aggressively optimized for read-only workload. F4 has less throughput needs. (How is this reflected in its architecture?)

Caching is an orthogonal issue handled at another layer using memcache nodes. I wonder if the caching policies treat content cached from Haystack versus F4 differently.

3) Why is energy-efficiency of F4 not described at all? Can we use grouping tricks to get cold machines/clusters in F4 and improve energy-efficiency further as we discussed here?

4) BLOBs have large variation in size. Can this be utilized in F4 to improve access efficiency? (Maybe treat/store very small BLOBs differently, store them together, don't use erasure coding for them. How about very large BLOBs?)