Wednesday, March 25, 2015

Paper review: "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems"

This paper appeared in OSDI'14 and is written by Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm, University of Toronto.

While the title mentions "analysis of production failures", the paper is not concerned/interested in root cause analysis of failures, instead, the paper is preoccupied with problems in error handling that lead to failures.

I enjoyed this paper a lot. The paper has interesting counterintuitive results. It opens to discussion the error handling issue, and particularly the Java exception handling. Digesting and interpreting the findings in the paper will require time. To contribute to this process, here is my take on the findings ---after a quick summary of the paper.

The setup and terminology 

The paper studied 198 randomly sampled user-reported failures of 5 opensource data-intensive distributed systems: Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis. Table 1 shows how the sampling is done. (As a superficial observation, it seems like HBase is very buggy where 50% of sampled failures are catastrophic, and Cassandra is well engineered.) Almost all of these systems are written in Java, so Java's exception-style error handling plays a significant role in the findings of the paper.

Definitions: A fault is the initial root cause, which could be a hardware malfunction, a software bug, or a misconfiguration. A fault can produce abnormal behaviors referred to as errors, such as Java exceptions. Some of the errors will have no user-visible side-effects or may be appropriately handled by software; other errors manifest into a failure, where the system malfunction is noticed by end users or operators.

The paper defines catastrophic failure as "failures where most or all users experience an outage or data loss". Unfortunately this is a very vague definition. Table 2 provides some example categories for catastrophic failures considered.


Table 3 shows that single event input failures are relatively low, this is probably because these systems are well-tested with unit tests as they are heavily used in production. On the other hand, the predominating case is where 2 events conspire to trigger the failures.

From Table 4, what jumps up is that starting up services is particularly problematic. Incidentally, in airplanes most fatal accidents occur during the climbing stage.

Table 5 shows that almost all (98%) of the failures are guaranteed to manifest on no more than 3 nodes, and 84% will manifest on no more than 2 nodes. For large-scale distributed systems, 3 nodes being sufficient to manifest almost all failures seems surprisingly low. Of course, this paper looks at data-intensive distributed systems, which may not be representative of general distributed systems. In any case, these numbers don't surprise me as they agree with my experience using TLA+ in verifying distributed algorithms.

Deterministic failures totaling at 74% of all failures is good news. Deterministic failures are low-hanging fruit, they are easier to fix.

Catastrophic failures

Figure 5 shows a break-down of all catastrophic failures by their error handling. Based on this figure, the paper claims that "almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software".

But, it seems to me that this provocative statement is due to a broad/vague definition of "incorrect error handling". If you use a broad/vague definition of "incorrect landing", you can say that every catastrophic airplane failure is an incorrect landing problem. Java casts everything into an error exception, then every fault materializes/surfaces as an exception. But, does that mean if we do a good job on exception handling, there will be almost no catastrophic failures? That is an incorrect assumption. Sometimes the correction required needs to be invasive (such as resetting nodes) and the correction also counts as a catastrophic failure.

And how can we do a good job on error handling? The paper does not provide help on that. Correct error-handling is very hard: You need a lot of context and a holistic understanding of the system, and that is probably why error-handling has been done sloppily in the systems studied.

The paper also claims: "in 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code." I can agree with the 35%, as they consist of trivial mistakes in exception handling, such as an empty error handling block. But for including the other 23% labeling them as easily detectable, I think we should exercise caution and not rule out the hindsight bias.

To prevent this 23% failures, the paper suggests 100% statement coverage testing on the error handling logic. To this end, the paper suggests that reverse engineering test cases that trigger them. At the end of the day, this boils to thinking hard to see how this can be triggered. But how do we know when to quit and when we got enough? Without a rule, this can get cumbersome.

Figure10 shows an example of the 23% easily detectable failures. That example still looks tricky to me, even after we exercise the hindsight advantage.

Speculation about the causes for poor error-handling

Maybe one reason for poor error-handling in the studied systems is that features are sexy, but fault-tolerance is not. You see features, but you don't see fault-tolerance. Particularly, if you do fault-tolerance right, nobody notices it.

But, that is a simplistic answer. After all, these systems are commonly used in production, and they are well-tested with the FindBugs tool, unit tests, and fault injection. Then, why do they still suck so badly at their exception-handling code? I speculate that maybe there is a scarcity of "expert generalists", developers that understand the project as a whole and that can write error-handling/fault-tolerance code.

Another reason of course could be that developers may think a particular exception will never arise. (Some comments in the exception handlers in the 5 systems studied hint at that.) That the developers cannot anticipate how a particular exception can be raised doesn't mean it won't be raised. But, maybe we should include this case within the above case that says there is a scarcity of fault-tolerance experts/generalists in these projects.

"Correct" error handling

While the paper pushes for 100% error handling coverage,  it doesn't offer techniques for "correct" error handling. Correct error handling requires root cause analysis, a holistic view of the system, and fault-tolerance expertise.

So there is no silver bullet. Exceptions in Java, and error handling may facilitate some things, but there is no shortcut to fault-tolerance. Fault-tolerance will always have a price.

The price of reliability is the pursuit of the utmost simplicity.
It is a price which the very rich find most hard to pay.
C.A.R. Hoare

Possible future directions

Since 2-input triggered corruptions are predominant, maybe we should consider pairwise testing in addition to unit testing to guard against them. Unit testing is good for enforcing a component's local invariants. Pairwise testing can enforce an interaction invariant between two components, and guard against 2-input triggered corruptions.

I think the main reason it is hard to write "correct error handling" code is that the exception handler doesn't have enough information to know the proper way to correct the state. This is still futuristic, but if we had an eidetic system, then when exceptions hit, we could call the corrector, which can do a root-cause analysis by doing a backwards query on the eidetic system, and after determining the problem could do a forward query to figure out what are precisely the things that need to be fixed/corrected as a result of that fault.

Thursday, March 12, 2015

My crazy proposal for achieving lightweight distributed consensus

Distributed consensus is a hard problem. See some of my previous posts for discussion about impossibility results on distributed consensus.
Paxos taught
Perspectives on the CAP theorem
Attacking Generals problem

The reason distributed consensus is hard is because the parties involved don't have access to the same knowledge (the same point of view) of the system state.  Halpern's work on knowledge and common knowledge in distributed systems provides a useful framework to explain this.  Here is a summary of common knowledge paper by Halpern, and here is a nice talk on that paper.

Of course, we have distributed consensus solutions, Paxos, ZooKeeper/ZAB, that ensure safety always and provide progress whenever the system conditions step outside the impossibility results territory (with respect to synchrony, channel reliability/reachability, and number of up nodes). But these solutions come with a performance cost, as they need serialization of all update requests from a single Paxos leader, and acknowledgement of these updates by a majority quorum of the Paxos replicas. In ZooKeeper non-leader replicas can also serve reads locally, but the updates still need to get serialized by the leader. Some Paxos variants such as Mencius, ePaxos try to relieve this problem, but the bottom-line is they are still subject to the same fundamental performance limitations due to serialization at a single leader.

I have this crazy idea for circumventing the impossibility results of distributed consensus as well as improving the performance of distributed consensus. The idea is to connect the nodes involved in consensus (let's call these nodes coordinators) with a single-collision-domain Ethernet bus in order to solve the asymmetric knowledge problem.

No, this setup does not assume reliable communication. There can be collisions in the Ethernet bus. But the collisions will be total collisions since this is a shared Ethernet bus. When there is a collision, none of the coordinators would deliver the message, including the sender. Since Ethernet on a shared bus is CSMA-CD, the transmitter also detects the collision and does not accept its own message into the consensus log if that is the case.

So, in effect, the shared Ethernet bus performs the serialization of the coordinators' proposals. As a result, a coordinator proposing an operation does not need to collect explicit acknowledgement from any other coordinator let alone from a majority quorum of coordinators. This makes this consensus algorithm very lightweight and fast.

This system is masking fault tolerant to coordinator crashes until at least one coordinator left remaining. (If we are to allow reintegrating coordinators recovering back from crash, things get complicated of course. Then we would need to assume reasonable churn to allow time for recovering coordinators to catch up before they can be integrated. This would also require fast one broadcast consensus on the reconfigurations.)

That is it. That simple. Now comes the algorithmician's apology.

On the theory side, I know I am not suggesting anything novel. The impossibility results withstand; I just changed the system conditions and stepped outside the territory of the impossibility results (that is why I used the term "circumvent"). In fact, I had noticed this idea first in the context of wireless broadcast and wireless sensor networks, when I was a post-doc at Nancy Lynch's group at MIT. We had published papers exploring the concept for wireless broadcast with total and partial collisions.

On the practical side, I know this proposal has downsides. This is not readily applicable as it requires the Ethernet driver to expose collision information to the application layer. This requires setting up an auxiliary Ethernet LAN across the coordinators. And, yes, I know this doesn't scale outside a LAN. (The coordinators should be connected by the single domain Ethernet bus, but the clients to the coordinators may communicate to the coordinators over TCP/IP and need not be in the same LAN. The coordinators can be located across different racks to guard against rack-wide crashes.)

But every system design is an exercise in determining which tradeoffs you make. The important question is: Is this tradeoff worth exploring? Would the performance improvement and simplicity stemming from this setup makes this a reasonable/feasible option for solving the distributed consensus problem at the datacenter level?

Tuesday, March 10, 2015

Eidetic systems

This paper appeared in OSDI'14. The authors are all from University of Michigan: David Devecsery, Michael Chow, Xianzheng Dou, Jason Flinn, and Peter M. Chen.

This paper presents a transformative systems work, in that it introduces a practical eidetic system implementation on a Linux computer/workstation. This paper is a tour de force: It undertakes a huge implementation effort to implement a very useful and novel eidetic memory/system service. The authors should be commended for their audaciousness.

An eidetic computer system can recall any past state that existed on that computer, including all versions of all files, the memory and register state of processes, interprocess communication, and network input. An eidetic computer system can explain the lineage of each byte of current and past state. (This is related to the concept of data provenance, which I mention briefly at the end of my review.)


One use case for an eidetic system is to track where/how erroneous information entered to the system. The paper considers tracking down a faulty bibtex reference as a case study. This is done using a backwards query. After tracking down the faulty bibtex reference you can then perform a forward query on the eidetic system, in order to figure out which documents are contaminated with this faulty information and to fix them.

Another use case for an eidetic system is to do postmortem of a hack attack and whether it leaked any important information. In the evaluation section, the paper uses as another case study the heartbleed attack, which occurred during time the authors were testing/evaluating their eidetic system implementation.

With a good GUI for querying, the eidetic system concept can enhance the Mac OSX Time Machine significantly, with data lineage/provenance, backward querying, and forward querying/correction. This can augment time travel with analytics, and you can have a time machine on steroids. (Accomplishment unlocked: +100 points for serious use of time machine and time travel in writing.)

Design and implementation

The authors develop the eidetic system, Arnold, by modifying Linux kernel to record all nondeterministic data that enters a process: the order, return values, and memory addresses modified by a system call, the timing and values of received signals, and the results of querying the system time. Arnold and accompanying eidetic system tools (for replay, etc.) are available as opensource.
The key technologies that enable Arnold to provide the properties of an eidetic system efficiently are deterministic record and replay, model-based compression, deduplicated file recording, operating system tracking of information flow between processes, and retrospective binary analysis of information flow within processes.
Arnold uses deterministic record and replay, and trades storage for recomputation whenever possible. That is, Arnold only saves nondeterministic choices or new input and can reproduce everything else by recomputation. The major space saving technique Arnold uses is model based compression: Arnold constructs a model for predictable operations and records only instances in which the returned data differs from the model. Another optimization is copy on RAW (read-after-write) recording: "To deduplicate the read file data, Arnold saves a version of a file only on the first read after the file is written. Subsequent reads log only a reference to the saved version, along with the read offset and return code." These techniques enable Arnold to fit 4 years of desktop/workstation eidetic system into 4TB of off-the-shelf hard disk (which costs $150).

Querying and Replaying

Arnold uses the replay groups abstraction to perform storing and replaying efficiently. Replay groups consist of frequently communicating processes which can be replayed independently of any other group. Arnold employs "Pin" binary instrumentation to analyze replayed executions and track the lineage of data within a replay group. Inter process communication is tracked with the help of a dependency graph which keeps track of the communications between different replay groups. Bundling frequently communicating processes into a group ensures that a large number of conversations need not be recorded to the dependency graph. As such selection of replay group (and replay group size) gives rise to a tradeoff between storage efficiency and query efficiency. It would be nice if the paper provided the replay groups it used in Arnold as a table. This information would be useful to understand the replay groups concept better.

Arnold records even user propagated lineage, such as a user reading a webpage and entering text into an editor as a result. (Of course this leads to introducing some false positives, as it needs to be done speculatively.) Tracking this actually required a lot of work: "Understanding GUI output turned out to be tricky, however, because most programs we looked at did not send text to the X server, but instead sent binary glyphs generated by translating the output characters into a particular font. Arnold identifies these glyphs as they are passed to standard X and graphical library functions. It traces the lineage backward from these glyphs using one of the above linkages (e.g., the index linkage)."

Finally, for the querying of Arnold, the paper has this to say. "A backward query proceeds in a tree-like search, fanning out from one or more target states. The search continues until it is stopped by the user or all state has been traced back to external system inputs. As the search fans out, Arnold replays multiple replay groups in parallel. In addition, if no lineage is specified, it may test multiple linkages for the same group in parallel, terminating less restrictive searches if a more restrictive search finds a linkage."

Unfortunately, user-friendly GUI-based tools for querying is not available yet. That would be asking too much from this paper which already packed a lot of contributions into a single publication. The evaluation section gives some results about backward and forward querying performance in Arnold.

Related work on data provenance

Data provenance is a topic which has been studied as part of the database field traditionally. However, recent work on data provenance started considering the problem of capturing provenance for applications performing arbitrary computations (not resricted to a small set of valid transformations in database systems). The paper "A primer on provenance" provides a nice accessible survey of data provenance work.

Future work

This paper presents an eidetic system on a single computer. An obvious future direction is to enable building an eidetic distributed system. By leveraging Arnold, such a system also seems to be in reach now. Our work on hybrid logical clocks can also help here by relating and efficiently tracking causality across distributed nodes running Arnold. Since our hybrid logical clocks can work with loosely synchronized time (a la NTP), and is resilient to uncertainty (it enables efficient tracking of causality without blocking for synchronization uncertainties), it can be adopted for implementing a distributed eidetic system in practice.

A remaining kink for a distributed eidetic system could be the cost of querying. Querying and replay is already slow and hard for a single eidetic system, and it is likely to become more complicated for a distributed system since coordination of replay is needed across the machines involved in the replay.

Thursday, March 5, 2015

Extracting more concurrency from Distributed Transactions

This paper appeared in OSDI'14. The authors are: Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, Jinyang Li. The paper, presentation slides and video are accessible here.

The paper proposes a concurrency control protocol for distributed transactions, and evaluates its performance comparing with two-phase locking (2PL) and optimistic concurrency control (OCC).

The protocol

The protocol introduced, ROCOCO (ReOrdering COnflicts for COncurrency) is targeted for extracting more concurrency under heavily contended workload than 2PL and OCC can handle. In fact, ROCOCO's improvements over OCC and 2PL comes after the peak throughput point of even ROCOCO. One of the questions after the OSDI presentation was about this. "That region where the system is thrashing is not a region you want to be. Why would you not employ admission control to refuse extra workload that pushes the system past peak performance?"
ROCOCO assumes that a distributed transaction consists of a set of stored procedures called pieces. Each piece accesses one or more data items stored on a single server using user-defined logic. Thus, each piece can be executed atomically with respect to other concurrent pieces by employing proper local concurrency control.

ROCOCO achieves safe interleavings without aborting or blocking transactions using two key techniques: 1) deferred and reordered execution using dependency tracking; and 2) offline safety checking based on the theory of transaction chopping.

ROCOCO's transaction reordering idea is adopted from the ePaxos protocol introduced a couple years ago. This is a neat idea. The first phase is sort of like a dry run for the transaction. Dependencies with other concurrently executing transactions are learned. In the second phase, the dependent transactions are synchronized. They are forced to wait for each other and executed that way.

This approach is basically pipelining the transactions. This is also similar to what the Calvin does with its log-based approach. Pipelining helps for throughput of course, but it also introduces a drawback.

Unlike 2PL and OCC which executes a depended-upon transaction to completion before allowing its dependent/conflicting transactions to execute, ROCOCO is deciding on an order and pipelining the execution of these conflicting transactions in some determined order. However, if the first transaction of these pipeline-executed transactions does not complete for some reason or due to a fault and needs to be rolled back, this also requires rolling-back the remaining transactions in the pipeline that depended on this transaction. This is a problem 2PL and OCC did not have.

This basic reordering protocol is when some of the transaction pieces/fragments are deferrable. For transaction fragments that are immediate (whose outputs are inputs to other pieces in a transaction), the reordering protocol is inapplicable, and the paper uses an offline checker to avoid conflicts in such situations.  The Offline checker works in following steps:
1. It constructs an S-C graph based on transaction chopping. Each edge in the graph is either a Sibling edge (an edge formed for pieces of same transaction instance) or a Conflict edge (an edge formed by pieces which access the same database table and any one of the piece issues a write).
2. Each vertex in the graph is either tagged as immediate (I) or deferrable (D). A conflict edge  can be an I-I edge or a D-D edge.
3. The checker observes all the S-C cycles formed by the graph. SC-cycles represent potential non-serializable interleavings. However, if an SC-cycle contains at least one D-D edge, ROCOCO can reorder the execution of the D-D edge's pieces to break the cycle and ensure serializability. For an unreorderable SC-cycle with all I-I C-edges, the checker proposes to merge those pieces in the cycle belonging to the same transaction into a larger atomic piece. ROCOCO relies on traditional distributed concurrency control methods such as 2PL or OCC to execute merged pieces.


ROCOCO is  implemented as an in-memory key-value store with 20K C++ code  and evaluated using a scaled TPC-C benchmark in comparison to OCC and 2PL. Given that Calvin  is a closely related work (because it also orders transactions and pipelines their execution), it would be good to see a comparison to Calvin, but the paper does not include that.

Some miscellaneous thoughts

In the introduction, the following paragraph is provided as a motivation for ROCOCO.
Unfortunately, contention is not rare in large-scale OLTP applications. For example, consider a transaction where a customer purchases a few items from a shopping website. Concurrent purchases by different customers on the same item create conflicts. Moreover, as the system scales—i.e., the site becomes more popular and has more customers, but maintains a relatively stable set of items—concurrent purchases to the same item are more likely to happen, leading to a greater contention rate.
The OSDI presentation also includes the same example as motivation. I don't think this is the right/best way to argue that contentions will happen, because this is a faux contention, and not an inherent contention. When the number of a sale item is very high (which is almost always the case), why do we care to carefully check the number of items remaining? Conflict-Free Replicated Data Types (CRDT) approach helps here to avoid conflicts easily. (For some nice papers on this see: CRDT1, CRDT2, CRDT3) The coordination avoidance in distributed databases paper also argues for  avoiding coordination when all local commit decisions are globally valid (in other words, when the commit decisions are composable).

Sunday, March 1, 2015

My trick for coordinating Dropbox collaborations

There are high-tech solutions to managing the collaborative paper writing process, such as using version control systems like CVS or git. The downside to these systems is that when you are collaborating on a paper with a low-tech author, it is cumbersome to get them on board with these tools. Some of my collaborators have been from a non-CS background. And even some CS-background collaborators refuse to deal with cryptic version control commands and error messages.

Dropbox provides a simple easy-to-use solution to sharing files which helps  teams to collaborate on projects, including collaborating on a paper. However, Dropbox does not have access control and cannot coordinate concurrent accesses to the same file by multiple writers. This causes overwritten/lost updates and frustrated collaborators.

I have a low-tech solution to this problem. I have used this solution successfully with multiple collaborators over the last 2-3 years. Chances are that, you have also come up with the same idea. If not, feel free to adopt this solution for your collaborations. It helps.

Here is the solution. When I share a folder to collaborate with coauthors, I create a token.txt file in that folder. This file is there to coordinate who edits which file at any given time.

Below is a sample token.txt file contents. The first part explains the guidelines for coordination, the second part lists the files to collaborate on. We generally use one file to correspond to one section of the paper.

Check with token.txt file to see who is editing which file in order to avoid conflicts.
If you want to write:
+ pick a section/file,
+ reopen token.txt to avoid seeing old-state,
+ write your name next to that section/file at token.txt to get the lock/token,
+ quickly hit save on token.txt, and
+ start editing that section/file on your own pace.

When done writing:
+ delete your name for that section/file at token.txt. (I.e., Unlock that file.)

Don't keep unlocked files open in your editor, lest you inadvertently save them over a newer version your collaborator wrote. An editor that auto-saves may also cause that problem.
eval: ...........Ailidani

ALL FILES: This file is always with Murat! Treat this as read-only, don't edit.

The token.txt file acts as a message board to show who is working on which file. Of course there can be concurrent updates to token.txt, but this is less likely because you make a tiny edit and then quickly save this file. Moreover, many editors, including Emacs, notifies you immediately if the token.txt file has been changed during the time you keep it open.

(This is in essence similar to the use of RTS-CTS messages in wireless communications. Instead of having a more costly collision for a long-size data transfer, it is an acceptable tradeoff to have occasional collided/lost RTS messages, because those are very short transfers and are low-cost.)

Yes, this is a lazy low-tech solution and requires some discipline from collaborators, but the effort needed is minimal, and I haven't got any bad feedback from any collaborator on this. During the time I used this method, we never had a problem with concurrent editing on the same file. (I used this with teams of size 2-to-6, and only for collaborating on papers.) One problem, however, we occasionally had was with someone forgetting to remove a lock after he is done working on that file. In this case, we resort to email: "Are you still holding the lock on this file? When do you plan to release it?"

Notice the "ALL FILES" entry at the end. That is when one author may want to lock all the files for a short duration easily. This happens infrequently, but comes handy for restructuring the paper, making fast substitution edits across all files, or spell-checking the entire paper before submission.

You may also notice a entry at the bottom. That is an org-mode file where I keep my TODO items and notes about the paper. It is my lab-notebook for this writing project. I can't function in any project without my file.

PS: Geoffrey Challen suggests: "You could also replace Murat's token.txt file with a Google Drive document." 

Friday, February 27, 2015

Salt: Combining ACID and BASE in a Distributed Database

This paper appeared in OSDI'14. The authors are Chao Xie, Chunzhi Su, Manos Kapritsos, Yang Wang, Navid Yaghmazadeh, Lorenzo Alvisi, and Prince Mahajan, all from The University of Texas at Austin. Here you can watch a video of the OSDI presentation of the paper, and also find the presentation slides and the paper available as open access.  USENIX knows how to do conferences right.

Dropping ACID

ACID (Atomicity, Consistency, Isolation, Durability) approach provides ease-of-programming through its simple transaction abstraction, but loses on the performance. BASE (Basically-Available, Soft state, Eventually consistent) approach, popularized by the NoSQL systems, provides good performance (low-latency, high-throughput, and scalability), but loses on the ease-of-programming due to increased complexity of concurrent execution in distributed systems.

This paper introduces Salt, which aims to find a best-of-both-worlds middle ground: to provide high performance like BASE with modest programming effort like ACID.

Less is more

Salt approach is motivated by the following observation (which is an instance of the Pareto principle). I am quoting from the paper:
"When an application outgrows the performance of an ACID implementation, it is often because of the needs of only a handful of transactions: most transactions never test the limits of what ACID can offer. Numerous applications [2, 4, 5, 10, 11] demonstrate this familiar lopsided pattern: few transactions are performance-critical, while many others are either lightweight or infrequent." 
Of course, a better quantification than saying "numerous" or "many" applications would make the argument for the Salt approach stronger. However, such an extensive study may take several months, and is unfair to ask for a conference paper submission.

It is tempting to increase the concurrency of those critical ACID transactions by splitting them into smaller mini ACID transactions. However, doing so may not be safe as you would lose Atomicity (which frees you from worrying about intermediate states during failures) and Isolation (which regulates which states can be accessed when transactions execute concurrently). The paper calls this a stark choice and cites the YouTube clip from Butch Cassidy to further drive this point home (I am not kidding).

The paper's proposal is this. Instead of worrying about all possible Atomicity & Isolation complications with all other ACID transactions, Salt categorizes transactions into ACID transactions and a small number of BASE transactions, and lets you worry only about checking Atomicity & Isolation complications between the BASE transactions.

Of course there is an unmentioned complication here. You also have to ensure that these BASE transactions leave the database in consistent state. This is not an easy task as well, and opens up Salt's ease-of-programming to debate. On the other hand, you would have to worry about this for all operations if you implement a full-BASE solution.

The Salt approach

Actually Salt's BASE transactions approach gets its inspiration from the ACID nested transactions. The paper says:
"Nested transactions is an abstraction originally introduced to offer, for long-running transactions, atomicity at a finer granularity than isolation. Our purpose in introducing BASE transactions is similar in spirit to that of traditional nested transactions: both abstractions aim at gently loosening the coupling between atomicity and isolation. The issue that BASE transactions address, however, is the flip side of the one tackled by nested transactions: this time, the challenge is to provide isolation at a finer granularity, without either drastically escalating the complexity of reasoning about the application, or shattering atomicity."

By providing isolation at a finer granularity, Salt allows BASE transactions to achieve high concurrency by observing each other's internal states, without affecting the isolation guarantees of ACID transactions. A BASE transaction is made up of multiple mini-transactions, called alkaline subtransactions. (Yes, the paper makes several Chemistry puns.) The committed state of an alkaline subtransaction is observable by other BASE or alkaline subtransactions. On the other hand, the committed state of an alkaline subtransaction is not observable by other ACID transactions until the parent BASE transaction commits.  As such, Salt guarantees that, when BASE and ACID transactions execute concurrently, ACID transactions retain, with respect to all other transactions (whether BASE, alkaline, or ACID), the same isolation guarantees they used to enjoy in a purely ACID environment.

The proverbial Bank example

When you talk about ACID versus BASE the classical example is the bank money transfer example. The paper gives a pure ACID, a pure BASE, and a Salt implementation of this example.

I wonder, which implementation is closer to the way the money transfers are actually handled in real banks. If someone familiar with how banks handle the money transfer can weigh in on this, that may shed a light on this debate.  My money is on the pure BASE approach.

Something about this bank example is odd though. The total-balance operation is an unrealistic big/monolithic operation. Why did that have to be a transaction? The total-balance operation could be done with a consistent/timed snapshot, no? Then why did it have to be a transactions spanning/freezing the entire system?

Now let's check the three implementations more closely. The implementation in Figure 1.a (ACID) uses only one type of locks, ACID locks. Figure 1.b (BASE) uses only one type of locks, local locks. BASE programming looks like classical distributed system programming, it tries to do everything with local asynchronous actions as much as possible. Figure 2 (Salt) uses Alkaline, Saline, and ACID locks.

Why does Salt need 3 different locks? Because the first alkaline transaction is special, as it determines whether the entire BASE transaction it is included in will succeed or not. If that first alkaline subtransaction succeeds, then comes the saline-lock to make the intermediate states in the rest of the BASE transaction available to other BASE and alkaline transactions. But then, this is something I am still not clear on. Salt attributes significance to the first alkaline subtransaction, so its design should also be special. But I am unclear about what should go in that subtransaction? Are there guidelines to design that first subtransaction so it captures the successful completion of the rest of the BASE transaction?


Like any OSDI/SOSP paper worth its salt (see, I can also make "Salt" puns :-), the paper includes a strong evaluation section. The authors implemented a Salt prototype by modifying MySQL Cluster. They evaluated the prototype using the TPC-C benchmark, which consists of 5 types of transactions: new-order (43.5%), payment (43.5%), stock-level (4.35%), order-status (4.35%), and delivery (4.35%). The results show that by BASE-ifying 2 of these transactions, Salt achieves 80% of maximum throughput of a full BASE implementation.

Notice the left side of Figure 7, which shows that ACID version faster than the Salt version under low load. The paper doesn't discuss this but, if we had both ACID and BASE version of transactions, it may be possible to hot-swap between these two versions to optimize the throughput of the system depending on the current system load.

Related work

The paper fails to cite several papers that proposed orthogonal approaches to deal with the same problem: improving the performance of ACID transactions, while keeping the ease-of-programmability.

The Red-Blue Actions paper (OSDI'12) considered the problem of reducing the granularity of ACID transactions by using a generator operation and shadow operation and categorizing the operations  as red (consistency critical) and blue (fast and eventually consistent). That paper also uses a Bank example.

Then there are work from the Berkeley group on this problem. I had summarized the Invariant-based coordination avoidance paper earlier. There is also the Highly Available Transactions paper  (VLDB'13) where they  look at ACID isolation level relaxing for improving availability.


The paper proposes the Salt approach assuming the ACID implementation of the application is already present. The application for this is if you already have an ACID system, you can Salt it to fine-tune the performance. Another approach would be to start with a BASE implementation and to increase atomicity/isolation for a couple operations. Maybe for some applications, it will be easier for going from BASE to Salt: Starting with BASE and then adding Alkaline subtransactions and ACID transactions (if necessary).

The Salt approach do require a learning curve and can be tricky. But how do we argue the complexity of the programming effort needed for one approach (say Salt) versus the other (say pure BASE) in an objective manner? One can objectively show performance improvements with evaluations, but it is harder to argue for "ease of programming" in an objective/quantitative manner.

There can also be some complications with the early-committing (after the first alkaline subtransaction) in BASE transactions. This question came up in the OSDI presentation. If deadlock occurs the MySQL Cluster approach is to use timouts and roll the transactions back. In this case, Salt throws exceptions which the developer need to address to re-achive consistency and atomicity.

Tuesday, February 17, 2015

Book report: "Good Prose" and "How to write a lot"

Notes  from the "Good prose"

This book is written by a journalist turned nonfiction author and his editor.
This is a nice book, and a good read.

"Quiet beginnings: You cannot make the reader love/trust you in the first sentence, but you can lose the reader with a grand proposition in the first sentence."

"To write is to talk to strangers. Prepare the reader, tell everything reader needs to know to read on, but no more."

"For a story to live, it is essential only that there be something important at stake, a problem that confronts the characters or reader. The unfolding of the problem and its resolutions are the real payoff."

"Revelation transforms events into a story."

"Don't mess with chronology unless you have a good reason."

"On the topic of essays:  Essayists tend to argue with themselves. Who am I to write this? Who cares to read this? If I knew my own mind, I would not make essays, I would make decisions. --Montaigne."

A main take-away from the book is the respect these writers show for their craft. They spend countless hours drafting/revising/editing with patience. They spend days discussing about improvements on a draft. The important thing is to get it right, do the right thing. At some point, over a dispute on one word with the editor in chief of Atlantic, they consider quitting their dream jobs there on principle.

Notes from "How to write a lot"

This book aims to help academicians to write a lot. It is a short book and it has a single, simple message: "schedule time for writing 2 hours daily in your calendar, and sit down to write at those times".

"Writing is a skill, not an innate gift or special talent. Like any advanced skill, writing must be developed through systematic instruction and practice. Learn the rules and do deliberate practice."

"Schedule time for writing and stick to it."

"Any action that is instrumental in completing a writing project counts as writing."

"No internet. The best kind of self control is to avoid situations that require self control."

Some recommended reading by the book are: "Writers book of hope" and "Professors as writers". I hope to check these later.

Tuesday, February 10, 2015

Paper summary: Perspectives on the CAP theorem

This is a 2012 short paper by Seth Gilbert and Nancy Lynch that appeared in a special issue commemorating the 12th anniversary of the CAP theorem. Gilbert and Lynch get to write in this special issue because they were the ones to first publish a proof the CAP conjecture put forward by Eric Brewer in PODC 2000 keynote.

In this paper, Gilbert and Lynch aim to situate the CAP theorem in the broader context of a family of results in distributed computing theory that shows impossibility of guaranteeing both safety and liveness in an unreliable distributed system. A quick refresher on specifications is in order. Safety says "nothing bad happens". (You can satisfy safety by doing nothing.) Liveness says "eventually something good happens". Unreliability (aka failure model) says "such and such faults happen".

The impossibility results surveyed in relation to CAP concern slightly different problems and slightly different fault models. While it is easy to confuse CAP with those results on a superficial look, on a closer inspection we see that the results are all distinct and none subsume CAP.

The CAP problem 

The CAP theorem does NOT consider the consensus problem, but considers an easier problem: the atomic read/write register (aka atomic storage) problem. Atomic means that the system provides linearizability, a strong type of single-copy consistency that guarantees that a read returns the most recent version of data. The specifications of this problem are as follows. (The first is the safety property, the second one liveness.)
Consistency: The system provides its clients with a single register (emulated by
multiple storage nodes), and each client can read or write from that register.
Availability: Each request for read or write eventually receives a response.

The FLP (Fisher-Lynch-Patterson) and the attacking generals impossibility results consider the consensus problem. The specifications for consensus are as follows. (The first two are safety properties, the last one a liveness property.)
Agreement: No two process can commit different decisions.
Validity (Non-triviality): If all initial values are same, nodes must commit
that value.
Termination: Nodes commit eventually.

So here is the difference between consensus and atomic storage. Consensus is supposed to dutifully remember a value that is anchored (stored by a majority number of nodes). Consensus is loyal to making that value persist as the committed decision. Atomic storage doesn't have that responsibility. The nodes don't need to commit to a decision value, so the system doesn't need to keep track of and remember whether a value is anchored. The atomic storage system as whole accepts new writes as long as the reads don't return results that betray the single register (i.e., single-copy) abstraction.

And what is the implication of this difference? FLP result declares that even under reliable channels assumption, consensus is impossible to solve in an asynchronous system with node crash failures. For example, Paxos loses liveness because it can not converge to a single leader in an asynchronous model. Did the current leader crash? The failure detector cannot be accurate. If the failure detector incorrectly says that the leader (who is supposed to ensure and remember that a value is anchored) is not crashed, liveness is violated since nodes keep waiting on a failed leader. If failure detector incorrectly says that the leader is crashed, then you have multiple leaders, and liveness is violated because of multiple leaders dueling with forever escalating ballot numbers to get the majority to accept their proposal.

On the other hand, since the atomic storage problem doesn't care about remembering whether a value is anchored, it is oblivious to the dueling leaders clients, and as such it is solvable for crashes of up to half of the nodes with the FLP model (i.e., with reliable channels in an asynchronous system). I had blogged about the Attiya, Bar-Noy, Dolev (ABD) algorithm that achieves this feat.

Now that we know atomic storage problem is solvable with reliable channels with up to minority crashes, what can we say about the atomic storage in the presence of unreliable channels? That is covered by the CAP theorem's fault model, which we discuss next.

The CAP fault model 

We discussed the specifications of the problems considered by CAP, FLP, and attacking generals, but we omitted to talk about another important part of the system specification, the unreliability/fault model.

Above I had introduced the FLP fault model when discussing solvability of consensus versus atomic storage in the FLP model. FLP fault model assumes reliable channels, asynchronous system, crash failure. Of course, by assuming reliable channels, you don't get reliable channels in your deployment. That is just wishful thinking. But since the attacking generals impossibility result proved that consensus is not achivable in the presence of unreliability channels, FLP had to consider reliable channels. Even then, we have disappointment; consensus is also impossible in the FLP model.

CAP does something courageous and considers unreliable channels again (as in the attacking generals fault model) in its fault model. Since CAP is concerned with the atomic storage problem, which is a slightly easier problem than consensus, the attacking generals impossibility result does not subsume the CAP result.

CAP result says that atomic storage problem is also impossible to solve under unreliable channels.

Recall that ABD solved the atomic storage problem in the FLP model. If we move to the CAP fault model and allow partitions, we observe from the ABD algorithm that it blocks (loses availability) for a read or write request that arrives to a node in a minority partition. Just as the CAP says, either consistency or availability has to give.

Here is the proof sketch verbatim from Gilbert-Lynch paper.

Similar to the attacking generals result, the CAP result is oblivious to whether the system is synchronous or asynchronous, and holds in both cases.

What is remaining?

Observe from the CAP proof sketch that the CAP fault model is very rough. When it says unreliable channels, it allows you to assume the worst case (i.e., no message makes it through at all), and prove the impossibility result for that worst case.

What if we quantify and limit the unreliability of the channels to more realistic scenarios. Can we prove more refined versions of CAP? What would be the consistency level a system can provide if the system model allows eventual message arrival? A recent technical report from University of Texas Austin, "Consistency availability convergence" paper, looks at that problem. We will discuss that paper next in our distributed systems seminar.

More about CAP tradeoffs

The Gilbert-Lynch paper discusses some of the practical implications of the CAP theorem and says that Consistency versus Availability should not be seen as an absolute and binary tradeoff. Instead you can consider shades of Consistency versus Availability. Also you can make different Consistency versus Availability tradeoffs at the data level, operation level, and subsystem level. These observations are very similar to the suggestions made in Eric Brewer's article in the same special issue: "CAP 12 years later, how the rules have changed".

The Gilbert-Lynch paper also mentions the scalability problems caused due to trying to enforce consistency, but leaves that discussion as future work. PACELC model by Daniel Abadi provides a more detailed explanation for Low-latency versus Consistency tradeoffs in the absence of partitions.

Saturday, February 7, 2015

How to present your work

Presentation is a very important skill. Determining how to present and communicate your results in an effective manner is  as important as doing the research and getting those results. From the same material you can get a killer job talk or a bummed dissertation talk. Presentation skills can make the difference.

Presentation is not a soft skill despite the common misconception among many technical people. It takes a lot of brains, analyzing, and synthesizing to produce a good presentation. You have to understand and internalize your content very well in order to present it clearly in the most engaging and compelling way. So if you fail to present your work clearly, that reflects poorly on you and your work. The audience will think "the work doesn't look significant or promising", or even worse "the presenter doesn't truly understand/internalize his work".

The most essential requirement for a successful presentation is practice. I observe that our graduate students are not good at presenting because they don't get enough practice and rehearsals. As with any skill, you will learn about presenting best by doing it. So don't waste any opportunity to talk about your work. Talk about your work to your relatives, who won't understand. (This is very useful, because you get to think of how to explain your contributions in the context of the real world out there.) Tell it to your friends not in the same field. Tell it to your lab mates. Give talks in the department seminars. You need to learn how to get feedback and gauge your presentation. Eventually, you will get to present your work in conferences.

The "Talk like TED" book 

I recently picked up this book from a library. (Yes, a physical library.) The book was a very easy, lightweight reading. The book's outline consists of the 9 tips it offers for giving successful TED-like talks. It argued that good talks need to be emotional, novel, and memorable, and categorized the 9 tips under those 3 headings.
+ unleash your passion
+ tell a story
+ have a conversation
+ teach something new
+ deliver WOW moments
+ use humor
+ keep it short
+ paint a mental picture
+ be authentic

This may be my own failing, but I didn't learn/benefit much from the book. (Shall I say, the book was not very emotional, novel, or memorable?) If you had done some reading about public speaking before, this book does not offer much new insights or content. This book could have been condensed to a long blog post.  For example, read this article by Chris Anderson instead.

Maybe the most interesting take away from the book is how much practice the presenters put into the talk. I knew TED speakers rehearsed a lot, but I was still surprised to learn how much. Six months before the talk, the presentation draft is there. Then it is all practice rehealsals and tweaking to refine and fine-tune. One month before the talk, the final form of the presentation is ready.

Keep your eyes on the message 
I think the best presentation advice I received was from another TED, Ted Herman. After hearing me speak as a graduate student on one of my work, he told me: "Focus on the message. Give your presentation with the sole goal to teach, and not to impress."

If you worry about impressing the audience in a presentation, then your ego gets in the way, you get self-conscious. You start questioning "Did I say that right? Was I too slow? Am I perceived as confident enough? etc." This will get you nervous and self-doubting. If you focus on the message, you will get your point across, even if it takes flapping your arms wildly. And if your content is good, it will impress people after all. As Anderson says "Presentations rise or fall on the quality of the idea, the narrative, and the passion of the speaker. It’s about substance, not speaking style or multimedia pyrotechnics."

Focusing on the message is hard to do using Powerpoint/Keynote. Powerpoint makes it easy to lose the focus on the message. By writing slide after slide and getting things stylistically right (which Powerpoint facilitates immensely), you get the false-sense of security that you are communicating the content. To avoid this and to force yourself to focus on the message/content, you should prepare the story and outline of the talk first in your mind, before sitting down to prepare the slides. How would you give this talk if you were forced not to use any slides. Thinking this way and purposefully omitting the slides as crutches will help you learn, discover, and hone your message. (More on this below, where I talk about framing the talk.)

To reiterate what I said in the beginning of the post, presentation is not a soft skill and requires a lot of brains. In order to produce a clear and condensed message, you need to learn how to abstract the most important lessons from all the work you did. This requires you to first process and internalize your work really well. You should omit a lot of incidental details, and provide a condensed message in a conference presentation or job talk. These talks are there to whet people's appetites and get them to read your papers. This does not mean that these talks should omit technical content. On the contrary they should include technical content that communicates the essence of your technical contribution in a clear and accessible manner.

Frame your talk well

And from my PhD advisor, Anish Arora, I learned how to frame a talk well. Framing means finding the right place to begin and developing the right story and context to pitch the ideas.

When we meet to discuss how to present our work, Anish would declare "let's try to understand what we did in this work". Wow! He was totally comfortable accepting that we could understand our work better, and the context at which we started and performed the work is not necessarily the best context with which we present the work. From Anish, I learned that it is OK to search for a new perspective and a better context to frame and present the work.

In fact, your best presentations are which you also learn and get a new appreciation of your work. If you get that new understanding of your work, that is a good indication that you did enough to frame your work, and you achieved a good understanding and focusing of your message.

Related links

Presentation skills is somewhat related to writing skills. So most of the advice here also applies to writing. Similarly some of the advise on writing also applies to preparing a presentation.
How to write your research paper
How I write

What is the biggest rock?
You can get the style right, and give a content-free and superficially interesting talk. But as Lincoln said: "You can fool all the people some of the time and some of the people all of the time, but you cannot fool all the people all the time."

Saturday, January 31, 2015

Dijkstra's stabilizing token ring algorithm

One of the classical algorithms I teach in my distributed systems class is Dijkstra's stabilizing token ring algorithm. This algorithm has started the self-stabilization field as a subfield of fault-tolerance. And, it still receives interest even after 40 years. There has been probably hundreds of self-stabilization papers that revisits Dijkstra's stabilizing token ring algorithm as part of a solution or as part of a case study. This algorithm never gets old for me as well. I still enjoy talking about this algorithm in class and thinking about it once in a while. I guess this is because it is a very elegant algorithm.

Linked is Dijkstra's original paper introducing the algorithm. This paper also includes two variants of the stabilizing token ring algorithm, 3-state and 4-state token ring algorithms. Dijkstra would later do a followup writeup, titled, "A belated proof of self-stabilization", where he gave a proof of stabilization for 3-state token ring program.

As a side note, when I asked my distributed class whether they heard Dijkstra's name before, almost all hands went up. And almost all of them had heard his name due to the "all pairs shortest path algorithm". Interesting. Probably, in Dijkstra's mind, that algorithm was a minor contribution. He came up with that algorithm at a cafe in 20 minutes without using pen and paper. None of the students heard the "Go-to considered harmful" paper, which may be closer to what Dijkstra was aiming with his career. Dijkstra's later work focused on predicate calculus for enabling  structured/principled verifiably-correct programming.

Stabilizing fault-tolerance

Stabilization is at the heart of the Dijkstra's token ring algorithm. Stabilization is a type of fault tolerance that advocates dealing with faults in a principled unified approach instead of on a case by case basis. Instead of trying to figure out how much faults can disrupt the system stabilization assumes arbitrary state corruption, which covers all possible worst-case collusions of faults and program actions. Stabilization advocates designing recovery actions that takes the program back to invariant states starting from any arbitrary. Anish Arora's paper "Practical Self-Stabilization for Tolerating Unanticipated Faults in Networked Systems" provides a great overview of stabilization concepts.

Token Ring stabilization

Let's start with the simple token ring concept. The processes are arranged in a ring fashion and there is a unique token circulating in this ring. (You can think of the token may be providing mutual exclusion to the processes; whichever process has the token can access the critical-section/shared-resource).

In the invariant states there is exactly one token. So in the bad states, there are two possible cases: 1) we end up with a ring with no tokens,  2) we end up with a ring with multiple tokens.

Let's consider how Dijkstra's token ring algorithm deals with case 1. Dijkstra's token ring algorithm maps the concept of a physical token to a relation between two neighboring processes. This mapping is done skillfully to limit the state space so the case 1 faults cannot possibly happen in that state space. In the mapped state space, any arbitrary state  is guaranteed to contain 1 or more tokens.

How does this work? Dijstra's algorithm encodes a token with the relation: "the x variable of a process is DIFFERENT than the x variable of its predecessor in the ring". In this case, if all x variables are the same, there won't be any token in the system. To fix this Dijkstra's algorithm encodes the token differently for process 0: "There is a token at process 0, if its x variable is the SAME as that of its predecessor, process N". Encoded this way, we cannot not have 0 tokens in any arbitrary state in the state space.

A good analogy for bounding the corruption space is the use of the carousal door. For a normal door, you have two possible states: door is open or door is closed. For a carousal door, there is only one possible space: the door is closed. (Yet you still make it through that door by revolving through it.)

In order to deal with case 2, Dijkstra's algorithm uses some token overwriting mechanisms. First let's observe that no process action introduces an extra token to the system. (Of course, a fault action can corrupt state arbitrarily, and can throw the system to a state with many extra tokens.) In the normal/fault-free case, by taking an action a process consumes a token that exists between its predecessor and itself and passes this token to its successor. In the faulty case, this is also the worst damage a process can do, because it will always consume the token at itself, and since its state change affects only its successor, it may or may not pass the token to its successor. If the process execution does not pass the token to its successor, this is actually good: an extra token is removed from the system. And if enough of these happen, the system will stabilize to exactly one token case.

But what if this leads to a system with no tokens? Recall that this cannot happen due to the discussion regarding Case 1.

Stabilizing token ring algorithm in TLA+ 

After explaining the program actions, in order to help students internalize what is going on, I call 5 students to stand up in front of the class. The students form a ring. And they use hands up or down to represent x=1 or x=0, and enact a simulation of the program starting from some arbitrarily corrupted states. The students first get to observe that there is indeed no configuration possible with 0 tokens. Then as they enact several scenarios, they observe how no new tokens are introduced by a process action, but in some cases an extra token is consumed.

Then, I challenge them to come up with a scenario where this binary token ring formed does not get to converge.  The corruption continues to live on in the system.  Here is such a run in a ring of 4:
1010 --> tokens at nodes 1,2,3
1011 --> tokens at nodes 1,2, and 0
1001 --> tokens at nodes 0,1,3
1101 --> tokens at nodes 0,2,3
0101 --> tokens at nodes 1,2,3
0100 --> tokens at nodes 1,2,0
0110 --> tokens at nodes 0,1,3
0010 --> tokens at nodes 0,2,3
1010 --> tokens at nodes 1,2,3
and this gets us to the first state. The system can repeat this loop forever without stabilizing.

Of course, when you run the model checker for this setup (where x is limited to 0 and 1) the TLA+ model checker provides you this counterexample.

The M>=N requirement 

So, what went wrong? Wasn't Dijkstra's token ring supposed to be stabilizing? The first time around we discussed the algorithm since x was an unbounded integer, the algorithm was stabilizing. When we considered the binary token ring example, x was bounded by 2, and that failed to stabilize in a ring of size 3 or more. There is a relation on M, the bound on x, and the algorithm's ability to stabilize for a ring of size N. We had seen in the counter example that for M<N there is a way to form a loop in the execution trace and keep the corruption going in the system.

For M>=N, the node 0 acts as a gate. Consider the case where node 0 never fires. In this case the other nodes fire, and eventually they all copy the x value at node 0, and in that configuration the system reduces to one token (at node 0) and stabilization is achieved. For M>=N, we can use the pigeon-hole principle to show that the system  eventually reaches the case where node 0 does not fire until all other nodes  copy its value. This is because, for M>=N, at any configuration there is a value "v" of x that is not currently present in the ring. Eventually node 0 will hit "v", and then node 0 will not be enabled until v is copied by all nodes. This is how node 0 acts as a gateway.

Concluding remarks  

I may talk about Dijkstra's 3-state and 4-state token ring algorithms in another post. Those are also interesting and elegant algorithms.  In a paper I wrote in 2002, I used Dijkstra's classic, 3-state, and 4-state token ring algorithms as case studies. If you like to learn more about this, you can read that paper here.

Related links

Crash only software 
The role of distributed state

Thursday, January 29, 2015

My Distributed Systems Seminar reading list for Spring 2015

Below is the list of papers I plan to discuss in my distributed systems seminar this semester. If you have some suggestions on other good/recent papers to cover, please let me know in the comments.
  1. Perspectives on the CAP Theorem
  2. Consistency, availability, and convergence.
  3. Salt: combining ACID and BASE in a distributed database
  4. Extracting More Concurrency from Distributed Transactions
  5. Eidetic Systems
  6. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
  7. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
  8. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
  9. GraphX: Graph Processing in a Distributed Dataflow Framework
  10. Architecture of a database system
  11. Arrakis: The Operating System is the Control Plane
  12. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism
  13. Project Adam: Building an Efficient and Scalable Deep Learning Training System

My experience with using TLA+ in distributed systems class

I used TLA+ in my distributed system class in Fall 2014. (To learn the backstory on this, read my pre-semester TLA+ post.)

In short, I loved the experience and I am hooked. Integrating TLA+ to the class gave students a way to get a hands-on experience in algorithms design and correctness verification. Even for a sophisticated algorithm, such as Paxos, I can refer the students to TLA+ to practice and play with the algorithm, so they can internalize what is going on. Throughout the semester as I had more chance to work with TLA+, I started growing a library of TLA+ modeling of distributed algorithms. Next time I teach the class, I hope to provide the students with a TLA+ model of each algorithm I cover in the class.

The student feedback was also positive. The students liked TLA+ since it gave them a way to experiment and supported them in reasoning about the algorithms. Several students complained of the learning curve. They wanted me to cover TLA+ in more detail in the class. (The TLA+ manual can be very dense for the students.) Next semester I plan to allocate more time getting the students acquainted with TLA+. (Next time around, I will also work on devising automated ways to grade the TLA+ projects. My TA had do manual testing of the projects, not a pleasant task for a class of 60 students.)

The TLA+ mini-projects I used

Throughout the semester, I assigned 3 TLA+ mini-projects with increasing complexity.

The aim of the first mini-project was to get the students started with TLA+. I asked the students to model a bean can problem (white and black beans with rules to remove beans form the can). This was simple and there was only one process. But the students got to learn how they can check safety and liveness conditions on their program.
The second mini-project required them to deal with multiple processes. I asked them to model the logical physical clocks we had introduced in our research.  Using TLA+ model checker the students found the counterexample in the naive algorithm, where the logical clock gradually but unboundedly gets ahead of the physical clock. This counterexample is hard to find and involves 30 steps in the algorithm. So this was an example were TLA+ proved its value.  (Try to figure out this counterexample with unit tests, I dare you.)
The final mini-project gave them more practice on modeling distributed algorithms. I asked them to model Dijkstra's classical stabilizing token ring algorithm as well as Dijkstra's lesser known 4-state stabilizing token ring algorithm. Students learned about self-stabilization and how it deals with arbitrary memory corruption.

Lessons learned (the good, the bad, and the ugly)

I learned that PlusCal language is the right abstraction for modeling distributed algorithms. TLA+ is too low level for writing (and reading) distributed algorithms. PlusCal allows you to include TLA+ expressions and definitions, so you benefit from all the expressive power of TLA+ without going full TLA+. PlusCal is not a separate entity, but just a more convenient way to write TLA+ code. In the TLA+ toolkit you can write the algorithm in PlusCal and hit the "translate to TLA+" to produce the corresponding TLA+ code. When I use the term "modeling an algorithm in TLA+", I really mean writing the model in PlusCal, auto-translating it to TLA+ and model checking it.

Learning to model algorithms using TLA+ takes time. The language is powerful and very general; it is Math after all. You need to study many examples to learn about the RIGHT ways of doing things. I experienced this as a novice first hand. I had written a PlusCal code for Paxos within a couple days. My code worked, but it used procedures, and it was very procedural/operational rather than declarative, the way Math should be. And so my code was very slow to model check. Model checking with 2 acceptors worked, but I gave up on model checking with 3 acceptors since it took more than an hour. Then I found Lamport's PlusCal code about Byzantine agreement. I rewrote my Paxos implementation, imitating his style. I used macros, instead of procedures. My code become more declarative than operational. And, for message passing, I used Lamport's trick of using record typed messages which are written just once to a shared message board; it was the acceptors responsibility to react to them (by just reading, not consuming them). My new refactored code translated to ~150 LOC TLA+, whereas my first version had translated to ~800 LOC TLA. And the new version model checked in a minute for 3 acceptors, leading to a couple of magnitudes of order improvement in efficiency.

While PlusCal is nice, the crutch of using labels in PlusCal algorithms (for enabling TLA+ translation) is a bit ugly. The labels in PlusCal are not just aesthetic. They determine the granularity of atomic execution of blocks of code, so it is critical to place the labels right. If you fail to do this right, you will run into deadlocks due to co-dependent await conditions in different blocks of code. Inserting a label reduces the granularity of block of code from atomic to break at the label. Labels were a little unintuitive for me. But after some practice, I was able to get a hang of them.

Another ugly part is the lack of data structure support for passing messages and parsing them. Using tuples without field names is flexible, but it becomes intractable. Using records help somewhat, but manipulating field names in records also gets complicated quickly. TLA+ doesn't have static/strict typing, and I got stuck with runtime errors several times. In one case, I realized I had to find a way to flatten out a set inside of another set; after a lot of head banging, I found about the UNION operator, which solved the problem.

Good engineers use good tools. And TLA+ is a great tool for modeling distributed algorithms. Modeling an algorithm with TLA+ is really rewarding as it makes you "grok" the algorithm.

Sunday, November 23, 2014

Google Cloud Messaging (GCM): An evaluation

I had written about our work on building a crowdsourced superplayer for the "Who wants to be a millionaire (WWTBAM)" quiz show earlier. In that work we developed an Android app that enabled the app users in Turkey to participate in WWTBAM in real time as the show was airing on TV. When a question was read by the show host, my PhD students typed the question and the multiple-choice options, which were transmitted via Google Cloud Messaging (GCM) to the app users. App users played the game, and enjoyed competing with other app users, and we got a chance to collect precious data about MCQA dynamics in crowdsourcing. Our app was downloaded 300K+ times, and at the peak of its popularity 20K participants played the game simultaneously.

We used GCM to send the questions to the participants because we wanted to keep the app simple. GCM is the default push messaging solution for the Android platform and is maintained by Google as a free service with no quotas. GCM allows app developers to send push messages to Android devices from the server. As an alternative to GCM, we could have developed our app to maintain open TCP connections with the backend servers, but this would have made our backend design complicated and we would need many machines to be able to serve the need. Using GCM, we were able to send quiz questions as push messages easily, and can get away with using only one backend server to collect the answers from the app users. Another alternative to GCM was to deploy our messaging service, say using MQTT, but since we were developing this app as an experiment/prototype, we wanted to keep things as simple as possible.

After we deployed our app, we noticed that there were no studies or analysis of GCM performance, and we wondered whether GCM was enough to satisfy the real-time requirements in our application. Since we had a lot of users, we instrumented our app to timestamp the reception time of the GCM messages and reply back to the backend server with this information.

Our evaluation was done for both offline and online mode of GCM messaging. In the online mode, we send the message to the online participants that are playing the game while WWTBAM is broadcasting. That is, in the online mode, we know that the client devices are powered on, actively in use and have network connection. In the offline mode, we send the GCM message at a random time, so the server has no knowledge of whether the client devices are powered on, are in use, or have network connection. Table2 shows the number of devices involved in each of our experiments.

We found a giant tail in the GCM delivery latencies. Figure 4 above shows the message arrival time cumulative distribution in the online and offline modes. Note that the x-axis is in logarithmic scale. Table 6 shows message arrival times broken into quartiles. The median and the average message arrival times in the table indicate that the latency in the offline experiment is significantly high compared to the online experiment.

In Figure 5, comparing the GCM delivery times under WiFi versus cellular data connection in the offline mode shows that GCM delivery latencies are lower using the cellular data connection.
To investigate the GCM arrival times in the offline scenario further, we devised a double message experiment, where we send 2 GCM messages initiated at the same time on our server. This time, instead of measuring the message arrival time, we measure the difference between the arrival times of the first message and the arrival time of the second message. We were expecting to see very low time difference between these twin messages, but Figure 9 shows that the giant tail is preserved, and that some of the devices receive the second message hours after receiving the first one. Since we know that, at the time the first of the twin messages has arrived to the device, the device is reachable by Google’s GCM servers, the latency for the second message should be due to scheduling it to be sent independently than the first message.

These results show that the GCM message delivery is unpredictable, namely having a reliable connection to Google's GCM servers on the client device does not by itself guarantee a timely message arrival. It would be worth replicating this experiment where all participants are in the US, to see how that changes the results.

Our experiments raised more questions than it answered. Delving deeper at the documentation, we found some information that may explain some of the problems with GCM in the offline mode, but we need more examples to test these and understand the behavior better. It turns out that both cellular and WiFi modes keep a long lived connection to GCM servers. In the case of deep sleep mode (where user does not turn on the screen for long), the CPU wakes up and sends heartbeat to the servers every 28 minutes on cellular, every 15 minutes on the WiFi. This causes the connection to drop in many cases, because routers and providers kill the inactive TCP connections without sending any info to the client and/or server. Not all users are affected by this problem, it depends on carrier and/or router they use.

Related links

Our GlobeCom14 paper includes more information.

Saturday, November 15, 2014

Paper Summary: Granola, Low overhead distributed transaction coordination

This paper is by Cowling and Liskov, and it appeared at Usenix ATC'12.  Most closely related papers to this paper are the Sinfonia and Calvin papers. So, it may be helpful to also read the summaries of those papers from the links above to familiarize yourself with them.

This paper looks at coordination of 1-round transactions, which are different from general transactions that involve multi-round interaction with the client. 1-round transactions execute at the participant nodes, with no communication with other nodes except for at most a single commit/abort vote.

Figure 1 shows the system architecture. The clients are the transaction managers for these 1-round transactions. They initiate transactions and evaluate the commit/conflict/abort decisions returned for the transactions. The repositories communicate with one another (for at most a single commit/abort vote) to coordinate transactions.

There are 2 types of transactions in Granola: coordinated ones, and uncoordinated ones. Actually, better/less-confusing names for these 2 types would be lock-coordinated transactions and timestamp-coordinated transactions, since the latter type of transactions still involve coordination among the repositories.

To reflect this duality in coordination mode, each repository runs in one of two modes. When there are no coordinated transactions running at a repository, it runs in timestamp mode. When a repository receives a request for a coordinated transaction, it switches to the locking mode.

Uncoordinated transactions

There are two kinds of uncoordinated transactions: single repository transactions, or independent distributed transactions. Granola uses a timestamped-based coordination to provide serializability to both kinds of uncoordinated transactions and avoids locking for them. Examples of independent distributed transactions include read-only transactions, and local-read transactions, such as give every employee a 2% raise.

(In our previous work, the slow-fast paper (2010), we had also made an analysis similar to the independent transactions idea. Our slow-fast analysis inspects the program actions, which are precondition guarded assignment statements, and determined for which actions the atomicity can be relaxed so that they can execute in an uncoordinated manner. Our finding was that if the precondition of a program action is "locally-stable" (i.e., this precondition predicate cannot get falsified by execution of other program actions), then it is safe to execute this program action in an uncoordinated/nonatomic manner.)

Repositories are assumed to have access to loosely synchronized clocks (say NTP synchronized clocks). The use of timestamps in Granola are for ordering/serializability of transactions, and logical clocks would also suffice. Granola needs time synchronization for performance/throughput not for correctness. This separation is always a welcome one. (I can't resist but refer to our hybrid logical clock, HLC, work here. I think HLC would improve exposition of Granola, and would make it tolerant to malformed clients which may increase highTS and deny service to normal clients' requests.)

Figure 5 shows the straightforward single repository execution, and Figure 6 shows the more interesting independent transaction execution. Since these are 1-round transactions, the commit/abort decisions are local and one-shot at repositories. Voting is used to notify other repositories about the local decision (a conflict vote cause a repository to abort the transaction), and also for nominating a proposed timestamp for the transaction. The transaction is assigned the highest timestamp from among the votes.

In the "pure" timestamp mode, all single repository and independent distributed transactions commit successfully, and serializability is guaranteed as they are executed in timestamp order. This provides a substantial reduction in overhead from locking and aborts, which improve the throughput.

While in the timestamp mode, coordinated transactions may also arrive which may lead to conflict decisions to be returned for the single or distributed independent transactions executing on those repositories. Those repositories will switch to coordinated mode to serve the coordinated transactions. Once all coordinated transactions have completed those repositories can again transition to the timestamp mode.

Dually, in locking mode, repositories can still serve single repository and distributed independent transactions provided they do not conflict, i.e., they do not need to update a locked item. (In effect, the timestamp mode is just a shortcut to denote the lack of any coordinated transactions in the system.)

Coordinated transactions

Figure 7 shows coordinated transaction execution. Coordination is needed for a transaction that requires a remote read for abort/commit decision. For example, the transaction to transfer $50 from Alice's account to Bob's account first needs to remote-read Alice's account to certify that it indeed has more than $50. (Recall that in contrast for an independent distributed transaction the read, or the guard of the transaction, was local: give every employe 2% raise.) Different from independent distributed transaction execution, the coordinated transaction execution involves a prepare phase to acquire required locks. Locks ensure serializability for coordinated transaction execution. So timestamp order may not be satisfied in commit of transactions, and thus, external consistency may not be provided for coordinated transaction execution.


This paper is written with a focus on describing the system in reasonable detail to potential users of the system. This diverges a bit from the academic style, which would put the focus on the novelties and "intellectual merit" (a la NSF). Maybe this is because of the style/focus of the USENIX ATC conference. The evaluation section is also really nice and detailed. (Granola performs like Sinfonia for coordinated transacations, and improve throughput for uncoordinated transactions.)

It seems that Granola does not offer much advantage over Calvin. The paper compares in the related work with Calvin and states the following. "The Calvin transaction coordination protocol was developed in parallel with Granola, and provides similar functionality. Rather than using a distributed timestamp voting scheme to determine execution order, Calvin delays read/write transactions and runs a global agreement protocol to produce a deterministic locking order."

I think Granola may have an edge over Calvin for low-latency, especially for transactions that involve a couple repositories. The best way to see why is to consider this analogy.
Granola:Calvin :: CSMA:TDMA.