Saturday, April 18, 2015

GraphX: Graph processing in a distributed dataflow framework

This paper appeared in OSDI'14, and is authored by Joseph E. Gonzalez, University of California, Berkeley; Reynold S. Xin, University of California, Berkeley, and Databricks; Ankur Dave, Daniel Crankshaw, and Michael J. Franklin, University of California, Berkeley; Ion Stoica, University of California, Berkeley, and Databricks. This link includes video and slides which are useful to understand the paper.

This paper comes from the AMP lab at UC Berkeley. (Nice name! AMP stands for Algorithms, Machines, and People.) This lab brought to us Spark, GraphLab, PowerGraph. And this paper is a logical successor. This paper is about marrying Spark (dataflow systems) with GraphLab (graph-processing systems).


Here is the motivation for this merger. In large-scale computation, we need both dataflow processing and graph processing systems. Graph-processing systems outperform dataflow processing systems by an order of magnitude for iterative computations on graphs (e.g., connected-component analysis, PageRank analysis). Unfortunately, it is very cumbersome to use two different tools and convert data back and forth between the two. The pipeline becomes very inefficient.

The paper sees an opportunity to unify the two tools (using a narrow-waist data/graph representation in the form of mrTriplets) and provide a single system to address the entire analytics pipeline.

GraphX is actually a thin abstraction layer on top of Spark that provides a conversion from graph computation to dataflow operations (Join, Map, GroupBy). During this reduction from graph computation to dataflow patterns, GraphX applies optimizations based on lessons learned in earlier work on efficient graph-processing (e.g., GraphLab).


GraphX introduces a range of optimizations.

As the programming abstraction GraphX introduces a normalized representation of graphs logically as a pair of vertex and edge property collections. This is called the triplet view.
The GroupBy stage gathers messages destined to the same vertex, an intervening Map operation applies the message sum to update the vertex property, and the Join stage scatters the new vertex property to all adjacent vertices.  This allows GraphX to embed graphs in a distributed dataflow framework. Flexible vertex-cut partitioning is used to encode graphs as horizontally partitioned collections and match the state of the art in distributed graph partitioning.

Here vertex mirroring approach substantially reduces communication for two reasons. First, real-world graphs commonly have orders of magnitude more edges than vertices. Second, a single vertex may have many edges in the same partition, enabling substantial reuse of the vertex property.

As another optimization learned from graph-processing systems, GraphX performs active vertices tracking. In graph algorithms, as algorithm converges, the set of active vertices shrink significantly, and this optimization avoids, wasteful work. GraphX tracks active vertices by restricting the graph using the subgraph operator. The vertex predicate is pushed to the edge partitions, where it can be used to filter the triplets.

GraphX programming

While graph-processing systems, and most famously Pregel, advocated a "think like a vertex" approach to programming, the GraphX programming model is closer to thinking about transformations on data. This may require some getting used to for programmers not familiar with dataflow programming and database operations.


Comparison to Naiad

If you are familiar with the Naiad project, you might be thinking: "Well, Naiad solves the unified general purpose dataflow & graph processing problem and throws in stream-processing and dynamic graphs for good measure". (GraphX does not support dynamic graphs.) So, what are the contributions differences in GraphX over Naiad?

I am new to the dataflow systems domain, and don't know enough to give a more authoritative answer. The contributions in GraphX may be mostly in the idea and academic contributions form. I think the idea of representing graph computations back to dataflow systems is nice. Unfortunately the GraphX paper does not compare with Naiad in terms of performance. And, after the OSDI presentation, there were couple questions/complaints about this point.

GitHub page of the GraphX project

GraphX is available as opensource on GitHub.

Thursday, April 16, 2015

All file systems are not created equal: On the complexity of crafting crash-consistent applications

This paper appeared in OSDI'14 and is authored by Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau at University of Wisconsin–Madison.

A previous OSDI'14 paper we discussed had said almost every failure is due to bad exception/error-handling. But this paper shows that even when you divine the correct error-handling/recovery code, it may still not work. The layering abstraction leaks, and the filesystem underneath may do funny things in a crash.

The paper considers an important and timely problem, because many important applications, including databases such as SQLite and key-value stores such as LevelDB, are currently implemented on top of file systems instead of directly on raw disks. Such data-management applications must be crash consistent, but achieving this goal atop modern file systems is challenging because the exact guarantees provided by file systems are unclear and underspecified.

The paper defines persistence (a better term would be consistent-persistence) as a combination of two properties: atomicity and ordering (external linearizability). Figure 2 gives an example of how persistence can be violated by a crash.

From Table 1, we observe that persistence properties vary widely among file systems, and even among different configurations of the same file system. The order of persistence of system calls depends upon small details like whether the calls are to the same file or whether the file was renamed. The datajournal configuration of the filesystems are pretty solid, but they incur an overhead in terms of performance as well.

In order to analyze application-level protocols and detect crash vulnerabilities, the authors build ALICE framework. (ALICE is available as opensource here.) ALICE detects 60 vulnerabilities in total for the 11 applications analyzed, with 5 resulting in silent failures, 12 in loss of durability, 25 leading to inaccessible applications, and 17 returning errors while accessing certain data. ALICE is also able to detect previously known vulnerabilities.

The paper is easy to read and follow. And the conference presentation does a good job of explaining the paper in an accessible manner.


Is this paper being too alarmist? If we allow our system to recover to an earlier state instead of the most recent state at crash time, would that enable us to circumvent these crash-consistency problems? (Let's say we define "earlier state" as occuring in the past enough to be successfully flashed to the filesystem state.) Even that approach may fail if the most recent state at the moment of crash overwrites it inconsistently, which would corrupt it. So there is a reason to be alarmed!

But if we use a journaling approach (e.g., an append-only log approach) to writing the critical recovery states, this problem can be avoided. I guess a write-once style storage for critical state can be implemented even at the application-level. But again we pay a cost for fault-tolerance. If you take this to an extreme (to be able to recover everything), you implement the datajournal configuration of the filesystem at the application level.

This paper provides some motivation for the self-stabilization approach. If it is hard to enforce consistency, then always be converging to the consistent states. That is what the stabilization approach prescribes.

Monday, March 30, 2015

Facebook's Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services

This paper appeared in OSDI'14, and is authored by Michael Chow, University of Michigan; David Meisner, Facebook, Inc.; Jason Flinn, University of Michigan; Daniel Peek, Facebook, Inc.; Thomas F. Wenisch, University of Michigan.

The goal of this paper is very similar to that of Google Dapper (you can read my summary of Google Dapper here). Both work try to figure out bottlenecks in performance in high fanout large-scale Internet services. Both work use similar methods, however this work (the mystery machine) tries to accomplish the task relying on less instrumentation than Google Dapper. The novelty of the mystery machine work is that it tries to infer the component call graph implicitly via mining the logs, where as Google Dapper instrumented each call in a meticulous manner and explicitly obtained the entire call graph.

The motivation for this approach is that comprehensive instrumentation as in Google Dapper requires standardization....and I am quoting the rest from the paper:
[Facebook systems] grow organically over time in a culture that favors innovation over standardization (e.g., "move fast and break things" is a well-known Facebook slogan). There is broad diversity in programming languages, communication middleware, execution environments, and scheduling mechanisms. Adding instrumentation retroactively to such an infrastructure is a Herculean task. Further, the end-to-end pipeline includes client software such as Web browsers, and adding detailed instrumentation to all such software is not feasible.

While the paper says it doesn't want to interfere with the instrumentation, of course it has to interfere to establish a minimum standard in the resulting collection of individual software component logs, which they call UberTrace. (Can you find a more Facebooky name than UberTrace---which the paper spells as √úberTrace, but I spare you here---?)
UberTrace requires that log messages contain at least:
1. A unique request identifier.
2. The executing computer (e.g., the client or a particular server)
3. A timestamp that uses the local clock of the executing computer
4. An event name (e.g., "start of DOM rendering").
5. A task name, where a task is defined to be a distributed thread of control.
In order not to incur a lot of overhead, UberTrace uses a low sampling rate of all requests to Facebook. But this necessitates another requirement on the logging:
UberTrace must ensure that the individual logging systems choose the same set of requests to monitor; otherwise the probability of all logging systems independently choosing to monitor the same request would be vanishingly small, making it infeasible to build a detailed picture of end-to-end latency. Therefore, UberTrace propagates the decision about whether or not to monitor a request from the initial logging component that makes such a decision through all logging systems along the path of the request, ensuring that the request is completely logged. The decision to log a request is made when the request is received at the Facebook Web server; the decision is included as part of the per-request metadata that is read by all subsequent components. UberTrace uses a global identifier to collect the individual log messages, extracts the data items enumerated above, and stores each message as a record in a relational database.

The mystery machine

To infer the call graph from the logs, the mystery machine starts with a call graph hypothesis and refines it gradually as each log trace provides some counterexample. Figure 1 and Figure 2 explain how the mystery machine generates the model via large scale mining of UberTrace.

For the analysis in the paper, they use traces of over 1.3 million requests to the Facebook home page gathered over 30 days. Was the sampling rate enough, statistically meaningful? Figure 3 says yes.

We know that for large scale Internet services, a single request may invoke 100s of (micro)services, and that many services can lead to 80K-100K relationships as shown in Figure 3. But it was still surprising to see that it took 400K traces for the call graph to start to converge to its final form. That must be one heck of a convoluted spaghetti of services.


The mystery machine analysis is performed by running parallel Hadoop jobs.

Figure 5 is why critical path identification is important. Check the ratios on the right side.

How can we use this analysis to improve Facebook's performance?

As Figure 9 showed, some users/requests have "slack" (another technical term this paper introduced). For the users/requests with slack, the server time constitutes only a very small fraction of the critical path, which the network- and client-side latencies dominate.

And there are also a category of users/requests with no slack. For those, the server time dominates the critical path, as the network- and client-side latencies are very low.

This suggests a potential performance improvement by offering differentiated service based on the predicted amount of slack available per connection:
By using predicted slack as a scheduling deadline, we can improve average response time in a manner similar to the earliest deadline first real-time scheduling algorithm. Connections with considerable slack can be given a lower priority without affecting end-to-end latency. However, connections with little slack should see an improvement in end-to-end latency because they are given scheduling priority. Therefore, average latency should improve. We have also shown that prior slack values are a good predictor of future slack [Figure 11]. When new connections are received, historical values can be retrieved and used in scheduling decisions. Since calculating slack is much less complex than servicing the actual Facebook request, it should be feasible to recalculate the slack for each user approximately once per month.

Some limitations of the mystery machine 

This approach assumes that the call graph is acyclic. With their request id based logging, they cannot handle the same event, task pair to appear multiple times for the same request trace.

This approach requires normalizing/synchronizing local clock timestamps across computers. It seems like they are doing offline post-hoc clock synchronization by leveraging the RPC calls. (Does that mean further instrumentation of the RPC calls?)
Since all log timestamps are in relation to local clocks, UberTrace translates them to estimated global clock values by compensating for clock skew. UberTrace looks for the common RPC pattern of communication in which the thread of control in an individual task passes from one computer (called the client to simplify this explanation) to another, executes on the second computer (called the server), and returns to the client. UberTrace calculates the server execution time by subtracting the latest and earliest server timestamps (according to the server's local clock) nested within the client RPC. It then calculates the client-observed execution time by subtracting the client timestamps that immediately succeed and precede the RPC. The difference between the client and server intervals is the estimated network round-trip time (RTT) between the client and server. By assuming that request and response delays are symmetric, UberTrace calculates clock skew such that, after clock-skew adjustment, the first server timestamp in the pattern is exactly 1/2 RTT after the previous client timestamp for the task.
This work also did not consider mobile users; 1.19 billion of 1.39 billion users are mobile users.

Related links

Facebook's software architecture

Scaling Memcache at Facebook

Finding a needle in Haystack: Facebook's photo storage

Google Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Wednesday, March 25, 2015

Paper review: "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems"

This paper appeared in OSDI'14 and is written by Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm, University of Toronto.

While the title mentions "analysis of production failures", the paper is not concerned/interested in root cause analysis of failures, instead, the paper is preoccupied with problems in error handling that lead to failures.

I enjoyed this paper a lot. The paper has interesting counterintuitive results. It opens to discussion the error handling issue, and particularly the Java exception handling. Digesting and interpreting the findings in the paper will require time. To contribute to this process, here is my take on the findings ---after a quick summary of the paper.

The setup and terminology 

The paper studied 198 randomly sampled user-reported failures of 5 opensource data-intensive distributed systems: Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis. Table 1 shows how the sampling is done. (As a superficial observation, it seems like HBase is very buggy where 50% of sampled failures are catastrophic, and Cassandra is well engineered.) Almost all of these systems are written in Java, so Java's exception-style error handling plays a significant role in the findings of the paper.

Definitions: A fault is the initial root cause, which could be a hardware malfunction, a software bug, or a misconfiguration. A fault can produce abnormal behaviors referred to as errors, such as Java exceptions. Some of the errors will have no user-visible side-effects or may be appropriately handled by software; other errors manifest into a failure, where the system malfunction is noticed by end users or operators.

The paper defines catastrophic failure as "failures where most or all users experience an outage or data loss". Unfortunately this is a very vague definition. Table 2 provides some example categories for catastrophic failures considered.


Table 3 shows that single event input failures are relatively low, this is probably because these systems are well-tested with unit tests as they are heavily used in production. On the other hand, the predominating case is where 2 events conspire to trigger the failures.

From Table 4, what jumps up is that starting up services is particularly problematic. Incidentally, in airplanes most fatal accidents occur during the climbing stage.

Table 5 shows that almost all (98%) of the failures are guaranteed to manifest on no more than 3 nodes, and 84% will manifest on no more than 2 nodes. For large-scale distributed systems, 3 nodes being sufficient to manifest almost all failures seems surprisingly low. Of course, this paper looks at data-intensive distributed systems, which may not be representative of general distributed systems. In any case, these numbers don't surprise me as they agree with my experience using TLA+ in verifying distributed algorithms.

Deterministic failures totaling at 74% of all failures is good news. Deterministic failures are low-hanging fruit, they are easier to fix.

Catastrophic failures

Figure 5 shows a break-down of all catastrophic failures by their error handling. Based on this figure, the paper claims that "almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software".

But, it seems to me that this provocative statement is due to a broad/vague definition of "incorrect error handling". If you use a broad/vague definition of "incorrect landing", you can say that every catastrophic airplane failure is an incorrect landing problem. Java casts everything into an error exception, then every fault materializes/surfaces as an exception. But, does that mean if we do a good job on exception handling, there will be almost no catastrophic failures? That is an incorrect assumption. Sometimes the correction required needs to be invasive (such as resetting nodes) and the correction also counts as a catastrophic failure.

And how can we do a good job on error handling? The paper does not provide help on that. Correct error-handling is very hard: You need a lot of context and a holistic understanding of the system, and that is probably why error-handling has been done sloppily in the systems studied.

The paper also claims: "in 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code." I can agree with the 35%, as they consist of trivial mistakes in exception handling, such as an empty error handling block. But for including the other 23% labeling them as easily detectable, I think we should exercise caution and not rule out the hindsight bias.

To prevent this 23% failures, the paper suggests 100% statement coverage testing on the error handling logic. To this end, the paper suggests that reverse engineering test cases that trigger them. At the end of the day, this boils to thinking hard to see how this can be triggered. But how do we know when to quit and when we got enough? Without a rule, this can get cumbersome.

Figure10 shows an example of the 23% easily detectable failures. That example still looks tricky to me, even after we exercise the hindsight advantage.

Speculation about the causes for poor error-handling

Maybe one reason for poor error-handling in the studied systems is that features are sexy, but fault-tolerance is not. You see features, but you don't see fault-tolerance. Particularly, if you do fault-tolerance right, nobody notices it.

But, that is a simplistic answer. After all, these systems are commonly used in production, and they are well-tested with the FindBugs tool, unit tests, and fault injection. Then, why do they still suck so badly at their exception-handling code? I speculate that maybe there is a scarcity of "expert generalists", developers that understand the project as a whole and that can write error-handling/fault-tolerance code.

Another reason of course could be that developers may think a particular exception will never arise. (Some comments in the exception handlers in the 5 systems studied hint at that.) That the developers cannot anticipate how a particular exception can be raised doesn't mean it won't be raised. But, maybe we should include this case within the above case that says there is a scarcity of fault-tolerance experts/generalists in these projects.

"Correct" error handling

While the paper pushes for 100% error handling coverage,  it doesn't offer techniques for "correct" error handling. Correct error handling requires root cause analysis, a holistic view of the system, and fault-tolerance expertise.

So there is no silver bullet. Exceptions in Java, and error handling may facilitate some things, but there is no shortcut to fault-tolerance. Fault-tolerance will always have a price.

The price of reliability is the pursuit of the utmost simplicity.
It is a price which the very rich find most hard to pay.
C.A.R. Hoare

Possible future directions

Since 2-input triggered corruptions are predominant, maybe we should consider pairwise testing in addition to unit testing to guard against them. Unit testing is good for enforcing a component's local invariants. Pairwise testing can enforce an interaction invariant between two components, and guard against 2-input triggered corruptions.

I think the main reason it is hard to write "correct error handling" code is that the exception handler doesn't have enough information to know the proper way to correct the state. This is still futuristic, but if we had an eidetic system, then when exceptions hit, we could call the corrector, which can do a root-cause analysis by doing a backwards query on the eidetic system, and after determining the problem could do a forward query to figure out what are precisely the things that need to be fixed/corrected as a result of that fault.

Thursday, March 12, 2015

My crazy proposal for achieving lightweight distributed consensus

Distributed consensus is a hard problem. See some of my previous posts for discussion about impossibility results on distributed consensus.
Paxos taught
Perspectives on the CAP theorem
Attacking Generals problem

The reason distributed consensus is hard is because the parties involved don't have access to the same knowledge (the same point of view) of the system state.  Halpern's work on knowledge and common knowledge in distributed systems provides a useful framework to explain this.  Here is a summary of common knowledge paper by Halpern, and here is a nice talk on that paper.

Of course, we have distributed consensus solutions, Paxos, ZooKeeper/ZAB, that ensure safety always and provide progress whenever the system conditions step outside the impossibility results territory (with respect to synchrony, channel reliability/reachability, and number of up nodes). But these solutions come with a performance cost, as they need serialization of all update requests from a single Paxos leader, and acknowledgement of these updates by a majority quorum of the Paxos replicas. In ZooKeeper non-leader replicas can also serve reads locally, but the updates still need to get serialized by the leader. Some Paxos variants such as Mencius, ePaxos try to relieve this problem, but the bottom-line is they are still subject to the same fundamental performance limitations due to serialization at a single leader.

I have this crazy idea for circumventing the impossibility results of distributed consensus as well as improving the performance of distributed consensus. The idea is to connect the nodes involved in consensus (let's call these nodes coordinators) with a single-collision-domain Ethernet bus in order to solve the asymmetric knowledge problem.

No, this setup does not assume reliable communication. There can be collisions in the Ethernet bus. But the collisions will be total collisions since this is a shared Ethernet bus. When there is a collision, none of the coordinators would deliver the message, including the sender. Since Ethernet on a shared bus is CSMA-CD, the transmitter also detects the collision and does not accept its own message into the consensus log if that is the case.

So, in effect, the shared Ethernet bus performs the serialization of the coordinators' proposals. As a result, a coordinator proposing an operation does not need to collect explicit acknowledgement from any other coordinator let alone from a majority quorum of coordinators. This makes this consensus algorithm very lightweight and fast.

This system is masking fault tolerant to coordinator crashes until at least one coordinator left remaining. (If we are to allow reintegrating coordinators recovering back from crash, things get complicated of course. Then we would need to assume reasonable churn to allow time for recovering coordinators to catch up before they can be integrated. This would also require fast one broadcast consensus on the reconfigurations.)

That is it. That simple. Now comes the algorithmician's apology.

On the theory side, I know I am not suggesting anything novel. The impossibility results withstand; I just changed the system conditions and stepped outside the territory of the impossibility results (that is why I used the term "circumvent"). In fact, I had noticed this idea first in the context of wireless broadcast and wireless sensor networks, when I was a post-doc at Nancy Lynch's group at MIT. We had published papers exploring the concept for wireless broadcast with total and partial collisions.

On the practical side, I know this proposal has downsides. This is not readily applicable as it requires the Ethernet driver to expose collision information to the application layer. This requires setting up an auxiliary Ethernet LAN across the coordinators. And, yes, I know this doesn't scale outside a LAN. (The coordinators should be connected by the single domain Ethernet bus, but the clients to the coordinators may communicate to the coordinators over TCP/IP and need not be in the same LAN. The coordinators can be located across different racks to guard against rack-wide crashes.)

But every system design is an exercise in determining which tradeoffs you make. The important question is: Is this tradeoff worth exploring? Would the performance improvement and simplicity stemming from this setup makes this a reasonable/feasible option for solving the distributed consensus problem at the datacenter level?

Tuesday, March 10, 2015

Eidetic systems

This paper appeared in OSDI'14. The authors are all from University of Michigan: David Devecsery, Michael Chow, Xianzheng Dou, Jason Flinn, and Peter M. Chen.

This paper presents a transformative systems work, in that it introduces a practical eidetic system implementation on a Linux computer/workstation. This paper is a tour de force: It undertakes a huge implementation effort to implement a very useful and novel eidetic memory/system service. The authors should be commended for their audaciousness.

An eidetic computer system can recall any past state that existed on that computer, including all versions of all files, the memory and register state of processes, interprocess communication, and network input. An eidetic computer system can explain the lineage of each byte of current and past state. (This is related to the concept of data provenance, which I mention briefly at the end of my review.)


One use case for an eidetic system is to track where/how erroneous information entered to the system. The paper considers tracking down a faulty bibtex reference as a case study. This is done using a backwards query. After tracking down the faulty bibtex reference you can then perform a forward query on the eidetic system, in order to figure out which documents are contaminated with this faulty information and to fix them.

Another use case for an eidetic system is to do postmortem of a hack attack and whether it leaked any important information. In the evaluation section, the paper uses as another case study the heartbleed attack, which occurred during time the authors were testing/evaluating their eidetic system implementation.

With a good GUI for querying, the eidetic system concept can enhance the Mac OSX Time Machine significantly, with data lineage/provenance, backward querying, and forward querying/correction. This can augment time travel with analytics, and you can have a time machine on steroids. (Accomplishment unlocked: +100 points for serious use of time machine and time travel in writing.)

Design and implementation

The authors develop the eidetic system, Arnold, by modifying Linux kernel to record all nondeterministic data that enters a process: the order, return values, and memory addresses modified by a system call, the timing and values of received signals, and the results of querying the system time. Arnold and accompanying eidetic system tools (for replay, etc.) are available as opensource.
The key technologies that enable Arnold to provide the properties of an eidetic system efficiently are deterministic record and replay, model-based compression, deduplicated file recording, operating system tracking of information flow between processes, and retrospective binary analysis of information flow within processes.
Arnold uses deterministic record and replay, and trades storage for recomputation whenever possible. That is, Arnold only saves nondeterministic choices or new input and can reproduce everything else by recomputation. The major space saving technique Arnold uses is model based compression: Arnold constructs a model for predictable operations and records only instances in which the returned data differs from the model. Another optimization is copy on RAW (read-after-write) recording: "To deduplicate the read file data, Arnold saves a version of a file only on the first read after the file is written. Subsequent reads log only a reference to the saved version, along with the read offset and return code." These techniques enable Arnold to fit 4 years of desktop/workstation eidetic system into 4TB of off-the-shelf hard disk (which costs $150).

Querying and Replaying

Arnold uses the replay groups abstraction to perform storing and replaying efficiently. Replay groups consist of frequently communicating processes which can be replayed independently of any other group. Arnold employs "Pin" binary instrumentation to analyze replayed executions and track the lineage of data within a replay group. Inter process communication is tracked with the help of a dependency graph which keeps track of the communications between different replay groups. Bundling frequently communicating processes into a group ensures that a large number of conversations need not be recorded to the dependency graph. As such selection of replay group (and replay group size) gives rise to a tradeoff between storage efficiency and query efficiency. It would be nice if the paper provided the replay groups it used in Arnold as a table. This information would be useful to understand the replay groups concept better.

Arnold records even user propagated lineage, such as a user reading a webpage and entering text into an editor as a result. (Of course this leads to introducing some false positives, as it needs to be done speculatively.) Tracking this actually required a lot of work: "Understanding GUI output turned out to be tricky, however, because most programs we looked at did not send text to the X server, but instead sent binary glyphs generated by translating the output characters into a particular font. Arnold identifies these glyphs as they are passed to standard X and graphical library functions. It traces the lineage backward from these glyphs using one of the above linkages (e.g., the index linkage)."

Finally, for the querying of Arnold, the paper has this to say. "A backward query proceeds in a tree-like search, fanning out from one or more target states. The search continues until it is stopped by the user or all state has been traced back to external system inputs. As the search fans out, Arnold replays multiple replay groups in parallel. In addition, if no lineage is specified, it may test multiple linkages for the same group in parallel, terminating less restrictive searches if a more restrictive search finds a linkage."

Unfortunately, user-friendly GUI-based tools for querying is not available yet. That would be asking too much from this paper which already packed a lot of contributions into a single publication. The evaluation section gives some results about backward and forward querying performance in Arnold.

Related work on data provenance

Data provenance is a topic which has been studied as part of the database field traditionally. However, recent work on data provenance started considering the problem of capturing provenance for applications performing arbitrary computations (not resricted to a small set of valid transformations in database systems). The paper "A primer on provenance" provides a nice accessible survey of data provenance work.

Future work

This paper presents an eidetic system on a single computer. An obvious future direction is to enable building an eidetic distributed system. By leveraging Arnold, such a system also seems to be in reach now. Our work on hybrid logical clocks can also help here by relating and efficiently tracking causality across distributed nodes running Arnold. Since our hybrid logical clocks can work with loosely synchronized time (a la NTP), and is resilient to uncertainty (it enables efficient tracking of causality without blocking for synchronization uncertainties), it can be adopted for implementing a distributed eidetic system in practice.

A remaining kink for a distributed eidetic system could be the cost of querying. Querying and replay is already slow and hard for a single eidetic system, and it is likely to become more complicated for a distributed system since coordination of replay is needed across the machines involved in the replay.

Thursday, March 5, 2015

Extracting more concurrency from Distributed Transactions

This paper appeared in OSDI'14. The authors are: Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, Jinyang Li. The paper, presentation slides and video are accessible here.

The paper proposes a concurrency control protocol for distributed transactions, and evaluates its performance comparing with two-phase locking (2PL) and optimistic concurrency control (OCC).

The protocol

The protocol introduced, ROCOCO (ReOrdering COnflicts for COncurrency) is targeted for extracting more concurrency under heavily contended workload than 2PL and OCC can handle. In fact, ROCOCO's improvements over OCC and 2PL comes after the peak throughput point of even ROCOCO. One of the questions after the OSDI presentation was about this. "That region where the system is thrashing is not a region you want to be. Why would you not employ admission control to refuse extra workload that pushes the system past peak performance?"
ROCOCO assumes that a distributed transaction consists of a set of stored procedures called pieces. Each piece accesses one or more data items stored on a single server using user-defined logic. Thus, each piece can be executed atomically with respect to other concurrent pieces by employing proper local concurrency control.

ROCOCO achieves safe interleavings without aborting or blocking transactions using two key techniques: 1) deferred and reordered execution using dependency tracking; and 2) offline safety checking based on the theory of transaction chopping.

ROCOCO's transaction reordering idea is adopted from the ePaxos protocol introduced a couple years ago. This is a neat idea. The first phase is sort of like a dry run for the transaction. Dependencies with other concurrently executing transactions are learned. In the second phase, the dependent transactions are synchronized. They are forced to wait for each other and executed that way.

This approach is basically pipelining the transactions. This is also similar to what the Calvin does with its log-based approach. Pipelining helps for throughput of course, but it also introduces a drawback.

Unlike 2PL and OCC which executes a depended-upon transaction to completion before allowing its dependent/conflicting transactions to execute, ROCOCO is deciding on an order and pipelining the execution of these conflicting transactions in some determined order. However, if the first transaction of these pipeline-executed transactions does not complete for some reason or due to a fault and needs to be rolled back, this also requires rolling-back the remaining transactions in the pipeline that depended on this transaction. This is a problem 2PL and OCC did not have.

This basic reordering protocol is when some of the transaction pieces/fragments are deferrable. For transaction fragments that are immediate (whose outputs are inputs to other pieces in a transaction), the reordering protocol is inapplicable, and the paper uses an offline checker to avoid conflicts in such situations.  The Offline checker works in following steps:
1. It constructs an S-C graph based on transaction chopping. Each edge in the graph is either a Sibling edge (an edge formed for pieces of same transaction instance) or a Conflict edge (an edge formed by pieces which access the same database table and any one of the piece issues a write).
2. Each vertex in the graph is either tagged as immediate (I) or deferrable (D). A conflict edge  can be an I-I edge or a D-D edge.
3. The checker observes all the S-C cycles formed by the graph. SC-cycles represent potential non-serializable interleavings. However, if an SC-cycle contains at least one D-D edge, ROCOCO can reorder the execution of the D-D edge's pieces to break the cycle and ensure serializability. For an unreorderable SC-cycle with all I-I C-edges, the checker proposes to merge those pieces in the cycle belonging to the same transaction into a larger atomic piece. ROCOCO relies on traditional distributed concurrency control methods such as 2PL or OCC to execute merged pieces.


ROCOCO is  implemented as an in-memory key-value store with 20K C++ code  and evaluated using a scaled TPC-C benchmark in comparison to OCC and 2PL. Given that Calvin  is a closely related work (because it also orders transactions and pipelines their execution), it would be good to see a comparison to Calvin, but the paper does not include that.

Some miscellaneous thoughts

In the introduction, the following paragraph is provided as a motivation for ROCOCO.
Unfortunately, contention is not rare in large-scale OLTP applications. For example, consider a transaction where a customer purchases a few items from a shopping website. Concurrent purchases by different customers on the same item create conflicts. Moreover, as the system scales—i.e., the site becomes more popular and has more customers, but maintains a relatively stable set of items—concurrent purchases to the same item are more likely to happen, leading to a greater contention rate.
The OSDI presentation also includes the same example as motivation. I don't think this is the right/best way to argue that contentions will happen, because this is a faux contention, and not an inherent contention. When the number of a sale item is very high (which is almost always the case), why do we care to carefully check the number of items remaining? Conflict-Free Replicated Data Types (CRDT) approach helps here to avoid conflicts easily. (For some nice papers on this see: CRDT1, CRDT2, CRDT3) The coordination avoidance in distributed databases paper also argues for  avoiding coordination when all local commit decisions are globally valid (in other words, when the commit decisions are composable).

Sunday, March 1, 2015

My trick for coordinating Dropbox collaborations

There are high-tech solutions to managing the collaborative paper writing process, such as using version control systems like CVS or git. The downside to these systems is that when you are collaborating on a paper with a low-tech author, it is cumbersome to get them on board with these tools. Some of my collaborators have been from a non-CS background. And even some CS-background collaborators refuse to deal with cryptic version control commands and error messages.

Dropbox provides a simple easy-to-use solution to sharing files which helps  teams to collaborate on projects, including collaborating on a paper. However, Dropbox does not have access control and cannot coordinate concurrent accesses to the same file by multiple writers. This causes overwritten/lost updates and frustrated collaborators.

I have a low-tech solution to this problem. I have used this solution successfully with multiple collaborators over the last 2-3 years. Chances are that, you have also come up with the same idea. If not, feel free to adopt this solution for your collaborations. It helps.

Here is the solution. When I share a folder to collaborate with coauthors, I create a token.txt file in that folder. This file is there to coordinate who edits which file at any given time.

Below is a sample token.txt file contents. The first part explains the guidelines for coordination, the second part lists the files to collaborate on. We generally use one file to correspond to one section of the paper.

Check with token.txt file to see who is editing which file in order to avoid conflicts.
If you want to write:
+ pick a section/file,
+ reopen token.txt to avoid seeing old-state,
+ write your name next to that section/file at token.txt to get the lock/token,
+ quickly hit save on token.txt, and
+ start editing that section/file on your own pace.

When done writing:
+ delete your name for that section/file at token.txt. (I.e., Unlock that file.)

Don't keep unlocked files open in your editor, lest you inadvertently save them over a newer version your collaborator wrote. An editor that auto-saves may also cause that problem.
eval: ...........Ailidani

ALL FILES: This file is always with Murat! Treat this as read-only, don't edit.

The token.txt file acts as a message board to show who is working on which file. Of course there can be concurrent updates to token.txt, but this is less likely because you make a tiny edit and then quickly save this file. Moreover, many editors, including Emacs, notifies you immediately if the token.txt file has been changed during the time you keep it open.

(This is in essence similar to the use of RTS-CTS messages in wireless communications. Instead of having a more costly collision for a long-size data transfer, it is an acceptable tradeoff to have occasional collided/lost RTS messages, because those are very short transfers and are low-cost.)

Yes, this is a lazy low-tech solution and requires some discipline from collaborators, but the effort needed is minimal, and I haven't got any bad feedback from any collaborator on this. During the time I used this method, we never had a problem with concurrent editing on the same file. (I used this with teams of size 2-to-6, and only for collaborating on papers.) One problem, however, we occasionally had was with someone forgetting to remove a lock after he is done working on that file. In this case, we resort to email: "Are you still holding the lock on this file? When do you plan to release it?"

Notice the "ALL FILES" entry at the end. That is when one author may want to lock all the files for a short duration easily. This happens infrequently, but comes handy for restructuring the paper, making fast substitution edits across all files, or spell-checking the entire paper before submission.

You may also notice a entry at the bottom. That is an org-mode file where I keep my TODO items and notes about the paper. It is my lab-notebook for this writing project. I can't function in any project without my file.

PS: Geoffrey Challen suggests: "You could also replace Murat's token.txt file with a Google Drive document." 

Friday, February 27, 2015

Salt: Combining ACID and BASE in a Distributed Database

This paper appeared in OSDI'14. The authors are Chao Xie, Chunzhi Su, Manos Kapritsos, Yang Wang, Navid Yaghmazadeh, Lorenzo Alvisi, and Prince Mahajan, all from The University of Texas at Austin. Here you can watch a video of the OSDI presentation of the paper, and also find the presentation slides and the paper available as open access.  USENIX knows how to do conferences right.

Dropping ACID

ACID (Atomicity, Consistency, Isolation, Durability) approach provides ease-of-programming through its simple transaction abstraction, but loses on the performance. BASE (Basically-Available, Soft state, Eventually consistent) approach, popularized by the NoSQL systems, provides good performance (low-latency, high-throughput, and scalability), but loses on the ease-of-programming due to increased complexity of concurrent execution in distributed systems.

This paper introduces Salt, which aims to find a best-of-both-worlds middle ground: to provide high performance like BASE with modest programming effort like ACID.

Less is more

Salt approach is motivated by the following observation (which is an instance of the Pareto principle). I am quoting from the paper:
"When an application outgrows the performance of an ACID implementation, it is often because of the needs of only a handful of transactions: most transactions never test the limits of what ACID can offer. Numerous applications [2, 4, 5, 10, 11] demonstrate this familiar lopsided pattern: few transactions are performance-critical, while many others are either lightweight or infrequent." 
Of course, a better quantification than saying "numerous" or "many" applications would make the argument for the Salt approach stronger. However, such an extensive study may take several months, and is unfair to ask for a conference paper submission.

It is tempting to increase the concurrency of those critical ACID transactions by splitting them into smaller mini ACID transactions. However, doing so may not be safe as you would lose Atomicity (which frees you from worrying about intermediate states during failures) and Isolation (which regulates which states can be accessed when transactions execute concurrently). The paper calls this a stark choice and cites the YouTube clip from Butch Cassidy to further drive this point home (I am not kidding).

The paper's proposal is this. Instead of worrying about all possible Atomicity & Isolation complications with all other ACID transactions, Salt categorizes transactions into ACID transactions and a small number of BASE transactions, and lets you worry only about checking Atomicity & Isolation complications between the BASE transactions.

Of course there is an unmentioned complication here. You also have to ensure that these BASE transactions leave the database in consistent state. This is not an easy task as well, and opens up Salt's ease-of-programming to debate. On the other hand, you would have to worry about this for all operations if you implement a full-BASE solution.

The Salt approach

Actually Salt's BASE transactions approach gets its inspiration from the ACID nested transactions. The paper says:
"Nested transactions is an abstraction originally introduced to offer, for long-running transactions, atomicity at a finer granularity than isolation. Our purpose in introducing BASE transactions is similar in spirit to that of traditional nested transactions: both abstractions aim at gently loosening the coupling between atomicity and isolation. The issue that BASE transactions address, however, is the flip side of the one tackled by nested transactions: this time, the challenge is to provide isolation at a finer granularity, without either drastically escalating the complexity of reasoning about the application, or shattering atomicity."

By providing isolation at a finer granularity, Salt allows BASE transactions to achieve high concurrency by observing each other's internal states, without affecting the isolation guarantees of ACID transactions. A BASE transaction is made up of multiple mini-transactions, called alkaline subtransactions. (Yes, the paper makes several Chemistry puns.) The committed state of an alkaline subtransaction is observable by other BASE or alkaline subtransactions. On the other hand, the committed state of an alkaline subtransaction is not observable by other ACID transactions until the parent BASE transaction commits.  As such, Salt guarantees that, when BASE and ACID transactions execute concurrently, ACID transactions retain, with respect to all other transactions (whether BASE, alkaline, or ACID), the same isolation guarantees they used to enjoy in a purely ACID environment.

The proverbial Bank example

When you talk about ACID versus BASE the classical example is the bank money transfer example. The paper gives a pure ACID, a pure BASE, and a Salt implementation of this example.

I wonder, which implementation is closer to the way the money transfers are actually handled in real banks. If someone familiar with how banks handle the money transfer can weigh in on this, that may shed a light on this debate.  My money is on the pure BASE approach.

Something about this bank example is odd though. The total-balance operation is an unrealistic big/monolithic operation. Why did that have to be a transaction? The total-balance operation could be done with a consistent/timed snapshot, no? Then why did it have to be a transactions spanning/freezing the entire system?

Now let's check the three implementations more closely. The implementation in Figure 1.a (ACID) uses only one type of locks, ACID locks. Figure 1.b (BASE) uses only one type of locks, local locks. BASE programming looks like classical distributed system programming, it tries to do everything with local asynchronous actions as much as possible. Figure 2 (Salt) uses Alkaline, Saline, and ACID locks.

Why does Salt need 3 different locks? Because the first alkaline transaction is special, as it determines whether the entire BASE transaction it is included in will succeed or not. If that first alkaline subtransaction succeeds, then comes the saline-lock to make the intermediate states in the rest of the BASE transaction available to other BASE and alkaline transactions. But then, this is something I am still not clear on. Salt attributes significance to the first alkaline subtransaction, so its design should also be special. But I am unclear about what should go in that subtransaction? Are there guidelines to design that first subtransaction so it captures the successful completion of the rest of the BASE transaction?


Like any OSDI/SOSP paper worth its salt (see, I can also make "Salt" puns :-), the paper includes a strong evaluation section. The authors implemented a Salt prototype by modifying MySQL Cluster. They evaluated the prototype using the TPC-C benchmark, which consists of 5 types of transactions: new-order (43.5%), payment (43.5%), stock-level (4.35%), order-status (4.35%), and delivery (4.35%). The results show that by BASE-ifying 2 of these transactions, Salt achieves 80% of maximum throughput of a full BASE implementation.

Notice the left side of Figure 7, which shows that ACID version faster than the Salt version under low load. The paper doesn't discuss this but, if we had both ACID and BASE version of transactions, it may be possible to hot-swap between these two versions to optimize the throughput of the system depending on the current system load.

Related work

The paper fails to cite several papers that proposed orthogonal approaches to deal with the same problem: improving the performance of ACID transactions, while keeping the ease-of-programmability.

The Red-Blue Actions paper (OSDI'12) considered the problem of reducing the granularity of ACID transactions by using a generator operation and shadow operation and categorizing the operations  as red (consistency critical) and blue (fast and eventually consistent). That paper also uses a Bank example.

Then there are work from the Berkeley group on this problem. I had summarized the Invariant-based coordination avoidance paper earlier. There is also the Highly Available Transactions paper  (VLDB'13) where they  look at ACID isolation level relaxing for improving availability.


The paper proposes the Salt approach assuming the ACID implementation of the application is already present. The application for this is if you already have an ACID system, you can Salt it to fine-tune the performance. Another approach would be to start with a BASE implementation and to increase atomicity/isolation for a couple operations. Maybe for some applications, it will be easier for going from BASE to Salt: Starting with BASE and then adding Alkaline subtransactions and ACID transactions (if necessary).

The Salt approach do require a learning curve and can be tricky. But how do we argue the complexity of the programming effort needed for one approach (say Salt) versus the other (say pure BASE) in an objective manner? One can objectively show performance improvements with evaluations, but it is harder to argue for "ease of programming" in an objective/quantitative manner.

There can also be some complications with the early-committing (after the first alkaline subtransaction) in BASE transactions. This question came up in the OSDI presentation. If deadlock occurs the MySQL Cluster approach is to use timouts and roll the transactions back. In this case, Salt throws exceptions which the developer need to address to re-achive consistency and atomicity.

Tuesday, February 17, 2015

Book report: "Good Prose" and "How to write a lot"

Notes  from the "Good prose"

This book is written by a journalist turned nonfiction author and his editor.
This is a nice book, and a good read.

"Quiet beginnings: You cannot make the reader love/trust you in the first sentence, but you can lose the reader with a grand proposition in the first sentence."

"To write is to talk to strangers. Prepare the reader, tell everything reader needs to know to read on, but no more."

"For a story to live, it is essential only that there be something important at stake, a problem that confronts the characters or reader. The unfolding of the problem and its resolutions are the real payoff."

"Revelation transforms events into a story."

"Don't mess with chronology unless you have a good reason."

"On the topic of essays:  Essayists tend to argue with themselves. Who am I to write this? Who cares to read this? If I knew my own mind, I would not make essays, I would make decisions. --Montaigne."

A main take-away from the book is the respect these writers show for their craft. They spend countless hours drafting/revising/editing with patience. They spend days discussing about improvements on a draft. The important thing is to get it right, do the right thing. At some point, over a dispute on one word with the editor in chief of Atlantic, they consider quitting their dream jobs there on principle.

Notes from "How to write a lot"

This book aims to help academicians to write a lot. It is a short book and it has a single, simple message: "schedule time for writing 2 hours daily in your calendar, and sit down to write at those times".

"Writing is a skill, not an innate gift or special talent. Like any advanced skill, writing must be developed through systematic instruction and practice. Learn the rules and do deliberate practice."

"Schedule time for writing and stick to it."

"Any action that is instrumental in completing a writing project counts as writing."

"No internet. The best kind of self control is to avoid situations that require self control."

Some recommended reading by the book are: "Writers book of hope" and "Professors as writers". I hope to check these later.

Tuesday, February 10, 2015

Paper summary: Perspectives on the CAP theorem

This is a 2012 short paper by Seth Gilbert and Nancy Lynch that appeared in a special issue commemorating the 12th anniversary of the CAP theorem. Gilbert and Lynch get to write in this special issue because they were the ones to first publish a proof the CAP conjecture put forward by Eric Brewer in PODC 2000 keynote.

In this paper, Gilbert and Lynch aim to situate the CAP theorem in the broader context of a family of results in distributed computing theory that shows impossibility of guaranteeing both safety and liveness in an unreliable distributed system. A quick refresher on specifications is in order. Safety says "nothing bad happens". (You can satisfy safety by doing nothing.) Liveness says "eventually something good happens". Unreliability (aka failure model) says "such and such faults happen".

The impossibility results surveyed in relation to CAP concern slightly different problems and slightly different fault models. While it is easy to confuse CAP with those results on a superficial look, on a closer inspection we see that the results are all distinct and none subsume CAP.

The CAP problem 

The CAP theorem does NOT consider the consensus problem, but considers an easier problem: the atomic read/write register (aka atomic storage) problem. Atomic means that the system provides linearizability, a strong type of single-copy consistency that guarantees that a read returns the most recent version of data. The specifications of this problem are as follows. (The first is the safety property, the second one liveness.)
Consistency: The system provides its clients with a single register (emulated by
multiple storage nodes), and each client can read or write from that register.
Availability: Each request for read or write eventually receives a response.

The FLP (Fisher-Lynch-Patterson) and the attacking generals impossibility results consider the consensus problem. The specifications for consensus are as follows. (The first two are safety properties, the last one a liveness property.)
Agreement: No two process can commit different decisions.
Validity (Non-triviality): If all initial values are same, nodes must commit
that value.
Termination: Nodes commit eventually.

So here is the difference between consensus and atomic storage. Consensus is supposed to dutifully remember a value that is anchored (stored by a majority number of nodes). Consensus is loyal to making that value persist as the committed decision. Atomic storage doesn't have that responsibility. The nodes don't need to commit to a decision value, so the system doesn't need to keep track of and remember whether a value is anchored. The atomic storage system as whole accepts new writes as long as the reads don't return results that betray the single register (i.e., single-copy) abstraction.

And what is the implication of this difference? FLP result declares that even under reliable channels assumption, consensus is impossible to solve in an asynchronous system with node crash failures. For example, Paxos loses liveness because it can not converge to a single leader in an asynchronous model. Did the current leader crash? The failure detector cannot be accurate. If the failure detector incorrectly says that the leader (who is supposed to ensure and remember that a value is anchored) is not crashed, liveness is violated since nodes keep waiting on a failed leader. If failure detector incorrectly says that the leader is crashed, then you have multiple leaders, and liveness is violated because of multiple leaders dueling with forever escalating ballot numbers to get the majority to accept their proposal.

On the other hand, since the atomic storage problem doesn't care about remembering whether a value is anchored, it is oblivious to the dueling leaders clients, and as such it is solvable for crashes of up to half of the nodes with the FLP model (i.e., with reliable channels in an asynchronous system). I had blogged about the Attiya, Bar-Noy, Dolev (ABD) algorithm that achieves this feat.

Now that we know atomic storage problem is solvable with reliable channels with up to minority crashes, what can we say about the atomic storage in the presence of unreliable channels? That is covered by the CAP theorem's fault model, which we discuss next.

The CAP fault model 

We discussed the specifications of the problems considered by CAP, FLP, and attacking generals, but we omitted to talk about another important part of the system specification, the unreliability/fault model.

Above I had introduced the FLP fault model when discussing solvability of consensus versus atomic storage in the FLP model. FLP fault model assumes reliable channels, asynchronous system, crash failure. Of course, by assuming reliable channels, you don't get reliable channels in your deployment. That is just wishful thinking. But since the attacking generals impossibility result proved that consensus is not achivable in the presence of unreliability channels, FLP had to consider reliable channels. Even then, we have disappointment; consensus is also impossible in the FLP model.

CAP does something courageous and considers unreliable channels again (as in the attacking generals fault model) in its fault model. Since CAP is concerned with the atomic storage problem, which is a slightly easier problem than consensus, the attacking generals impossibility result does not subsume the CAP result.

CAP result says that atomic storage problem is also impossible to solve under unreliable channels.

Recall that ABD solved the atomic storage problem in the FLP model. If we move to the CAP fault model and allow partitions, we observe from the ABD algorithm that it blocks (loses availability) for a read or write request that arrives to a node in a minority partition. Just as the CAP says, either consistency or availability has to give.

Here is the proof sketch verbatim from Gilbert-Lynch paper.

Similar to the attacking generals result, the CAP result is oblivious to whether the system is synchronous or asynchronous, and holds in both cases.

What is remaining?

Observe from the CAP proof sketch that the CAP fault model is very rough. When it says unreliable channels, it allows you to assume the worst case (i.e., no message makes it through at all), and prove the impossibility result for that worst case.

What if we quantify and limit the unreliability of the channels to more realistic scenarios. Can we prove more refined versions of CAP? What would be the consistency level a system can provide if the system model allows eventual message arrival? A recent technical report from University of Texas Austin, "Consistency availability convergence" paper, looks at that problem. We will discuss that paper next in our distributed systems seminar.

More about CAP tradeoffs

The Gilbert-Lynch paper discusses some of the practical implications of the CAP theorem and says that Consistency versus Availability should not be seen as an absolute and binary tradeoff. Instead you can consider shades of Consistency versus Availability. Also you can make different Consistency versus Availability tradeoffs at the data level, operation level, and subsystem level. These observations are very similar to the suggestions made in Eric Brewer's article in the same special issue: "CAP 12 years later, how the rules have changed".

The Gilbert-Lynch paper also mentions the scalability problems caused due to trying to enforce consistency, but leaves that discussion as future work. PACELC model by Daniel Abadi provides a more detailed explanation for Low-latency versus Consistency tradeoffs in the absence of partitions.

Saturday, February 7, 2015

How to present your work

Presentation is a very important skill. Determining how to present and communicate your results in an effective manner is  as important as doing the research and getting those results. From the same material you can get a killer job talk or a bummed dissertation talk. Presentation skills can make the difference.

Presentation is not a soft skill despite the common misconception among many technical people. It takes a lot of brains, analyzing, and synthesizing to produce a good presentation. You have to understand and internalize your content very well in order to present it clearly in the most engaging and compelling way. So if you fail to present your work clearly, that reflects poorly on you and your work. The audience will think "the work doesn't look significant or promising", or even worse "the presenter doesn't truly understand/internalize his work".

The most essential requirement for a successful presentation is practice. I observe that our graduate students are not good at presenting because they don't get enough practice and rehearsals. As with any skill, you will learn about presenting best by doing it. So don't waste any opportunity to talk about your work. Talk about your work to your relatives, who won't understand. (This is very useful, because you get to think of how to explain your contributions in the context of the real world out there.) Tell it to your friends not in the same field. Tell it to your lab mates. Give talks in the department seminars. You need to learn how to get feedback and gauge your presentation. Eventually, you will get to present your work in conferences.

The "Talk like TED" book 

I recently picked up this book from a library. (Yes, a physical library.) The book was a very easy, lightweight reading. The book's outline consists of the 9 tips it offers for giving successful TED-like talks. It argued that good talks need to be emotional, novel, and memorable, and categorized the 9 tips under those 3 headings.
+ unleash your passion
+ tell a story
+ have a conversation
+ teach something new
+ deliver WOW moments
+ use humor
+ keep it short
+ paint a mental picture
+ be authentic

This may be my own failing, but I didn't learn/benefit much from the book. (Shall I say, the book was not very emotional, novel, or memorable?) If you had done some reading about public speaking before, this book does not offer much new insights or content. This book could have been condensed to a long blog post.  For example, read this article by Chris Anderson instead.

Maybe the most interesting take away from the book is how much practice the presenters put into the talk. I knew TED speakers rehearsed a lot, but I was still surprised to learn how much. Six months before the talk, the presentation draft is there. Then it is all practice rehealsals and tweaking to refine and fine-tune. One month before the talk, the final form of the presentation is ready.

Keep your eyes on the message 
I think the best presentation advice I received was from another TED, Ted Herman. After hearing me speak as a graduate student on one of my work, he told me: "Focus on the message. Give your presentation with the sole goal to teach, and not to impress."

If you worry about impressing the audience in a presentation, then your ego gets in the way, you get self-conscious. You start questioning "Did I say that right? Was I too slow? Am I perceived as confident enough? etc." This will get you nervous and self-doubting. If you focus on the message, you will get your point across, even if it takes flapping your arms wildly. And if your content is good, it will impress people after all. As Anderson says "Presentations rise or fall on the quality of the idea, the narrative, and the passion of the speaker. It’s about substance, not speaking style or multimedia pyrotechnics."

Focusing on the message is hard to do using Powerpoint/Keynote. Powerpoint makes it easy to lose the focus on the message. By writing slide after slide and getting things stylistically right (which Powerpoint facilitates immensely), you get the false-sense of security that you are communicating the content. To avoid this and to force yourself to focus on the message/content, you should prepare the story and outline of the talk first in your mind, before sitting down to prepare the slides. How would you give this talk if you were forced not to use any slides. Thinking this way and purposefully omitting the slides as crutches will help you learn, discover, and hone your message. (More on this below, where I talk about framing the talk.)

To reiterate what I said in the beginning of the post, presentation is not a soft skill and requires a lot of brains. In order to produce a clear and condensed message, you need to learn how to abstract the most important lessons from all the work you did. This requires you to first process and internalize your work really well. You should omit a lot of incidental details, and provide a condensed message in a conference presentation or job talk. These talks are there to whet people's appetites and get them to read your papers. This does not mean that these talks should omit technical content. On the contrary they should include technical content that communicates the essence of your technical contribution in a clear and accessible manner.

Frame your talk well

And from my PhD advisor, Anish Arora, I learned how to frame a talk well. Framing means finding the right place to begin and developing the right story and context to pitch the ideas.

When we meet to discuss how to present our work, Anish would declare "let's try to understand what we did in this work". Wow! He was totally comfortable accepting that we could understand our work better, and the context at which we started and performed the work is not necessarily the best context with which we present the work. From Anish, I learned that it is OK to search for a new perspective and a better context to frame and present the work.

In fact, your best presentations are which you also learn and get a new appreciation of your work. If you get that new understanding of your work, that is a good indication that you did enough to frame your work, and you achieved a good understanding and focusing of your message.

Related links

Presentation skills is somewhat related to writing skills. So most of the advice here also applies to writing. Similarly some of the advise on writing also applies to preparing a presentation.
How to write your research paper
How I write

What is the biggest rock?
You can get the style right, and give a content-free and superficially interesting talk. But as Lincoln said: "You can fool all the people some of the time and some of the people all of the time, but you cannot fool all the people all the time."

Saturday, January 31, 2015

Dijkstra's stabilizing token ring algorithm

One of the classical algorithms I teach in my distributed systems class is Dijkstra's stabilizing token ring algorithm. This algorithm has started the self-stabilization field as a subfield of fault-tolerance. And, it still receives interest even after 40 years. There has been probably hundreds of self-stabilization papers that revisits Dijkstra's stabilizing token ring algorithm as part of a solution or as part of a case study. This algorithm never gets old for me as well. I still enjoy talking about this algorithm in class and thinking about it once in a while. I guess this is because it is a very elegant algorithm.

Linked is Dijkstra's original paper introducing the algorithm. This paper also includes two variants of the stabilizing token ring algorithm, 3-state and 4-state token ring algorithms. Dijkstra would later do a followup writeup, titled, "A belated proof of self-stabilization", where he gave a proof of stabilization for 3-state token ring program.

As a side note, when I asked my distributed class whether they heard Dijkstra's name before, almost all hands went up. And almost all of them had heard his name due to the "all pairs shortest path algorithm". Interesting. Probably, in Dijkstra's mind, that algorithm was a minor contribution. He came up with that algorithm at a cafe in 20 minutes without using pen and paper. None of the students heard the "Go-to considered harmful" paper, which may be closer to what Dijkstra was aiming with his career. Dijkstra's later work focused on predicate calculus for enabling  structured/principled verifiably-correct programming.

Stabilizing fault-tolerance

Stabilization is at the heart of the Dijkstra's token ring algorithm. Stabilization is a type of fault tolerance that advocates dealing with faults in a principled unified approach instead of on a case by case basis. Instead of trying to figure out how much faults can disrupt the system stabilization assumes arbitrary state corruption, which covers all possible worst-case collusions of faults and program actions. Stabilization advocates designing recovery actions that takes the program back to invariant states starting from any arbitrary. Anish Arora's paper "Practical Self-Stabilization for Tolerating Unanticipated Faults in Networked Systems" provides a great overview of stabilization concepts.

Token Ring stabilization

Let's start with the simple token ring concept. The processes are arranged in a ring fashion and there is a unique token circulating in this ring. (You can think of the token may be providing mutual exclusion to the processes; whichever process has the token can access the critical-section/shared-resource).

In the invariant states there is exactly one token. So in the bad states, there are two possible cases: 1) we end up with a ring with no tokens,  2) we end up with a ring with multiple tokens.

Let's consider how Dijkstra's token ring algorithm deals with case 1. Dijkstra's token ring algorithm maps the concept of a physical token to a relation between two neighboring processes. This mapping is done skillfully to limit the state space so the case 1 faults cannot possibly happen in that state space. In the mapped state space, any arbitrary state  is guaranteed to contain 1 or more tokens.

How does this work? Dijstra's algorithm encodes a token with the relation: "the x variable of a process is DIFFERENT than the x variable of its predecessor in the ring". In this case, if all x variables are the same, there won't be any token in the system. To fix this Dijkstra's algorithm encodes the token differently for process 0: "There is a token at process 0, if its x variable is the SAME as that of its predecessor, process N". Encoded this way, we cannot not have 0 tokens in any arbitrary state in the state space.

A good analogy for bounding the corruption space is the use of the carousal door. For a normal door, you have two possible states: door is open or door is closed. For a carousal door, there is only one possible space: the door is closed. (Yet you still make it through that door by revolving through it.)

In order to deal with case 2, Dijkstra's algorithm uses some token overwriting mechanisms. First let's observe that no process action introduces an extra token to the system. (Of course, a fault action can corrupt state arbitrarily, and can throw the system to a state with many extra tokens.) In the normal/fault-free case, by taking an action a process consumes a token that exists between its predecessor and itself and passes this token to its successor. In the faulty case, this is also the worst damage a process can do, because it will always consume the token at itself, and since its state change affects only its successor, it may or may not pass the token to its successor. If the process execution does not pass the token to its successor, this is actually good: an extra token is removed from the system. And if enough of these happen, the system will stabilize to exactly one token case.

But what if this leads to a system with no tokens? Recall that this cannot happen due to the discussion regarding Case 1.

Stabilizing token ring algorithm in TLA+ 

After explaining the program actions, in order to help students internalize what is going on, I call 5 students to stand up in front of the class. The students form a ring. And they use hands up or down to represent x=1 or x=0, and enact a simulation of the program starting from some arbitrarily corrupted states. The students first get to observe that there is indeed no configuration possible with 0 tokens. Then as they enact several scenarios, they observe how no new tokens are introduced by a process action, but in some cases an extra token is consumed.

Then, I challenge them to come up with a scenario where this binary token ring formed does not get to converge.  The corruption continues to live on in the system.  Here is such a run in a ring of 4:
1010 --> tokens at nodes 1,2,3
1011 --> tokens at nodes 1,2, and 0
1001 --> tokens at nodes 0,1,3
1101 --> tokens at nodes 0,2,3
0101 --> tokens at nodes 1,2,3
0100 --> tokens at nodes 1,2,0
0110 --> tokens at nodes 0,1,3
0010 --> tokens at nodes 0,2,3
1010 --> tokens at nodes 1,2,3
and this gets us to the first state. The system can repeat this loop forever without stabilizing.

Of course, when you run the model checker for this setup (where x is limited to 0 and 1) the TLA+ model checker provides you this counterexample.

The M>=N requirement 

So, what went wrong? Wasn't Dijkstra's token ring supposed to be stabilizing? The first time around we discussed the algorithm since x was an unbounded integer, the algorithm was stabilizing. When we considered the binary token ring example, x was bounded by 2, and that failed to stabilize in a ring of size 3 or more. There is a relation on M, the bound on x, and the algorithm's ability to stabilize for a ring of size N. We had seen in the counter example that for M<N there is a way to form a loop in the execution trace and keep the corruption going in the system.

For M>=N, the node 0 acts as a gate. Consider the case where node 0 never fires. In this case the other nodes fire, and eventually they all copy the x value at node 0, and in that configuration the system reduces to one token (at node 0) and stabilization is achieved. For M>=N, we can use the pigeon-hole principle to show that the system  eventually reaches the case where node 0 does not fire until all other nodes  copy its value. This is because, for M>=N, at any configuration there is a value "v" of x that is not currently present in the ring. Eventually node 0 will hit "v", and then node 0 will not be enabled until v is copied by all nodes. This is how node 0 acts as a gateway.

Concluding remarks  

I may talk about Dijkstra's 3-state and 4-state token ring algorithms in another post. Those are also interesting and elegant algorithms.  In a paper I wrote in 2002, I used Dijkstra's classic, 3-state, and 4-state token ring algorithms as case studies. If you like to learn more about this, you can read that paper here.

Related links

Crash only software 
The role of distributed state