Saturday, March 29, 2014

How to write your research paper

The legend of Köroğlu

I will start with a story of oppression and an uprising, involving a mythical horse in the process. Way to start a post about writing research papers, huh?

In 16th century Anatolia, there was a corrupt and oppressor mayor in the Bolu state. The mayor one day decided that he should find and gift the best horse in the world to the Sultan. He contacted a very skilled horse breeder. The breeder said the horse that is deserving of the Sultan should be very special, and said none of his horses is worthy of this. He went on a quest for this horse himself. One day he saw some street kids abusing a feeble and awkward-looking foal. He immediately recognized the potential in this foal and bought the foal, and headed for the mayor's palace. The mayor got outraged, being the ignorant oppressor he is, he thought the breeder is mocking him by offering this weak awkward foal. The mayor immediately ordered the breeder to be blinded.

The breeder had a young son, who became devastated by his father's situation. His father was now blind ("kör" in Turkish), and the son later got nicknamed "Köroğlu", the blind's son. The breeder, instead of worrying about his eyes was more worried about the foal, and instructed his son to build a pitch-black stable for the foal. He then instructed his son to constantly tend to the foal and fatten the foal as much as possible. For many months, the foal was made to stay in this pitch-black cave to eat and fatten up. The breeder did not start any training at all. Many months later, the breeder instructed Köroğlu to get this fat horse out and started a strict training regimen for the horse. The fat quickly turned to muscle, and the horse got very lean in a short time.

The legend is that the horse got so fast that it would run over a mud field and would not get any mud on its feet. Köroğlu used this horse to get his father's revenge from the mayor and became a Robin Hood like figure. Here is a link to the 1968 Turkish movie made for commemorating this legend.

Back to writing!

And I claim that a legend about a horse and an outlaw gives great lessons about writing your paper?! I must be nuts!

Give your idea a chance to grow and thrive 

All excellent ideas/papers/design start in a feeble fragile form. Very much like that foal. Don't judge too soon, otherwise you will misjudge. Instead if you can glimpse a sliver of potential, give your idea a chance to grow.

(With experience, you will be able to tell which feeble ideas are promising, which are not. You will get better at it, like the old breeder.)

In this initial phase (the cave phase), don't listen to any critics. Keep this feeble idea close to your chest. You would need to guard it even from your own criticisms early on. Suppress the criticisms for a while. Feed the idea to see what it can become.

Here Jony Ive talks about Steve Jobs' approach to creative work:
"And just as Steve loved ideas, and loved making stuff, he treated the process of creativity with a rare and a wonderful reverence. You see, I think he better than anyone understood that while ideas ultimately can be so powerful, they begin as fragile, barely formed thoughts, so easily missed, so easily compromised, so easily just squished."
Good thing the reviewers don't get to see the first drafts of any idea/paper, otherwise nothing would get published in the conferences or journals.

Fatten it up

In the cave phase, you need to greedily feed and build up your manuscript.

And this is how you do it: Start writing as soon as possible, and write while you do the work. That means, you keep a lab notebook. This doesn't need to be physical notebook. Open a directory for your new research idea, and create a notes.txt file to act as your lab notebook. In this lab notebook, you will be able to explore each sub idea and produce in bulk without any pressure of good/presentable writing. Since nobody is going to see this writing, you won't have restraints and you can go fast. You should come up with new tangential ideas, and explore all potential directions the original idea can grow. See my post about free writing for more information.

(I use the file as my lab notebook. Org-mode in Emacs is great to outline a project, keep track of the progress of each sub-idea, and manage and review ToDo items for the project.)

So feed it, build it up. Fatten it up. At the end of this you will have a fat mess in your hand. Don't feel ashamed about it. Instead, feel proud.

(Warning: If you have to keep twisting and spinning the same idea too many times just to squeeze out a small contribution, that is bad. There should be potential in the idea. Don't try to resuscitate an idea, if it refuses to grow despite your nourishing.)

Train hard: turn fat into muscles! 

This is the coming out of the cave phase. After finding your purpose and voice, you should now try to present it coherently and clearly. Now, you should be ruthless about getting your paper back in shape. Cut ruthlessly to make it lean. Cut the fluffy parts, the unnecessary tangents, and even the parts that can give the wrong vibe and that may lead an unsuspecting reader to a dead-end. Make it succinct and as simple as possible.

Editing is much much easier than starting with nothing and having to write from scratch, especially when the conference deadline is looming. If you haven't tried this approach to writing a paper before, you will be surprised how much easier it is to edit a fluffy mess into a coherent draft than writing from scratch. I have witnessed many times how quick a 20 pages of mess can be edited to form a 10 page good looking draft.

Read the Elements of Style to learn more about how to edit and produce a coherent presentable manuscript.


Don't take horse breeding advice from me, I haven't bred/trained any horses in my life. But you can take the writing advice. I use it every time I write, including this post.

Other related posts

Here are some of my related/nontechnical posts.
How I write 
How I read a research paper
My Advice To My Graduate Students
One Pomodoro, two pomodoro, three pomodoro, four
Black sheep 
Energy approach to life, the universe, and everything
Antifragility from an engineering perspective 

Monday, March 10, 2014

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

This paper is from Google. This is a refreshingly honest and humble paper. The paper is not pretending to be sophisticated and it doesn't have the "we have it all, we know it all" attitude. The paper presents the Dapper tool which is trying to solve a real problem, and it honestly represents how this simple straightforward solution fares and where it can be improved. This is the attitude of genuine researchers and seekers of truth.

It is sad to see that this paper did not get published in any conferences and is still listed as a Google Technical Report since April 2010. What was the problem? Not enough novelty? Not enough graphs?

Use case: Performance monitoring tail at scale

Dapper is Google's production distributed systems tracing infrastructure. The primary application for Dapper is performance monitoring to identify the sources of latency tails at scale. A front-end service may distribute a web query to many hundreds of query servers. An engineer looking only at the overall latency may know there is a problem, but may not be able to guess which of the dozens/hundreds of services is at fault, nor why it is behaving poorly. (See Jeff Dean and Barraso paper for learning more about the latency tails at scale).

It seems like performance monitoring was not the intended/primary use case for Dapper from the start though. Section 1.1 says this: The value of Dapper as a platform for development of performance analysis tools, as much as a monitoring tool in itself, is one of a few unexpected outcomes we can identify in a retrospective assessment.

Design goals and overview

Dapper has three design goals:

  • Low overhead: the tracing system should have negligible performance impact on running services. 
  • Application-level transparency: programmers should not need to be aware of (write code for /instrument for) the tracing system. 
  • Scalability: Tracing and trace collection needs to handle the size of Google's services and clusters.

Application-level transparency was achieved by restricting Dapper's core tracing instrumentation to a small corpus of ubiquitous threading, control flow, and RPC library code. In Google environment, since all applications use the same threading model, control flow and RPC system, it was possible to restrict instrumentation to a small set of common libraries, and achieve a monitoring system that is effectively transparent to application developers.

Making the system scalable and reducing performance overhead was facilitated by the use of adaptive sampling. The team found that a sample of just one out of thousands of requests provides sufficient information for many common uses of the tracing data.

Trace trees and spans

Dapper explicitly tags every record with a global identifier that links the reports for generated messages/calls back to the originating request. In a Dapper trace tree, the tree nodes are basic units of work and are referred to as spans. The edges indicate a casual relationship between a span and its parent span. Span start and end times are timestamped with physical clocks, likely NTP time (or TrueTime?).

Trace sampling and collection

The first production version of Dapper used a uniform sampling probability for all processes at Google, averaging one sampled trace for every 1024 candidates. This simple scheme was effective for our high-throughput online services since the vast majority of events of interest were still very likely to appear often enough to be captured.

Dapper performs trace logging and collection out-of-band with the request tree itself. Thus it is unintrusive on performance, and not paired to the application strongly.

The trace collection is asynchronous, and the trace is finally laid out as a single Bigtable row, with each column corresponding to a span. Bigtable's support for sparse table layouts is useful here since individual traces can have an arbitrary number of spans. In BigTable, it seems that the columns correspond to the "span names" in Figure 3, i.e., the name of the method called. The median latency for trace data collection is less than 15 seconds. The 98th percentile latency is itself bimodal over time; approximately 75% of the time, 98th percentile collection latency is less than two minutes, but the other approximately 25% of the time it can grow to be many hours. The paper does not mention about the reason of this very long tail, but this may be due to the batching fashion that the Dapper collectors work.

Experiences and Applications of Dapper in Google

Dapper's daemon is part of Google's basic machine image and so Dapper is deployed across virtually all of Google's systems, and has allowed the vast majority of our largest workloads to be traced without need for any application-level modifications, and with no noticeable performance impact.

The paper lists the following Dapper use cases in Google:

  • Using Dapper during development (for the Google AdWords system)
  • Addressing long tail latency
  • Inferring service dependencies
  • Network usage of different services
  • Layered and shared storage services  (for user billing and accounting for Google App Engine)
  • Firefighting (trying to quickly-fix a distributed system in peril) with Dapper

Dapper is not intended to catch bugs in codes and track root causes of problems. It is useful for identifying which parts of a system is experiencing slowdowns.

Tuesday, March 4, 2014

Naiad: A timely dataflow system

What is in a name?

Naiad is from Microsoft Research. Dryad, a general purpose runtime for execution of data parallel applications, was also from Microsoft Research. An application written for Dryad is modeled as a directed acyclic graph (DAG) and Dryad is the "tree nymph" in Greek mythology. Naiad is a stream processing platform and Naiad is the "stream nymph" in Greek mythology.

Naiad is an opensource project that has been receiving a lot of attention recently. I expect we will hear more about Naiad, because it is very useful for low-latency real-time querying and high-throughput incremental-processing of streaming big data. What is not to like?

Naiad is useful especially in incremental processing of graphs. As has been observed before, MapReduce is inappropriate for graph processing because of the large number of iterations needed in graph applications. MapReduce is a functional language, so using MapReduce requires passing the entire state of the graph from one stage to the next, which is inefficient. And for real-time applications batch processing delays of MapReduce becomes unacceptable.

Dataflow graph

The developer supplies a a logical graph to Naiad to describe the dataflow of computation. (Don't confuse this with the large scale input graph that Naiad computes on). The edges in this graph show dataflow. The vertices are stateful computation stages.

Figure 1 shows the overall architecture, with two main separate components: (1) incremental processing of incoming updates and (2) low-latency realtime querying of the application state. The query path is cleanly separated from the update path. Queries are done separately on a slightly stale version of the current application state. This way, the query path does not get blocked or incur delays due to the update processing. This also avoids complexity: If queries shared the same path with updates, the queries could be accessing partially-processed/incomplete/inconsistent states, and we would have to worry about how to prevent this.

With this architecture, Naiad is able to provide <100ms interactive query processing, <1s batch updates, and <1ms loop iterations.

(The separate query path is not a new idea. In big data processing, there is a batch layer that does occasional/periodic batch processing. This batch processing would output indexed state (new graph) and queries were performed over this output state in the serving layer in a read-only and quick manner.)

Loops in dataflow graph

The logical dataflow graph can have loops and even nested loops. (Note that, in contrast, a MapReduce computation dataflow does not have any loops, it is a chain of stages; at each stage you progress forward using output of previous stage and producing input for the next stage.)

The loop concept in the dataflow graph is very useful as it enables new applications that may not be possible to compute with MapReduce like frameworks (at least in a reasonably efficient manner). Loops are natural way of dealing with iterative graph processing as in PageRank and machine learning applications.

Naiad even allows nested loops. But, as useful as they are, loops complicate the job of a stream processing system significantly. How do you keep track of when data is purged out, and that data doesn't keep looping in the system? How do you keep differentiate between older data looping in the system versus new data that is just entering the loop? To deal with these issues the paper introduces an epoch based timestamp to label data that is being processed. These timestamps make the model suitable for tracking global progress in iterative algorithms in a local manner. The progress tracking looks like a deep topic, the Naiad paper refers the readers to a separate 2013 paper for the full explanation of the progress tracking algorithm.

The paper calls the resulting model, the timely dataflow model. In sum, the timely dataflow model supports:
+ structured loops allowing feedback in the dataflow,
+ stateful data flow vertices capable of consuming and producing records without global coordination, and
+ notifications for vertices once they have received all records for a given round of input or loop iteration.

Naiad runtime

The logical dataflow graph, which denotes the stages of computation and the dataflow between these stages, is mapped on the physical worker machines in many-to-1 fashion. Each worker may host several stages of the dataflow graph. The workers are data nodes. They keep a portion of the input data (usually a large-scale input graph, such as Twitter follower graph) in memory. So it makes sense to move computation (dataflow graph stages) to the data (the partitioned input graph).

Writing programs with Naiad

A vertex (of the logical dataflow graph) may invoke two system-provided methods:
this.SendBy(e : Edge, m : Message, t : Timestamp)
this.NotifyAt(t : Timestamp).

Each call to u.SendBy(e,m,t) results in a corresponding invocation of v.OnRecv(e,m,t), where e is an edge from u to v, and each call to  v.NotifyAt(t) results in a corresponding invocation of v.OnNotify(t).

The OnRecv method may send elements on the first output as soon as they are first observed, allowing for low latency, but to ensure correctness the vertex must use OnNotify to delay sending a final synopsis until all inputs have been observed. In other words, SendBy and OnRecv are more suitable for streaming, and NotifyAt and OnNotify are more suitable for batching.

As such, Naiad provides tunable consistency. The developer can use loose-consistency operation like OnReceive or a strong consistency operation (that requires waiting) like OnNotify.

A prototypical Naiad program is given in the paper as follows.

Evaluation results

The paper has extensive evaluation results. Naiad was deployed on up to 64 computers and scalability results are shown for throughput, global barrier latency, progress tracking and speedup. PageRank (on Twitter follower graph), logistic regression (as an example of batch iterative machine learning) and k-Exposure algorithms (for Twitter topics) are used as examples.


A feedback first: It would have been very useful if the paper used different words for edges/vertices in logical dataflow graph versus those in the input graph that workers compute and modify. This gets very confusing at places. (See it even became confusing as I wrote the above.)

The paper is 16 pages long, and packed with information. But several things remain unclear to me after reading.

How does Naiad do rate control? Within a loop at each epoch a larger neighborhood of a vertex may get affected/triggered (e.g., think of a PageRank like spreading application). How does this not cause an input avalanche? How does Naiad do rate control to send/initiate only as much as it can consume on the worker nodes?

It is not clear if we can implement tightly coordinated applications in Naiad. By tightly coordinated applications I mean applications that require multihop transactions on input graph, such as graph coloring and graph subcoloring.

Wednesday, February 26, 2014

Energy approach to life, the universe, and everything

I recently read Scott Adams new book titled: "How to fail at almost everything and still win big". I enjoyed reading the book a lot; Scott discusses a lot of intriguing ideas as usual. The main idea in the book was that you need systems instead of goals. Goals are for losers because motivation is ephemeral. In contrast, systems/processes are persistent, durable (by definition). As Bezos says "Good intentions don't work, mechanisms work!"

Another main idea in the book is the importance of monitoring and managing your energy for being successful. The book advises you to "watch what you eat, exercise, match mental states to activity and attack tasks when you have the appropriate energy level".

I think this energy approach to personal productivity and life is promising. So I wanted to collect my thoughts on this and see what I can add on this topic.

Things that increase energy and decrease energy

It isn't the mountains ahead to climb that wears you down.
It is the pebble in your shoe.
– Muhammad Ali
This quote from Muhammad Ali is extremely insightful, and nicely summarizes everything that needs to be said on this topic. Here are my observations on what energizes me and drains my energy.

small victoriesbig victories
tea/healthy foodeating a lot
happy moodbad mood
waking up earlywaking up late
reading a good bookbrowsing junk/news

Conserving energy

To keep your energy level high, you need to find tricks to conserve energy. Instilling useful "habits" is a great trick to conserve energy. When you make something a habit, you don't need to waste your energy for remembering to do it and more importantly for finding the willpower to do it. Habits make inertia work for you. The key to instilling habits is to start with baby steps. See tiny habits by Dr. Fogg to learn more.

Keeping things simple in your life helps conserve energy. Being an organized person (Getting Things Done) avoids ongoing chaos in your life, and conserves energy. I use Emacs Org-Mode to track all my tasks so I can clear my mind. Even by cleaning/organizing your room office you can notice a boost in your energy.

On this tangent, being a principled/virtuous/religious person can help a lot for success. Once you make your decision and commit to it, the temptations you need to fight become less, you can ignore a lot of distractions because they are out of bounds (designated as sin) for you. This is a very pragmatic (very Benjamin Franklin) way to approach the issue of character/virtue/religion, but there it is.

Growing your energy potential

Maintaining high energy levels is good, but you know what is better: Growing your energy potential.

The antifragility book tells us that to organic things grow by exploring and pushing/straining its limits occasionally. Related to this topic is the "Growth mindset" concept, which has been put forth by the Stanford psychiatrist Carol Dweck (I highly recommend her book). Growth mindset describes an antifragile approach to personality and life. Growth mindset people like challenges and failures, as these would make them learn, improve, and grow.

Following this line of thought, in order to grow your energy potential, you need to strain it and totally deplete it from time to time. Occasionally, you should take on more than you can handle, exercise vigorously and push your physical limits, pull some all-nighters working on a project, and fail at some of your projects. Pushing your limits and failing is good for you. It will make you grow. If nothing else, it will teach you some humbleness and throw you out of the fixed mindset you may have acquired (e.g., I know it all, I have it all, I am naturally successful). You need some occasional resets to be able to experience the beginner's mind.

Focusing/intensifying your energy

It is also important to learn how to control and focus your energy to get things done. I doubt there is an easy or universal way to achieve this.

I think intensifying your emotions helps for focusing your energy. Being curious and asking questions focuses your energy, because you will really want to get answers to your unresolved questions. Being very determined and motivated (you need to find ways to motivate yourself) will help in focusing your energy. Finally, even anger helps. I occasionally argue and pick fights with papers/books, in order to focus my energy to better understand them.

Wednesday, February 12, 2014

Consistency-Based Service Level Agreements for Cloud Storage

This paper is from Microsoft Research and appeared in SOSP'13. This paper is more of a position and vision paper. The paper introduces a consistency-based SLA concept, which can provide a win-win arrangement for cloud providers and developers.


For performing reads from key-value stores, currently you have two options. You can do strongly-consistent reads by increasing the size of your read replica quorum, but this will increase latency of the responses, and you don't get the flexibility to revert to a quick dirty (eventually-consistent) read if a strong-consistent read would take a long time to respond. Or you go with best effort reads (which are eventually-consistent) from the key value store because you insist on low-latency answers.

(Actually there is another option: You can use a strongly consistent multiversion data store like BigTable or Spanner, and relax it by reading slightly stale data and get flexibility. I will revisit this option in the discussion.)

| Rank | Consistency | Latency | Utility |
|   1.    | strong         | 150 ms  |     1.0  |
|   2.    | eventual      | 150 ms  |     0.5  |
|   3.    | strong         | 500 ms  |    0.25 |

Enter the consistency-based SLA concept. The SLA acts as an interface between the application developer and the inners of the cloud. The developer provides a wishlist for their get (i.e., read) operations from the key-value store as above. Here the developer says "I want a reply in under 150 ms and prefer strongly consistent data but will accept any data; if no data can be obtained quickly then I am willing to wait up to 500ms for up-to-date data." The cloud-provider backend is structured such that it keeps track of which of these reads is feasible currently for that location and it satisfies the highest ranked one it can in order to give the best utility to the developer.

Using such an SLA makes good business sense. With this SLA the developers put their money where their mouth is. They agree to pay more for better utility provided to them. The cloud-providers can use the SLAs to prioritize the read requests: they can give more priority to consistency requiring higher paying (higher utility) customers.

To illustrate, Figure 3 shows some read latencies at a given point from given locations. The developer does not have access to all per region or per client latencies like this, but in the SLA she can state her ranked preferences for latency and consistency of the reads she thinks would make most sense for her application, and through this interface she has access to dynamic tuning of performance of her application.

Pileus Architecture:

To showcase the SLA, the authors developed a replicated key-value store called Pileus. Pileus is a type of cloud formation, it is a cap cloud. (Get it? A "CAP" cloud.) Pileus dynamically selects which servers to access in order to deliver the best service given the current configuration and system conditions.

Some storage nodes are designated as primary nodes, which hold the master data, while others are secondary nodes. All Puts (i.e., writes) in Pileus are performed and strictly ordered at a primary site. Secondary nodes eventually receive from the primary core all the updated objects along with their update timestamps. Since all Put operations are assigned increasing update timestamps from the primary site and the asyncronous replication protocol transfers updated objects in timestamp order, at any point in time, each secondary node has received a prefix of the overall sequence of Put operations.

When selecting the node to which a Get operation should be sent, the desired consistency guarantee, along with the previous object versions that have been read or written in the current session and the key being read, determines the minimum acceptable read timestamp. The minimum acceptable read timestamp indicates how far a secondary node can lag behind the primary and still provide an answer to the given Get operation with the desired consistency. This is being decided by the client library of Pileus.

This architecture forces all the writes to be performed on a single primary core limits the problem space, and simplifies things for ensuring consistency for the reads in the consistency-spectrum. But this also limits the performance on reads (except for eventual-consistency reads). Moreover, with this setup you don't get to specify latency bounds for writes.

Evaluation results show that consistency-based SLAs can indeed improve application-specific levels of service (i.e., utility).


Q: How rich is the class of applications that benefit from this SLA?

A: I am still confused about this. It sounds like this can be applicable to a large class of applications, but sometimes I revert to thinking maybe not that big.

For latency-favoring (eventual-consistency happy) applications there are existing solutions: DynamoDB, and several key-value stores. And the target applications are those that tolerate relaxed consistency but, nevertheless, benefit from improved consistency. It may seem that these are already served to some extent by the eventual-consistent key-value stores. They are just best effort. You don't know what you get, but fresher more consistent data improves service the same as in Pileus. Pileus gives you tuned performance, but maybe you could have gotten that performance by probabilistic means also. (Peter Bailis has a very nice work on probabilistically bounded staleness, which is also a related approach here.)

For consistency-favoring applications, there are existing solutions like Bigtable, Spanner. And you can still do a quick dirty read from Spanner, by giving a slightly past read timestamp. This works because Spanner is a multiversion key-value store. But I guess you still need to manage when you would want to revert to the quick dirty reads.

Q: How does Pileus change the application code?

A: Yes we learn from API when we get back a consistent read and when not, but reacting on the type of reads may lead to polluting my program with a lot of branches and checks. Maybe programming languages people may have an answer to that. I guess, this way is still better than monitoring for latencies and implement these tuning in your application.

Saturday, February 8, 2014

The scalable commutativity rule: Designing scalable software for multicore processors

This paper was one of the best paper's at SOSP 13. The paper is open access since SOSP paid for the open access fees for each paper that appeared in the conference.

The scalable commutativity rule says that "Whenever interface operations commute, they can be implemented in a way that scales". In other words, whenever several operations commute at the interface-level (i.e., there's no way to distinguish their execution order using the interface), for those operations we can have implementations whose memory accesses are conflict-free.

(Scalable commutativity sounds like a misnomer, shouldn't the theorem be called "commutative scalability" theorem? The end goal is scalability, and commutativity is a tool for it. Thus "commutative" should qualify "scalability", not the other way around.)

I found several nice generalizable insights in the paper.

The first one is using commutativity as a detectable/measurable and controllable witness for concurrency. Concurrency is an abstract and specification-dependent concept. If you disregard safety (which is defined by the specification), you can boost concurrency easily by parallelizing aggressively, and you end up with a highly scalable system full of race conditions and incorrect/unsafe operations. It is hard to measure "safe concurrency" and boost it. By way of contrast, commutativity is well defined and easier to measure and control. If the outcome doesn't change when you change the order of operations then the order is not important and that means you don't need to lock anything and you can find a lock-free/wait-free/coordination-free implementation.

The second insight is to pose the rule as a "possibility result" to guide the implementation process. It is clear that the paper is motivated by practical problems and from the ground up. These guys are Linux kernel hackers, they are trying to modify Linux to run on multicore machines in a more efficient/scalable manner. As part of their work to make Linux more scalable, they work at the trenches and develop a lot of tricks and hacks. But they didn't know when to quit and when to persist: "Before the rule, we tried to determine if these operations could scale by analyzing all of the implementations we could think of. This process was difficult, unguided, and itself did not scale to complex interfaces, which motivated our goal of reasoning about scalability in terms of interfaces." When operations commute at the specification level, this rule tells them that they should persist because scalable implementations exist. When operations don't commute, this rule tells them to quit because a scalable implementation may not exist with this specifications.

Finally this rule can be used to guide the design as well. When a scalable implementation may not be available with the current specifications of the operations (i.e., the operations do not commute at the interface level), the developers may consider changing the operations slightly to find a design where operations commute (where it is guaranteed to find scalable implementations). To make the operations commute, the operations may be designed to use different parameters or relax the guarantees on the returned result. For example, in the case of a file operation, don't return the smallest unused file descriptor FD, but return an unused FD. If you change the specification this way, then you prevent the contention on the smallest unused FD. In a sense this is moving away from a queue semantic (which is order sensitive) to a set semantic (which is order insensitive, a.k.a. commutative). The essence of trick has been used to modify operations to commute and achieve conflict-free implementations.

The paper is a tour de force. The authors developed a tool called COMMUTER that accepts high-level interface models and generates tests of operations that commute and hence could scale. The tool uses a combination of symbolic and concolic execution, and generates test cases for an arbitrary implementation based on a model of that implementation's interface.

"First, ANALYZER takes a symbolic model of an interface and computes precise conditions under which that interface’s operations commute. Second, TESTGEN takes these conditions and generates concrete test cases of sets of operations that commute according to the interface model, and thus should have a conflict-free implementation according to the commutativity rule. Third, MTRACE checks whether a particular implementation is conflict-free for each test case."

The evaluation of the paper is also exhaustive. They modeled several POSIX file system and virtual memory calls in COMMUTER, then used this both to evaluate Linux's scalability and to develop a scalable file and virtual memory system for their sv6 research kernel. I loved Figure 6, it shows a lot of things without overwhelming the viewer.

To show how the results in Figure 6 translate to scalability on real hardware they evaluated with some microbenchmarks and a mail server application.


Question: The scalable commutativity rule says: If interface operations commute, then they can be implemented in a way that scales. Is there a reason why you don't claim the inverse: If interface operations don't commute, then they cannot be implemented in a way that scales. If the inverse is true, then this rule will have stronger implications for interface design.

A: The paper avoids this question and doesn't have an answer to this. My guess is the inverse would indicate only a relative unscalability, not an absolute one. And it is hard to quantify the relative unscalability. But the original theorem is clean, it provides an absolute result: an implementation with conflict-free memory accesses, i.e., full-scalability is possible.

Question: Is this rule is applicable to the general distributed systems and not just multicore systems?

A: The paper claims this is the case in the related work but doesn't elaborate more on that. Yes, this should work but there are assumptions. The operation should have invocation and response parts. The high level specification of the operations should make it clear what the operation does so you can perform the commutativity tests.

Question: What are the tradeoffs when coming up with variants of operations to make them commute?

A: Instead of a big operation, you may divide it into two small operations to improve commutativity. But did this make the job of the developers harder. Well maybe, but it is worth it if you made it conflict-free on multicores. And now instead of crossing the kernel boundary once, you may be crossing the kernel boundary twice because you have replaced one operation with two operations. So you pay overhead, but it is OK if you had improved the commutativity and scalability of the operations.


Martin Vechev emailed me about my first question, and he had this to say: " we had an earlier POPL'11 paper that shows one part of the inverse question you are asking, that if you have strong non-commutativity you need expensive instructions:
It was also profiled in Linux Weekly:"

Sunday, February 2, 2014

Black sheep

"Publish or Perish" has become a motto for academics, but it is hard not to get depressed about this rat race in academia. I see researchers who have published an implausible amount of papers in good venues, but they still remain obscure. With all those papers, they haven't even made a splash in their field of study. I then ask myself what chance I stand for making a big enough contribution to be useful. Then, I proceed to have my semi-annual existential crisis, and question the point of publishing and my career in academia. I recover in a couple days generally, because I like working on problems and I am curious about stuff. But, these questions loom: Am I wasting my time? Am I irrelevant with my publications? How should I allocate/manage my time to be useful?

Publish less? 

It is simplistic to say "It is the quality that counts. If you publish less you will publish better quality work and get recognition". Experience refutes this. If you strive to publish less, and hold your self to imaginary/unattainable standards, it will actually harm your impact. Feynman talks about the "Nobel Prize effect" in his memoir. Hamming describes the Nobel Prize effect as follows:

The day the prize was announced we all assembled in Arnold Auditorium; all three winners got up and made speeches. The third one, Brattain, practically with tears in his eyes, said, "I know about this Nobel-Prize effect and I am not going to let it affect me; I am going to remain good old Walter Brattain." Well I said to myself, "That is nice." But in a few weeks I saw it was affecting him. Now he could only work on great problems.
When you are famous it is hard to work on small problems. This is what did Shannon in. After information theory, what do you do for an encore? The great scientists often make this error. They fail to continue to plant the little acorns from which the mighty oak trees grow. They try to get the big thing right off. And that isn't the way things go. So that is another reason why you find that when you get early recognition it seems to sterilize you. In fact I will give you my favorite quotation of many years. The Institute for Advanced Study in Princeton, in my opinion, has ruined more good scientists than any institution has created, judged by what they did before they came and judged by what they did after. Not that they weren't good afterwards, but they were superb before they got there and were only good afterwards.

Baa baa black sheep

So publishing less is bad, and publishing more does not guarantee you make an impact. Then what is a good heuristic to adopt to be useful and to have an impact?

I suggest that the rule is to "be daring, original, and bold". We certainly need more of that in the academia. The academia moves more like a herd, there are flocks here and there mowing the grass together. And staying with the herd is a conservative strategy. That way you avoid becoming an outlier, and it is easier to publish and get funded because you don't need justify/defend your research direction; it is already accepted as a safe research direction by your community. (NSF will ask to see intellectual novelty in proposals, but the NSF panel reviewers will be unhappy if a proposal is out in left field and is attempting to break new ground. They will find a way to reject the proposal unless a panelist champions the proposal and challenges the other reviewers about their concocted reasons for rejecting. As a result, it is rare to see a proposal that suggests truly original/interesting ideas and directions.)

To break new ground, we need more mavericks that leave the herd and explore new territory in the jungle. Looking at the most influential names in my field of study, distributed systems, I see that Lamport, Dijkstra, Lynch, Liskov were all black sheep.

Of course, when you are daring and original, you will be wrong half (?) the time. But being wrong teaches you something, and this insight will help you to have a breakthrough eventually. This is the growth mindset thinking. (I am reading the book "Mindset: The New Psychology of Success", and it is awesome. I recommend it for everyone.)

Moreover, when you are onto something really good, others will not get or accept what you are doing for a while. That's the price you pay for daring to be original. You can read several accounts of amusing paper rejections Lamport received here. However, I still think it is better to be ignored temporarily than remain unoriginal permanently. If you are not original, you are not truly contributing. Being daring and original is the only way you have for making a genuine contribution and having impact.
Don't worry about people stealing an idea. If it's original, you will have to ram it down their throats.
Howard H. Aiken

The interesting thing is, to be the black sheep, you don't need to put on an act. If you have a no bullshit attitude about research and don't take fashionable research ideas/directions just by face value, you will soon become the black sheep. But, as Feynman used to say "what do you care what other people think?". Ignore everybody, work on what you think is the right thing.

PS: Coincidentally Daniel Lemire has recently discussed about the black sheep phenomena in "To be smarter, try being crazier?"

Saturday, January 25, 2014

Bionic Cyber Physical Systems

I have been neglecting my blog because of several paper and proposal deadlines. Rather than keeping the blog gather more dust, I decided to share some summaries from recent talks I attended and papers I read. This is a talk I attended couple months ago and enjoyed a lot. It is directly on the distributed systems topic, but in the future we may find ourselves programming these kind of distributed systems, and we will be dealing with a lot of bugs in our code for sure ;-)

Alper Bozkurt (NCSU) has been working on bionic cyber physical systems, which aim to fuse synthetic man-made systems with naturally occuring biological organisms, such as cockroaches, moths, dogs, lemurs, and eventually humans. His research is very interesting and has been featured at National Geographics, CNN, and Stephen Hawkins show "Brave New World". Here is the PhDComics coverage of his research (and this).

Alper primarily works with insects. He says insects are fabuluous because they are optimized aerodynamic systems with onboard efficient control systems and dense chemical fat stores. In other words, insects have sensors and actuators, collision avoidance, and power source. Any plane-like machine at the size of insect will not fly because of aerodynamics. At that size, you need flapping wings or helicopter blades.

He makes a good case for converting the insects to biobots. We, humans, have used ATP powered animal muscle before oil powered engines through the use domestication technology. We domesticated horses, donkeys, mules, dogs. But we couldn't domesticate the insects, because they are hard to train and teach. He argues that the biobots technology is just a domestication technology (a computer-assisted technology) applied on insects. Domesticating insects via biobot technology has applications in search and rescue, environmental sensing, industrial plant monitoring, and nuclear plant monitoring. Think of a colony of biobot insects networked together for exploration and mapping for challenging terrains. The cost of a raising insects is almost nothing. Starting with a male and female insect, you can go to a colony in two weeks. Alper's research is on developing low cost neurostimulation probes that can be easily applied to the insects. The vision is that a robot would be able to catch the insects and implant these probes to the insects with a minimally invasive way. Insects are simplistic, they only respond to reflexes, so the neuro control is much simpler to perform in insects. Insects also don't register pain, there are no pain receptors in insects, so there is no need for permissions for running scientific experiments on insects.

The payload that can be carried by the insect is not much, but now thanks to miniaturization in circuit technology with a few grams of payload, it is possible to store and carry megabytes of digital information (photos/videos). The challenge is how to implant the circuits to the insects without crippling or killing them in the process. Alper's research solves this problem by benefiting from the metamorphic development of the insect. During metamorphosis, 90% of the tissue is retained and in 1 week the terrastrial insect morphs to become a flying insect by developing wings and antenna. To explain how drastic metamorphosis, people use the analogy of transforming from wheelchair to an Apache helicopter. Alper's method performs the surgery before metamorphosis, and the wound caused for surgery is auto-recovered. The surgery does not require skill, and it will possible to automate the surgery to be performed by a robot.

They work with Carolina sphinx moth, which is 4-7 cm, with a wing spand of 10cm. This is quite a large insect and weighs 1-2 grams, and can carry a payload capacity half of their weight. They fly at 5m/s speed. With the probe installed, Alper's team can control the moth to initiate flight, stop flight, turn right, left.

For terrestrial locomotion control, the team works with pet cockroaches, called hissing cockroaches. These cockroaches have 5-7 cm size and weigh 5-10 grams. The insect looks down (doesn't need to look for predators) uses antennas for sensing like blind people with canes. If the probe gives the right antenna the signal, the insect turns to left. Thus, the team is able to make the insect follow the line very precisely controlled by a joystick.

The tissue electrode interface is a bottleneck in the implants. The team is experimenting with electroplating to optimize enough charge and charge storage. They also run some in-vivo electrochemical analysis to see if they can improve/optimize the process. Finally, habituation can become a problem: the insect may get used to the implant, and may start to ignore after a while. The control is not always very deterministic, so they try to automate and make the control system adaptive.

The team is also developing backpack implants for the insects with zigbee (and sometimes Bluetooth support). Networking the insects will enable controlling them as a swarm and move them in a coordinated manner.

Monday, December 23, 2013

Research is a capricious mistress

Research is a capricious mistress. You should be totally devoted to it, and you should be thinking about it all the time. Otherwise, it won't divulge its secrets to you.

Research has this nonlinear return on investment (ROI) function. It awards persistence and dedication by folds and punishes apathy and laziness severely.

If you give research your 100%, you will get at least 200% ROI back. Eventually, that is. It will first test you to see if you are worthy, if you really mean it, if you really have it in you. You will get several setbacks, false alarms. Yet, if you persist and overcome the failures, it will start giving back to you by folds.

If you give research your 50% you will get 20% back. You won't make the cut, and you will be operating a business with deficiency. Usually someone else (the university that employs you) pays the tab.

So what does giving 100% mean? It doesn't mean working overtime (sometimes that will be needed, and you will be doing it though). The productive focused working time (i.e., deep thinking time) is limited to 4 hours a day more or less. There seems to be a bound after which your brain refuses to make progress; it is as if your brain needs to wait to adjust for and catch up to what it had produced. It won't go further even if you push more. I liken this to climbing a mountain; you have to rest daily to let your metabolism adjust. (This 4 hour bound has been anectodally mentioned by many researchers, but I don't know of any detailed analysis of this. For example, I don't know if it is possible to fit in 8 concentrated hours as 4+4 on two different projects. I haven't been able to succeed so far.)

Let's stick with that 4 hours daily. Are you giving that focused 4 hours to your research? Only a small fraction of researchers can answer this affirmatively. How can you cultivate this deep thought? It is hard work. Be persistent and try different things until you find what works best for you. Here are some necessary conditions to get you started. First, turn all distractions off. Don't check your email, twitter, browser in those precious hours. Go offline if possible. Secondly, you should be writing and taking notes. You need to be writing to be able to think in a concentrated manner. If you work with pen and paper (or a whiteboard), you can get more visual, you can doodle, you can link concepts. If you use a good word editor (I swear by Emacs orgmode) and keep typing as you work, the bonus is that this often will be your zeroth draft for your research paper, and you won't get stuck in thinking how/where to start writing.

After paying your dues to your research in those 4 hours, you are still not off the hook. What you should do is you should think about your research in that remaining time. This is the digestion and cudding process. At the background of your brain, you should revisit your research at different times of the day. You do this to see 1) if what you produced holds water, and 2) if by approaching from different angles you can make more progress. You come back to your research again and again during the day (while driving, in the shower) to see if you can catch your research off-guard and get more secrets out. But, you should not let this thinking/reconsidering turn to worrying. Worrying is unproductive and harmful.

If you make some progress regularly and persistently, you will get to see the many folds return on your investment.

Saturday, November 9, 2013

My notes from SOSP13 welcome and awards

The ACM Symposium on Operating Systems Principles (SOSP) is arguably the top conference in the computer systems area. The SOSP conference took a start with a welcome talk from the General Chair, Michael Kaminsky (Intel Labs), and PC Chair, Mike Dahlin (Google and UT Austin).

The chairs started by thanking to the many sponsors (platinum, gold, silver, bronze level) for the conference.

This year SOSP had 628 registrations, which made it the biggest SOSP as of yet, with a 16% increase over 2011 SOSP (which was yet biggest till then). Attendance distribution to SOSP is 76% from North America, 15% Europe, and 11% Asia. Among those attending 42% is faculty, 42% students, and 15% industry. There were 7 workshops on the weekend preceding the SOSP (LADIS was one of them), and 40% of attendants also attended workshops.

This year, for the first time, SOSP had full open access conference proceedings (the cost, $1100 per paper has been paid by SIGOPS), and this announcement got a huge applause from the audience.

Among the 150+ papers submitted, 30 papers are accepted to SOSP. There was a 3 round review process, with a total of 740 reviews, and on average 5 reviews per paper. 70 papers that fell in the middle were discussed in the depth PC meeting.

Three papers have been awarded with the best paper awards:

  1. The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors, Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert Morris (MIT CSAIL), Eddie Kohler (Harvard)
  2. Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior, Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, Armando Solar-Lezama (MIT CSAIL)
  3. Naiad: A Timely Dataflow System, Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, Martin Abadi (Microsoft Research)

The accepted 30 papers were presented within the duration of 2.5 days as single track. Each paper got a 30 minute time slot for which 22-25 minutes is for presentation, and 5-8 minutes are dedicated to question-answering session. This year Google-moderator tool has also been employed for taking questions from the audience, in addition to walk-to-microphone questions. The tool enables the audience to vote on the questions and get the questions with most votes rise to the top.

A majority of the presentations were very good (which is not the case in a typical conference where a big majority of the presentations are dull). It was clear that presenters had practiced extensively before the presentations, which helped them to deliver polished 22-25 minutes talks. Question-answering sessions were lively and there were several insightful questions asked.

The papers and  presentation slides (and soon talk videos) are available from

Some of my top picks (reflecting my prejudicies and research interests) from the conference are as follows:

  • The scalable commutativity rule: Designing scalable software for multicore processors
  • Dandelion: a compiler and runtime for heterogenous systems
  • Sparrow: distributed, low latency scheduling
  • From L3 to seL4 what have we learnt in 20 years of L4 microkernels?
  • An analysis of facebook photo caching
  • Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior
  • Transactions chains: achieving serializability with low-latency in geo-distributed storage systems
  • Consistency-Based Service Level Agreements for Cloud Storage
  • Tango: Distributed Data Structures over a Shared Log
  • There Is More Consensus In Egalitarian Parliaments
  • Naiad: A Timely Dataflow System

I will try to edit/organize my notes about some of these talks and share them soon. Especially the last five papers on this list appeal a lot to me since they are more relevant and related to my research interests large-scale distributed systems.

Tuesday night there was a banquet and award ceremony. Stefan Savage has been awarded with the 2013 Mark Weiser award. He gave a humble yet powerful talk, and shared personal stories about how his career has benefited a lot from interacting with the late Mark Weiser. Two recent PhD thesis were awarded with the Dennis Ritchie award.

Finally, the following five papers have been added to SIGOPS Hall of Fame:

  1. “Tenex, A Paged Time Sharing System for the PDP-10”, Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy and Raymond S. Tomlinson, Communications of the ACM 15(3), March 1972. 
  2. “A NonStop Kernel”, Joel Bartlett,  in Proceedings of the Eighth ACM Symposium on Operating Systems Principles (SOSP’81), Pacific Grove, California, December 1981 
  3. K. Mani Chandy and Leslie Lamport, “Distributed Snapshots: Determining Global States of a Distributed System”, ACM Transactions on Computer Systems 3(1), February 1985. 
  4. Kenneth P. Birman and Thomas A. Joseph, “Exploiting Virtual Synchrony in Distributed Systems”, in Proceedings of the Eleventh ACM Symposium on Operating Systems Principles (SOSP’87), Austin, Texas, November 1987. 
  5. Eddie Kohler, Robert Morris, Benjie Chen, John Janotti and Frans Kaashoek, “The Click Modular Router”, ACM Transactions on Computer Systems (TOCS), 18(3), August 2000.