Thursday, February 4, 2016

Holistic Configuration Management at Facebook

Move fast and break things

That was Facebook's famous mantra for developers. Facebook believes in getting early feedback and iterating rapidly, so it releases software early and frequently: Three times a day for frontend code, three times a day for for backend code. I am amazed that Facebook is able to maintain such an agile deployment process at that scale. I have heard other software companies, even relatively young ones, develop problems with agility, even to the point that deploying a trivial app would take couple months due to reviews, filing tickets, routing, permissions, etc.

Of course when deploying that frequently at that scale you need discipline and good processes in order to prevent chaos. In his F8 Developers conference in 2014, Zuckerberg announced the new Facebook motto "Move Fast With Stable Infra."

I think the configerator tool discussed in this paper is a big part of the "Stable Infra". (By the way, why is it configerator but not configurator? Another Facebook peculiarity like the spelling of Übertrace?)

Here is a link to the paper in SOSP'15 proceedings.
Here is a link to the conference presentation video.

Configuration management

What is even more surprising than daily Facebook code deployment is this: Facebook's various configurations are changed even more frequently, currently thousands of times a day. And hold fast: every single engineer can make live configuration changes! This is certainly exceptional especially considering that even a minor mistake could potentially cause a site-wide outage (due to complex interdependencies). How is this possible without incurring chaos?

The answer is: Discipline sets you free. By being disciplined about the deployment process, by having built the configerator, Facebook lowers the risks for deployments and can give freedom to its developers to deploy frequently.

Ok, before reviewing this cool system configerator, let's get this clarified first: what does configuration management involve and where is it needed? It turns out it is essential for many and diverse set of systems at Facebook. These include: gating new product features, conducting experiments (A/B tests), performing application-level traffic control, performing topology setup and load balancing at TAO, performing monitoring alerts/remediation, updating machine learning models (which varies from KBs to GBs), controlling applications' behaviors (e.g., how much memory is reserved for caching, how many writes to batch before writing to the disk, how much data to prefetch on a read).

Essentially configuration management provides the knobs that enable tuning, adjusting, and controlling Facebook's systems. No wonder configuration changes keep growing in frequency and outdo code changes by orders of magnitudes.

Configuration as code approach

The configerator philosophy is treating configuration as code, that is compiling and generating configs from high-level source code. Configerator stores the config programs and the generated configs in the git version control.

There can be complex dependencies across systems services in Facebook: after one subsystem/service config is updated to enable a new feature, the configs of all other systems might need be updated accordingly. By taking a configuration as code approach, configerator automatically extracts dependencies from source code without the need to manually edit a makefile. Furthermore, Configerator provides many other foundational functions, including version control, authoring, code review, automated canary testing, and config distribution. We will review these next as part of the Configerator architecture discussion.

While configerator is the main tool, there are other configuration support tools in the suite.
Gatekeeper controls the rollouts of new product features. Moreover, it can also run A/B testing experiments to find the best config parameters. In addition to Gatekeeper, Facebook has other A/B testing tools built on top of Configerator, but we omit them in this paper due to the space limitation. PackageVessel uses peer-to-peer file transfer to assist the distribution of large configs (e.g., GBs of machine learning models), without sacrificing the consistency guarantee. Sitevars is a shim layer that provides an easy-to-use configuration API for the frontend PHP products. MobileConfig manages mobile apps' configs on Android and iOS, and bridges them to the backend systems such as Configerator and Gatekeeper. MobileConfig is not bridged to Sitevars because Sitevars is for PHP only. MobileConfig is not bridged to PackageVessel because currently there is no need to transfer very large configs to mobile devices.

The P2P file transfer mentioned as part of PackageVessel is none other than BitTorrent. Yes, BitTorrent finds many applications in the datacenter. This example from Twitter in 2010.

The Configerator architecture

The Configerator application is designed to defend against configuration errors using many phases. "First, the configuration compiler automatically runs developer-provided validators to verify invariants defined for configs. Second, a config change is treated the same as a code change and goes though the same rigorous code review process. Third, a config change that affects the frontend products automatically goes through continuous integration tests in a sandbox. Lastly, the automated canary testing tool rolls out a config change to production in a staged fashion, monitors the system health, and rolls back automatically in case of problems."

I think this architecture is actually quite simple, even though it may look complex.  Both control and data are flowing the same direction: top to down. There are no cyclic dependencies which can make recovery hard. This is a soft-state architecture. New and correct information pushed from top, will cleans old and bad information.

Canary testing: The proof is in the pudding

The paper has this to say on their canary testing:
The canary service automatically tests a new config on a subset of production machines that serve live traffic. It com- plements manual testing and automated integration tests. Manual testing can execute tests that are hard to automate, but may miss config errors due to oversight or shortcut under time pressure. Continuous integration tests in a sandbox can have broad coverage, but may miss config errors due to the small-scale setup or other environment differences. A config is associated with a canary spec that describes how to automate testing the config in production. The spec defines multiple testing phases. For example, in phase 1, test on 20 servers; in phase 2, test in a full cluster with thousands of servers. For each phase, it specifies the testing target servers, the healthcheck metrics, and the predicates that decide whether the test passes or fails. For example, the click-through rate (CTR) collected from the servers using the new config should not be more than x% lower than the CTR collected from the servers still using the old config.
Canary testing is an end-to-end test, and it somewhat overrides trying to build more exhaustive static tests on configs. Of course the validation, review, and sandbox tests are important precautions to try to make sure the config is sane before it is tried in small amount in production. However, given that Facebook already has canary testing, it is a good end proof for correctness of the config, and this somewhat obviates the need for heavyweight correctness checking mechanisms. The paper gives couple examples of problems caught during canary testing.

On the other hand, the paper does not make it clear how conclusive/exhaustive are the canary tests. What if canary tests don't catch slowly manifesting errors, like memory leaks. Also, how does Facebook detect whether there are  abnormality during a canary test? Yes, Facebook has monitoring tools (ubertrace and mystery machine) but are they sufficient for abnormality detection and subtle bug detection? Maybe we don't see adverse effect of configuration change for this application, but what if it adversely affected other applications, or backend services. It seems like an exhaustive monitoring, log collection, and log analysis may need to be done to detect more subtle errors.

Performance of the Configerator

Here are approximate latencies for configerator phases:
When an engineer saves a config change, it takes about ten minutes to go through automated canary tests. This long testing time is needed in order to reliably determine whether the application is healthy under the new config. After ca- nary tests, how long does it take to commit the change and propagate it to all servers subscribing to the config? This la- tency can be broken down into three parts: 1) It takes about 5 seconds to commit the change into the shared git repository, because git is slow on a large repository; 2) The git tailer (see Figure 3) takes about 5 seconds to fetch config changes from the shared git repository; 3) The git tailer writes the change to Zeus, which propagates the change to all subscribing servers through a distribution tree. The last step takes about 4.5 seconds to reach hundreds of thousands of servers distributed across multiple continents.

This figure from the paper show that git is the bottleneck for configuration distribution. "The commit throughput is not scalable with respect to the repository size, because the execution time of many git operations increases with the number of files in the repository and the depth of the git history.  Configerator is in the process of migration to multiple smaller git repositories that collectively serve a partitioned global name space."

Where is the research?

Configerator is an impressive engineering effort, and I want to  focus on what are the important research take aways from this. Going forward, what are the core research ideas and findings? How can we push the envelope for future-facing improvements?

How consistent should the configuration rollouts be?
There can be couplings/conflicts between  code and configuration. Facebook solves this cleverly. They deploy code first, much earlier than the config, and enable the hidden/latent code later with the config change. There can also be couplings/conflicts between old and new configs. The configuration change arrives at production servers at different times, albeit within 5-10 seconds of each other. Would it cause problems to have some servers run old configuration, some new configuration? Facebook punts this responsibility to the developers, they need to make sure that new config can coexist with old config in peace. After all they use canary testing where fraction of machines use new config, remaining the old config. So, in sum, Facebook does not try to have a strong consistent reset to the new config. I don't know the details of their system, but for backend servers config changes may need stronger consistency than that.

Push versus Pull debate.
The paper claims push is more advantageous than pull in the datacenter for config deployment. I am not convinced because the arguments do not look strong.
Configerator uses the push model. How does it compare with the pull model? The biggest advantage of the pull model is its simplicity in implementation, because the server side can be stateless, without storing any hard state about individual clients, e.g., the set of configs needed by each client (note that different machines may run different applications and hence need different configs). However, the pull model is less efficient for two reasons. First, some polls return no new data and hence are pure overhead. It is hard to determine the optimal poll frequency. Second, since the server side is stateless, the client has to include in each poll the full list of configs needed by the client, which is not scalable as the number of configs grows. In our environment, many servers need tens of thousands of configs to run. We opt for the push model in our environment.
This may be worth revisiting and investigating in more detail. Pull is simple and stateless as they also mention, and it is unclear why it couldn't be adopted.

How do we extend to WAN?
All coordination mentioned is single master (i.e., single producer/writer). Would there be a need for multi master solution, a master at each region/continent that can start a config update? Then the system shall need to deal with concurrent and potentially conflicting configuration changes.  However, given that canary testing is on the order of minutes, there would not be a practical need for multi-master deployment in the near future.

Reviews of other Facebook papers

Facebook's Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services

Facebook's software architecture

Scaling Memcache at Facebook

Finding a Needle in Haystack: Facebook's Photo Storage

Finding a Needle in Haystack: Facebook's Photo Storage

Thursday, January 14, 2016

Fool yourself

"The first principle is that you must not fool yourself --and you are the easiest person to fool."     ---Richard Feynman
I agree with the second part. I am the easiest person to be fooled by myself, because I like to believe what is convenient, comfortable, less scary to me. For example, I am fooled to procrastinate, because I am scared of facing the task and not having a perfect complete product. Sometimes I am fooled to be overly optimistic, because I am scared of the alternative: failure and hassle. And sometimes I am fooled to be overly pessimistic, because I am scared of the long arduous fight to achieve something, so I say "No way, I won't be able to achieve it anyways".

I like to propose a change to the first part. If it is so easy to fool yourself, you should exploit it: you should fool yourself in a beneficial manner to avoid fooling yourself in the usual/default harmful manner.

For example, when you catch yourself procrastinating, you should fool yourself to get started. When your are frozen by the prospect of writing an article or blog post, you can say: "This is the zeroth draft, I can throw this anyways. This is just a brain dump, let's see what it will look like."

Insisting on a oneshot completion, waiting inspiration to hit you for getting it perfect and finished in one sitting, raises the bar up high. Instead, you should think of any project as iterations of attempts. You should forget about shipping, and just cajole yourself for making progress. And when you get something good enough, then you can convince yourself to overcome your fear and ship it.

Similarly, when you catch yourself to be overly optimistic, you can fool yourself to consider alternatives: "Just for fun, let me take 15 minutes, to write about what alternatives I can pursue if things go wrong with this. Let's brainstorm hypothetical failure scenarios." Or, when you catch yourself to be overly pessimistic, you can fool yourself to be hopeful: "Why not just give this a try? Let's pretend as if this will workout, and don't bother if it doesn't."

And how do you catch and interject when you are fooling yourself in a harmful manner? By always being on the watch. (Step 1, raise your gaze from your smartphone screen. Step 2, observe yourself and reflect.) You should always suspect that you are fooling yourself: which you are, either one way or another. You can only choose the type of stories you tell yourself. You can fool yourself to engage in unproductive harmful behavior, or fool yourself to engage in productive behavior. You get to decide what stories you tell yourself.

I also like this quote a lot:
"Give up on yourself. Begin taking action now, while being neurotic or imperfect, or a procrastinator or unhealthy or lazy or any other label by which you inaccurately describe yourself. Go ahead and be the best imperfect person you can be and get started on those things you want to accomplish before you die." -- Shoma Morita, MD

Monday, January 11, 2016

Paper review: TensorFlow, Large-Scale Machine Learning on Heterogeneous Distributed Systems

The paper is available here.

TensorFlow is Google's new framework for implementing  machine learning algorithms using dataflow graphs. Nodes/vertices in the graph represent operations (i.e., mathematical operations, machine learning functions), and the edges represent the tensors, (i.e.,  multidimensional data arrays, vectors/matrices) communicated between the nodes.  Special edges, called control dependencies, can also exist in the graph to denote that the source node must finish executing before the destination node starts executing. Nodes are assigned to computational devices and execute asynchronously and in parallel once all the tensors on their incoming edges becomes available.

It seems like the dataflow model is getting a lot of attention recently and is emerging as a useful abstraction for large-scale distributed systems programming. I had reviewed Naiad dataflow framework earlier. Adopting the dataflow model provides flexiblity to TensorFlow, and as a result, TensorFlow framework can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models.

TensorFlow's heterogeneous device support

The paper makes a big deal about TensorFlow's heterogenous device support, it is even right there in the paper title. The paper says: "A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards."

Wait, what? Why does TensorFlow need to run on wimpy phones?!

The paper says the point is just portability: "Having a single system that can span such a broad range of platforms significantly simplifies the real-world use of machine learning system, as we have found that having separate systems for large-scale training and small-scale deployment leads to significant maintenance burdens and leaky abstractions. TensorFlow computations are expressed as stateful dataflow graphs."

I understand this, yes portability is important for development. But I don't buy this as the explanation. Again, why does TensorFlow, such a powerhorse framework, need to be shoehorned to run on a single wimpy phone?!

I think Google has designed and developed TensorFlow as a Maui-style integrated code-offloading framework for machine learning. What is Maui you ask? (Damn, I don't have a Maui summary in my blog?)

Maui is a system for offloading of smartphone code execution onto backend servers at method-granularity. The system relies on the ability of managed code environment (.NET CLR) to be run on different platforms. By introducing this automatic offloading framework, Maui enables applications that exceed memory/computation limits to run on smartphones in a battery- & bandwidth-efficient manner.

TensorFlow enables cloud backend support for machine learning to the private/device-level machine learning going on in your smartphone. It doesn't make sense for a power-hungry entire TensorFlow program to run on your wimpy smartphone. Your smartphone will be running only certain TensorFlow nodes  and modules, the rest of the TensorFlow graph will be running on the Google cloud backend. Such a setup is also great for preserving privacy of your phone while still enabling machine learned insights on your Android.

I will talk about applications of this, but first let me mention this other development about TensorFlow that supports my guess.

Google Opensourced the TensorFlow API in Nov 2015

Google opensourced the TensorFlow API and a limited reference implementation (the implementation runs on a single device only) under the Apache 2.0 license in November 2015. This implementation is available at, and has attracted a lot of attention.

Why did Google opensource this project relatively early rather than keeping it proprietary for longer? This is their answer: "We believe that machine learning is a key ingredient to the innovative products and technologies of the future. Research in this area is global and growing fast, but lacks standard tools. By sharing what we believe to be one of the best machine learning toolboxes in the world, we hope to create an open standard for exchanging research ideas and putting machine learning in products."

This supports my guess. TensorFlow's emphasis on heterogeneity is not just for portability. Google is thinking of TensorFlow as an ecosystem. They want developers to adopt TensorFlow, so TensorFlow is used for developing machine learning modules in Android phones and tablets. And then, Google will support/enrich (and find ways to benefit from) these modules by providing backends that run TensorFlow. This is a nice strategy for Google, a machine learning company, to percolate to the machine learning in the Internet of Things domain in general, and the mobile apps market in particular. Google can be the monopoly of Deep learning As A Service (DAAS) provider leveraging the TensorFlow platform.

How can Google benefit from such integration? Take a look at this applications list: "TensorFlow has been used in Google for deploying many machine learning systems into production: including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery."

With a mobile-integrated TensorFlow machine-learning system, Google can provide better personal assistant on your smartphone. Watch out Siri, better speech recognition, calendar/activity integration, face recognition, and computer vision is coming. Robotics applications can enable Google to penetrate self-driving car OS, and drone OS markets. And after that can come more transformative globe-spanning physical world sensing & collaboration applications.

With this I rest my case. (I got carried away, I just intended to do a paper review.)

Related links

In the next decade we will see advances in machine learning coupled with advances in Internet Of Things.

In this talk, Jeff Dean gives a very nice motivation and introduction for Tensorflow.

Here is an annotated summary of the TensorFlow paper.

This post explains why TensorFlow framework is good news for deep learning.

Monday, January 4, 2016

Book Review: "Zero to One", Peter Thiel, 2014

I liked this book a lot. It inspired me to write about "How to go for 10X". That blog post and the "Zero to One" book I mentioned there got covered better than me by Todd Hoff at his High Scalability blog. I am including my brief and unstructured notes on Zero to One book here just for the record.

The main theme in the book is: Don't do incremental business, invent a new transformational product/approach. Technology is 0-to-1, globalization is 1-to-n. Most people think the future of the world will be defined by globalization, but the book argues that technology matters more. The book says: Globalization (copying  and incrementalism as China has been doing) doesn't scale, it is unsustainable. That's a hard argument to make, but a softer version of that argument is: "technology creates more value than globalization".

A related theme is that you should aim to become a monopoly with your transformational product/technology. If you compete, everybody loses: competitive markets destroy profits. (That is why a lot of restaurants fail.) A smarter approach is to start small and monopolize your area. Of course, over time monopolies also fade and outdated, so you should strive to reinvent another business.

This book also has a lot of interesting counterintuitive ideas. When the book
told me about these items, I thought they made sense:

  • make incremental advances
  • stay lean and flexible
  • improve on the competition
  • focus on product not sales

NO! Peter Thiel says that these lesson have become dogma in the startup world, and yet the opposite principles are probably more correct.

  • it is better to risk boldness than triviality
  • a bad plan is better than no plan
  • competitive markets destroy profits
  • sales matter just as much as product

It is easy to fool yourself to think you have an innovative thing. If you are defining your business in terms of intersections, it may not be that innovative, and maybe it shouldn't exist in the first place. It is hard to find something truly innovative/transformative, and you will know it when you find it. It will open a new market and will monopolize that market. If you are a monopoly, you try to downplay and define yourself as a union of many markets. Google is a monopoly in search, so it lies to underplay this by casting itself as a IT company.

Friday, January 1, 2016

My new pomodoro workflow

Pomodoro is a timeboxing technique. You set a Pomodoro timer for 25 minutes to get a task done. Then you take a break for 5 minutes, after which you can do another Pomodoro. Since Pomodoro imposes a limit on your work time, this adds a gaming element to the task.

I have been using the Pomodoro technique for 3-4 years now and I had written about that before. Recently I changed my Pomodoro usage significantly. Now I use Pomodoro more deliberately to achieve intense mental workouts and to force myself to get more creative and obtain transformative results.

Deep work versus Shallow work

The problem I want to attack with this new deliberate Pomodoro practice is the shallow work problem. Human brain is programmed to save energy and go easy on itself, so it prefers shallow tasks (such as watching TV, web browsing, answering emails) and tries to avoid intense thinking sessions required for many creative tasks such as writing, thinking on a proof, and designing an algorithm. As a result, we accumulate a lot of bad studying habits: we seem to be working but we take it slow, we get distracted with other thoughts and break our concentration. In other words, unless we are deliberate about it, it is easy to switch to a superficial shallow work mode, rather than an effective deep work mode.

(If you like to learn more about deep work versus shallow work, and improve your study habits to achieve deep work, I recommend you my friend/colleague Cal Newport's new book: Deep Work.)

Using Pomodoro in a more deliberate/mindful way, I aim to prevent shallow work and achieve better focus and intense concentration. Why is intense concentration this important? I had made this observation before: Intense concentration sessions followed by a rest period (meal, walking, playing with the kids) is helpful to cultivate creative ideas. The more intense the study session, the better chance you will have an epiphany/insight in your resting time. (A Coursera course titled "Learning How To Learn" also mentions this finding.)

Intense concentration builds tension about the problem in your brain. So in your resting time, your brain spawns several background processes that try to resolve this tension, and, voila, you get epiphanies about solutions. In order to preserve this tension and keep your brain obsessed about the problem, it is also helpful to focus on one problem/paper at any given day. My postdoc advisor Nancy Lynch would work on only one paper for any given week. "The way she worked with students was that she would dedicate herself solely on a student/paper for the duration of an entire week. That week, she would avoid thinking or listening other works/students. This is because, she wanted to immerse and keep every parameter about the paper she is working on in her mind, and grok it."

My new Pomodoro cycle

I now have a three phase Pomodoro cycle: 4 minutes of prePomodoro, 22 minutes of Pomodoro, and 4 minutes of postPomodoro. In the prePomodoro, I plan about what I will do, how I can go best about it, and how I can take an alternative and more productive route. In other words, I go meta about the task. By raising several questions I warm up my brain to get into an active attention state. In the Pomodoro, I do my best to get as much done in 22 minutes with intense concentration. In the postPomodoro, I first do a celebration routine; I stretch and walk around. This helps for associating positive feelings of accomplishment with a hard focused studying session. I then review my performance, and critique myself. What did I accomplish? What could I have done better? I write a tweet about this at a protected Twitter account, @YazarKata, that only I follow. This way when reading my own twitter feed @muratdemirbas, I can see/review my Pomodoro tweets as well. This postPomodoro is extensible, with a key press, I can always add another 4 minutes to work on completing the task in case the Pomodoro needs a little bit more work.

I also changed my Pomodoro software. I used to have a timer at the menubar, but now I replaced it with an Emacs script. I figured it makes more sense to have my Pomodoro workflow incorporated to my Emacs setup, because I live most of my productive life in Emacs and use the org-mode to organize my tasks. I adopted the pomodoro.el script, and modified it to use the Mac OSX "say" command to announce out special motivational messages at the beginning and end of my pomodoros. I am not disabling wifi during a Pomodoro anymore. My pomodoros are already very intense, so I don't have any urges, distractions to browse the web.

Maybe in a couple years time, I will have another update to report on this.

Related links: 

How I read a research paper
How to go for 10X
Research is a capricious mistress

Monday, December 21, 2015

What technological advancements will the next decade bring?

There have been amazing progress in technology in recent years. Just to name a few, I would mention deep learning, big data, ubiquitous tablets and smartphones, advances in medicine, BitCoin, 3D printing, accurate voice recognition, Arduino, Raspberry pie, maker movement, drones, etc. I can't help but wonder what we will see next, in the coming decade. What discoveries we will be able to make? How would the world change? I am very curious and excited about the future.

To extrapolate to 2025, indulge me as I go back to the year 2005 to extrapolate. Ten years ago, iPhone was still not introduced (iPhone 1 got introduced on June 29, 2007). Smartphones were rare. Internet was not a household item. Cloud was not there. Tablets were not there. In fact, none of the technologies mentioned in the previous paragraph were there. In ten years we have come a long way.

Let me add an anectode about how I witnessed this ride in the last 10 years. In 2005, I joined University at Buffalo as an Assistant professor, and  started teaching CSE 646: Wireless and Mobile Computing, a graduate level project-based course on wireless sensor networks. In order to pique the interest of my students, I would start the class by a review-assignment on a short science-fiction story by Vernor Vinge, "Synthetic Serendipity". (If you have 10-15 minutes, do read this story; it is really good.) So, every year, we would talk about which technology mentioned in this short story may get realized soon. A funny thing started to happen around 2010. It seemed like, every year we would find a couple technologies mentioned in the book to get realized: Google Glass, delivery drones, Arduinos, robot sets for education, augmented reality, BitCoin, holograms, etc.

This leads me to be optimistic about the kind of progress we would see in the next 10 years. (Ok, maybe not as optimistic as the singularity guys and what their law of accelerating returns predict.) Let me wear my wizard hat, and look into my crystal ball. (I will stick to predicting only computing related advances with which I have some familiarity.)

Advances in machine learning 

Machine learning is all the rage now. Thanks to big data and cloud computing, machine learning stole the show in the recent years. However, machine learning is still just simple statistical learning today. In other words it is still in its infancy. I think we are going to see much better technology and accelerated progress in machine learning in the coming years. Five years ago, I would not put singularity before 2150-2200 timeline. Nowadays, I am not so sure. The AI winter is over, and AI is blooming. With all the attention it is receiving we are bound to witness more breakthroughs in machine learning.

I wager that by 2025 we can expect to have AI as smart as a 3rd-grader. This AI will of course be a savant at book/Google knowledge in specialized fields, but in addition it will be able to synthesize and contextualize the information and use cognition as well as a 3rd-grader can. (This article summarizes cognitive capabilities of an 8-10 year old.)

This will of course be accessible through our smartphones. Siri will get that good. We will have a personal asistant who has human intelligence at the level of a 3rd-grader, and who will have all the world's knowledge accessible through filtering of a 3rd grader's synthesis/creative capability. It was already uncanny when I witnessed my 4 year old having a conversation with Siri a couple months ago. I wonder how this would feel to the middle-aged and elderly. It may come as a total culture shock and be an alienating experience as predicted in the Synthetic Serendipity story.

This is the pattern of software-based innovation. Software-based innovation virtualizes a physical thing, and then improves on it every year thanks to the availability of exponentially more computing/querying power. By 2025, we will have a virtual personal assistant, virtual nanny, virtual teacher, and a virtual doctor in our smartphones.

If you have missed the movie "Her", schedule sometime to watch it. I thought it was pretty good. Come to think of it, in the most recent Terminator movie the SkyNet was rising from the smartphones. Coincidence?

Advances in Internet of Things (IoT)

Arduino and maker movement made a splash. Raspberry Pie is already getting dirt cheap. Yet, the Internet of Things domain is still looking for a killer application. I expect the killer application to arrive, and obviously I don't know what it will be. Maybe it will be in querying the physical spaces as well as we can query the cyber space with Google. Being able to query for car keys at the house, or items in schools, warehouses. I don't know. Maybe we will become a security/surveillance-obsessed community and employ IoT to record/query audio and video.

In any case, I wager that IoT will finally take off. I include 3D-printing technology as auxiliary to IoT. In 2025, you may not quite be able to download and print a car, but I believe we will start downloading and printing toy cars for our children at least. (Whatever you can't print at home will be printed at an Amazon warehouse nearby and will be drone-delivered to you within a couple hours.) Of course the toy car would be steerable from the Internet and it would streaming videos to YouTube :-) Most amazing of technologies always start as toys, then find serious use. I remember in 2005 I had an argument with a graduate student who thought "the newly released Google Maps was a silly toy idea because we already have paper maps" (I kid you not).

Printable IoT utensils/tools/gadgets may help revolutionalize/democratize retail the way cloud computing revolutionalized/democratized the IT startups domain. Walmart's business may be in jeopardy in 2025, because printable IoT technology will accelerate the innovation in consumer items/tools and bridge the gap between idea to production and delivery in retail. If you have a good idea for a smart kitchen utensil, you will be able to prototype it and get it into production and retail at large-scale quickly. This will be like Kickstarter on steroids. We may see next Instagrams, Dropboxes, Apples(?), flourish in the domain of printable IoTs.

The way cloud computing provided pay-per-use elastic-scalability to software-as-a-service, an IoT printing platform can provide pay-per-use elastic-scalability to retail-as-a-service. Amazon was the cloud computing platform provider which facilitated/accelerated IT-based startups. Can Amazon be the platform provider for IoT printing business as well? Wow, I might have figured out Amazon's convergent strategy to world domination in 2025.

Software is already eating the world, and by 2025 software will be rebuilding/reintegrating to the world. Maybe for future science-fiction reading in my classes, I should assign Neil Stephenson's "Diamond Age", which is set in a future world in which programmable-nanotechnology affects all aspects of life.

Sunday, December 20, 2015

My Distributed Systems Seminar's reading list for Spring 2016

Below is the list of papers I plan to discuss in my distributed systems seminar in Spring'16 semester. These are all very exciting papers, and I am looking forward to the Spring semester.

If you have some suggestions on other good/recent papers to cover, please let me know in the comments.

Distributed coordination

  1. No compromises: distributed transactions with consistency, availability, and performance, SOSP 15
  2. Implementing Linearizability at Large Scale and Low Latency, SOSP 15
  3. High-Performance ACID via Modular Concurrency Control, SOSP 15
  4. Existential Consistency: Measuring and Understanding Consistency at Facebook, SOSP 15
  5. Holistic Configuration Management at Facebook, SOSP 15
  6. Building Consistent Transactions with Inconsistent Replication, SOSP 15
  7. Bolt-on Causal Consistency, Sigmod 13
  8. The Design and Implementation of the Wave Transactional Filesystem

Big data

  1. Arabesque: A System for Distributed Graph Mining, SOSP 15
  2. Petuum: A New Platform for Distributed Machine Learning on Big Data, KDD 15
  3. Graphlab: A new framework for parallel machine learning
  4. Twitter Heron: Stream Processing at Scale, Sigmod 15
  5. Chimera: Large-scale classification using machine learning, rules, and crowdsourcing, VLDB 14


  1. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services, OSDI 14
  2. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems, SOSP 15

Formal methods

  1. IronFleet: Proving Practical Distributed Systems Correct, SOSP 15

Friday, December 18, 2015

Paper summary: Detecting global predicates in distributed systems with clocks

This is a 2000 paper by Scott Stoller. The paper is about detecting global predicates in distributed systems.

There has been a lot of previous work on predicate detection (e.g., Marzullo & Neiger WDAG 1991, Verissimo 1993), but those work considered vector clock (VC) timestamped events sorted via happened-before (hb) relationship. This paper proposes a framework for predicate detection over events timestamped with approximately-synchronized (think NTP) physical-time (PT) clocks.

This was a tough/deep paper to read and a rewarding one as well. I think this paper should receive more interest from distributed systems developers as it has applications to the cloud computing monitoring services. As you can see in Facebook stack and Google stack, monitoring services are an indispensible component of large-scale cloud computing systems.


Using PT for timestamping (and predicate detection) has several advantages over using VC. VC is O(N), whereas PT is a scalar. For capturing hb, VC assumes that all communication occur in the present system and there are no backchannels, which is not practical for today's large-scale integrated system of systems. Using PT alleviates the backchannel problem: even if there is not direct communication path, if event B happened in physical time later than event A, then we can still identify that event A can affect event B due to out-of-bound communication. Finally, using PT for timestamping allows the devops to be able to query the system in relation to physical time, whereas VC fails to support it.

It turns out that using PT also provides benefits over using VC in terms of complexity of predicate detection. The worst case complexity for predicated detection using hb captured by VC is Ω(EN), where E is the maximum number of events executed by each process, and N is the number of processes. On the other hand, the smaller the uncertainties in events' PT timestamps, i.e., the less the events overlap, the cheaper the PT-based predicate detection gets. With some assumptions on the inter-event spacing being larger than time synchronization uncertainty, it is possible to have worst-case time complexity for PT-based predicate detection to be O(3N E N2) --- linear in E.

I will say more on this point and how using our hybrid clocks can further reduce cost of predicate detection at the end of this post.

Computation model

A local computation has the form e1, s1, e2, s2, e3, s3, where the ei are events, and the si are local states. A global state of a distributed system is a collection of local states.

Each event e has an interval timestamp its(e), which is an interval with lower endpoint lwr(e) and upper endpoint upr(e). PT timestamping should satisfy the following conditions.
TS1: ∀ e: lwr(e) ≤ upr(e)
TS2: ∀ e1 with a succeeding event e2 on the same process: lwr(e1) ≤ lwr(e2) and upr(e1)≤ upr(e2)
TS3: ∀ e1,e2: if e1 occurred before e2 in real time, then lwr(e1) ≤ upr(e2)

TS3 is the strongest among these assumptions. Still this is a fairly general practical modeling of approximate time synchronization: The intervals of events do not need to be all equal across processes or even along the events on the same process.

Generic theory of consistent global states (CGS)

A consistent global state (CGS) is a global cut that could have occurred during the computation.

The paper shows that CGS lattice theory is not specific to hb but applies to any partial generic ordering gb on events, that satisfy TS2. Two local states are concurrent if they are not related by gb. A global state is consistent with respect to gb if its constituent local states are pairwise concurrent.
consisgb (g) ≡ ∀ i, j: i ≠ j: \neg (g(i) gb g(j))

By adopting PT instead of VC for timestamping and ordering, the paper defines two ordering relations db ("definitely occurred before") and pb ("possibly occurred before") among events as follows.

e1 db e2 ≡ upr(e1) < lwr(e2)

e1 pb e2 ≡ ¬ (e2 db e1)
Notice how pb is elegantly defined in terms of db.

By substituting db and pb orderings in the generic CGS framework above, the paper obtains definitions of 3 distinct detection modalities: Poss-db θ (“θ possibly held”), Def-db θ (“θ definitely held”), and Inst θ (“θ definitely held at a specific instant”). We talk about these detection modalities and developing efficient detection algorithms for them in the next two sections.

Detection based on strong event ordering, db

Substituting db ordering in the generic theory of CGS, we get Poss-db and Def-db. A computation satisfies Poss-db Q iff there exists some interleaving of events that is consistent with db and in which the system passes through a global state satisfying predicate Q. A computation satisfies Def-db Q iff for every interleaving of events that is consistent with db, Q holds. These two are analog to Poss-hb and Def-hb for the hb relation on the CGS framework. Figure 1 illustrates the nuance between Poss-db and Def-db. If Q holds only at CGS <3,1>, then Poss-db holds, but Def-db would not since computation might have taken <2,2> path instead of <3,1> path and Q does not hold at <2,2>. Def-db would hold if Q holds in both <3,1> and <2,2>. Or,  instead if Q holds only at CGS <2,1>, then Def-db still holds.

As we mentioned in the beginning, the quality of time synchronization affect the cost of predicate detection. If the bounds on the offsets are comparable to or smaller than the mean interval between events that potentially truthify or falsify θ, then the number of global states that must be checked is comparable to the number of global states that the system actually passed through during execution, which is O(NE). In contrast, the number of global states considered in the asynchronous hb case is O(EN).

Assume the interval between consecutive events at a process is at least τ.
For Poss-db and Def-db worst-case time complexities are as follows:
- if τ > 2ε, O(3N E N2)
- if τ ≤ 2ε, O((4ε/τ +2){N-1} E N2)

This second option is the problematic zone. For τ << ε the time complexity of detection algorithm grows quickly. The figure below illustrates this, and shows how number of CGS change with respect to E and the ratio μ/ε, where μ is the mean inter-event time.

Detection based on weak event ordering, pb

In contrast to db, pb is a total order and not a partial order.

Substituting pb ordering in the generic theory of CGS, we get Poss-pb and Def-pb. In fact, pb collapses Def-pb and Pos-pb into one, Inst. This is because there are no alternate-paths in the pb computation. pb looks at inevitable global states, i.e., SCGS. The below computation has only two SCGSs (1,2) and (3,2).

This is similar to Properly detection by Fromentin and Raynal 1997. The computation contains O(NE) local states and there are only O(NE) SCGSs. And, the worst case time complexity of algorithm: O((N log N) E)

This is really low cost. So, where is the catch?

It turns out Inst detection is not enough/complete. Inst may miss some CGSs as illustrated below. (The paper doesn't explicitly discuss this, so this gave me some confusion before I could figure this out.)

How would hybrid time/clocks benefit

In Stoller's world, you need to make a binary choice before hand: go either with VC- or PT- based timestamping and detection.

But, going with VC becomes disadvantageous in many cases: when there isn't enough communication to introduce enough hb restrictions. PT can also become very disadvantageous: when μ/ε << 1. (Figure 6 shows how quickly number of CGSs to consider can grow in this region.)

Even worse, within the same system you can have VC become disadvantegous in some regions/phases of computation and PT in others. (You can have excited communication caused by closeby events within ε. So for ε in 10ms, you can have several regions where μ/ε to be <<1. This increases the complexity of Deff-db and Poss-db greatly. Especially for large N.)

We had recently introduced hybrid clocks, and in particular hybrid vector clocks (HVC). With HVC you don't have to choose one over another; HVC offers you the lowest cost detection of VC and PT at any point. Moreover while VC is of  O(N) size, with HVC thanks to loosely-synchronized clock assumption it is possible to keep the sizes of HVC to be a couple entries at each process. HVC captures the communications in the timestamps and provides the best of VC and PT worlds.

We are investigating these issues with my colleague Sandeep Kulkarni at Michigan State, as part of our NSF supported project on highly auditable distributed systems.

It also looks like the db and pb relations should have applications to linearizability/serializability in distributed databases. And it will be interesting to investigate these further.

Monday, November 9, 2015

Thursday, November 5, 2015

HPTS trip report (day 2)

This is a belated wrap up of my HPTS trip report. Read the first part of the HPTS trip report here, if you haven't before.

The first day of HPTS was a very busy, productive, and tiring day. Since my body clock was still on East Coast time, I was wide awake at 6am again. I went for a longer run this time. I ran for 5 miles along the beach on the aptly named Sunset drive. Monterey Bay, Asilomar is an eerily beautiful piece of earth.

After breakfast, HPTS started with the data analytics session. That night it was announced that Chris Re (from Stanford) won a MacArthur genius award. Chris was scheduled to talk in this first session of HPTS day 2. It was a very engaging talk. Chris talked about the DeepDive macroscopic data system they have been working on. 

DeepDive is a system that reads texts and research papers and collect macroscopic knowledge about a topic with a quality that exceeds paid human annotators and volunteers. This was surprising news to me; another victory for the singularity team. (I had written about singularity before. It is a coming.)

They had demonstrated this system for extracting paleobiological facts to build high coverage fossil record by reading/curating research papers published on the topic. There has been human volunteers (are there any other kind of volunteers?) working on the paleo fossil project. It took  329 volunteers, 13 years to go through 46K documents and produce fossil record tables. Paleo DeepDive system processed through 10x documents and produced 100x extractions of information with better precision and recall that humans. All this in 45 minutes of runtime (over 80 cores, 1 TB RAM, using SQL processing and statistical operations), instead of 2 decades of using human readers.
Holy cow!

Imagine this being applied/integrated to Google Scholar searches. Wow! That would be transformative. There are already some developments toward smarter (more semantic-based) indexing/search of research papers.

I am not a machine learning guy, but from what I got DeepDive is using probabilistic bootstrapping approach to refine its guesses and make them precise. DeepDive uses distant/weak supervision; the user writes some rules crappy rules that serve as crappy training data, then DeepDive progressively denoises the data and learns more.

They are using DeepDive for other domains for accelerating science, such as drug repurposing, genomics. Another interesting application of DeepDive is for fighting with human trafficking by analyzing the dark web. Here is a talk by Chris on Deepdive system. (Not from the HPTS venue.)

After Chris's talk Todd Mostak, CEO MapD, gave a talk about a gpu-powered end-to-end visual anlaytics platform for interactively exploring big datasets. Todd was Sam Madden's PhD student. His startup evolved from his thesis. His demo looked interesting and fluid/quick.

The other sessions were on Big Data platforms, Distributed System platforms, and storage. The distributed system platform session was dominated by talks about containers and Docker.

So what was all the buzz about at HPTS'15? It was about NVRAMs. That NVRAMs are getting available wider and off-the-shelf is leading us towards new applications, design choices. Similarly RDMA was attracting excitement. Of course, there was a lot of talking/discussing about HPTS, high-performance transaction systems. And some about in-memory databases.

Overall HPTS was a great experience. I am amazed by the sheer volume and quality of interactions HPTS cultivated through hallway conversations. Keeping the workshop size limited to at most 100 participants helped. Providing ample opportunities for conversations helped. I met/talked to at least half of the 100 workshop attendees over the two days. And I am a shy person. Comparing the HPTS experience with other conference experience the difference gets starker. I hope more conferences can learn from HPTS and manage to improve their interaction volume and quality. That is why we travel many hours to go to conferences afterall.

Wednesday, October 7, 2015

HPTS trip report (days 0 and 1)

Last week, from Sunday to Tuesday night, I attended the 16th International Workshop on High Performance Transaction Systems (HPTS). HPTS is an unconventional workshop. "Every two years, HPTS brings together a lively and opinionated group of participants to discuss and debate the pressing topics that affect today's systems and their design and implementation, especially where scalability is concerned. The workshop includes position paper presentations, panels, moderated discussions, and significant time for casual interaction. The only publications are slide decks by presenters who choose to post them." HPTS is by invitation only and keeps it under 100 participants. The workshop brings together experts from both industry and academia so they mix and interact. Looking at the program committee, you can see names of entrepreneurs venture capitalists (David Cheriton, Sequoia Capital), large web companies (Google, Facebook, Salesforce, Cloudera), and academics. HPTS is legacy of Jim Gray, and among its regular participants include Mike Stonebraker (Turing Award winner) and C. Mohan (IBM).

HPTS travel and venue

HPTS is always held at the same venue, Asilomar Conference Grounds, Pacific Grove, CA. My flights Sunday morning (JetBlue BUF-JFK, JFK-SFO) were smooth and on time. I even get to do some writing on the JFK-SFO flight. Since Asilomar is not easily accessible from SFO (or San Jose airport for that matter), I had to rent a car. I used the scenic Route 1 for my drive. It was a 3 hour drive. I stopped a couple of times to take pictures. I made it to the Asilomar conference center at 5pm, just enough time to settle before the dinner at 6pm.

The Asilomar conference grounds is just on the edge of the Monterey Bay. It is overlooking the Pacific, and next to white sand dunes (a nature reserve area) and a nice beach. The barracks were ran down and showing their age. There is a dining hall, a separate building where the HPTS participants dined together as a group (breakfast/lunch/dinner). The talks were held in the chapel, another building close to the dining hall. The program was so full that we would go to our rooms only for sleeping, and that too briefly. After dinner, there was social hour the first night, and lightning talks the next 2 nights, all accompanied by beer and wine. And after 9pm, the crowd moved to Director's cottage, a fancy vacation house for chatting till midnight, lubed by whisky, wine, beer (see, the order is different this time). Then breakfast starts at 7:30am, rinse and repeat each day.

HPTS first night

So Sunday evening, at the first dinner, I somehow sat right next to David Cheriton. He is smart and curious. He asked me to explain about my work, but I wasn't able to communicate the motivation for our hybrid clocks work. I suspect this was partly because we don't share the same terminology/background, and partly because I was unprepared to explain the motivation from a database industry perspective. David was interested and persistent and pushed to understand the problem completely, a trait shared by ultra successful people like him. After spending 10 minutes, I was the party to quit, and told David that I hope to clarify these in the lightning talk, which I proceeded to botch up Monday night :-(

There was a meet and greet after the dinner, from 7-9pm, at the chapel. I am actually a shy guy, but I made an effort to meet people. When I saw a crowd of 3 or more people, I joined to listen and participate. I had nice conversations with the Salesforce crew. I asked them a lot of questions and learned a lot.

At 9pm, the group moved to the director's cottage. I was very tired from the JFK-SFO flight and the 3 hour drive, so after hanging around in the director's cottage for 10 minutes, I went to bed. Since I was jetlagged I woke up at 4am, tried to sleep again but woke up at 6 am. I went for a run, for 3.2 miles. It felt good. Sometimes the best way to fight off exhaustion is by exercising.

Monday morning sessions

Pat Helland (Salesforce) opened the workshop. He said that the workshop is a platform for the academicians and practitioners of databases interact and exchanged ideas. He urged participants to make good use of the 30 minute coffee breaks between the sessions. He said that "actually the sessions are there to punctuate the breaks" :-). Phil Bernstein (Microsoft) asked us to refrain from live blogging or tweeting, as some speakers may talk about confidential technology. Speakers who like release their slidedecks after a week to be posted at the HPTS website. Checking back I see 1/3rds of the slides from talks are posted, and almost all the lightning talk slides are posted.

Here are some of the interesting talks from the two sessions (transactions and applications sessions) Monday morning.

The talk "A1 and FARM scalable graph database on top of a transactional memory layer" from Microsoft Research was about a high performance graph database platform enabled by three hadware trends: 1. inexpensive DRAM (currently $8/GB, machines with 128GB, container will hold more than 100TBs), 2. nonvolatile RAM (DRAM + battery + SSD), and 3. fast commodity networks with RDMA becoming available. (Turns out the Microsoft Research group has an SOSP 15 paper describing this system.)

Kyle Kingsburry, a computer safety researcher at Stripe, gave a nice talk on his Jepsen toolkit for blackbox verification of systems by injecting network partitions and testing basic invariants.

"The Quick and the Dead" talk by James Barrese (PayPal) described Paypal technology initiatives for creating a PAAS platform. These were motivated by a need to innovate quickly and to automate everything.

"From Microservices to Teraservices" talk by Adrian Cockcroft (Battery Ventures) described how microservices and containerazation are useful for accelerating innovation by allowing doing daily releases. Adrian defined a microservice as a loosely coupled service oriented architecture with bounded contexts.

Monday afternoon sessions

In the "Data Federations: An Idea Whose Time Has Come Again", Michael Stonebraker (MIT) talked about their bigdawg polystore platform which will be released on github in a couple months.

"From Trash to Treasure" talk by Pat Selinger (Paradata) was about how to clean and integrate dirty customer data with duplicates, missing values, corrupted values, and invalid values.

I really liked Rodrigo Fonseca's (Brown Univ.) talk on causal metadata tracking in distributed systems. 

Eric Grosse (Google Security Team) gave an informative talk titled "Security Lessons from the Front Lines".

Monday evening lighting talks

Monday evening lighting talks slides are available here. The lighting talks are of 5 minute duration. In my talk, I somehow ran out of 3 minutes in the first 2 slides. After I got the "last 2 minute warning" and I rushed through a couple more slides waiving my arms frantically :-) I should have prepared and practiced before the talk. I was upset about how the talk went but several people (including Phil Bernstein at Microsoft Research) showed interest and approached me later to learn more about the hybrid clocks, which made me feel better.

I will write about day 2, newly-minted McArthur Genius Chris Re's talk, and the overall buzz at HPTS in a couple days. I hope more slides will be added to HPTS site by then.

Tuesday, October 6, 2015

Consensus in the wild

The consensus problem has been studied in the theory of distributed systems literature extensively. Consensus is a fundamental problem in distributed systems. It states that n nodes agree on the same decision eventually. Consistency part of the specification says that no two nodes decide differently. Termination states that all nodes eventually decide. And NonTriviality says that the decision cannot be static (you need to decide a value among inputs/proposals to the system, you can't keep deciding 0 discarding the inputs/proposals). This is not a hard problem if you have reliable and bounded-delay channels and processes, but becomes impossible in the absence of either. And with even temporary violation of reliability and timing/synchronicity assumptions, a consensus system can easily spawn multiple corner-cases where consistency or termination is violated. E.g., 2-phase commit is blocking (this violates termination), and 3-phase commit is unproven and has many corner cases involving the old leader waking up in the middle of execution of the new leader (this violates consistency).

Paxos appeared in 1985 and provided a fault-tolerant solution to consensus. Paxos dealt with asynchrony, process crash/recovery, and message loss in a uniform and elegant algorithmic way. When web-scale services and datacenter computing took off in early 2000s, fault-tolerant consensus became a practical concern. Google started to run into corner cases of consensus that introduced downtime. Luckily Google had people who had academic background in distributed systems (like Tushar Chandra) and they knew what to do. Paxos algorithm got adopted at Google in the Chubby lock service, and used in Google File System and for replicating master node in Map Reduce systems. Then Paxos, the algorithm only distributed systems researchers knew about, got popular in the wild. Several other companies adopted Paxos, and several opensource implementations appeared.

(Had we not have a well-enginereed robust algorithm for consensus in the form of Paxos, what would happen? It would probably be a mess with many groups coming up with their own implementation of a consensus protocol which would be buggy in some small but significant manner.)

My student Ailidani and I are working on a survey of consensus systems in the wild. We compare different flavors of the Paxos consensus protocol with their associated advantages and drawbacks. We also survey how consensus protocols got adopted in the industry and for which jobs. Finally, we discuss where Paxos is used in an overkill manner, where a consensus algorithm could be avoided, or could be tucked out of the main/critical pathway (consensus is expensive afterall.).

Paxos flavors

There are three main/popular flavors: classical multi-Paxos, ZooKeeper Zab protocol, and Raft protocol.

The classical multi-Paxos protocol is nicely reviewed and presented in Robbert Van Rennesse's "Paxos Made Moderately Complex" paper.

Zab is used in ZooKeeper, the popular "coordination kernel". ZooKeeper is used by Hadoop (replicating master at HDFS, Map Reduce), and in the industry for keeping/replicating configurations (Netflix, etc.)

Raft provides a flavor of Paxos very similar to Zab. It comes with a focus on understandability and simplicity and has seen several opensource implementations.

Differences between Paxos and Zab

Zab provides consensus by atomic broadcast protocol. Zab implements a primary process as the distinguished leader, which is the only proposer in the system. The log entries flow only from this leader to the acceptors.

The leader election in Paxos can be concurrent with the ongoing consensus requests/operations, and multiple leaders may even get requests proposed and accepted. (Mencius/e-Paxos systematize this and use it for improving throughput.) In contrast, in Zab, a new leader cannot start proposing a new value before it passes a barrier function which ensures that the leader has the longest commit history and every previously proposed value are commited at each acceptor. This way, Zab divides time into three sequential phases.

Another major difference between Zab and Paxos is that Zab protocol also includes client interaction, which introduced an additional order guarantee, per-client FIFO order. All requests from a given client are executed in the order that they were sent by the client. Such guarantee does not hold with Paxos.

Differences between Zab and Raft

There isn't much difference between Zab and Raft. ZooKeeper keeps a filesystem like API and hierarchical znodes, whereas Raft does not specify the state machine. On the whole, if you compare Zab (the protocol underlying ZooKeeper) and Raft there aren't any major differences in each component, but only minor implementation differences.

Abusing Paxos consensus

1) Paxos is meant to be used as fault-tolerant storage of *metadata*, not data. Abusing Paxos for replicated storage of data will kill the performance.

Apache Giraph made this mistake in aggregators. (This was mentioned in the Facebook's recent Giraph paper.) In Giraph, workers would write partial aggregated values to znodes (Zookeeper's data storage) and the master would aggregate these and write the final result back to its znode for the workers to access. This wasn't scalable due to Zookeeper write throughput limitations and caused a big problem for Facebook which needed to support very large sized aggregators.

In the same vein, using Paxos for queueing or messaging service is a bad idea. When the number of messages increase, performance doesn't scale.

What is the right way of approaching this then? Use chain replication! Chain replication uses Paxos for fault-tolerant storage of metadata:"the configuration of replicas in the chain" and lets replication/storage of data occur in the chain, without involving Paxos. This way, Paxos doesn't get triggered with every piece of data entering the system. Rather it gets triggered rarely, only if a replica fails and a new configuration needs to be agreed.

Apache Kafka and Bookkeeper work based on this principle and are the correct ways to address the above two scenarios.

2) Paxos implies serializability but serializability does not imply Paxos. Paxos provides a total order on operations/requests replicated over k replicas and can be an overkill for achieving serializability for two reasons. First Paxos's true goal is fault-tolerant replication and serialization only its side effect. If you just need serializability and don't need fault-tolerant replication of each operation/request, then Paxos slows your performance. Secondly, Paxos gives you total order but serializability does not require a total order. A partial order that is serializable is good enough and gives you more options.

Monday, October 5, 2015

Analysis of Bounds on Hybrid Vector Clocks

This work is in collaboration with Sorrachai Yingchareonthawornchai and Sandeep Kulkarni at the Michigan State University and is currently under submission.

Practice of distributed systems employs loosely synchronized clocks, mostly using NTP. Unfortunately, perfect synchronization is unachievable due to messaging with uncertain latency, clock skew, and failures. These sync errors lead to anomalies. For example, a send event at Branch1 may be assigned a timestamp greater than the corresponding receive event at Branch2, because Branch1's clock is slightly ahead of Branch2's clock. This leads to /inconsistent state snapshots/ because, at time T=12:00, a money transfer is recorded as received at Branch2, whereas it is not recorded as sent at Branch1.

Theory of distributed systems shrugs and doesn't even try. Theory abstracts away from the physical clock and uses logical clocks for ordering events. These are basically counters, as in Lamport's clocks and vector clocks. The causality relationship captured by these logical clocks is defined based on passing of information rather than passing of time. As such, it is not possible to query events in relation to physical time.

Recently, we introduced a third option, hybrid clocks. Hybrid clocks combine the best of logical and physical clocks and avoid their disadvantages. Hybrid clocks are loosely synchronized using NTP, yet they also provide provable comparison conditions as in LC or VC.

Our hybrid clocks come in two flavors: hybrid logical clocks (HLC) and hybrid vector clocks (HVC). HLC satisfy the logical clock comparison condition and find applications in multiversion distributed database systems (such as in CockroachDB) where it enables efficient querying of consistent snapshots for read transactions, and ensures that commits of write transactions do not get delayed despite the uncertainties in NTP clock synchronization. HVC satisfy the vector clock comparison condition: in contrast to HLC that can provide a single consistent snapshot for a given time, HVC provide all consistent snapshots for that given time. HVC find applications in debugging and in causal delivery of messages.

The space requirement of VC is shown to be of size n, the number of nodes in the system, which is prohibitive. HVC reduces the overhead of causality tracking in VC by using the fact that the clocks are reasonably synchronized within epsilon. If j does not hear (directly or transitively) from k within epsilon then hvc.j[k] need not be explicitly maintained. We still infer that hvc.j[k] equals hvc.j[j]-epsilon, thanks to clock sync. So hvc.j only maintains entries for nodes that talked to j within last epsilon and provided a fresh timestamp higher than hvc.j[j]-epsilon. This way HVC can potentially scale the VC benefits to many thousands of processes by still maintaining small HVC at each process.

HVC bounds

But how effective are HVC for reducing the size of VC? To address this question, we developed an analytical model that uses four parameters, epsilon: uncertainty of clock synchronization, delta: minimum message delay, alpha: message sending rate, and n: number of nodes in the system. We use a model with random unicast message transmissions and derive the size of HVC in terms of a delay differential equation.

This differential equation captures the rate of propogation of "redness". Red means the node maintains an entry for a node j. Initially only j is red and all other nodes are green. If a red node communicates a message to a green node which is received within epsilon, that node also becomes red (starts maintaining an entry for j in its hvc). A red node may turn back to being green if it doesn't receive a message that contains fresh information about j in the last epsilon.

Our model and simulations show the HVC size is a sigmoid function with respect to increasing epsilon: it has a slow start but grows exponentially after a critical phase transition. Before the phase transition threshold, HVC maintains couple entries per node, however when a threshold is crossed, a node not only gets entries added to its clock from direct interaction but also indirect transfer from another processes HVC, and this makes the HVC entries blow up. In other words, the redness goes viral after a threshold. We derive this threshold as (1/alpha + delta)* ln((2-√3)*(n-1)), for alpha*delta<1.

Our bounds are tight and we find that the size predicted by our model is almost identical to the results obtained by our simulation results.

Tuesday, September 15, 2015

Serving at NSF panels and what it teaches about how to pitch the perfect proposal

NSF is one of the largest funding sources for academic research.  It accounts for about one-fourth of federal support to academic institutions for basic research. NSF accepts 1000s of proposals from researchers, and organizes peer-review panels to decide which ones to fund.

Serving at NSF panels are fun. They are also very useful to understand the proposal review dynamics. NSF funding rates are around 10% for computer science and engineering research proposals, so understanding the dynamics of the panel is useful for applying NSF to secure some funding.

How do you get invited as a panelist? 

You get invited to serve at an NSF panel by the program director of that panel. (Program directors are researchers generally recruited from the academia to serve at NSF for a couple years to run panels and help make funding decisions.)

If you have been around and have successfully secured NSF funding, you will get panel invitations. They will have your name and contact you. But, if you are new, don't just wait. You can email program managers at NSF at your topic, and ask them to consider inviting you as a panelist, because you will need the experience. This doesn't always work, but it helps. I found this from NSF about volunteering as a reviewer at a panel.

Preparation for a panel: Reading, reading, reading, writing reviews

As a panelist, you will be assigned around 8 proposals to review in 3 weeks. Each proposal body consists of 15 pages. So that means a lot of reading in the next couple weeks. And it can get boring, if you just read idly. You should try to read actively, discuss with the proposal, and write notes  as you read the proposal that will help you prepare your review.

Tips on reading papers also apply somewhat to reading proposals. But proposals have their peculiarities. Proposals need to have nicely motivated vision and clearly defined research problems, but they don't need to have all the solutions worked up. Instead of full-fledged solutions, the proposals provide draft solution approaches and competitive advantage insights to attack these research questions.

Pitching a successful proposal is an art. I provide some tips at the end of this post.

NSF panel structure

You travel to NSF headquarters at Washington DC a day before the panel. (NSF pays for your travel and reimburses your stay.) Panels are generally for one and a half day. The first day all the proposals get discussed. (Actually some proposals that don't receive any "Very Good" ratings can be triaged/skipped; for those only the 3 reviewer reports are sent back without a panel discussion summary.)  The second day (i.e., the half day) is for preparing and reviewing the panel discussion summaries of proposals discussed in the first day.

In the panel, everybody is smart and knowledgeable in their fields. Of course some are smarter and more knowledgeable, and it is a treat to listen them discuss proposals. If the panel is in your narrow field of expertise, it is gonna be a geek-fest for you. You will geek out and have a lot of fun. If the panel is in a field you are familiar but is not in your specialized field of expertise, it is still a lot of fun as you get to learn new things.

At the end of first day proposals are sorted into 4 categories: HC (High Competitive), C (Competitive), LC (Low Competitive), NC (Not Competitive). HC proposals get funded. Usually only 1-2 proposals out of around 20+ proposals make it to HC. The C proposals need to get compared and ranked, this generally happens the morning of second day. Only the top 1-2 proposals in C would get funding. LC and NC proposals do not need to get sorted.

The panel does not make the final funding decision; it only provides feedback to NSF to make the final funding decisions. NSF is fairly transparent and mostly goes with panel recommendations. If a proposal is not funded, the proposers still get detailed proposal reviews with the rationale of the panel review. In contrast, some other funding agencies such as DARPA, DOE may not even provide you with reviews/evalutions of your proposal.

NSF panels are tiring. You sit for one a half day and listen and discuss. And in the evening of the first day, you often have homework (hotelwork?) to read some extra proposals to help out in comparing/ranking proposals.  Instead of trying to multitask during the panel (by answering your email, or reading other stuff), it is much better to just participate in the discussion, and listen to the discussion of other proposals, even the ones you have not reviewed. After traveling all that distance to NSF headquarters, you should try to savor the panel as much as possible.

Panel discussions, interesting panel dynamics

Panelists can make mistakes and may have biases. Common mistakes include the following:
  • Panelists may play it safe and prefer incremental proposals over novel but risky proposals. Program managers sometimes warn about this bias, and try to promote high-risk/high-reward proposals.
  • Sometimes panelists may read too much into a proposal. They may like to write/complete the proposal in their heads, and give more credit than deserved to the proposal.
  • Panelists may be overly conformist, resulting in groupthink.
  • There can be some good arguer, a charismatic person that dominates over the other panelists. 

Lessons for pitching the perfect proposal to the panel

Make sure your proposal has a novel research component and intellectual merit. Your proposed project should "advance knowledge and understanding within its own field" and "explore creative, original, or potentially transformative concepts". You need to get at least one panelist excited for you. So stand for something.

Be prepared to justify/support what you stand for. You can fool some panelists some of the time, but you can't fool all of the panelists all of the time. Don't make promises you can't deliver. Show preliminary results from your work. It is actually better to write the proposal after you write an initial interesting (workshop?) paper on the topic.

Target the correct panel and hit the high notes in the CFP. If your proposal falls into the wrong panel, it will get brutally beaten. When in doubt about the scope and field of a CFP, contact the program director to get information. Most academic sub-disciplines/communities will have their biases and pet peeves. You want to target a panel that will get your proposal and is not adversarial to those ideas.

Write clearly and communicate clearly. Remember, the panelists are overworked. They need to review 8 proposals over a short time. It gets boring. So make it easy for the reviewer. Don't make the reviewer do the work. Spell out the contributions and novelty clearly, put your contributions in the context of the literature on that topic. If you make the reviewer work, you will leave him angry and frustrated.

Don't forget to write about the proposal's broader impact including education and minority outreach. If you omit it, it will bite you back.

All being said, there is still a luck factor. An adversarial or cranky panelist may ruin your proposal's chances, or a panelist that loves your work may make your case and improve your chances. The acceptance rate is under 10%. So good luck!

Disclaimer: These are of course my subjective views/opinions as an academician that participated in NSF peer-review panels. My views/opinions do not bind NSF and may not reflect NSF's views/stance.

paper summary: One Trillion Edges, Graph Processing at Facebook-Scale

This paper was recently presented at VLDB15.
A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, S. Muthukrishnan, "One Trillion Edges: Graph Processing at Facebook-Scale." Proceedings of the VLDB Endowment 8.12 (2015).

This paper is about graph processing. Graphs provide a general flexible abstraction to model relations between entities, and find a lot of demand in the field of big data analysis (e.g., social networks, web-page linking, coauthorship relations, etc.)
You think the graphs in Table 1 are big, but Frank's laptop begs to differ. These graphs also fail to impress Facebook. In Facebook, they work with graphs of trillion edges, 3 orders magnitude larger than these. How would Frank's laptop fare for this? @franks_laptop may step up to answer that question soon. This paper presents how Facebook deals with these huge graphs of one trillion edges.

Apache Giraph and Facebook

In order to analyze social network data more efficiently, Facebook considered some graph processing platforms including Hive, GraphLab, Giraph in the summer of 2012. They ended up choosing Apache Giraph for several reasons: it is open source, it directly interfaces with Facebook's internal version of HDFS and Hive, it is written in Java, and its BSP model is simple and easy to reason about.

(The BSP model and Pregel, which Apache Giraph derived from, was covered in an earlier post of mine. You can read that first, if you are unfamiliar with these concepts. I have also summarized some of the Facebook data storage and processing systems before, if you like to read about them.)

However, chosing Apache Giraph was not the end of the story. Facebook was not happy with the state of Apache Giraph, and extended, polished, optimized it for their production use. (And of course Facebook contributed these back to the Apache Giraph project as open source.) This paper explains those extensions.

Significant technical extensions to Giraph

Several of Facebook's extensions were done in order to generalize the platform. Facebook extended the original input model in Giraph, which required a rather rigid and limited layout (all data relative to a vertex, including outgoing edges, had to be read from the same record and were assumed to exist in the same data source) to enable flexible vertex/edge based input. Facebook added parallelization support that enabled adding more workers per machine, and introduced worker local multithreading to take advantage of additional CPU cores. Finally Facebook added memory optimizations to serialize the edges of every vertex into a byte array rather than instantiating them as native Java objects.

I was more interested in their extensions to the compute model, which I summarize below.

Sharded aggregators

The aggregator framework in Giraph was  implemented over ZooKeeper rather inefficiently. Workers would write partial aggregated values to znodes (Zookeeper data storage). The master would aggregate all of them, and write the final result back to its znode for workers to access it. This wasn't scalable due to znode size constraints (maximum 1 megabyte) and Zookeeper write limitations and caused a problem for Facebook which needed to support very large aggregators (e.g. gigabytes).

(That was in fact a bad use of the ZooKeeper framework, as outlined in this post there are better ways to do it. Incidentally, my student Ailidani and I are looking at Paxos use in production environments and we collect anectodes like this. Email us if you have some examples to share.)

In the sharded aggregator architecture implemented by Facebook (Figure 3), each aggregator is randomly assigned to one of the workers. The assigned worker is in charge of gathering the values of its aggregators from all workers and distributing the final values to the master and other workers. This balances aggregation across all workers rather than bottlenecking the master and aggregators are limited only by the total memory available on each worker. Note that this is not fault-tolerant; they lost the crash-tolerance of ZooKeeper.

Worker and Master Phase Extensions

For the worker-side, the methods preSuperstep(), postSuperstep(), preApplication(), and postApplication() were added. As an example, the preSuperstep() method is executed on every worker prior to every superstep, and can be used in k-means clustering implementation to let every worker compute the final centroid locations just before the input vectors are processed.

Similarly, Facebook added master computation to do centralized computation prior to every superstep that can communicate with the workers via aggregators. This is generally a lightweight task (easily computable without requiring much data analysis) that has a global scope (applies as input to all workers in the next supercomputing step).

Superstep Splitting

When operating on very large scale graphs, a superstep may generate a lot of data to share with other workers (e.g., in the friends-of-friends score calculation), that the output does not fit in memory. Giraph can use disk but this slows things signification. The superstep technique is for doing the same computation all in-memory for such applications. The idea is that in such a message heavy superstep, a worker can send a fragment of the messages to their destinations and do a partial computation that updates the state of the vertex value.

Operational experience

Facebook uses Apache Giraph for production applications, for a variety of tasks including label propagation, variants of PageRank, and k-means clustering. The paper reports that most of Facebook's production applications run in less than an hour and use less than 200 machines. Due to the short execution duration and small number of machines, the chance of failure is relatively low and, when a failure occurs, it is handled by restarts.

The production application workflow is as follows. The developer first develops and unit tests the application locally. Then tests the application on small number of servers on a test dataset (e.g., Facebook graph for one country). Then the application is run at scale on 200 workers. After tests, the application is ready for production use.


This is where Facebook shows off. They ran an iteration of unweighted PageRank on the 1.39B Facebook user dataset with over 1 trillion social connections. They were able to execute PageRank in less than 3 minutes per iteration with only 200 machines.


The paper gives the following as concluding remarks:
First, our internal experiments show that graph partitioning can have a significant effect on network bound applications such as PageRank.
Second, we have started to look at making our computations more asynchronous as a possible way to improve convergence speed.
Finally, we are leveraging Giraph as a parallel machine-learning platform.
These are of course related to what we mentioned recently about the trends in distributed systems research in cloud computing. (Part 1, Part 2)


After I wrote this summary, I found that Facebook already has a nice post summarizing their paper.