CS6963 Distributed Systems

Lecture 07 Scaling

Scalability! But at what COST?
McSherry, Isard, and Murray

  • Objectives:
    • Understand why scaling is important/useful.
    • Understand why scaling is challenging.
    • Understand why a common metric is broken.
    • Understand when scaling isn't meeting isn't goals/objectives.

Overall, this paper should increase your skepticism when reading recent distributed systems work and help you separate important improvements from misleading measurements.

Key idea: recent software systems have focused on "scaling" for performance, but these systems running on many nodes are slower than a system running on one node (or even a single thread).


  • Big data/data parallelism

    • How can we solve problems that involve a lot of data?
      • Add nodes.
    • Map Reduce example [draw MR figure]
    • Easy way to apply many threads to a problem
    • Scaling is a general term for splitting or partitioning load/data across many machines/cores/etc
      • Basically, a way to parallelize a problem
  • Scaling is a popular topic in recent top systems conferences.

    • Many papers showing speedups as nodes are added.
    • Formula:
      • Take performance of system on one node.
      • Take performance of system on n nodes.
      • Divide (tputn)/(tput1), hope result is about n.
  • Raises the question: what's the speedup potential with parallelism?

    • Q: Can a parallel algorithm be slower than a sequential one?
    • Yes.
      • Some cost to coordinate parallelism
      • Communication between parallel operations
        • Communication has a cost
      • Parallel framework may force an algorithm that isn't optimal
      • Example: MR wc, emit KV pairs with (word, 1) and reduce
    • [show MR useful work timeline]
   setup         |-----|

  • Back to MR
    • How do we determine the best number of mappers/reducers?
    • More machines, less work for each, faster.
    • There is some fixed setup cost.
    • Setup cost also a function of number of servers involved.
      • More machines, more setup cost, slower.
| o
|    o
|        o
|               o
  1 Workers           n

Is this 1/n curve what we'll get?

| x  x  
|        x                  
|          x           
|             x    
|                x    
|                     x 
  1 Workers            n
  • Q: Why flat at the beginning?
    • Cost to distribute: think about Thor/Sinfonia going from one node to two.
    • Network becomes involved.
    • Cost varies depending on system.
      • MR: schedule nodes, copy code over, assign tasks, etc.
  • Q: Why flat at end?
    • Cost to coordinate begins to outweigh parallelism advantage.
    • e.g. nth MR task doesn't even get started before the tasks finish
| x  x  
|        x            z     
|          x     z     
|             x    
| C              x    
|                     x 
  1 Workers            n
  • Proposed metric of the paper COST:
    • "Configuration that Outperforms a Single Thread"
    • How many machines does it take to beat the naive single threaded solution?
    • Rationale: aggressively minimize setup and coorination costs.
    • Even the one-node version of the scalable system has overheads so that it can "be prepared" to scale.
    • Notice: this isn't really a measurement of the system itself its a measurment of a systems ability to implement an algorithm/solve a particulr problem.
    • The same system will have different COST for different problems.

Note in this paper when they talk about cores, the cores aren't all in the same machine

  • [Figure 1]

    • Same system: Naiad, before and after an optimization.
    • System A 'scales', and system B doesn't.
    • System B is still better.
      • Solves the problem with 11 cores in same time as A with 300.
    • What went wrong? System B on one node created a big denominator in scalability calculation.
    • Points out that scalability in relative numbers can lie.
    • At least report absolute performance numbers.
  • Surprising result of this paper: a simple single threaded implementation may be faster for all points on the graph for some systems.

Rest of the Paper

  • Take a 2014 laptop
  • Take datasets from recent large-scale systems papers
    • Put it on the laptop
  • Write a single threaded algorithms against the data that solves the same problem.
  • Compare results to large scale systems.

Primer: Parallel Graph Frameworks

First, we're going to talk about graph analytics frameworks, since that's the example from this paper. Don't get hung up on graph processing. The high-level point is about scaling and efficiency. Need this for context of the conversation.

  • Disclaimer: I'm not an expert on these.
  • Rough idea: often a vertex-centric programming model.
    • Similar to how MR constrains the programming model for bulk parallel processing of large files.
    • Except here we need a way to divide computation across subsets of a graph.
  • In a "round" each vertex
    • Looks at incoming messages and vertex state
    • Produces messages for neighbors next round.
  • Can divide graph across multiple machines.
    • Determining how to cut is the subject of papers.
  • Run vertes code on each machine in parallel.
  • For simplicity, assume synchronous rounds.

  • What is this good for?

    • Many popular graph computations can map to this.
    • PageRank: determine popularity of a web site based on incoming links.
    • Need for iterative fixed point due to cycles in the graph.
    • Connected components
o--o  o---o
| /   |
o     o   o---o

Basic Graph Computations


fn pagerank<G: EdgeMapper>(graph: &G, nodes: u32, alpha: f32) {
    let mut src: Vec<f32> = (0..nodes).map(|_| 0f32).collect();
    let mut dst: Vec<f32> = (0..nodes).map(|_| 0f32).collect();
    let mut deg: Vec<f32> = (0..nodes).map(|_| 0f32).collect();

    // Initialize deg of all nodes to count of outbound neighbors.
    graph.map_edges(|x, _| { deg[x] += 1f32 });

    for _iteration in (0 .. 20) {
        for node in (0 .. nodes) {
            // Divide up current weight of the node among neighbors.
            src[node] = alpha * dst[node] / deg[node];

            // Each node starts with 1 - alpha weight.
            dst[node] = 1f32 - alpha;
        // Then add on the weight coming in from neighbors.
        // i.e. Each neighbor y of x receives a share of x's original weight.
        graph.map_edges(|x, y| { dst[y] += src[x]; });
  • Look at last map_edges: does the heavy part: this is a scan over the edges of the graph.

    • When on SSD this reads the whole graph from SSD each iteration (up to 20x)!
  • Table 2

    • Single thread PageRank beats all systems some case.
    • Single thread beats all systems but GraphLab and GraphX in both cases.

Connected Components

Label propagation easy on vertex-centric model


1--3  4---6
| /   |
2     5   7---8---9

After round 1

1--1  4---4
| /   |
1     4   7---7---8
  • Table 3
    • Even more extreme results
    • Single thread on SSD faster than every other system on both graphs

Better Baselines

Used super naive implementations; simple improvements make them even better.

Improving Graph Layout

  • Reorder edge traversal in PageRank
  • Original enumerated them by source vertex first.
    • (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), ...
  • Use a space-filling Hilbert curve to get better locality on vertex accesses.
    • Don't worry about the details of what this is.
  • Maps 2 dimensional space onto one dimension.
  • Z-curve: interleave bits and sort
  • [probably skip diagrams below, unless students want a vague idea of it]
0 --> 1
|     ^
V     |
2 --> 3

Source order:  Interleaved:
0, 1           0001b
0, 2           0100b
2, 3           1101b
3, 1           1011b

0, 1
0, 2
3, 1
2, 3

In the limit:

0, 1
0, 10
0, 20
0, 1000
0, 100000
2, 1
2, 10
2, 100000
0, 1
2, 1
0, 10
2, 10
0, 20
0, 1000
0, 100000
2, 100000

For x, y edges with close x's and close y's are likely to hit in cache.

Source order: have to read each destination node for edge from any source. Z-curve: if node 10 has an edge to 200 and so does node 20, then the accesses will be proximate and likely to hit in the buffer cache.

  • Table 4
    • Big speed boost for in RAM
    • Curve leverages large L3 to work in neighborhoods of graph most likely
    • 4x perf boost for running on a laptop instead of a cluster

Improving Algorithms

  • Point: these naive implementations are still mimicking the vertex-centric model
    • But they run on single thread in a single machine
    • They can be implemented in any way that makes sense
  • Label propagation costly and chatty
  • Union-find - remember this from algorithms class?

    • Create a forest of vertexes, union together as edges are encountered.
  • Table 5

    • !!!
    • 15-30x faster than 128 cores?!

Applying COST

  • Q: What does a COST of 16 cores for PageRanking twitter_rv with Naiad mean?
  • Q: COST of 512 cores for GraphLab?
  • Q: Unbounded COST for GraphX?
    • Can't get perf of single threaded implementation regardless of number of machines given.
    • Does GraphX intersect way out there?

Lessons Learned

  • System implementations may add overheads that a single thread doesn't require.
    • High-level language: GC, bounds checks, memcpy
    • Difficult to tell if engineering effort can eliminate these overheads.
    • e.g. Network bottleneck may be thinly hiding algorithmic bottleneck
  • Computational model matters.
    • It forces particular approaches/algorithms that may not always be the best.
    • This was raised in this class
    • Think back to MR wc emiting KV pairs of form (word, 1)...
  • Target hardware makes a difference
    • Data center machines have slower memory, CPUs, bigger caches, more memory...
    • Laptops have high single thread perf, fast small RAM
    • Alternatives should be considered.


  • Partitioning, scaling, and parallelism key to big problems today.
    • Map Reduce, Spark
    • Sharded KVS, databases
    • Graph analytics frameworks
  • Important to be careful in reading the papers and understanding the costs.
  • When writing, prefer absolute numbers to relative.