Scalability! But at what COST?
McSherry, Isard, and Murray
Overall, this paper should increase your skepticism when reading recent distributed systems work and help you separate important improvements from misleading measurements.
Key idea: recent software systems have focused on "scaling" for performance, but these systems running on many nodes are slower than a system running on one node (or even a single thread).
Big data/data parallelism
Scaling is a popular topic in recent top systems conferences.
Raises the question: what's the speedup potential with parallelism?
Parallel:


setup 
Sequential:
..
Time
 o


 o

 o
 o

1 Workers n
Is this 1/n curve what we'll get?
Time
 x x
 x
 x
 x
 x
 x


1 Workers n
Time
 x x
 x z
 x z
 x
 C x
 x


1 Workers n
Note in this paper when they talk about cores, the cores aren't all in the same machine
[Figure 1]
Surprising result of this paper: a simple single threaded implementation may be faster for all points on the graph for some systems.
First, we're going to talk about graph analytics frameworks, since that's the example from this paper. Don't get hung up on graph processing. The highlevel point is about scaling and efficiency. Need this for context of the conversation.
For simplicity, assume synchronous rounds.
What is this good for?
oo oo
 / 
o o oo
fn pagerank<G: EdgeMapper>(graph: &G, nodes: u32, alpha: f32) {
let mut src: Vec<f32> = (0..nodes).map(_ 0f32).collect();
let mut dst: Vec<f32> = (0..nodes).map(_ 0f32).collect();
let mut deg: Vec<f32> = (0..nodes).map(_ 0f32).collect();
// Initialize deg of all nodes to count of outbound neighbors.
graph.map_edges(x, _ { deg[x] += 1f32 });
for _iteration in (0 .. 20) {
for node in (0 .. nodes) {
// Divide up current weight of the node among neighbors.
src[node] = alpha * dst[node] / deg[node];
// Each node starts with 1  alpha weight.
dst[node] = 1f32  alpha;
}
// Then add on the weight coming in from neighbors.
// i.e. Each neighbor y of x receives a share of x's original weight.
graph.map_edges(x, y { dst[y] += src[x]; });
}
}
Look at last map_edges: does the heavy part: this is a scan over the edges of the graph.
Table 2
Label propagation easy on vertexcentric model
Start
13 46
 / 
2 5 789
After round 1
11 44
 / 
1 4 778
Used super naive implementations; simple improvements make them even better.
0 > 1
 ^
V 
2 > 3
Source order: Interleaved:
0, 1 0001b
0, 2 0100b
2, 3 1101b
3, 1 1011b
Zcurve:
0, 1
0, 2
3, 1
2, 3
In the limit:
0, 1
0, 10
0, 20
...
0, 1000
...
0, 100000
...
2, 1
2, 10
...
2, 100000
0, 1
2, 1
0, 10
2, 10
0, 20
...
0, 1000
...
0, 100000
2, 100000
...
For x, y edges with close x's and close y's are likely to hit in cache.
Source order: have to read each destination node for edge from any source. Zcurve: if node 10 has an edge to 200 and so does node 20, then the accesses will be proximate and likely to hit in the buffer cache.
Unionfind  remember this from algorithms class?
Table 5