spark_intro — Hari Sundar

# First Steps with Spark Let us first create a small array that we will manipulate. For realistic applications, data will most likely be loaded from a file or from a data stream. The builtin python command range creates a sequence from 1 to \(n\). Since we generated this data instead of loading it in a distributed fashion, we need to create an RDD out of it, using the parallelize() function provided by the spark context sc. The spark context is created by default when in an interactive session like this one. In case of standalone programs, you will need to setup the spark context. Here is a minimal example, python conf = SparkConf() conf.setMaster("local[4]") conf.setAppName("reduce") conf.set("spark.executor.memory", "4g") sc = SparkContext(conf=conf) First lets create some simple data, say the integers from 1 to 1000. We use the python command range to do this. Note that in practice our data will most likely come from data files. python A = range(1000) Now let us distribute this data across all our processes using the sc.parallelize function. python pA=sc.parallelize(A) Let us start with a simple task of computing the sum of the values in this distributed array, compute \(\sum A\). We call the reduce function with a lambda function that adds two values. python pA.reduce(lambda a,b: a+b) > 499500 In this case, sc will distributed the reduction across available processes. This is simple enough. Now that we have seen a simple example, let us consider the example from the lecture, \[ x = \sum \sqrt A_i\] First, let us import the sqrt function from the math module and also write a simple sequential python code to generate the correct result so that we may test our spark implementation. python from math import sqrt sum = 0 for i in range(1000): sum += sqrt(i) print sum > 21065.8331109 That works. Now we can rewrite our lambda function to compute \(\sum \sqrt A_i\) using the reduce function, python pA.map(sqrt).reduce(lambda a,b: a+b) > 21065.83311087905 lets go back to the presentation now … ## Basic Transformations Let us quickly review some basic transformations availble within Spark. Lets create a smaller list of numbers to play with. python nums = sc.parallelize([1,2,3,4,5]) python # retain elements passing a predicate evens = nums.filter(lambda x: x%2 == 0) python # map each element to zero or more others x = nums.flatMap(lambda x: range(x)) Now let us look at some actions … python # retrieve RDD contents as a local collection x.collect() > [0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4] python # return first k elements evens.take(2) > [2, 4] python # count number of elements nums.count() > 5 ## < key,value > pairs Due to the origins from MapReduce, a lot of data analytics defaults to using (key,value) pairs as the data representation. In python you would use tuples to represent these, python pair = ('a','b') print pair[0], pair[1] Lets consider a quick example, python pets = sc.parallelize([('cat',1), ('dog',3), ('cat',2),('dog',1),('hamster',1)]) python pets.reduceByKey(lambda x,y: x+y) python pets.groupByKey() python pets.sortByKey() Lets try for a more complex example, word count and working with files. python lines = sc.textFile("sherlock.txt") lines.count() > 128457 Ok, so thats a long file. Lets try and do a word count on this file and list out frequent words occuring in The Adventures of Sherlock Holmes by Sir Arthur Connan Doyle. python lines.flatMap(lambda line: line.split(" ")) \ .filter(lambda w: len(w) > 8) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda x,y: x+y) \ .sortBy(lambda x: x[1], False)