# First Steps with Spark
Let us first create a small array that we will manipulate. For realistic applications, data will most likely be loaded from a file or from a data stream. The builtin python command range creates a sequence from 1 to \(n\). Since we generated this data instead of loading it in a distributed fashion, we need to create an RDD out of it, using the parallelize() function provided by the spark context sc. The spark context is created by default when in an interactive session like this one. In case of standalone programs, you will need to setup the spark context. Here is a minimal example,
python
conf = SparkConf()
conf.setMaster("local[4]")
conf.setAppName("reduce")
conf.set("spark.executor.memory", "4g")
sc = SparkContext(conf=conf)
First lets create some simple data, say the integers from 1 to 1000. We use the python command range to do this. Note that in practice our data will most likely come from data files.
python
A = range(1000)
Now let us distribute this data across all our processes using the sc.parallelize function.
python
pA=sc.parallelize(A)
Let us start with a simple task of computing the sum of the values in this distributed array, compute \(\sum A\). We call the reduce function with a lambda function that adds two values.
python
pA.reduce(lambda a,b: a+b)
> 499500
In this case, sc will distributed the reduction across available processes. This is simple enough. Now that we have seen a simple example, let us consider the example from the lecture,
\[ x = \sum \sqrt A_i\]
First, let us import the sqrt function from the math module and also write a simple sequential python code to generate the correct result so that we may test our spark implementation.
python
from math import sqrt
sum = 0
for i in range(1000):
sum += sqrt(i)
print sum
> 21065.8331109
That works. Now we can rewrite our lambda function to compute \(\sum \sqrt A_i\) using the reduce function,
python
pA.map(sqrt).reduce(lambda a,b: a+b)
> 21065.83311087905
lets go back to the presentation now …
## Basic Transformations
Let us quickly review some basic transformations availble within Spark. Lets create a smaller list of numbers to play with.
python
nums = sc.parallelize([1,2,3,4,5])
python
# retain elements passing a predicate
evens = nums.filter(lambda x: x%2 == 0)
python
# map each element to zero or more others
x = nums.flatMap(lambda x: range(x))
Now let us look at some actions …
python
# retrieve RDD contents as a local collection
x.collect()
> [0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4]
python
# return first k elements
evens.take(2)
> [2, 4]
python
# count number of elements
nums.count()
> 5
## < key,value > pairs
Due to the origins from MapReduce, a lot of data analytics defaults to using (key,value) pairs as the data representation. In python you would use tuples to represent these,
python
pair = ('a','b')
print pair[0], pair[1]
Lets consider a quick example,
python
pets = sc.parallelize([('cat',1), ('dog',3), ('cat',2),('dog',1),('hamster',1)])
python
pets.reduceByKey(lambda x,y: x+y)
python
pets.groupByKey()
python
pets.sortByKey()
Lets try for a more complex example, word count and working with files.
python
lines = sc.textFile("sherlock.txt")
lines.count()
> 128457
Ok, so thats a long file. Lets try and do a word count on this file and list out frequent words occuring in The Adventures of Sherlock Holmes by Sir Arthur Connan Doyle.
python
lines.flatMap(lambda line: line.split(" ")) \
.filter(lambda w: len(w) > 8) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x,y: x+y) \
.sortBy(lambda x: x[1], False)