# First Steps with Spark
Let us first create a small array that we will manipulate. For realistic applications, data will most likely be loaded from a file or from a data stream. The builtin python command range
creates a sequence from 1 to \(n\). Since we generated this data instead of loading it in a distributed fashion, we need to create an RDD out of it, using the parallelize()
function provided by the spark context sc
. The spark context is created by default when in an interactive session like this one. In case of standalone programs, you will need to setup the spark context. Here is a minimal example,
python
conf = SparkConf()
conf.setMaster("local[4]")
conf.setAppName("reduce")
conf.set("spark.executor.memory", "4g")
sc = SparkContext(conf=conf)
First lets create some simple data, say the integers from 1 to 1000. We use the python command range
to do this. Note that in practice our data will most likely come from data files.
python
A = range(1000)
Now let us distribute this data across all our processes using the sc.parallelize
function.
python
pA=sc.parallelize(A)
Let us start with a simple task of computing the sum of the values in this distributed array, compute \(\sum A\). We call the reduce
function with a lambda function that adds two values.
python
pA.reduce(lambda a,b: a+b)
> 499500
In this case, sc
will distributed the reduction across available processes. This is simple enough. Now that we have seen a simple example, let us consider the example from the lecture,
\[ x = \sum \sqrt A_i\]
First, let us import the sqrt function from the math
module and also write a simple sequential python code to generate the correct result so that we may test our spark implementation.
python
from math import sqrt
sum = 0
for i in range(1000):
sum += sqrt(i)
print sum
> 21065.8331109
That works. Now we can rewrite our lambda function to compute \(\sum \sqrt A_i\) using the reduce function,
python
pA.map(sqrt).reduce(lambda a,b: a+b)
> 21065.83311087905
lets go back to the presentation now …
## Basic Transformations
Let us quickly review some basic transformations availble within Spark. Lets create a smaller list of numbers to play with.
python
nums = sc.parallelize([1,2,3,4,5])
python
# retain elements passing a predicate
evens = nums.filter(lambda x: x%2 == 0)
python
# map each element to zero or more others
x = nums.flatMap(lambda x: range(x))
Now let us look at some actions …
python
# retrieve RDD contents as a local collection
x.collect()
> [0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4]
python
# return first k elements
evens.take(2)
> [2, 4]
python
# count number of elements
nums.count()
> 5
## < key,value > pairs
Due to the origins from MapReduce, a lot of data analytics defaults to using (key,value) pairs as the data representation. In python you would use tuples to represent these,
python
pair = ('a','b')
print pair[0], pair[1]
Lets consider a quick example,
python
pets = sc.parallelize([('cat',1), ('dog',3), ('cat',2),('dog',1),('hamster',1)])
python
pets.reduceByKey(lambda x,y: x+y)
python
pets.groupByKey()
python
pets.sortByKey()
Lets try for a more complex example, word count and working with files.
python
lines = sc.textFile("sherlock.txt")
lines.count()
> 128457
Ok, so thats a long file. Lets try and do a word count on this file and list out frequent words occuring in The Adventures of Sherlock Holmes by Sir Arthur Connan Doyle.
python
lines.flatMap(lambda line: line.split(" ")) \
.filter(lambda w: len(w) > 8) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x,y: x+y) \
.sortBy(lambda x: x[1], False)