## Bigrams
Let us import some helper routines …
python
from operator import add
import re
def clean_and_split(text):
# remove punctuation
str1 = re.split('[^a-zA-Z.]', text.lower())
str2 = filter (None, str1)
return str2
test_str = """In order to-do this, we will define! features for each book?
For the Jaccard distance, we define this as the intersection/union
of the words used in the two books that I am comparing."""
print clean_and_split(test_str)
[‘in’, ‘order’, ‘to’, ‘do’, ‘this’, ‘we’, ‘will’, ‘define’, ‘features’, ‘for’, ‘each’, ‘book’, ‘for’, ‘the’, ‘jaccard’, ‘distance’, ‘we’, ‘define’, ‘this’, ‘as’, ‘the’, ‘intersection’, ‘union’, ‘of’, ‘the’, ‘words’, ‘used’, ‘in’, ‘the’, ‘two’, ‘books’, ‘that’, ‘i’, ‘am’, ‘comparing.’]
python
sentences = sc.textFile("sherlock.txt") \
.glom() \
.map(lambda x: " ".join(x)) \
.flatMap(lambda x: x.split("."))
Let us first generate all pairs occuring together, and count them similar to the wordcount example.
python
bigrams = sentences.map(lambda x: x.split()) \
.flatMap(lambda s: [((s[i],s[i+1]),1) for i in range (0, len(s)-1)])
python
transitions = bigrams.reduceByKey(add).sortBy(lambda x: x[1], False).take(10)
## Text similarity
You will find here the top 100 books from the Gutenberg project. I am interested in seeing how similar these books are, by a metric of my choice. We shall define similarity between books based on two metrics,
* Jaccard Distance
* Cosine distance
In order to do this, we will define features for each book. For the Jaccard distance, we define this as the intersection/union of the words used in the two books that I am comparing. For the cosine distance, we will first compute the words used in all the books, then select the top 1000 most used words as a feature vector. For each book, the normalized word counts for these selected 1000 words will be used for the cosine distance, i.e., the similarity between any two books is the inner or dot product of these normalized feature vectors (\((a,b)=\sum a_i b_i\)).
First let us load the books,
python
books = sc.wholeTextFiles("data/*")
# separate book names from content
booknames = books.keys().map(os.path.basename)
books.count()
451
## Jaccard distance
Lets extract the words in each text and compute the similarity using cartesian
python
def clean_and_unique(text):
# remove punctuation
str1 = re.split('[^a-zA-Z]', text.lower())
str2 = filter (None, str1)
return set(str2)
test_str = "In order to-do this, we will define! features for each book? For the Jaccard distance, we define this as the intersection/union of the words used in the two books that I am comparing."
print clean_and_unique(test_str)
set([‘features’, ‘am’, ‘as’, ‘books’, ‘in’, ‘for’, ‘comparing’, ‘union’, ‘that’, ‘two’, ‘to’, ‘book’, ‘define’, ‘do’, ‘we’, ‘used’, ‘jaccard’, ‘words’, ‘intersection’, ‘distance’, ‘this’, ‘of’, ‘will’, ‘i’, ‘each’, ‘the’, ‘order’])
python
contents = books.values().map(clean_and_unique)
feat_jaccard = booknames.zip(contents).cache()
python
feat_jaccard.cartesian(feat_jaccard) \
.filter(lambda (b1,b2): b1[0] > b2[0]) \
.map(lambda (b1,b2): ((b1[0], b2[0]), float(len(b1[1] & b2[1]))/len(b1[1] | b2[1]))) \
.filter(lambda (b,sim): sim>0.4)\
.collect()
[((u’HattonJoseph__Threerecruitsandthegirlstheyleftbehindthemanovel.txt’,
u’BanksLinnaeusGeorge__WooersandwinnersorUnderthescarsaYorkshirestory.txt’),
0.4051434110737298),
((u’TinsleyLily__Intheringanovel.txt’,
u’ForbesUrquhartAtwell__OtterstoneHall.txt’),
0.40178225493994574),
((u’OliphantMrsMargaret__Madam.txt’,
u’ChetwyndHenryWaylandMrs__MrsDorrimananovel.txt’),
0.41121726239499257),
((u’PaynJames__Kitamemory.txt’, u’CollinsWilkie__Isayno.txt’),
0.40801567991349014),
((u’OliphantMrsMargaret__ThegreatestheiressinEngland.txt’,
u’GissingGeorge__Theunclassedanovel.txt’),
0.4014689680499449),
((u’YongeCharlotteMary__Nuttiesfather.txt’,
u’BlackWilliam__YolandeThestoryofadaughter.txt’),
0.41390034480427285),
((u’CrommelinMay__Lovethepilgrim.txt’, u’BradshawJohnMrs__RogerNorth.txt’),
0.403788144510698)]
## Cosine Distance
Things are a bit more involved for the cosine distance. Let us first compute the global list of words. Notice I get both python and spark to help me remove duplicate words.
python
from collections import Counter
import numpy as np
def common_words(text):
str2 = re.split('[^a-zA-Z]', text.lower())
str3 = [i for i in str2 if len(i) > 5]
return Counter(str3).most_common(100)
feat_words=books.values().flatMap(common_words).reduceByKey(add).sortBy(lambda x: x[1], False).keys().take(100)
def gen_norm_feature(text):
str2 = re.split('[^a-zA-Z]', text.lower())
wc = Counter(str2)
fv = [wc[i] for i in feat_words]
nfv = np.array(fv)
return nfv/np.linalg.norm(nfv)
# now we will create the feature vector similar to how we did for Jaccard.
contents = books.values().map(gen_norm_feature)
feat_cos = booknames.zip(contents).cache()
python
feat_cos.cartesian(feat_cos) \
.filter(lambda (b1,b2): b1[0] > b2[0]) \
.map(lambda (b1,b2): ((b1[0], b2[0]), np.dot(b1[1], b2[1]))) \
.filter(lambda (b,sim): sim > 0.92) \
.collect()
[((u’FothergillJessie__Borderlandacountrytownchronicle.txt’,
u’ErrollHenry__Theacademician.txt’),
0.92106217175448735),
((u’HarrisonJoanna__Anorthernlilyfiveyearsofanuneventfullife.txt’,
u’BlackburneGertrudeMaryIreland__Inopposition.txt’),
0.92322759818450328),
((u’SmithConstance__TherepentanceofPaulWentworthAnovel.txt’,
u’ErrollHenry__Theacademician.txt’),
0.93747231623466687),
((u’WoodMrsHenry__LadyGraceandotherstories.txt’,
u’MarryatFlorence__Amomentofmadnessandotherstories.txt’),
0.93804304128916194)]