Hari Sundar


## Bigrams Let us import some helper routines … python from operator import add import re def clean_and_split(text): # remove punctuation str1 = re.split('[^a-zA-Z.]', text.lower()) str2 = filter (None, str1) return str2 test_str = """In order to-do this, we will define! features for each book? For the Jaccard distance, we define this as the intersection/union of the words used in the two books that I am comparing.""" print clean_and_split(test_str) [‘in’, ‘order’, ‘to’, ‘do’, ‘this’, ‘we’, ‘will’, ‘define’, ‘features’, ‘for’, ‘each’, ‘book’, ‘for’, ‘the’, ‘jaccard’, ‘distance’, ‘we’, ‘define’, ‘this’, ‘as’, ‘the’, ‘intersection’, ‘union’, ‘of’, ‘the’, ‘words’, ‘used’, ‘in’, ‘the’, ‘two’, ‘books’, ‘that’, ‘i’, ‘am’, ‘comparing.’] python sentences = sc.textFile("sherlock.txt") \ .glom() \ .map(lambda x: " ".join(x)) \ .flatMap(lambda x: x.split(".")) Let us first generate all pairs occuring together, and count them similar to the wordcount example. python bigrams = sentences.map(lambda x: x.split()) \ .flatMap(lambda s: [((s[i],s[i+1]),1) for i in range (0, len(s)-1)]) python transitions = bigrams.reduceByKey(add).sortBy(lambda x: x[1], False).take(10) ## Text similarity You will find here the top 100 books from the Gutenberg project. I am interested in seeing how similar these books are, by a metric of my choice. We shall define similarity between books based on two metrics, * Jaccard Distance * Cosine distance In order to do this, we will define features for each book. For the Jaccard distance, we define this as the intersection/union of the words used in the two books that I am comparing. For the cosine distance, we will first compute the words used in all the books, then select the top 1000 most used words as a feature vector. For each book, the normalized word counts for these selected 1000 words will be used for the cosine distance, i.e., the similarity between any two books is the inner or dot product of these normalized feature vectors (\((a,b)=\sum a_i b_i\)). First let us load the books, python books = sc.wholeTextFiles("data/*") # separate book names from content booknames = books.keys().map(os.path.basename) books.count() 451 ## Jaccard distance Lets extract the words in each text and compute the similarity using cartesian python def clean_and_unique(text): # remove punctuation str1 = re.split('[^a-zA-Z]', text.lower()) str2 = filter (None, str1) return set(str2) test_str = "In order to-do this, we will define! features for each book? For the Jaccard distance, we define this as the intersection/union of the words used in the two books that I am comparing." print clean_and_unique(test_str) set([‘features’, ‘am’, ‘as’, ‘books’, ‘in’, ‘for’, ‘comparing’, ‘union’, ‘that’, ‘two’, ‘to’, ‘book’, ‘define’, ‘do’, ‘we’, ‘used’, ‘jaccard’, ‘words’, ‘intersection’, ‘distance’, ‘this’, ‘of’, ‘will’, ‘i’, ‘each’, ‘the’, ‘order’]) python contents = books.values().map(clean_and_unique) feat_jaccard = booknames.zip(contents).cache() python feat_jaccard.cartesian(feat_jaccard) \ .filter(lambda (b1,b2): b1[0] > b2[0]) \ .map(lambda (b1,b2): ((b1[0], b2[0]), float(len(b1[1] & b2[1]))/len(b1[1] | b2[1]))) \ .filter(lambda (b,sim): sim>0.4)\ .collect() [((u’HattonJoseph__Threerecruitsandthegirlstheyleftbehindthemanovel.txt’, u’BanksLinnaeusGeorge__WooersandwinnersorUnderthescarsaYorkshirestory.txt’), 0.4051434110737298), ((u’TinsleyLily__Intheringanovel.txt’, u’ForbesUrquhartAtwell__OtterstoneHall.txt’), 0.40178225493994574), ((u’OliphantMrsMargaret__Madam.txt’, u’ChetwyndHenryWaylandMrs__MrsDorrimananovel.txt’), 0.41121726239499257), ((u’PaynJames__Kitamemory.txt’, u’CollinsWilkie__Isayno.txt’), 0.40801567991349014), ((u’OliphantMrsMargaret__ThegreatestheiressinEngland.txt’, u’GissingGeorge__Theunclassedanovel.txt’), 0.4014689680499449), ((u’YongeCharlotteMary__Nuttiesfather.txt’, u’BlackWilliam__YolandeThestoryofadaughter.txt’), 0.41390034480427285), ((u’CrommelinMay__Lovethepilgrim.txt’, u’BradshawJohnMrs__RogerNorth.txt’), 0.403788144510698)] ## Cosine Distance Things are a bit more involved for the cosine distance. Let us first compute the global list of words. Notice I get both python and spark to help me remove duplicate words. python from collections import Counter import numpy as np def common_words(text): str2 = re.split('[^a-zA-Z]', text.lower()) str3 = [i for i in str2 if len(i) > 5] return Counter(str3).most_common(100) feat_words=books.values().flatMap(common_words).reduceByKey(add).sortBy(lambda x: x[1], False).keys().take(100) def gen_norm_feature(text): str2 = re.split('[^a-zA-Z]', text.lower()) wc = Counter(str2) fv = [wc[i] for i in feat_words] nfv = np.array(fv) return nfv/np.linalg.norm(nfv) # now we will create the feature vector similar to how we did for Jaccard. contents = books.values().map(gen_norm_feature) feat_cos = booknames.zip(contents).cache() python feat_cos.cartesian(feat_cos) \ .filter(lambda (b1,b2): b1[0] > b2[0]) \ .map(lambda (b1,b2): ((b1[0], b2[0]), np.dot(b1[1], b2[1]))) \ .filter(lambda (b,sim): sim > 0.92) \ .collect() [((u’FothergillJessie__Borderlandacountrytownchronicle.txt’, u’ErrollHenry__Theacademician.txt’), 0.92106217175448735), ((u’HarrisonJoanna__Anorthernlilyfiveyearsofanuneventfullife.txt’, u’BlackburneGertrudeMaryIreland__Inopposition.txt’), 0.92322759818450328), ((u’SmithConstance__TherepentanceofPaulWentworthAnovel.txt’, u’ErrollHenry__Theacademician.txt’), 0.93747231623466687), ((u’WoodMrsHenry__LadyGraceandotherstories.txt’, u’MarryatFlorence__Amomentofmadnessandotherstories.txt’), 0.93804304128916194)]