## Word Transition counts
In the first assignment, we will continue from the word count example covered in class. Assume you are designing a smart keyboard app, and would like to predict the next word based on the last word entered. To do this, we need to estimate transition probabilities of words. In the simplest case, we would like to know the most likely word to follow any given word. Normally, such a list would be built using the users emails, and prior text entry etc. For the purposes of this assignment, build a simple data structure that computes the number of times word2
follows word1
. Effectively, you need something similar to the word count, that looks like ((word1. word2), n)
. Use the same text we used for the word count example, The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle. List out the ten most frequently occurring word pairs.
## Text similarity
You will use the top 100 books from the Gutenberg project. Due to problems downloading the gutenberg files, I have changed to using the British Corpus - 451 books. I was able to run the similarity problem using all 451 books on my laptop with 8GB of RAM in under a minute.
I am interested in computing how similar these books are, by a metric of my choice. We shall define similarity between books based on two metrics,
* Jaccard Distance
* Cosine distance
In order to do this, we will define features for each book. For the Jaccard distance, we define this as the intersection/union of the words used in the two books that I am comparing. For the cosine distance, we will first compute the words used in all the books, then select the top 1000 (or some small subset) most used words as a feature vector. For each book, the normalized word counts for these selected 1000 words will be used for the cosine distance, i.e., the similarity between any two books is the inner or dot product of these normalized feature vectors (\((a,b)=\sum a_i b_i\)). By normalized, I mean that the feature vector for each book should have a magnitude of 1. In python,
python
from numpy import array
from numpy.linalg import norm
a = array([1.0, 1.0, 0.6])
b = a/norm(a)
# b = array([ 0.65094455, 0.65094455, 0.39056673])
You should include the similarity matrices plotted as an images in your report. Refer to this page for using matplotlib
to render a matrix as an image. Also include a similarity matrices as a text file (ascii) along with your submission.
Submissions should be done via canvas.