Hari Sundar


Assignment 3 - Web Search

Due Oct 4, 11:59pm # Lets search for some wikipedia pages You will design a basic web-search for this assignment. We will use the current Wikipedia pages as our test data. Your goal will be to search for a given text (phrase) and return the best matches ranked by their pagerank. I am intentionally keeping this assignment open ended. The building blocks for this assignment are the two previous assignments. You need to build on it and make reasonable choices. I anticipate 2 parts to the project, 1. Processing the input files to generate indexed data that can be searched and retrieved. 2. Performing the actual search Part 2 can be done in spark or locally in pure python (or other language). For part 1, roughly the steps would be, 1. Extract text and links from each wiki pages 1. generate descriptor from the text (say the feature vectors for Jaccard/Cosine from assignment 1) 2. Generate the graph or transition probability matrix using the links 2. Compute pagerank 3. Save descriptors and pageranks to be used by Part 2 I would recommend you start thinking about the various aspects and complications of each step. You will need to finish the task, therefore it might be important to make reasonable assumptions. Think about how you want to approach this assignment and come talk to me if you are not sure about the direction you wish to take. You will submit a report documenting the overall design of the web-search and the choices and assumptions you made. You will also submit your code and give a demo of your search. ## Helpful Links * Wikimedia datasets * Wikimedia parsers