Assign2: Search Engine

Search engines are one of the most influential developments of the modern Internet age, having completely revolutionized the way that people use and interact with the Web. What was once an intractable jumble of data with no semblance of organization has been transformed into an easily searchable repository of human information, For this assignment task, you will recreate this phenomenal technological development by using the Map and Set ADTs to build a document search engine that can find matching pages for a user's query on demand with lighting-fast response time. This is a simplified version of the technology underpinning Spotlight, Google, Bing, and every other modern search engine that you encounter in your day-to-day internet use.

Want to see the power of search right at your finger tips? Use the "Search" feature in the top-right of the website and search for the phrase "easter egg" and see what you instantly find hidden deep in the recesses of the course website…

Big picture

The program you write for this assignment task will produce a custom data structure for a particular set of web pages, represented in the form of a URL ("Uniform Resource Locator") and an associated string of text representing the body of the webpage. First, your program will process the body text of each page, and will populate the data structure with a mapping from words in the body text to page URLs. Once you have built the data structure, you will then write a function that enables search, given the data structure of webpage information and a target query that one might want to search for. Finally, you will write a console program that enables a user to enter many different search queries and get back the webpages on which those search queries can be found. Put all together this way, you will have built your own mini search engine!

An inverted index

The key to enabling efficient search of a large data structure comes down to how we structure and store the data. A poor choice in data structures can cause search to be very slow, while a wise arrangement can allow search to be near instantaneous. In this section of the handout, we will describe one data storage scheme that can be used to enable super-fast lookup.

To begin with, let's consider the index of a book. For example, when you look in the index of the CS 106B textbook, one of the entries is the keyword "Internet", and two page numbers, 18 and 821. The word internet occurs on page number 18 and again on page number 821. A book's index is an example of an inverted index, where you have a word in mind, and you can find the page number it is listed on (a book's index does not usually include every instance of the word on all pages, but is usually curated to give you the best matches). In other words, an inverted index creates a mapping from content to locations in a document (or table, etc.). This in contrast to a forward index, which for our book example would be a list of page numbers with all the words listed on that page. A search engine uses an inverted index to allow for fast retrieval of web pages – thus, we will first figure out how to build one of these inverted indexes to efficiently store the data that we want to be able to search through.

Efficiency note: To create an inverted index, you must process the entire document, or set of documents, and catalog where each word is located. This may not be a fast process, but once the inverted index is complete, searching for words and their corresponding locations is extremely fast.

Milestone 1: Building the inverted index

Given that we love decomposition in CS106B, we will break down building the index into a couple different steps.

Your first main task will be to complete the following function, whose declaration has been provided for you in search.cpp.

Map<string, Set<string>> readDocs(string dbfile)

This function should opens the named file, reads it line by line, and builds a map from a URL to a Set<string> representing the unique words contained on that page.

We gathered the body text from each of the pages in the course web site and organized it into a combined database file for your program to process. The format of the database file is very simple:

File format
- The lines of the file are grouped in pairs.
  - The first line of a pair is a page URL.
  - The second line of a pair is the body text of that page, with all newlines removed (basically text of the page in a single string).
- The third and fourth lines form another pair, the fifth and sixth another, ad so on, with alternating lines of page URL and page content.

To process the database file, your code will need to:

For each page URL, tokenize its contents – in other words, divide the body text into individual tokens, which will be separated by spaces. Do this division using the stringSplit function from strlib.h.
Discard any non-word tokens. A word (or at least "word-like") is a token that contains at least one letter.
Trim away leading and trailing punctuation marks (according to ispunct) from each token. Specifically this means to remove all punctuation characters from the beginning and end of a token, but not from inside a token. The input tokens section and section. and "section" are each trimmed to become the same token, which makes searching more effective. It also means that doesn't should be stored as-is, (and likewise for as-is), since for both of these words, the punctuation occurs in the middle of the word.
- (As an aside, this sounds similar to the functions you wrote for last week's assignment so you may have some code you can repurpose. The trim task makes a tidy helper function that you can unit-test independently. Decomposition is always a good idea to aid your test and debug efforts.)
- There are a few oddball characters (curly quotes, bullets, and the like) that are not reported as punctuation by ispunct, you do not need to make a special case out of this, simply trim according to ispunct.
All words should be stored in lowercase format, to make for easier search eventually.

Note: Even if the body text contains a thousand occurrences of the word "llama", the set of unique tokens contains only one copy, so gathering the unique words in a set is perfect for avoiding unnecessary duplication.

Given the following sample file (found in res/tiny.txt),

www.shoppinglist.com
EGGS! milk, fish,      @  bread cheese
www.rainbow.org
red ~green~ orange yellow blue indigo violet
www.dr.seuss.net
One Fish Two Fish Red fish Blue fish !!!
www.bigbadwolf.com
I'm not trying to eat you

your function should return the following map:

{
    "www.bigbadwolf.com" : {"eat", "i'm", "not", "to", "trying", "you"},
    "www.dr.seuss.net" : { "blue", "fish", "one", "red", "two"},
    "www.rainbow.org" : { "blue", "green", "indigo", "orange", "red", "violet", "yellow"},
    "www.shoppinglist.com" :{ "bread", "cheese", "eggs", "fish", "milk"}
}

We've provided some unit tests to get you started. Make sure all the provided tests pass, but that won't be enough to fully vet the function. You should add student tests of your own to ensure you have comprehensive coverage. Don't move on until all tests pass.

Now that we have the document data in a well-structured form, it is time to build our inverted index! Your next task is to complete the following function, defined in search.cpp.

Map<string, Set<string>> buildIndex(Map<string, Set<string>>& docs)

Given the map of document data, your job is to build the inverted index data structure, which maps each word to a set of URLs where that word can be found. Before starting to implement this function, take some time to work through the example map shown above.

Q8: What would the sample inverted index for the above document map look like?

Once you've worked through this example, go ahead and implement the function. Make sure to follow the typical cycle of testing and debugging your code. Start with the provided tests and then extend with student tests of your own. Don't move on until all tests pass.

Compound queries

Now that we have built an inverted index, we can turn our focus to figuring out how to give users the ability to query our database and get search results back!

Milestone 2: Compiling website matches based on user query

For this part of the assignment, you will be implementing the following function, provided in search.cpp.

Set<string> findQueryMatches(Map<string, Set<string>>& index, string query)

The query string argument can either be a single search term or a compound sequence of multiple terms. A search term is a single word, and a sequence of terms is multiple consecutive words, each of which (besides the first one) may or may not be preceded by a modifier like + or - (see below for details).

When finding the matches for a given query, you should follow these rules:

For a single search term, the search matches are the URLs of the webpages that contain the specified term.
A sequence of terms is handled as a compound query, where the matches from the individual terms are synthesized into one combined result.
By default, the matches are added across search terms. However, the user can apply a modifier to a search term to change how its matches are aggregated into the result.
- If the user prefaces a search term with +, then matches for this term are intersected with the results.
- If the user prefaces a search term with -, then matches for this term are removed from the result.
The same punctuation stripping rules that apply to webpage contents apply to query terms. Before looking for matches in the inverted index, make sure you strip all punctuation from the beginning and end of the individual query terms.

Here are some example queries and how they are interpreted

quokka
- matches all pages containing the term "quokka"
simple cheap
- means simple OR cheap
- matches pages that contain either "simple" or "cheap" or both
tasty +healthy
- means tasty AND healthy
- matches pages that contain both "tasty" and "healthy"
tasty -mushrooms
- means tasty WITHOUT mushrooms
- matches pages that contain "tasty" but do not contain "mushrooms"
tasty -mushrooms simple +cheap
- means tasty WITHOUT mushrooms OR simple AND cheap
- matches pages that match ((("tasty" without "mushrooms") or "simple") and "cheap")

There is no precedence for the operators, the query is simply processed from left to right. The matches for the first term are combined with matches for second, then combined with matches for third term and so on. In the last query shown above, the matches for tasty are first filtered to remove all pages containing mushrooms, then unioned with all matches for simple and lastly intersected with all matches for cheap. In implementing this logic, you will find the Set operators for union, intersection, and difference to be very handy!

There is a lot of functionality to test in query processing, be sure you add an appropriate set of student tests to be sure you're catching all the cases.

Putting it all together

So far, we've built capability to turn a mass of unstructured text data into a highly-organized and quickly-searchable inverted index, as well as written code that allows us to find matches in this database given specific user queries. Now, let's put it all together and build our own search engine!

Milestone 3: Building a search engine!

Your final task will be to implement the function below, whose definition has been provided in search.cpp.

void searchEngine(string dbfile)

This function should implement a console program that should implement the following logic:

Using functions you have already written, construct an inverted index from the contents of the specified file.
Display to the user how many website URLs were processed to build the index and how many distinct words were found across all website content.
Once you have the index constructed, your program should go into a loop that allows the user to enter queries that your program will evaluate.
Given a user query, you should calculate the appropriate matches from the inverted index.
Once you have computed the resulting set of matching URLs, you should print it to the screen.
Repeat this process until the user enters the empty string ("") as their query. At this point, your program should stop running.

After you have completed this function, you should be able to the following interaction flow with the user.

Example program run (executed by running searchEngine("res/website.txt") in main.cpp):

Stand by while building index...
Indexed 50 pages containing 5595 unique terms.

Enter query sentence (RETURN/ENTER to quit): llama
Found 1 matching pages 
{"http://cs106b.stanford.edu/assignments/assign2/searchengine.html"}

Enter query sentence (RETURN/ENTER to quit): suitable +kits
Found 2 matching pages 
{"http://cs106b.stanford.edu/assignments/assign2/searchengine.html", "http://cs106b.stanford.edu/qt/troubleshooting.html"}

Enter query sentence (RETURN/ENTER to quit): Mac linux -windows
Found 3 matching pages 
{"http://cs106b.stanford.edu/lectures/sets-maps/qa.html", "http://cs106b.stanford.edu/qt/install-linux.html", "http://cs106b.stanford.edu/qt/install-mac.html"}

Enter query sentence (RETURN/ENTER to quit): as-is wow!
Found 3 matching pages 
{"http://cs106b.stanford.edu/about_assignments", "http://cs106b.stanford.edu/assignments/assign1/soundex.html", "http://cs106b.stanford.edu/assignments/assign2/searchengine.html"}

Enter query sentence (RETURN/ENTER to quit):

All done!

Congratulations! You're well on your way to becoming the next Internet search pioneer! 🔍

Notes

You should use the Stanford library's stringSplit function for tokenizing the text and parsing the user's query sentence. Also apply the same cleanup operation (removing leading and trailing punctuation, converting to lowercase) to the query search terms as you did to the tokens when building the index.
The Map and Set classes are the right data structures for storing the inverted index itself. Sets are especially handy when evaluating the query sentence because of the awesome high-level set operations that combine sets.
Searching should be case-insensitive, that is, a search for "binky" should return the same results as "Binky" or "BINKY". Be sure to consider what implications this has for how you create and search the index.
The res subfolder of the starter project includes are two provided database files. tiny.txt is the small example used in the writeup and website.txt is the body text extracted from all of the pages in our course website (as of April 17). These are text files that you can view the contents of these files in Qt Creator if desired. The project resource files are listed under Other files -> res. Your program can open a resource file by specifying the path to open as "res/myfilename".

References

Extension ideas

If you get done with the main portion of the assignment and have some extra time, we would encourage you to try working on an extension! A non-exhaustive list of potential extensions is listed below:

Weights
- When you get the results from a Google search, they are ranked so that the more relevant results are first on the list. The Google algorithm is a well-kept trade secret (though it was originally the Page Rank algorithm, named for its creator, (then) Stanford Master's student Larry Page), but a simple idea is to put results that have the most matches inside a page highest in the ranking. For this extension, you would need to re-think how you create your index, to include the number of matches.
Phrase search
- The assignment does not allow a search for multi-word terms, such as "section leader". Searching for phrases is not trivial, as you cannot simply keep a mapping of all possible phrases in the document. You could, however, keep track of where in each document a word is, and then use that information to determine if words in a phrase are next to each other in any particular document.
Stop Words
- The English language has many, many words that show up in text but are not particularly important for content, such as "the", "and", "if", "a", etc. These words are called Stop Words, and it would make your index smaller if you removed a set of stop words from the index. You can find out more about stop words here.
Stemming
- For your current index, if a user searches for "section" they won't find matches for "sections", even though pages that talk about sections might be relevant and a good match. Stemming is the process of reducing words to their base form, so that (for example) both "section" and "sections" would become, simply, "section".

If you have other creative ideas for extensions, run them by the course staff and we'd be happy to give you guidance!