Assign2: Search Engine

An introduction to search engines

Search engines are one of the most influential developments of the modern internet age, having completely revolutionized the way that people use and interact with the web. What was once an intractable jumble of data with no semblance of organization has been transformed into an easily searchable repository of human information. For this part of the assignment, you will recreate this phenomenal technological development by using the Map and Set ADTs to build a document search engine that can find matching pages for a user's query with lighting-fast response time. This is a simplified version of the technology underpinning Spotlight, Google, Bing, and every other modern search engine that you encounter in your day-to-day internet use.

Want to see the power of search right at your finger tips? Use the "Search" feature in the top-right of the website and search for the phrase "easter egg" and see what you instantly find hidden deep in the recesses of the course website…

The program you write for this assignment task will produce a custom data structure for a particular set of webpages, represented in the form of a URL ("Uniform Resource Locator") and an associated string of text representing the body of the webpage. All five of the functions and tests we'll ask you to implement will be in search.cpp.

First, your program will process the body text of each page to populate the data structure with a mapping from words in the body text to page URLs. Once you have built the data structure, you will then write a function that can search for a target query, given the data structure of webpage information. Finally, you will write a console program that enables a user to enter many different search queries and get back the webpages on which those search queries can be found. Put all together, you will have built your own mini search engine!

Understanding the webpage data

We gathered the body text from each of the pages in a previous version of this course website and organized it into a combined database file for your program to process. The format of the database file is as follows:

The lines of the file are grouped into pairs with the following structure:
- The first line of a pair is a page URL.
- The second line of a pair is the body text of that page, with all newlines removed (basically text of the page in a single string).
The first two lines in the file form the first pair. The third and fourth lines form another pair, the fifth and sixth another, and so on, with alternating lines of page URL and page content.

To see what the database file looks like, check out website.txt inside the "Other files/res" folder in the starter code.

Using an inverted index for searching

The key to enabling efficient search of a large data structure comes down to how we structure and store the data. A poor choice in data structures can cause search to be very slow, while a wise arrangement can allow search to be near instantaneous. We will use a data storage scheme that can be used to enable super-fast lookup.

To begin with, let's consider the index of a book. For example, when you look in the index of the CS 106B textbook, one of the entries is the keyword "Internet" and two page numbers, 18 and 821. The word internet occurs on page number 18 and again on page number 821. A book's index is an example of an inverted index, where you have a word in mind and you can find the page number it is listed on (a book's index does not usually include every instance of the word on all pages but is instead curated to give you the best matches). In other words, an inverted index creates a mapping from content to locations in a document (or table, etc.).

This is in contrast to a forward index, which, for our book example, would be a list of page numbers with all the words listed on that page. A search engine uses an inverted index to allow for fast retrieval of webpages – thus, we will first figure out how to build one of these inverted indexes to efficiently store the data that we want to be able to search through.

Efficiency note: To create an inverted index, you must process the entire document, or set of documents, and catalog where each word is located. This may not be a fast process, but once the inverted index is complete, searching for words and their corresponding locations is extremely fast.

1) Write a word-processing helper `cleanToken()`

Like in our maze program, we'll start off by writing a reusable helper function that will make our lives much easier later down the line. The following function will take in a "token" from a webpage and process it into a trimmed down word that we want to store in our index:

string cleanToken(string token)

Tokens in our webpage database are whitespace-separated words that appear in the body text of each webpage. If you remember the lecture example for finding unique words in a text, one of the issues we run into with real-world texts is the noisy data generated due to punctuation surrounding words. To create and return "clean" tokens, your cleanToken function should meet the following requirements:

Trim away leading and trailing punctuation marks (according to ispunct) from each token. Specifically, this means to remove all punctuation characters from the beginning and end of a token, but not from inside a token. The input tokens section and section. and "section" are each trimmed to become the same token (section), which makes searching more effective. It also means that doesn't should be stored as-is, (and likewise for as-is), since for both of these words the punctuation occurs in the middle of the word.
- Note: There are a few oddball characters (curly quotes, bullets, and the like) that are not reported as punctuation by ispunct, you do not need to make a special case out of this; simply trim according to ispunct.
"Discard" any non-word tokens by just returning the empty string if the token does not contain at least one letter.
All tokens should be converted to lowercase format to make for easier search eventually.

Once you've edited the inputted token to adhere to the above guidelines, you should return it from your function.

2) Process webpages in `readDoc()`

Given that we love decomposition in CS106B, we will break down building the index into a couple of different steps. In this first step, we'll process the consolidated database text file into a map of URLs (strings) to Set<string>s representing the unique words contained on the given page. You will complete the following function:

Map<string, Set<string>> readDocs(string dbfile)

This function opens the named file (passed in as dbfile), reads it line by line, and builds a map from a URL to a Set<string> representing the unique words contained on that page.

To process the database file, your code will need to:

First, read in the contents of the file into a vector of lines. You may find it helpful to refer to the provided code in readMazeFile in maze.cpp.
For each page URL, tokenize its contents – in other words, divide the body text into individual tokens, which will be strings separated by spaces. Do this division using the stringSplit function from strlib.h.
Clean each token using your cleanToken helper function.
Store the tokens in your map.

Note: Even if the body text contains a thousand occurrences of the word "llama," the set of unique tokens contains only one copy, so gathering the unique words in a set is perfect for avoiding unnecessary duplication.

Given the following sample file (found in res/tiny.txt),

www.shoppinglist.com
EGGS! milk, fish,      @  bread cheese
www.rainbow.org
red ~green~ orange yellow blue indigo violet
www.dr.seuss.net
One Fish Two Fish Red fish Blue fish !!!
www.bigbadwolf.com
I'm not trying to eat you

your function should return the following map:

{
    "www.bigbadwolf.com" : {"eat", "i'm", "not", "to", "trying", "you"},
    "www.dr.seuss.net" : { "blue", "fish", "one", "red", "two"},
    "www.rainbow.org" : { "blue", "green", "indigo", "orange", "red", "violet", "yellow"},
    "www.shoppinglist.com" :{ "bread", "cheese", "eggs", "fish", "milk"}
}

We've provided some unit tests to get you started. Make sure all the provided tests pass, but that won't be enough to fully vet the function. You should add student tests of your own to ensure you have comprehensive coverage. Don't move on until all tests pass.

3) Create the inverted index with `buildIndex()`

Now that we have the document data in a well-structured form, it is time to build our inverted index! Your next task is to complete the following function:

Map<string, Set<string>> buildIndex(Map<string, Set<string>>& docs)

Given the map of document data (docs) that was created by the readDocs function, your job is to build the inverted index data structure. The output of buildIndex will be a map from word to a set of URLs where that word can be found. In other words, this function will invert the map created by readDocs!

Before starting to implement this function, take some time to work through the example map shown above.

Q9: What would the sample inverted index for the above example document map in milestone 2) look like?

Once you've worked through this example, go ahead and implement the function. Make sure to follow the typical cycle of testing and debugging your code. Start with the provided tests and then extend with student tests of your own. Don't move on until all tests pass.

4) Search webpages using `findQueryMatches()`

Now that we have built an inverted index, we can turn our focus to figuring out how to give users the ability to query our database and get search results back!

For this part of the assignment, you will be implementing the following function:

Set<string> findQueryMatches(Map<string, Set<string>>& index, string query)

The query string argument can either be a single search term or a compound sequence of multiple terms. A search term is a single word, and a sequence of search terms is multiple consecutive words, each of which (besides the first one) may or may not be preceded by a modifier like + or - (see below for details).

When finding the matches for a given query, you should follow these rules:

For a single search term, the search matches are the URLs of the webpages that contain the specified term.
A sequence of terms is handled as a compound query, where the matches from the individual terms are synthesized into one combined result.
A single search team can take on one of three forms:
- By default when not prefaced with a + or -, the matches are unioned across search terms. (any result matching either term is included)
- If the user prefaces a search term with +, then matches for this term are intersected with the existing results. (results must match both terms)
- If the user prefaces a search term with -, then matches for this term are removed from the existing result. (results must match one term without matching the other)
The same punctuation stripping rules that apply to webpage contents apply to query terms. Before looking for matches in the inverted index, make sure you strip all punctuation from the beginning and end of the individual query terms. As before, you should reject any search query terms that don’t have at least one alphabetic character.

Here are some example queries and how they are interpreted

quokka
- matches all pages containing the term "quokka"
simple cheap
- means simple OR cheap
- matches pages that contain either "simple" or "cheap" or both
tasty +healthy
- means tasty AND healthy
- matches pages that contain both "tasty" and "healthy"
tasty -mushrooms
- means tasty WITHOUT mushrooms
- matches pages that contain "tasty" but do not contain "mushrooms"
tasty -mushrooms simple +cheap
- means tasty WITHOUT mushrooms OR simple AND cheap
- matches pages that match ((("tasty" without "mushrooms") or "simple") and "cheap")

There is no precedence for the operators, the query is simply processed from left to right. The matches for the first term are combined with matches for second, then combined with matches for third term and so on. In the last query shown above, the matches for tasty are first filtered to remove all pages containing mushrooms, then unioned with all matches for simple and lastly intersected with all matches for cheap. In implementing this logic, you will find the Set operators for union, intersection, and difference to be very handy!

There is a lot of functionality to test in query processing, be sure you add an appropriate set of student tests to be sure you're catching all the cases.

5) Put it all together with `searchEngine()`

So far, we've built capability to turn a mass of unstructured text data into a highly-organized and quickly-searchable inverted index, as well as written code that allows us to find matches in this database given specific user queries. Now, let's put it all together and build our own search engine!

Your final task will be to implement the function below:

void searchEngine(string dbfile)

This function should implement a console program that should implement the following logic:

Using functions you have already written, construct an inverted index from the contents of the specified file.
Display to the user how many website URLs were processed to build the index and how many distinct words were found across all website content.
Once you have the index constructed, your program should go into a loop that allows the user to enter queries that your program will evaluate.
Given a user query, you should calculate the appropriate matches from the inverted index.
Once you have computed the resulting set of matching URLs, you should print it to the screen.
Repeat this process until the user enters the empty string ("") as their query. At this point, your program should stop running.

Note that searching should be case-insensitive, that is, a search for "binky" should return the same results as "Binky" or "BINKY". Be sure to consider what implications this has for how you create and search the index.

After you have completed this function, you should be able to the following interaction flow with the user.

Example program run (executed by running searchEngine("res/website.txt") in main.cpp):

Stand by while building index...
Indexed 50 pages containing 5595 unique terms.

Enter query sentence (RETURN/ENTER to quit): llama
Found 1 matching pages 
{"http://cs106b.stanford.edu/assignments/assign2/searchengine.html"}

Enter query sentence (RETURN/ENTER to quit): suitable +kits
Found 2 matching pages 
{"http://cs106b.stanford.edu/assignments/assign2/searchengine.html", "http://cs106b.stanford.edu/qt/troubleshooting.html"}

Enter query sentence (RETURN/ENTER to quit): Mac linux -windows
Found 3 matching pages 
{"http://cs106b.stanford.edu/lectures/sets-maps/qa.html", "http://cs106b.stanford.edu/qt/install-linux.html", "http://cs106b.stanford.edu/qt/install-mac.html"}

Enter query sentence (RETURN/ENTER to quit): as-is wow!
Found 3 matching pages 
{"http://cs106b.stanford.edu/about_assignments", "http://cs106b.stanford.edu/assignments/assign1/soundex.html", "http://cs106b.stanford.edu/assignments/assign2/searchengine.html"}

Enter query sentence (RETURN/ENTER to quit):

All done!

Congratulations! You're well on your way to becoming the next internet search pioneer! 🔍

Notes

You should use the Stanford library's stringSplit function for tokenizing the text and parsing the user's query sentence. For some of you, the text file and string parsing you're doing in this assignment may be reminiscient of exercises you did in the Python version of CS106A!
The Map and Set classes are the right data structures for storing the inverted index itself. Sets are especially handy when evaluating the query sentence because of the awesome high-level set operations that combine sets.
The res subfolder of the starter project includes are two provided database files. tiny.txt is the small example used in the writeup and website.txt is the body text extracted from all of the pages in our course website (as of April 17). These are text files that you can view the contents of these files in Qt Creator if desired. The project resource files are listed under Other files -> res. Your program can open a resource file by specifying the path to open as "res/myfilename".

References

Extension ideas

If you get done with the main portion of the assignment and have some extra time, we would encourage you to try working on an extension! A non-exhaustive list of potential extensions is listed below:

Weights
- When you get the results from a Google search, they are ranked so that the more relevant results are first on the list. The Google algorithm is a well-kept trade secret (though it was originally the Page Rank algorithm, named for its creator, (then) Stanford Master's student Larry Page), but a simple idea is to put results that have the most matches inside a page highest in the ranking. For this extension, you would need to re-think how you create your index, to include the number of matches.
Phrase search
- The assignment does not allow a search for multi-word terms, such as "section leader". Searching for phrases is not trivial, as you cannot simply keep a mapping of all possible phrases in the document. You could, however, keep track of where in each document a word is, and then use that information to determine if words in a phrase are next to each other in any particular document.
Stop Words
- The English language has many, many words that show up in text but are not particularly important for content, such as "the", "and", "if", "a", etc. These words are called Stop Words, and it would make your index smaller if you removed a set of stop words from the index. You can find out more about stop words here.
Stemming
- For your current index, if a user searches for "section" they won't find matches for "sections", even though pages that talk about sections might be relevant and a good match. Stemming is the process of reducing words to their base form, so that (for example) both "section" and "sections" would become, simply, "section".

If you have other creative ideas for extensions, run them by the course staff, and we'd be happy to give you guidance!