Search Engine

An introduction to search engines

Search engines are one of the most influential developments of the modern internet age, having completely revolutionized the way that people use and interact with the web. What was once an intractable jumble of data with no semblance of organization has been transformed into an easily searchable repository of information. For this part of the assignment, you will recreate this phenomenal technological development by using the Map and Set ADTs to build a document search engine that can find matching pages for a user's query with lighting-fast response time. This is a simplified version of the technology underpinning Spotlight, Google, Bing, and every other modern search engine that you encounter in your day-to-day internet use.

Want to see the power of search right at your finger tips? Click 🔍 Search in the top-right of our website navigation bar and search for the phrase "easter egg" and see what you instantly find hidden deep in the recesses of the course website…

In our version of the search engine, each web page has a URL ("Uniform Resource Locator") that serves as its unique id and a string containing the body text of the page. The magic of search is enabled by pre-processing the body text of each page and storing the contents into a data structure optimized for fast retrieval of pages matching the search query.

You will first write functions that process the body text and populate the data structure. Next you'll implement the function to search for pages matching a search query. Finally, you will write a console program that allows the user to enter many search queries and retrieve the matching web pages. Put all together, you will have built your own mini search engine!

Understanding the web page data

We have harvested the body text from the pages of our course website into a database file to use as input to your search engine. The format of the database file is as follows:

The lines of the file are grouped into pairs with the following structure:
- The first line of a pair is a page URL.
- The second line of a pair is the body text of that page, with all newlines removed (basically text of the page in a single string).
The first two lines in the file form the first pair. The third and fourth lines form another pair, the fifth and sixth another, and so on, with alternating lines of page URL and page content.

To view an example database, open the file tiny.txt or website.txt in the folder Other files/res of the starter project.

Using an inverted index for searching

The key to enabling efficient search of a large data structure comes down to how we structure and store the data. A poor choice in data structures would make search painfully slow, while a wise arrangement can allow search to be near instantaneous.

To begin with, let's consider the index of a book. For example, when you look in the index of the CS106B textbook, one of the entries is the keyword "Internet" and two page numbers, 18 and 821. The word internet occurs on page number 18 and again on page number 821. A book's index is an example of an inverted index, where you have a word in mind and you can find the page number it is listed on (a book's index does not usually include every instance of the word on all pages but is instead curated to give you the best matches). In other words, an inverted index creates a mapping from content to locations in a document (or table, etc.).

This is in contrast to a forward index, which, for our book example, would be a list of page numbers with all the words listed on that page. A search engine uses an inverted index to allow for fast retrieval of web pages – thus, we will first figure out how to build one of these inverted indexes to efficiently store the data that we want to be able to search through.

Efficiency note: To create an inverted index, you must process the entire document, or set of documents, and catalog where each word is located. This may not be a fast process, but once the inverted index is complete, searching for words and their corresponding locations is extremely fast.

Onto the code!

Decomposition is one of your best tools for tackling a complex problem. We'll guide you along by breaking the problem down into a sequence of steps. Follow our steps to success!

All of the functions and tests for the search engine will be implemented in the file search.cpp.

1) Write helper function `cleanToken()`

Start by writing a reusable helper function that will make your later work easier.

string cleanToken(string token)

cleanToken takes in a whitespace-separated string of characters that appears in the body text and returns a "cleaned" version of that token, ready to be stored in the index. The cleaned version has trimmed off any leading or trailing punctuation characters and has been converted to lowercase. If the token contains no letters whatsoever, cleanToken returns an empty string to indicate this token is to be discarded entirely.

More precisely, cleanToken should:

Remove all punctuation from the beginning and end of a token, but not from inside a token. The tokens section and section. and "section" are each trimmed to the same token section, which makes searching more effective. Tokens such as doesn't or zelenski@cs are unchanged, since the punctuation occurs in the middle of these tokens.
- By punctuation, we mean any character for which ispunct returns true.
- There are a few oddball characters (curly quotes, bullets, and the like) that are not recognized as punctuation by ispunct. Do not make a special case of this; just trim according to ispunct.
Return the empty string if the token does not contain at least one letter, i.e. isalpha must be true for at least one character. By returning the empty string, cleanToken indicates this token should be discarded as it is not a word.
Convert the token to lowercase. Standardizing on a canonical form for the tokens allows search queries to operate case-insensitively.

The return value from cleanToken is the trimmed, lowercase version (or empty string if the token is to be discarded).

As always, after writing a function and before moving on, you should stop here and write test cases to confirm your function works correctly on all inputs. You should write at least 3-4 additional tests to make sure your helper function works. Remember to label your tests as STUDENT_TEST.

2) Write helper function `gatherTokens()`

The helper function gatherTokens extracts the set of unique tokens from the body text.

Set<string> gatherTokens(string bodytext)

The single argument to gatherTokens is a string containing the body text from a single web page. The function returns a Set of the unique cleaned tokens that appear in that body text.

The function must first tokenize the body text – in other words, divide into individual tokens, which will be strings separated by spaces. Do this division using the stringSplit function available in the Stanford library.

Each token is then cleaned using your cleanToken helper function. The cleaned tokens are stored in a Set. Even if the body text contains a thousand occurrences of the word "llama," the set of unique tokens contains only one copy, so gathering the unique words in a set is perfect for avoiding unnecessary duplication.

Time to test! Add test cases that confirm the output from gatherTokens, so you will later be able to call on this function with confidence that it does its job correctly.

3) Create inverted index in `buildIndex()`

The function buildIndex reads the content from the database file and processes it into the form of an inverted index.

int buildIndex(string dbfile, Map<string, Set<string>>& index)

The first argument to buildIndex is the name of the database file of the web page data, the second argument is the Map to be populated with data for the inverted index. The return value of buildIndex is the number of documents processed from the database file.

Before starting to write code, first work through a small example on paper to ensure you understand what your function is trying to build.

Q8. Draw out the inverted index for the res/tiny.txt database file.

Review the section Understanding the web page data to remind yourself of the format of the database file. Next, read over the code we provided for readMazeFile in maze.cpp to see an example of opening a file and reading the contents into a vector of lines. Feel free to reuse that code for buildIndex.

Once you have read the line containing the body text of page, call your gatherTokens to extract the set of unique tokens. For each token in the set, update the inverted index to indicate that this token has a match to this page's URL.

The action of buildIndex is to populate index with entries where each key word is associated with a set of URLs where that word can be found. The function returns the count of web pages that were processed and added to the index.

The Map and Set classes are the right data structures for storing the inverted index itself. Sets are especially handy when evaluating a complex query because of the awesome high-level set operations that combine sets.

Our starter code contains some provided tests to get you started. Make sure all the provided tests pass, but that won't be enough to fully vet the function. You should add student tests of your own to ensure you have comprehensive coverage. Don't move on until all tests pass.

4) Search using `findQueryMatches()`

The inverted index you just built is exactly the data structure needed to enable quickly finding those pages that match a search query.

Next up is to implement the function:

Set<string> findQueryMatches(Map<string, Set<string>>& index, string query)

The query string argument can either be a single search term or a compound sequence of multiple terms. A search term is a single word, and a sequence of search terms is multiple consecutive words, each of which (besides the first one) may or may not be preceded by a modifier like + or - (see below for details). The same stringSplit function you used to divide the body text into tokens can be used to divide the user's query sentence into search terms.

When finding the matches for a given query, you should follow these rules:

For a single search term, matches are the URLs of the web pages that contain the specified term.
A sequence of terms is handled as a compound query, where the matches from the individual terms are synthesized into one combined result.
A search term has a slightly altered meaning when the term is prefaced by certain modifiers:
- By default when not prefaced with a + or -, the matches are unioned across search terms. (any result matching either term is included)
- If the user prefaces a search term with +, then matches for this term are intersected with the existing results. (results must match both terms)
- If the user prefaces a search term with -, then matches for this term are removed from the existing result. (results must match one term without matching the other)
The same token cleaning applied to the body text is also applied to query terms. Call your helper cleanToken to process each search term to strip punctuation, convert to lowercase, and discard non-words before performing the search for matches.

Note that searching is case-insensitive, that is, a search for "binky" should return the same results as "Binky" or "BINKY". Be sure to consider what implications this has for how you create and search the index.

Here are some example queries and how they are interpreted

quokka
- matches all pages containing the term "quokka"
simple cheap
- means simple OR cheap
- matches pages that contain either "simple" or "cheap" or both
tasty +healthy
- means tasty AND healthy
- matches pages that contain both "tasty" and "healthy"
tasty -mushrooms
- means tasty WITHOUT mushrooms
- matches pages that contain "tasty" but do not contain "mushrooms"
tasty -mushrooms simple +cheap
- means tasty WITHOUT mushrooms OR simple AND cheap
- matches pages that match ((("tasty" without "mushrooms") or "simple") and "cheap")

There is no precedence for the operators, the query is simply processed from left to right. The matches for the first term are combined with matches for second, then combined with matches for third term and so on. In the last query shown above, the matches for tasty are first filtered to remove all pages containing mushrooms, then unioned with all matches for simple and lastly intersected with all matches for cheap. In implementing this logic, you will find the Set operators for union, intersection, and difference to be very handy!

You may assume that the query sentence is well-formed, which means:

The query sentence is non-empty and contains at least one search term
If a search term has a modifier, it will be the first character
- A modifier will not appear on its own as a search term
- A + or - character within a search term that is not the first character is not considered a modifier
The first search term in the query sentence will never have a modifier
You can assume that no search term will clean to the empty string (i.e. has at least one letter)

There is a lot of functionality to test in query processing, be sure you add an appropriate set of student tests to be sure you're catching all the cases.

5) Put it all together with `searchEngine()`

Thus far, your amazing code has re-arranged a mass of unstructured text data into a highly-organized and quickly-searchable inverted index with fancy query-matching capability. Now take it over the finish line to build your own search engine!

The final function to implement is:

void searchEngine(string dbfile)

This function implements a console program that works as follows:

It first constructs an inverted index from the contents of the database file.
It prints how many web pages were processed to build the index and how many distinct words were found across all pages.
It then enters a loop that prompts the user to enter a query
For each query entered by the user, it find the matching pages and prints the URLs.
The user enters the empty string ("") to indicate they are done and the program finishes executing.

After you have completed this function, your program should behave as shown in the transcript shown below.

Example program run (executed by running searchEngine("res/website.txt") in main.cpp):

Stand by while building index...
Indexed 62 pages containing 6284 unique terms

Enter query sentence (RETURN/ENTER to quit): llama
Found 1 matching pages 
{"http://cs106b.stanford.edu/assignments/2-adt/searchengine.html"}

Enter query sentence (RETURN/ENTER to quit): suitable +kits
Found 2 matching pages 
{"http://cs106b.stanford.edu/assignments/2-adt/searchengine.html", "http://cs106b.stanford.edu/qt/troubleshooting.html"}

Enter query sentence (RETURN/ENTER to quit): linux windows -mac
Found 4 matching pages 
{"http://cs106b.stanford.edu/qt/install-cs106.html", "http://cs106b.stanford.edu/qt/install-linux.html", "http://cs106b.stanford.edu/qt/install-windows.html", "http://cs106b.stanford.edu/qt/library_beta.html"}

Enter query sentence (RETURN/ENTER to quit): aren't wow!
Found 8 matching pages 
{"http://cs106b.stanford.edu/about_assignments", "http://cs106b.stanford.edu/assignments/1-cpp/soundex.html", "http://cs106b.stanford.edu/assignments/2-adt/searchengine.html", "http://cs106b.stanford.edu/assignments/2-adt/warmup.html", "http://cs106b.stanford.edu/lectures/02-cpp/qa.html", "http://cs106b.stanford.edu/lectures/03-strings/qa.html", "http://cs106b.stanford.edu/lectures/04-vector-grid/qa.html", "http://cs106b.stanford.edu/lectures/05-stack-queue/qa.html"}

Enter query sentence (RETURN/ENTER to quit): 
All done!

Way to go, 👍 you're well on your way to becoming the next internet search pioneer!

Notes

The res folder of the starter project includes two database files: tiny.txt is the small example used in the writeup and website.txt is the body text extracted from all of the pages in our course website (as of Sept 24). You can open these files in Qt Creator to view their contents. The project resource files are listed under Other files -> res. Your program can open a resource file by specifying the path as "res/myfilename".

References

Extensions

If you have completed the basic assignment requirements and and want to go further, we encourage you to try adding an extension! A non-exhaustive list of potential extensions is listed below:

Weights
- When you get the results from a Google search, they are ranked so that the more relevant results are first on the list. The current Google algorithm is a well-kept trade secret (though it was originally the Page Rank algorithm, named for its creator, then-Stanford graduate student Larry Page), but a simple approach is to give higher rank to pages with more occurrences of the search term. For this extension, you would need to re-think how you create your index to include the number of matches.
Phrase search
- The assignment does not allow a search for multi-word terms, such as section leader. Searching for phrases is not trivial, as you cannot simply keep a mapping of all possible phrases in the document. You could, however, keep track of where in each document a word is, and then use that information to determine if words in a phrase are next to each other in any particular document.
Stop Words
- The English language has many, many words that show up in text but are not particularly important for content, such as the, and, if, a, etc. These words are called Stop Words, and it would make your index smaller if you removed such stop words from the index. Here is more info about stop words.
Stemming
- In the current design, if a user searches for section they won't find matches for sections, even though pages that mention either might be a relevant match. Stemming is the process of reducing words to their base form, so that (for example) both section and sections would become, simply, section in the index.

If you have other creative ideas for extensions, run them by the course staff, and we'd be happy to give you guidance!