An introduction to search engines
Search engines are one of the most influential developments of the modern internet age, having completely revolutionized the way that people use and interact with the web. What was once an intractable jumble of data with no semblance of organization has been transformed into an easily searchable repository of information. For this part of the assignment, you will recreate this phenomenal technological development by using the Map
and Set
ADTs to build a document search engine that can find matching pages for a user's query with lighting-fast response time. This is a simplified version of the technology underpinning Spotlight, Google, Bing, and every other modern search engine that you encounter in your day-to-day internet use.
Want to see the power of search right at your finger tips? Click 🔍 Search in the top-right of our website navigation bar and search for the phrase "easter egg" and see what you instantly find hidden deep in the recesses of the course website…
In our version of the search engine, each web page has a URL ("Uniform Resource Locator") that serves as its unique id and a string containing the body text of the page. The magic of search is enabled by pre-processing the body text of each page and storing the contents into a data structure optimized for fast retrieval of pages matching the search query.
You will first write functions that process the body text and populate the data structure. Next you'll implement the function to search for pages matching a search query. Finally, you will write a console program that allows the user to enter many search queries and retrieve the matching web pages. Put all together, you will have built your own mini search engine!
Understanding the web page data
We have harvested the body text from the pages of our course website into a database file to use as input to your search engine. The format of the database file is as follows:
- The lines of the file are grouped into pairs with the following structure:
- The first line of a pair is a page URL.
- The second line of a pair is the body text of that page, with all newlines removed (basically text of the page in a single string).
- The first two lines in the file form the first pair. The third and fourth lines form another pair, the fifth and sixth another, and so on, with alternating lines of page URL and page content.
To view an example database, open the file tiny.txt
or website.txt
in the folder Other files/res
of the starter project.
Using an inverted index for searching
The key to enabling efficient search of a large data structure comes down to how we structure and store the data. A poor choice in data structures would make search painfully slow, while a wise arrangement can allow search to be near instantaneous.
To begin with, let's consider the index of a book. For example, when you look in the index of the CS106B textbook, one of the entries is the keyword "Internet
" and two page numbers, 18 and 821. The word internet occurs on page number 18 and again on page number 821. A book's index is an example of an inverted index, where you have a word in mind and you can find the page number it is listed on (a book's index does not usually include every instance of the word on all pages but is instead curated to give you the best matches). In other words, an inverted index creates a mapping from content to locations in a document (or table, etc.).
This is in contrast to a forward index, which, for our book example, would be a list of page numbers with all the words listed on that page. A search engine uses an inverted index to allow for fast retrieval of web pages – thus, we will first figure out how to build one of these inverted indexes to efficiently store the data that we want to be able to search through.
Efficiency note: To create an inverted index, you must process the entire document, or set of documents, and catalog where each word is located. This may not be a fast process, but once the inverted index is complete, searching for words and their corresponding locations is extremely fast.
Onto the code!
Decomposition is one of your best tools for tackling a complex problem. We'll guide you along by breaking the problem down into a sequence of steps. Follow our steps to success!
All of the functions and tests for the search engine will be implemented in the file search.cpp
.
1) Write helper function cleanToken()
Start by writing a reusable helper function that will make your later work easier.
string cleanToken(string token)
cleanToken
takes in a whitespace-separated string of characters that appears in the body text and returns a "cleaned" version of that token, ready to be stored in the index. The cleaned version has trimmed off any leading or trailing punctuation characters and has been converted to lowercase. If the token contains no letters whatsoever, cleanToken
returns an empty string to indicate this token is to be discarded entirely.
More precisely, cleanToken
should:
- Remove all punctuation from the beginning and end of a token, but not from inside a token. The tokens
section
andsection.
and"section"
are each trimmed to the same tokensection
, which makes searching more effective. Tokens such asdoesn't
orzelenski@cs
are unchanged, since the punctuation occurs in the middle of these tokens.- By punctuation, we mean any character for which
ispunct
returns true. - There are a few oddball characters (curly quotes, bullets, and the like) that are not recognized as punctuation by
ispunct
. Do not make a special case of this; just trim according toispunct
.
- By punctuation, we mean any character for which
- Return the empty string if the token does not contain at least one letter, i.e.
isalpha
must be true for at least one character. By returning the empty string,cleanToken
indicates this token should be discarded as it is not a word. - Convert the token to lowercase. Standardizing on a canonical form for the tokens allows search queries to operate case-insensitively.
The return value from cleanToken
is the trimmed, lowercase version (or empty string if the token is to be discarded).
As always, after writing a function and before moving on, you should stop here and write test cases to confirm your function works correctly on all inputs. You should write at least 3-4 additional tests to make sure your helper function works. Remember to label your tests as STUDENT_TEST
.
2) Write helper function gatherTokens()
The helper function gatherTokens
extracts the set of unique tokens from the body text.
Set<string> gatherTokens(string bodytext)
The single argument to gatherTokens
is a string containing the body text from a single web page. The function returns a Set of the unique cleaned tokens that appear in that body text.
The function must first tokenize the body text – in other words, divide into individual tokens, which will be strings separated by spaces. Do this division using the stringSplit
function available in the Stanford library.
Each token is then cleaned using your cleanToken
helper function. The cleaned tokens are stored in a Set. Even if the body text contains a thousand occurrences of the word "llama
," the set of unique tokens contains only one copy, so gathering the unique words in a set is perfect for avoiding unnecessary duplication.
Time to test! Add test cases that confirm the output from gatherTokens
, so you will later be able to call on this function with confidence that it does its job correctly.
3) Create inverted index in buildIndex()
The function buildIndex
reads the content from the database file and processes it into the form of an inverted index.
int buildIndex(string dbfile, Map<string, Set<string>>& index)
The first argument to buildIndex
is the name of the database file of the web page data, the second argument is the Map to be populated with data for the inverted index. The return value of buildIndex
is the number of documents processed from the database file.
Before starting to write code, first work through a small example on paper to ensure you understand what your function is trying to build.
Q8. Draw out the inverted index for the res/tiny.txt
database file.
Review the section Understanding the web page data to remind yourself of the format of the database file. Next, read over the code we provided for readMazeFile
in maze.cpp
to see an example of opening a file and reading the contents into a vector of lines. Feel free to reuse that code for buildIndex
.
Once you have read the line containing the body text of page, call your gatherTokens
to extract the set of unique tokens. For each token in the set, update the inverted index to indicate that this token has a match to this page's URL.
The action of buildIndex
is to populate index
with entries where each key word is associated with a set of URLs where that word can be found. The function returns the count of web pages that were processed and added to the index.
The Map
and Set
classes are the right data structures for storing the inverted index itself. Sets are especially handy when evaluating a complex query because of the awesome high-level set operations that combine sets.
Our starter code contains some provided tests to get you started. Make sure all the provided tests pass, but that won't be enough to fully vet the function. You should add student tests of your own to ensure you have comprehensive coverage. Don't move on until all tests pass.
4) Search using findQueryMatches()
The inverted index you just built is exactly the data structure needed to enable quickly finding those pages that match a search query.
Next up is to implement the function:
Set<string> findQueryMatches(Map<string, Set<string>>& index, string query)
The query
string argument can either be a single search term or a compound sequence of multiple terms. A search term is a single word, and a sequence of search terms is multiple consecutive words, each of which (besides the first one) may or may not be preceded by a modifier like +
or -
(see below for details). The same stringSplit
function you used to divide the body text into tokens can be used to divide the user's query sentence into search terms.
When finding the matches for a given query, you should follow these rules:
- For a single search term, matches are the URLs of the web pages that contain the specified term.
- A sequence of terms is handled as a compound query, where the matches from the individual terms are synthesized into one combined result.
- A search term has a slightly altered meaning when the term is prefaced by certain modifiers:
- By default when not prefaced with a
+
or-
, the matches are unioned across search terms. (any result matching either term is included) - If the user prefaces a search term with
+
, then matches for this term are intersected with the existing results. (results must match both terms) - If the user prefaces a search term with
-
, then matches for this term are removed from the existing result. (results must match one term without matching the other)
- By default when not prefaced with a
- The same token cleaning applied to the body text is also applied to query terms. Call your helper
cleanToken
to process each search term to strip punctuation, convert to lowercase, and discard non-words before performing the search for matches.
Note that searching is case-insensitive, that is, a search for "binky" should return the same results as "Binky" or "BINKY". Be sure to consider what implications this has for how you create and search the index.
Here are some example queries and how they are interpreted
quokka
- matches all pages containing the term "quokka"
simple cheap
- means
simple OR cheap
- matches pages that contain either "simple" or "cheap" or both
- means
tasty +healthy
- means
tasty AND healthy
- matches pages that contain both "tasty" and "healthy"
- means
tasty -mushrooms
- means
tasty WITHOUT mushrooms
- matches pages that contain "tasty" but do not contain "mushrooms"
- means
tasty -mushrooms simple +cheap
- means
tasty WITHOUT mushrooms OR simple AND cheap
- matches pages that match ((("tasty" without "mushrooms") or "simple") and "cheap")
- means
There is no precedence for the operators, the query is simply processed from left to right. The matches for the first term are combined with matches for second, then combined with matches for third term and so on. In the last query shown above, the matches for tasty
are first filtered to remove all pages containing mushrooms
, then unioned with all matches for simple
and lastly intersected with all matches for cheap
. In implementing this logic, you will find the Set operators for union, intersection, and difference to be very handy!
You may assume that the query sentence is well-formed, which means:
- The query sentence is non-empty and contains at least one search term
- If a search term has a modifier, it will be the first character
- A modifier will not appear on its own as a search term
- A
+
or-
character within a search term that is not the first character is not considered a modifier
- The first search term in the query sentence will never have a modifier
- You can assume that no search term will clean to the empty string (i.e. has at least one letter)
There is a lot of functionality to test in query processing, be sure you add an appropriate set of student tests to be sure you're catching all the cases.
5) Put it all together with searchEngine()
Thus far, your amazing code has re-arranged a mass of unstructured text data into a highly-organized and quickly-searchable inverted index with fancy query-matching capability. Now take it over the finish line to build your own search engine!
The final function to implement is:
void searchEngine(string dbfile)
This function implements a console program that works as follows:
- It first constructs an inverted index from the contents of the database file.
- It prints how many web pages were processed to build the index and how many distinct words were found across all pages.
- It then enters a loop that prompts the user to enter a query
- For each query entered by the user, it find the matching pages and prints the URLs.
- The user enters the empty string (
""
) to indicate they are done and the program finishes executing.
After you have completed this function, your program should behave as shown in the transcript shown below.
Example program run (executed by running searchEngine("res/website.txt")
in main.cpp
):
Stand by while building index...
Indexed 62 pages containing 6284 unique terms
Enter query sentence (RETURN/ENTER to quit): llama
Found 1 matching pages
{"http://cs106b.stanford.edu/assignments/2-adt/searchengine.html"}
Enter query sentence (RETURN/ENTER to quit): suitable +kits
Found 2 matching pages
{"http://cs106b.stanford.edu/assignments/2-adt/searchengine.html", "http://cs106b.stanford.edu/qt/troubleshooting.html"}
Enter query sentence (RETURN/ENTER to quit): linux windows -mac
Found 4 matching pages
{"http://cs106b.stanford.edu/qt/install-cs106.html", "http://cs106b.stanford.edu/qt/install-linux.html", "http://cs106b.stanford.edu/qt/install-windows.html", "http://cs106b.stanford.edu/qt/library_beta.html"}
Enter query sentence (RETURN/ENTER to quit): aren't wow!
Found 8 matching pages
{"http://cs106b.stanford.edu/about_assignments", "http://cs106b.stanford.edu/assignments/1-cpp/soundex.html", "http://cs106b.stanford.edu/assignments/2-adt/searchengine.html", "http://cs106b.stanford.edu/assignments/2-adt/warmup.html", "http://cs106b.stanford.edu/lectures/02-cpp/qa.html", "http://cs106b.stanford.edu/lectures/03-strings/qa.html", "http://cs106b.stanford.edu/lectures/04-vector-grid/qa.html", "http://cs106b.stanford.edu/lectures/05-stack-queue/qa.html"}
Enter query sentence (RETURN/ENTER to quit):
All done!
Way to go, đź‘Ť you're well on your way to becoming the next internet search pioneer!
Notes
- The
res
folder of the starter project includes two database files:tiny.txt
is the small example used in the writeup andwebsite.txt
is the body text extracted from all of the pages in our course website (as of Sept 24). You can open these files in Qt Creator to view their contents. The project resource files are listed underOther files
->res
. Your program can open a resource file by specifying the path as"res/myfilename"
.
References
- Inverted Index on GeeksForGeeks
- Wikipedia article on Inverted Indexes
- Stanford Natural Processing Group on Tokenization
Extensions
If you have completed the basic assignment requirements and and want to go further, we encourage you to try adding an extension! A non-exhaustive list of potential extensions is listed below:
- Weights
- When you get the results from a Google search, they are ranked so that the more relevant results are first on the list. The current Google algorithm is a well-kept trade secret (though it was originally the Page Rank algorithm, named for its creator, then-Stanford graduate student Larry Page), but a simple approach is to give higher rank to pages with more occurrences of the search term. For this extension, you would need to re-think how you create your index to include the number of matches.
- Phrase search
- The assignment does not allow a search for multi-word terms, such as
section leader
. Searching for phrases is not trivial, as you cannot simply keep a mapping of all possible phrases in the document. You could, however, keep track of where in each document a word is, and then use that information to determine if words in a phrase are next to each other in any particular document.
- The assignment does not allow a search for multi-word terms, such as
- Stop Words
- The English language has many, many words that show up in text but are not particularly important for content, such as
the
,and
,if
,a
, etc. These words are called Stop Words, and it would make your index smaller if you removed such stop words from the index. Here is more info about stop words.
- The English language has many, many words that show up in text but are not particularly important for content, such as
- Stemming
- In the current design, if a user searches for
section
they won't find matches forsections
, even though pages that mention either might be a relevant match. Stemming is the process of reducing words to their base form, so that (for example) bothsection
andsections
would become, simply,section
in the index.
- In the current design, if a user searches for
If you have other creative ideas for extensions, run them by the course staff, and we'd be happy to give you guidance!