Assign1: Soundex Search
Census Day passed on April 1, 2020. If you live in the U.S. and haven't yet responded, take action to be counted! ๐๐ผ๐๐ฝ๐๐ฟ๐๐ป๐
This exercise is all about C++ string processing, with a little bit of file reading and use of the Vector data structure. For this part of the assignment, you will be writing code and making changes in soundex.cpp, as well as answering some short answer questions in short_answer.txt. As in the last part of the assignment, you can find this text file under "Other Files > src" in the Qt Creator project view.
Here is a summary of what you can expect to complete in this part of the assignment:
- You will evaluate the efficacy of a real-world algorthm used by the U.S. Census to encode the phonetic pronunciation of last names.
- You will implement the algorithm, developing a function that can take last names as input and produce phonetic encodings as output.
- You will implement a console program that allows users to input a surname and then find all matches in a database of Stanford surnames that have the same last name encoding.
Why phonetic name matching is useful
One of the more pesky features of the English language is the lack of consistency between phonetics and spelling. Matching surnames is a particularly vexing problem because many common surnames come in a variety of spellings and continue to change over time and distance. Many of these variations are the result of incorrectly inputted data, cultural differences in spelling, and transliteration errors.
The Soundex system attempts to group last names by phonetic structure in order to make tasks like tracking genealogy or searching for folks by surname easier. In contrast, traditional string matching focuses on exact matches or partial/overlap match, and this does not adapt well to the messy situations that arise when you have real-word data input errors due to human error.
Soundex was patented by Margaret O'Dell and Robert C. Russell in 1918, and the U.S. Census is a big consumer of Soundex along with genealogical researchers, directory assistance, and background investigators. The Soundex algorithm is a coded index based on the way a name sounds rather than the way it is spelled. Surnames that sound the same but are spelled differently, like "Vaska," "Vasque," and "Vussky," have the same code and are classified together.
How Soundex codes are generated
Before you get started with coding, we want to introduce the Soundex algorithm itself. A Soundex code is a four-character string in the form of an initial letter followed by three digits, such as Z452. The initial letter is the first letter of the surname, and the three digits are drawn from the sounds within the surname using the following algorithm:
- Discard all non-letter characters from surname: dashes, spaces, apostrophes, and so on.
-
Encode each letter as a digit using the table below.
Digit represents the letters 0 A E I O U H W Y 1 B F P V 2 C G J K Q S X Z 3 D T 4 L 5 M N 6 R - Coalesce adjacent duplicate digits from code (e.g.
222025becomes2025). - Replace the first digit of code with the first letter of the original name, converting to uppercase if necessary.
- Remove all zeros from code.
- Make the code exactly length 4 by padding with zeros or truncating the excess.
To ensure you understand the construction, get a piece of scratch paper and manually compute a few names, such as "Jue" (J000) and "Master" (M236).
- Q11: What is the Soundex code for "Bowman"? What about for your own last name?
Decomposing the problem
Your best strategy for approaching a complex algorithm like this is to decompose the problem into smaller, more manageable tasks and proceed step by step, testing as you go.
- Q12: Before writing any code, take some time to brainstorm your decomposition strategy and read over the existing starter code for this problem. Briefly describe your decomposition strategy as the answer to this question.
To get you started, we're going to walk you through what it might look like to decompose and implement the first step of the Soundex algorithm. Decomposition is important here because if you tried to implement a single function that accomplished the whole Soundex algorithm all in one go, you might end up with one big, unwieldy piece of code. However, if you break down the problem into a number of different steps, each corresponding to its own helper function, you can develop these helper functions one at a time and test each one as you go.
For example, Step 1 of the Soundex algorithm might correspond to a helper function to remove non-letters from the string. The C++ library has a function isalpha that reports whether a character is alphabetic, and the string operation substr can be used to extract certain desired characters out of a string. Putting those two together, we have this starting point (provided for you in soundex.cpp):
// WARNING: Code is buggy! Use unit tests to identify which inputs are mishandled
string removeNonLetters(string s) {
for (int i = 0; i < s.length(); i++) {
if (!isalpha(s[i])) {
// remove the character at index i
s = s.substr(0,i) + s.substr(i+1);
}
}
return s;
}
Testing, testing, testing
With an implementation of removeNonLetters drafted, the next step is to test it. We've provided a few unit tests to get you started. To run the provided tests that we have given to you, go ahead and hit the green play button and select option "4" from the menu to run the tests located in soundex.cpp. If you run these provided tests, you will see that the tests for the removeNonLetters function all pass. From this, you might conclude that the function is good to go. However, there is a problem lurking within the code that has yet to be uncovered!
Review our provided test cases to see what sorts of inputs we tested, and, more importantly, what sorts of inputs we didn't test. Brainstorm what those missing cases are and then add them. Think about edge cases that could lurk in the extremes or cases that are uniquely distinguished from the provided tests. As a hint, notice that all of the provided test cases have exactly one non-letter in the input. Add your own tests that target inputs at the extremes or that are otherwise unusual, such as input that is completely empty or composed of only non-letters.
You should add at least 1 new test to expose the bug in the given implementation of removeNonLetters.
Your goal in writing tests is to enumerate a comprehensive set of tests that brings any bugs in the code to light so you can debug and fix them. Good tests are the key to writing robust software. A great developer is not only a great coder, but also a great tester.
Debugging a failing test case
Once you have added a test case that fails, use the debugger to get more information about how the function has gone wrong. Set a breakpoint within the code block that contains your test case. A good place to stop is on the line with the operation that you want to trace through, like so:

Now run the tests while using the debugger. When you hit the breakpoint, you can single step through the call toremoveNonLetters while watching the debugger's variables pane to see the different values that your variables take on. This "step-and-watch" approach is the same as you used in the Assignment 0 debugging tutorial.
From this point, you should be able to find the bug and fix it โย go ahead and do that!
Implementing the Soundex algorithm
Once you've fixed the bug from above, you have a working helper function to complete the first step of the Soundex algorithm. You're now set up to follow a similar process to complete the rest of the assignment. In particular, you should implement the following function prototype:
string soundex(string s)
To do so, we recommend revisting the the instructions in the How Soundex codes are generated section and following these steps:
- Pick a small task or part of the algorithm to implement.
- Define a new helper function to accomplish that task.
- Write some tests that define the expected behavior of this new helper function.
- Fill in the code for your helper function, debugging as you go. Continue writing code and debugging until you have passed all the tests from the previous step.
- Rinse and repeat.
Your end goal is to pass all of the top-level Soundex tests that we've provided for you in the starter code. Once you pass all these tests, you've got a pretty solid working implementation of the Soundex algorithm!
As always though, it is important to add tests of your own to make sure you have covered the full possible range of inputs into your function. Make sure to add a couple of additional high-level Soundex tests as well, in addition to the tests that you're writing for your decomposed functions.
Note: Remember, the Soundex algorithm does not distinguish case in the input; letters can be lower, upper, or mixed case. The first character in the result code is always in upper case.
Developing a census search program
The capstone of this part of the assignment is to build a program that emulates the way in which the U.S. Census uses the Soundex algorithm. In particular, the console program that you will write allows a user to perform a Soundex search on a database of surnames. You should implement the following function prototype:
void soundexSearch(string filepath)
This function defines a console program that allows the user to specify a text file containing a database of names to search. The program then repeatedly allows the user to specify surnames that they want to look up in the database. For each surname that is specified, the program calculates the Soundex code of the specified name and then finds all names in the database that have a matching Soundex code. You should implement the following steps when writing this program:
- Read a database of surnames from the specified text file.
- This step is provided for you in the starter code. The "database" is simply a
Vector<string>.
- This step is provided for you in the starter code. The "database" is simply a
- Prompt the user to enter a surname.
- The function
getLinefrom"simpio.h"will be helpful here.
- The function
- Compute the Soundex code of the surname.
- Iterate over the database, compute Soundex of each name, and gather a vector of those surnames with a matching code.
- Print the matches in sorted order.
- The Vector has a handy
sort()operation (you can usevec.sort()wherevecis the name of your vector), and you can print a vector using the<<operator, e.g.cout << vec << endl;. Please note that thesort()function sorts the vector in place and does not return a value.
- The Vector has a handy
- Repeat steps 2-5 until the user indicates that they are done.
In order to run this code, we will need to switch back from running in testing mode to running in main mode. Go to main.cpp and change the SELECTED_TESTS argument back to NO_TESTS. Then, comment out the call to findPerfects from the first part of the assignment and uncomment the call to soundexSearch. Once you have completed these two steps, running your code with the green play button should result in a console window being pulled up that allows you to test your database search console program.
Below is the output from a sample run of the program. If you are able to match this sample output exactly, then you have successfuly completed this part of the assignment!
Read file res/surnames.txt, 26409 names found.
Enter a surname (RETURN to quit): Zelenski
Soundex code is Z452
Matches from database: {"Zelenski", "Zelnick", "Zelnik", "Zelnis", "Zielonka"}
Enter a surname (RETURN to quit): troccoli
Soundex code is T624
Matches from database: {"Therkelsen", "Torkelson", "Trakul", "Traxler", "Trisal", "Troccoli", "Trockel", "Troxel", "Troxell", "Trujillo", "Turkel"}
Enter a surname (RETURN to quit):
All done!
Considering limitations of Soundex
Now that you understand how Soundex works, we want you to consider some of its limitations. The U.S. census has extensively used a variation of the Soundex algorithm for almost a century, but just because it has been used in practice does not make it a perfect algorithm! Q13: Think about one or more examples of a class of names that the Soundex system might not work well for. Explain what this class of names is and why the system might incorrectly group them or mis-categorize one of the names. (Hint: You may find it useful to look over the other possible phonetic systems in Extending Your Phonetic Algorithm.)
A consistent theme of this class is that we want you to consider the ethical implications of the algorithms that we ask you to implement. While using computational systems to solve problems can be powerful, the cases that these systems ignore or perform poorly on can have a disproportionate negative impact on people. Considering a wide variety of use cases is part of why testing your code is so important!
Additional advice for testing
- Testing is key for implementing this problem. Make sure that you take advantage of the SimpleTest testing framework!
- If a test is failing, put a breakpoint inside the failing test, and step through the code line by line using the debugger. Keep an eye on the variables pane โย that will be your most helpful resource to figure out what is going wrong!
- The starter project includes a text file
res/surnames.txtcontaining the surnames of Stanford students, faculty, and staff for the use of Soundex search. There are some 26,000 unique surnames within the Stanford community, wow! For early testing, you may want to change the program to read names from the fileres/small.txtinstead which contains a tiny set of names that is easier to debug on.
Useful resources
- Make sure to take advantage of the
length,substr,concat,replace, etc. string functions. Also you may find the functions available in"strlib.h"useful as well. - For case conversions, you can either use the functions in
<cctype>if you are operating on acharvariable or in"strlib.h"if you are operating on thestringdatatype. - Remember that C++ treats individual characters differently than strings. Individual characters have type
char. To talk about a specific single character, you must use single quotes (i.e.'a'rather than"a"). Strings have typestring. To talk about a specific string, you must use double-quotes (i.e."hello"rather than 'hello'). There are some helpful functions in"strlib.h"to aid you with programatically doing type conversions, including thecharToStringandstringToCharfunctions. - In a similar vein, the integer 5 is a distinct beast from the digit character '5' or the string "5". Take care to express the type carefully to match your intentions.
- Welcome to C++ Guide
- C++ Standard Library Documentation
- Stanford C++ Library Documentation
- Textbook
Soundex references
- https://www.archives.gov/research/census/soundex
- Online Soundex code calculator: http://www.eogn.com/soundex/
Extending your phonetic algorithm
If you finish the main portion of the assignment and have some extra time, we encourage you to try working on an extension! There are lots of other phonetic systems out there besides Soundex. Here is a nonextensive list:
Try implementing one of these other systems and see if you can get better or more intuitive surname matches! We recommend adding new files to the project to implement your extension. When submitting extensions on Paperless, make sure to make a separate submission for your extension code so that it doesn't interfere with the main functionality of the Soundex system. If you have other creative ides for extensions, run them by the course staff, and we'd be happy to give you guidance!