Soundex Search


This program is all about C++ string processing, with a little bit of file reading and use of Vector. For this program, you will be writing code and making changes in soundex.cpp, as well as answering a few short answer questions in short_answer.txt. You can find this text file under "Other Files" in the Qt Creator project pane.

Here is a summary of what you can expect to complete in this part of the assignment:

  • You will evaluate the efficacy of a real-world algorithm used by the U.S. Census to encode the phonetic pronunciation of surname.
  • You will implement the algorithm, developing a function that can take surnames as input and produce phonetic encodings as output.
  • You will implement a console program that allows users to input a surname and then find all matches in a database of Stanford surnames that have the same encoding.

Why phonetic name matching is useful

One of the more pesky features of the English language is the lack of consistency between phonetics and spelling. Matching surnames is a particularly vexing problem because many common surnames come in a variety of spellings and continue to change over time and distance. Many of these variations are the result of incorrectly inputted data, cultural differences in spelling, and transliteration errors.

Traditional string matching algorithms that use exact match or partial/overlap match perform poorly in this messy milieu of real world data. In contrast, the Soundex system groups names by phonetic structure to enable matching by pronunciation rather than literal character match. This make tasks like tracking genealogy or searching for a spoken surname easier.

Soundex was patented by Margaret O'Dell and Robert C. Russell in 1918, and the U.S. Census is a big consumer of Soundex along with genealogical researchers, directory assistance, and background investigators. The Soundex algorithm is a coded index based on the way a name sounds rather than the way it is spelled. Surnames that sound the same but are spelled differently, like "Vaska," "Vasque," and "Vussky," have the same code and are classified together.

How Soundex codes are generated

Before you get started with coding, we want to introduce the Soundex algorithm itself. A Soundex code is a four-character string in the form of an initial letter followed by three digits, such as Z452. The initial letter is the first letter of the surname, and the three digits are drawn from the sounds within the surname using the following algorithm:

  1. Discard all non-letter characters from surname: dashes, spaces, apostrophes, and so on.
  2. Encode each letter as a digit using the table below.

    Digit represents the letters
    0 A E I O U H W Y
    1 B F P V
    2 C G J K Q S X Z
    3 D T
    4 L
    5 M N
    6 R
  3. Coalesce adjacent duplicate digits from code (e.g. 222025 becomes 2025).
  4. Replace the first digit of code with the first letter of the original name, converting to uppercase.
  5. Remove all zeros from code.
  6. Make the code exactly length 4 by padding with zeros or truncating the excess.

To ensure you understand the construction, get a piece of scratch paper and manually compute a few names, such as "Curie" (C600) and "O'Conner" (O256).

Q11. What is the Soundex code for "Angelou"? What is the code for your own surname?

Decomposing the problem

Your best strategy for approaching a complex algorithm like this is to decompose the problem into smaller, more manageable tasks and proceed step by step, testing as you go.

Q12. Before writing any code, brainstorm your plan of attack and sketch how you might decompose the work into smaller tasks. Briefly describe your decomposition strategy.

To get you started, we're going to walk you through what it might look like to decompose and implement the first step of the Soundex algorithm. Decomposition is important here because if you tried to implement a single function that accomplished the whole Soundex algorithm all in one go, you could end up with one big, unwieldy piece of code. However, if you break down the problem into a number of different steps, each corresponding to its own helper function, you can develop these helper functions one at a time and test each one as you go.

For example, Step 1 of the Soundex algorithm might correspond to a helper function to remove non-letters from the string. The C++ library function isalpha will report whether a character is alphabetic, and the function substr can be used to extract subsequences from a string. Putting those two together, we have this starting point (provided for you in soundex.cpp):

// WARNING: Code is buggy! Add test cases to identify which inputs are mishandled
string removeNonLetters(string s) {
    for (int i = 0; i < s.length(); i++) {
        if (!isalpha(s[i])) {
            // remove the character at index i
            s = s.substr(0,i) + s.substr(i+1);
        }
    }
    return s;
}

Testing, testing, testing

With an implementation of removeNonLetters drafted, the next step is to test it. Our starter code includes some provided tests to get you started. Run the program and when prompted, select the option from the menu for tests from soundex.cpp. When you run these provided tests, you will see that the tests for the removeNonLetters function all pass. From this, you might conclude that the function is good to go. However, there is a problem lurking within the code that has yet to be uncovered!

Review our provided test cases to see what sorts of inputs we tested, and, more importantly, what sorts of inputs we didn't test. Brainstorm what those missing cases are and then add them. Think about edge cases that could lurk in the extremes or cases that are uniquely distinguished from the provided tests. As a hint, notice that all of the provided test cases have exactly one non-letter in the input. Add your own tests that target inputs at the extremes or that are otherwise unusual, such as input that is completely empty or composed of only non-letters.

You should add at least 1 new student test to expose the bug in the given implementation of removeNonLetters.

Your goal in writing tests is to enumerate a comprehensive set of tests that brings any bugs in the code to light so you can debug and fix them. Good tests are the key to writing robust software. A great developer is not only a great coder, but also a great tester.

Debugging a failing test case

Once you have added a test case that fails, use the debugger to get more information about how the function has gone wrong. Set a breakpoint within the code block that contains your test case. A good place to stop is on the line with the operation that you want to trace through, like so:

breakpoint on line that has call to removeNonLetters

Now run the tests while using the debugger. When you hit the breakpoint, you can single step through the call to removeNonLetters while watching the debugger's variables pane to see the different values that your variables take on. This "step-and-watch" approach is the same as you used in the Assignment 0 debugging tutorial.

From this point, you should be able to find the bug and fix it – go ahead and do that!

Implementing the Soundex algorithm

Once you've fixed the bug from above, you have a working helper function to complete the first step of the Soundex algorithm. From here, you'll follow a similar process to complete the remaining steps of the full Soundex algorithm. Your ultimate goal is to implement the function:

string soundex(string s)

But not in one fell sweep. Instead, revisit the instructions in the How Soundex codes are generated section and follow this process:

  • Pick a small task or part of the algorithm to implement.
  • Define a new helper function to accomplish that task.
  • Write student tests that confirm the expected behavior of the new function.
  • Fill in the code for your helper function, debugging as you go. Continue writing code and debugging until you have passed all the tests from the previous step.
  • Rinse and repeat.

Your end goal is to pass all of the top-level Soundex tests that we've provided for you in the starter code. Once you pass all these tests, you've got a pretty solid working implementation of the Soundex algorithm!

As always though, it is important to add tests of your own to make sure you have covered the full possible range of inputs into your function. Make sure to add some tests on the completed high-level soundex function, in addition to the tests that you're writing for your decomposed helper functions.

Note: Remember, the Soundex algorithm does not distinguish case in the input; letters can be lower, upper, or mixed case. The first character in the result code is always in upper case.

Developing a census search program

The capstone of this part of the assignment is to build a program that emulates the way in which the U.S. Census uses the Soundex algorithm. In particular, the console program that you will write allows a user to perform a Soundex search on a database of surnames. You should implement the following function prototype:

void soundexSearch(string filepath)

This function defines a console program that allows the user to specify a text file containing a database of names to search. The program then repeatedly allows the user to enter a surname to look up in the database. For each surname that is entered, the program calculates the Soundex code of the entered name and then finds all names in the database that have a matching Soundex code. You should implement the following steps when writing this program:

  1. Read a database of surnames from the specified text file.
    • This step is provided for you in the starter code. The "database" is simply a Vector<string>.
  2. Prompt the user to enter a surname.
    • The function getLine from "simpio.h" will be helpful here.
  3. Compute the Soundex code of the surname.
  4. Iterate over the database, compute Soundex code of each name, and gather a vector of those surnames with a matching code.
  5. Print the matches in sorted order.
    • The Vector has a handy sort() operation (you can use vec.sort() where vec is the name of your vector), and you can print a vector using the << operator, e.g. cout << vec << endl;. Please note that the sort() function sorts the vector in place and does not return a value.
  6. Repeat steps 2-5 until the user indicates that they are done.

In order to run this code, we will need to switch back from running in testing mode to running in main mode. Go to main.cpp and change the SELECTED_TESTS argument back to NO_TESTS. Then, comment out the call to findPerfects from the first part of the assignment and uncomment the call to soundexSearch. Once you have completed these two steps, running your program should bring up a console window that allows you to test your database search console program.

Below is the output from a sample run of the program. If you are able to match this sample output exactly, then you have successfully completed this part of the assignment!

Read file res/surnames.txt, 26584 names found.

Enter a surname (RETURN to quit): Zelenski
Soundex code is Z452
Matches from database: {"Zelenski", "Zelnick", "Zelnik", "Zielonka"}

Enter a surname (RETURN to quit): hanrahan
Soundex code is H565
Matches from database: {"Hammerman", "Hamren", "Haner-McAllister", "Hanrahan"}

Enter a surname (RETURN to quit): 
All done!

Considering limitations of Soundex

Now that you understand how Soundex works, we want you to consider some of its limitations. The U.S. census has extensively used a variation of the Soundex algorithm for almost a century, but just because it has been used in practice does not make it a perfect algorithm!

A consistent theme of this class is that we want you to consider the ethical implications of the algorithms that we ask you to implement. Like many of the algorithms we will encounter in CS106B, Soundex is used in real systems. For example, the US Census has relied on a variation of the Soundex algorithm for almost a century to group names so that researchers can “find a surname even though it may have been recorded under various spellings.” Now that you understand how Soundex works, we want you to consider some of its limitations.

Q13. Think about one or more examples of a class of names that the Soundex system might not work well for. Explain what this class of names is and why the system might incorrectly group them or mis-categorize one of the names.

(Hint: You may find it useful to look over the other possible phonetic systems in Extending Your Phonetic Algorithm.)

As you know, not all names can be represented using ASCII; many are more accurately represented using Unicode. However, ASCII is the default standard in C++ and many other programming languages and computer systems. The strings you've been working with in this assignment all use ASCII.

Q14. Suppose you are a software engineer working for the US government. You have been tasked with implementing a system that collects names as part of the Census survey and analyzes these names using a phonetic algorithm. Your boss suggests using Soundex and notes that your algorithm is only expected to work for ASCII-encoded strings, since supporting Unicode would require extra work and time. What would your response be and why? What representational harms might result from building a system that exclusively uses Soundex and/or ASCII?

Note: To answer Q14, you will need the content of the Ethics lecture on Strings and Linguistic Representations which will be delivered on Monday, April 5.

Additional advice for testing

  • Testing is key for implementing this problem. Make sure that you take advantage of the SimpleTest testing framework!
  • If a test is failing, put a breakpoint inside the failing test, and step through the code line by line using the debugger. Keep an eye on the variables pane – that will be your most helpful resource to figure out what is going wrong!
  • The starter project includes a text file res/surnames.txt containing the surnames of Stanford students, faculty, and staff for the use of Soundex search. There are some 26,000 unique surnames within the Stanford community, wow! For early testing, you may want to change the program to read names from the file res/small.txt instead which contains a tiny set of names that is easier to debug on.
  • Your soundex function does not have to handle the empty string – in other words, all strings passed into soundex will have at least one character.

Useful resources

  • On the Qt Creator window, there is search field in the lower left labeled with a magnifying glass icon. If you enter the name of a header file (e.g. strlib.h or cctype), Qt will display its contents. This is a quick way to review the features of a given library.
  • You can also browse/search all library documentation online:
  • For this assignment, you will be using library functions that operate on strings and characters, in particular:
    • The standard library string functions length, substr, concat, replace, etc. will be useful, along with the Stanford-specific string functions in strlib.h.
    • For case conversions, you can use the functions in cctype to convert a single char or string functions in strlib.h to convert an entire string.
    • Remember that C++ treats individual characters differently than strings of characters. Individual characters have type char and are enclosed in single quotes (i.e. 'a' not "a"). Strings have type string and enclose a sequence of characters in double-quotes (i.e. "hello" not 'hello'). There are helpful functions in strlib.h to aid you with converting between the two, including the charToString and stringToChar functions.
    • In a similar vein, the integer 5 is a distinct beast from the digit character '5' or the string "5". Take care to express the type carefully to match your intentions.
  • Guide to Transitioning from Python to C++
  • Textbook

Soundex references

Extending your phonetic algorithm

If you have completed the assignment and want to go further, we encourage you to try working on an extension! There are many other phonetic systems out there besides Soundex. Here is a non-extensive list:

Try implementing one of these other systems and see if you can get better or more intuitive surname matches! When implementing an extension, add a new .cpp file to your project that contains the extension code, keeping it separate from the regular Soundex implementation. If you have other creative ides for extensions, run them by the course staff, and we'd be happy to give you guidance!