Text prediction


A long time ago in a land far away, smartphones with their full-power keyboards didn't exist, and texting had to be done on a number keypad, using only the digits 0-9. With only 10 digits for 26 letters, each digit key was necessarily mapped to several letters and a variety of janky interfaces were used to indicate which specific letter you intended. An example of an old-style keypad is shown below, with the associated number-to-letter mappings.

telephone keypad

To work around this restriction, there were many algorithms to guess what word a user was trying to enter based on the digits typed. One of these algorithms was Tegic T9 (which stands for Text on 9 keys), a predictive algorithm that guessed the intended letter for a digit by which sequences formed actual words.

Let's work through an example to see how this algorithm works.

  • The user taps the digit keys 6263.
  • On the keypad, the digit 6 maps to the set of potential letters {M,N,O}, the digit 2 to {A,B,C} and the digit 3 to {D,E,F}.
  • The user's sequence has four digits and three letter choices per digit, leading to a total of 3*3*3*3 = 81 possible letter sequences to explore.
  • Consider the recursive decision tree of possible sequences. The first digit tapped was 6; the three possibilities for this digit are {M,N,O}. The recursive decision tree will have three branches: one to form words starting with M, one to form words starting with N, and one for words starting with O. Visualize (or sketch) the rest of the tree to get your bearings.
  • The search first chooses M and goes down the branch that explores all words that start with M. The next decision is choosing whether the second letter is A, B, or C, and after that it chooses the third letter and so on. The letter sequences recursively explored down the M branch include MAMD, MAME, MAMF, MBMD, and many others. However, only one of these combinations is an actual word in the English dictionary: MANE. When you encounter this word, save it as a suggestion for what the user was trying to enter.
  • After unchoosing, the search will backtrack to try other choices and thereby explore other branches of the trees. When exploring sequences starting with N and then those starting with O, the search finds two other valid words: NAME and OBOE, which are also added to the set. The complete set of suggestions predicted for 6263 are the words "mane," "name," and "oboe."

Your task

You are to write a function that use a T9-style prediction algorithm to suggest words that match a sequence of tapped digits. The function prototype is

void predict(string digits, Set<string>& suggestions, Lexicon& lex)

The function operates by recursively exploring all the possible letter sequences formed by the tapped digits and storing those sequences that are actual words in the set of suggestions.

The starter code declares a program-wide constant keypad for the standard mapping from numbers to sets of letters. Here, we make an important distinction – keypad is a constant and not a global variable. Program-wide global variables (that is, variables that are declared in a scope that is outside any specific function) are considered poor style and can lead to functional errors and debugging challenges. The CS106B style guide sets down a clear expectation that the global variables are never welcome in this course.

Q6. In your own words, describe some of the specific drawbacks to using global variables and explain why a program-wide constant like keypad does not suffer from the same drawbacks.

With this justification, you should understand why using a program-wide constant is an acceptable and encouraged practice (and why global variables are not).

Notes

  • Helper functions. There is a firm requirement that the function predict is implemented to exactly match the prototype above. You are free to add additional helper functions with whatever parameters you like.
  • English lexicon. The Stanford library includes Lexicon, a special-purpose class to represent a word list. We also supply a lexicon data file containing a large list of English words. The sample code below demonstrates using a Lexicon to lookup words and prefixes. For more details, see the Lexicon documentation.
     Lexicon lex("res/EnglishWords.txt"); // initialize lex with words from file
        
     if (lex.contains("oboe")) {      // if oboe is on the word list
     	// do something 
     }
    
     if (lex.containsPrefix("ana")) { // if any words on list begin "ana...
     	// do something else
     }
    
  • Pruning. The Lexicon containsPrefix should be used to prune dead-end searches. For example, if the digits tapped were 7849 and the letter sequence being explored thus far is "PV," there is no point in continuing down this path because there are no words in the lexicon that begin with that prefix. Pruning is most cleanly handled by adding a base case to detect a dead end and immediately returning without making further recursive calls.
  • Use of ADTs. You will use Set , Map, and Lexicon in solving this problem. As it can be expensive to make copies of large data structures, pass these variables by reference to allow efficient sharing without copying. Note that the Lexicon lookup is case-insensitive; in contrast, a Set does distinguish by case when comparing two strings. The words in the set of suggestions should be in all lower-case to match our expected result.
  • Efficiency. With a correct algorithm, effective pruning, and appropriate use of ADTs, finding predictions should be blindingly quick, finishing within a second or two.
  • Converting digit chars to numbers. The strlib.h function int charToInteger(char) can be used to convert a single digit character to its numeric value. Your function may assume that the input will contain only digit characters that have assigned letters (e.g. only 2 through 9). Inputs such as 0, 1, #, * or any other character are illegal. You do not have to handle these other characters, and we will not test on such illegal inputs.
  • Testing. A simple way to add test cases of your own is to choose a word, convert it to tapped digits, pass those digits to predict, and be sure the original word is in the set of suggestions. Verify that the other suggested words also match the tapped digits.
  • Assumptions. You may assume that the input digits contain only valid digit characters, that the suggestions set is empty to start, and that the Lexicon is a populated word list. The function adds to the suggestion set all letter sequences that are determined to be valid words, i.e. contained in the lexicon. If none are found, the suggestion set remains empty.

Extensions

  • Error correction. The real-world T9 algorithm included a feature to recognize and correct typing/texting errors, by looking at neighboring keys on the keypad to correct typos. For example, the word "testing" is entered with the digit keys 8378464. If the user mistyped a 2 for the 3 and a 9 for the 6, the mis-entered digits 8278494 match no words. Rather than report no suggestions in that case, the fancier search would instead find words that are near matches. It would suggest words such as "tasting" (8278464), "testing" (8378464), and "tapping" (8277464) that have corrected up to two typos in the entered digits. Finding near-match suggestions is a marvelous further application of your recursive superpowers.
  • Auto-complete. After tapping the initial sequence of few digits, recursion can be further applied to find ways to extend the sequence to form a valid word and to offer these auto-completions to the user. The search for possible auto-completions is similar to the basic T9 algorithm, with a few little tweaks. More practice to build your core strength with recursion!
  • Other prediction/recognition tasks. Another cool recursive prediction algorithm is the "swipe-to-type" interface prevalent on modern cell phones. The same general prediction algorithm can be adapted to work for both T9 and swipe-to-type by configuring the set of possibilities for each user tap/gesture. Neat!