Assign3: Text prediction
A long, long time ago in a land far, far away, smartphones with their full-power keyboards didn't exist, and texting had to be done using a numerical keyboard, with only the digits 0-9 available on a standard phone keypad. With only 10 digits for 26 letters, each digit key was necessarily mapped to several letters and a variety of janky interfaces were used to indicate which letter you intended. An example of an old-style keypad is shown below, with the associated number-to-letter mappings.
To work around this restriction, there were many algorithms used to try and predict what words a user was trying to type based off of the digits they had entered. One of these algorithms was Tegic T9 (which stands for Text on 9 keys), a predictive algorithm that guessed which letter was intended based on which sequences formed actual words.
Let's work through an example to see how this algorithm works. For example, let's say that the user taps the digit keys "6263".
- Looking at the keypad, the number 6 maps to the potential letters "MNO", the number 2 maps to the potential letters "ABC", and the number 3 maps to the potential letters "DEF".
- Since there are 3 letters corresponding to each possible digit in our sequence, we know that there are
3*3*3*3 = 81
letter sequences possibly intended. Let's investigate what some of those letter sequences might be. - The first digit tapped was 6; the three possibilities for this digit are 'm', 'n', or 'o'. Our recursive decision tree will thus branch off to consider words starting with "m", words starting with "n", and words starting with "o". Let's say that we first consider the words starting with "m". The next decision we consider is whether the second letter is 'a', 'b', or 'c'. The possibilities considered down this branch of the decision will include such sequences as "mamd", "mame", "mamf", "mbmd", "mbme", and many others. However, only one of these many combinations is an actual English word contained in the dictionary: "mane". We save this word as one of the possible suggestions for what the user was trying to enter.
- If we repeat this process for the remaining potential starting letters, we would find that there are two more valid words among the remaining letter sequences: "name", and "oboe", which are added to the set. The complete set of suggestions predicted for "6263" are the words "mane", "name", and "oboe".
Thus, your task is to write the function
void predict(string digits, Set<string>& suggestions, Lexicon& lex, string sofar = "")
that implements a T9-style prediction algorithm. The function is given the sequence of digit keys tapped, say, "6263". It recursively forms all of the possible letter sequences that could have been intended ("mamd", "mame", and so on). Whichever of those letter sequences that is a valid word according to the Lexicon
is added to the set of suggestions
. The final result of calling this function is that all valid English words corresponding to the provided digit sequence are stored in the suggestions
set.
Notes
- Your function must operate recursively. You should not loop over the Lexicon trying to match words to the keys tapped. Instead use recursion to generate letter sequences which are looked up in the Lexicon for viability as actual words.
- The Stanford library includes a
Lexicon
ADT. This special-purpose class stores a word list and comes with a datafile containing a large English dictionary. The two operations most commonly used on a Lexicon arelex.contains(str)
(isstr
a valid word?) andlex.containPrefix(pre)
(are there any words that begin withpre
?). We've provided a code snippet below to help you think about how you might use theLexicon
class.Lexicon lex("res/EnglishWords.dat"); // initialize lex with words from file if (lex.contains("oboe")) { // do something } if (lex.containsPrefix("ana")) { // do something else }
The
containsPrefix
operation is particularly useful if you use it to detect when a letter sequence has hit a dead end for which no further exploration will be fruitful. - If the digit sequence contains keys that are not assigned letters (e.g. 1, #),
predict
should ignore those taps and find suggestions for the remaining taps.
Extensions
- The real-world T9 algorithm included a feature to recognize and correct typing/texting errors, by looking at neighboring keys on the keypad to determine an incorrect tap. For example, the word "testing" is entered with the key sequence
8378464
. Attempting to enter that number but hitting two incorrect taps of neighboring keys, e.g.,8278494
, results in T9 suggesting the words "tasting" (8278464
), "testing" (8378464
), and "tapping" (8277464
). Finding these alternate suggestions would be a marvelous further application of recursion. - Another cool recursive prediction algorithm is the "swipe-to-type" interface prevalent on modern cell phones. Your prediction algorithm is likely capable of being adapted to work for both T9 and swipe-to-type by configuring the set of possibilities for each user tap/gesture. Neat!