⚠️ This page is out of date, from a past version of the course Fall Quarter 2020. The current quarter site is at cs106b.stanford.edu

Lexicon ADT: A Case Study

CS 106B: Programming Abstractions

Autumn 2020, Stanford University Computer Science Department

Lecturers: Chris Gregg and Julie Zelenski

Photo of page in English dictionary

Slide 2

Announcements

Last step: personal project presentation
- Report/code/tests submitted to paperless yesterday, hooray!
- Sign up for time with your section leader to present your project
  - Check out our presentation tips and recorded presentation sample
  - Many thanks to SL Sanjaye Narayan for preparing these resources
- We are excited to see your creative accomplishments
  - Congratulations on completing your CS106B journey!

Today's Topics

Lexicon Case Study

Slide 3

What is a lexicon?

Lexicon is a special case of Set<string>
- List of words (and nothing more, no definition/synonyms/etymology)
- Words are (short) strings
- Thousands of entries
- Lexicon test case is OSPD2 125,000 words, average 8 chars, file on disk 1.1 MB
Define as C++ class, Lexicon ADT
- Ease of use for client, keep details private, can swap out implementation
- Core operations: contains and containsPrefix
Key requirements for implementation
- Very efficient search
- Very compact space usage
- Stretch goal: fast/easy to populate

Slide 4

Lexicon as sorted vector

class Lexicon {
private:
    Vector<string> _words;  
                 // ordered lexicographically

Diagram of sorted vector of words

How to implement core operations?

Solution

contains?
- binary search, O(logN)
containsPrefix?
- binary search (substr comparison), O(logN)
Memory used?
- Each string requires sizeof(char)*length, average word length 8
- ≈ 8 bytes per word * 125,000 words ⇒ 1MB total

Slide 5

Lexicon as binary search tree

class Lexicon {
private:
    TreeNode* _root;

Diagram of binary search tree

How to implement core operations?

Solution

contains?
- tree search, O(logN)
containsPrefix?
- tree search (substr comparison), O(logN)
Memory used
- Each word is tree node with string (8 bytes) and left/right pointers (each 8 bytes)
- ≈ 24 bytes per word * 125,000 words ⇒ 3MB total

Slide 6

Lexicon as hash table

class Lexicon {
private:
    ListNode* _buckets[NBuckets];

Diagram of hash table

How to implement core operations?

Solution

contains?
- hash to bucket, search bucket, O(1)
containsPrefix?
- search all buckets, O(N) (ouch!)
Memory used
- Each word is one list node = string (8) and next pointer (8), don't forget also N pointers in bucket array
- ≈ 24 bytes per word * 125,000 words ⇒ 3MB total

Slide 7

Options so far

	`contains`	`containsPrefix`	Memory
Sorted vector	O(logN)	O(logN)	1 MB
BST	O(logN)	O(logN)	3 MB
Hashtable	O(1)	O(N)	3 MB

Slide 8

Notice any patterns…?

straddle
straddled
straddler
straddlers
straddles
straddling
strafe
strafed
strafer
strafers
strafes
strafing
straggle
straggled
straggler
stragglers
straggles
stragglier
straggliest
straggling
straggly
straight
straightaway
straightaways
straighted
straightedge
straightedges
straighten
straightened
straightener
straighteners
straightening
straightens
straighter
straightest
straightforward
straightforwardly
straightforwardness
straighting
straightjacket
straightjackets
straightlaced
straightly
straights
straightway
strain
strained
strainer
strainers
straining
strains
strait
straiten
straitened
straitening
straitens
straiter
straitest
straitjacket
straitjackets
straitlaced
straitly
straitness
straits
strake
straked
strakes
stramash
stramashes
stramonies
stramony
strand
stranded
strander
stranders
stranding
strands
strang
strange
strangely
strangeness
stranger
strangered
strangering
strangers
strangest
strangle
strangled
stranglehold
strangleholds
strangler
stranglers
strangles
strangling
strangulate
strangulated
strangulates
strangulating
strangulation
strangulations
strap
strapless
strapped
strapper
strappers
strapping
strappings
straps
strass
strasses
strata
stratagem
stratagems
stratal
stratas
strategic
strategical
strategically
strategies
strategist
strategists
strategy
strath
straths
stratification
stratifications
stratified
stratifies
stratify
stratifying
stratosphere
stratospheres
stratospheric
stratous
stratum
stratums
stratus
stravage
stravaged
stravages
stravaging
stravaig
stravaiged
stravaiging
stravaigs
straw
strawberries
strawberry
strawed
strawhat

Slide 9

Letter trie

Diagram of letter trie

Trie from retrieval

(pronounced "try")
26-way branching tree
Paths are word prefixes
Path that ends at red letter is a word
Check out this neat demo from Chris!

Slide 10

Lexicon as trie

struct TrieNode {
    bool isWord;
    TrieNode* children[26];
};

class Lexicon {
private:
    TrieNode* _root;

Diagram of trie

How to implement core operations?

Solution

contains?
- trace path, check isWord flag, O(L) (L is length of word)
containsPrefix?
- trace path, O(L)
Memory used
- Each TrieNode 1 byte + 26 pointers (8) = 209 bytes
- Trie for OSPD2 has ~250,000 nodes
- 209 * 54 ⇒ 52.5MB total (ouch!)

Slide 11

Save space: dynamic array of pointers

struct TrieNode {
    char letter;
    bool isWord;
    TrieNode** children;
    int nChildren;
};

Diagram of trie

Replace fixed-size array of children pointers with right-sized array

Most of 26 entries are nullptr anyway

Memory used

Each TrieNode 1 + 1 + 4 + 8 + 8*numChildren (5 children on average) = 54 bytes total
54 * 250,000 nodes ⇒ 13.5 MB total

Slide 12

Save space: flatten into array

struct TrieNode {
    char letter;
    bool isWord;
    int childIndex;
    int nChildren;
};

Diagram of trie

Flatten tree into array

Similar to how we did for binary heap
- some complication because not always 2 children
No longer store any pointers!
Data structure less flexible (more expensive to update array)

Memory used

Each TrieNode reduced to 10 bytes
10 * 250,000 nodes ⇒ 2.5 MB total
- Now less memory than bst/hash!

Slide 13

Where to go next?

adapt
adapted
adapter
adapters
adapting
adaption
adaptions
adapts
adopt
adopted
adopter
adopters
adopting
adoption
adoptions
adopts
affect
affected
affecter
affecters
affecting
affection
affections
affects
assert
asserted
asserter
asserters
asserting
assertion
assertions
asserts
compact
compacted
compacter
compacters
compacting
compaction
compactions
compacts
corrupt
corrupted
corrupter
corrupters
corrupting
corruption
corruptions
corrupts
depict
depicted
depicter
depicters
depicting
depiction
depictions
depicts
desert
deserted
deserter
deserters
deserting
desertion
desertions
deserts
detect
detected
detecter
detecters
detecting
detection
detections
detects
fruit
fruited
fruiter
fruiters
fruiting
fruition
fruitions
fruits
impact
impacted
impacter
impacters
impacting
impaction
impactions
impacts
indent
indented
indenter
indenters
indenting
indention
indentions
indents
insert
inserted
inserter
inserters
inserting
insertion
insertions
inserts
intercept
intercepted
intercepter
intercepters
intercepting
interception
interceptions
intercepts
interrupt
interrupted
interrupter
interrupters
interrupting
interruption
interruptions
interrupts
invent
invented
inventer
inventers
inventing
invention
inventions
invents
perfect
perfected
perfecter
perfecters
perfecting
perfection
perfections
perfects
port
ported
porter
porters
porting
portion
portions
ports

Slide 14

Unify suffixes as well as prefixes!

Diagram of dawg

Dawg

Directed: Edges are one-way
Acyclic: No cycles, banana not bananana...na
Word: Each path traces prefix, node has isWord flag (red/bold in diagram above)
Graph: Can be multiple paths between nodes

Slide 15

Lexicon as dawg

struct DawgNode {
    char letter;
    bool isWord;
    int childIndex;
    int nChildren;
};
class Lexicon {
private:
    DawgNode* _root;

Diagram of trie

Compact trie into dawg

English words have even more more suffix redundancy than prefix
250,000 trie nodes reduce to 80,000 dawg nodes
125,000 words and only 80,000 nodes means many words have no unique dawg node

Memory used

10 bytes * 80,000 ⇒ 800KB
- less than the size of the dictionary file itself!

Slide 16

Bitpack to squeeze further

struct DawgNode {
  5 bits letter;
  1 bit isWord;
  1 bit for lastChild
  25 bits childIndex;
};

Diagram of bit-packed dawg

Apportion each bit very precisely

Only 26 letters, don't need full ASCII, squeeze in 5 bits (Huffman-esque)
1 bit boolean for each of isWord and lastChild
Remaining 25 bits used for index
10-byte DawgNode reduced to 4 bytes

Memory used

4 * 80,000 ⇒ .32 MB total
- less than a third of the size of original dictionary file – wow!

Slide 17

Neat facts about the dawg

Super-efficient to read/write/reconstitute

Write out entire array as-is, no processing
Indexes don't change, no pointers!
What you write is what you read, no allocate individual nodes, no add words one-by-one
That's what the binary EnglishWords.dat file is

Data structure facilitates other interesting operations

Tweak width of "beam" as you traverse to visit near neighbors
Regex, edit distance, spell correction, autocomplete/predictive text
Solving puzzles
Anagrams, hangman, scrabble

Original reference

"The World's Fastest Scrabble Program", Appel & Jacobsen, CACM May 1988