Lexicon ADT: A Case Study
CS 106B: Programming Abstractions
Autumn 2020, Stanford University Computer Science Department
Lecturers: Chris Gregg and Julie Zelenski
Slide 2
Announcements
- Last step: personal project presentation
- Report/code/tests submitted to paperless yesterday, hooray!
- Sign up for time with your section leader to present your project
- Check out our presentation tips and recorded presentation sample
- Many thanks to SL Sanjaye Narayan for preparing these resources
- We are excited to see your creative accomplishments
- Congratulations on completing your CS106B journey!
Today's Topics
- Lexicon Case Study
Slide 3
What is a lexicon?
Lexicon
is a special case ofSet<string>
- List of words (and nothing more, no definition/synonyms/etymology)
- Words are (short) strings
- Thousands of entries
- Lexicon test case is OSPD2 125,000 words, average 8 chars, file on disk 1.1 MB
- Define as C++ class,
Lexicon
ADT- Ease of use for client, keep details private, can swap out implementation
- Core operations:
contains
andcontainsPrefix
- Key requirements for implementation
- Very efficient search
- Very compact space usage
- Stretch goal: fast/easy to populate
Slide 4
Lexicon as sorted vector
class Lexicon {
private:
Vector<string> _words;
// ordered lexicographically
How to implement core operations?
contains
?- binary search, O(logN)
containsPrefix
?- binary search (substr comparison), O(logN)
- Memory used?
- Each string requires sizeof(char)*length, average word length 8
- ≈ 8 bytes per word * 125,000 words ⇒ 1MB total
Slide 5
Lexicon as binary search tree
class Lexicon {
private:
TreeNode* _root;
How to implement core operations?
contains
?- tree search, O(logN)
containsPrefix
?- tree search (substr comparison), O(logN)
- Memory used
- Each word is tree node with string (8 bytes) and left/right pointers (each 8 bytes)
- ≈ 24 bytes per word * 125,000 words ⇒ 3MB total
Slide 6
Lexicon as hash table
class Lexicon {
private:
ListNode* _buckets[NBuckets];
How to implement core operations?
contains
?- hash to bucket, search bucket, O(1)
containsPrefix
?- search all buckets, O(N) (ouch!)
- Memory used
- Each word is one list node = string (8) and next pointer (8), don't forget also N pointers in bucket array
- ≈ 24 bytes per word * 125,000 words ⇒ 3MB total
Slide 7
Options so far
contains |
containsPrefix |
Memory | |
---|---|---|---|
Sorted vector | O(logN) | O(logN) | 1 MB |
BST | O(logN) | O(logN) | 3 MB |
Hashtable | O(1) | O(N) | 3 MB |
Slide 8
Notice any patterns…?
- straddle
- straddled
- straddler
- straddlers
- straddles
- straddling
- strafe
- strafed
- strafer
- strafers
- strafes
- strafing
- straggle
- straggled
- straggler
- stragglers
- straggles
- stragglier
- straggliest
- straggling
- straggly
- straight
- straightaway
- straightaways
- straighted
- straightedge
- straightedges
- straighten
- straightened
- straightener
- straighteners
- straightening
- straightens
- straighter
- straightest
- straightforward
- straightforwardly
- straightforwardness
- straighting
- straightjacket
- straightjackets
- straightlaced
- straightly
- straights
- straightway
- strain
- strained
- strainer
- strainers
- straining
- strains
- strait
- straiten
- straitened
- straitening
- straitens
- straiter
- straitest
- straitjacket
- straitjackets
- straitlaced
- straitly
- straitness
- straits
- strake
- straked
- strakes
- stramash
- stramashes
- stramonies
- stramony
- strand
- stranded
- strander
- stranders
- stranding
- strands
- strang
- strange
- strangely
- strangeness
- stranger
- strangered
- strangering
- strangers
- strangest
- strangle
- strangled
- stranglehold
- strangleholds
- strangler
- stranglers
- strangles
- strangling
- strangulate
- strangulated
- strangulates
- strangulating
- strangulation
- strangulations
- strap
- strapless
- strapped
- strapper
- strappers
- strapping
- strappings
- straps
- strass
- strasses
- strata
- stratagem
- stratagems
- stratal
- stratas
- strategic
- strategical
- strategically
- strategies
- strategist
- strategists
- strategy
- strath
- straths
- stratification
- stratifications
- stratified
- stratifies
- stratify
- stratifying
- stratosphere
- stratospheres
- stratospheric
- stratous
- stratum
- stratums
- stratus
- stravage
- stravaged
- stravages
- stravaging
- stravaig
- stravaiged
- stravaiging
- stravaigs
- straw
- strawberries
- strawberry
- strawed
- strawhat
Slide 9
Letter trie
Trie from retrieval
- (pronounced "try")
- 26-way branching tree
- Paths are word prefixes
- Path that ends at red letter is a word
- Check out this neat demo from Chris!
Slide 10
Lexicon as trie
struct TrieNode {
bool isWord;
TrieNode* children[26];
};
class Lexicon {
private:
TrieNode* _root;
How to implement core operations?
contains
?- trace path, check
isWord
flag, O(L) (L is length of word)
- trace path, check
containsPrefix
?- trace path, O(L)
- Memory used
- Each TrieNode 1 byte + 26 pointers (8) = 209 bytes
- Trie for OSPD2 has ~250,000 nodes
- 209 * 54 ⇒ 52.5MB total (ouch!)
Slide 11
Save space: dynamic array of pointers
struct TrieNode {
char letter;
bool isWord;
TrieNode** children;
int nChildren;
};
Replace fixed-size array of children pointers with right-sized array
- Most of 26 entries are
nullptr
anyway
Memory used
- Each TrieNode 1 + 1 + 4 + 8 + 8*numChildren (5 children on average) = 54 bytes total
- 54 * 250,000 nodes ⇒ 13.5 MB total
Slide 12
Save space: flatten into array
struct TrieNode {
char letter;
bool isWord;
int childIndex;
int nChildren;
};
Flatten tree into array
- Similar to how we did for binary heap
- some complication because not always 2 children
- No longer store any pointers!
- Data structure less flexible (more expensive to update array)
Memory used
- Each TrieNode reduced to 10 bytes
- 10 * 250,000 nodes ⇒ 2.5 MB total
- Now less memory than bst/hash!
Slide 13
Where to go next?
- adapt
- adapted
- adapter
- adapters
- adapting
- adaption
- adaptions
- adapts
- adopt
- adopted
- adopter
- adopters
- adopting
- adoption
- adoptions
- adopts
- affect
- affected
- affecter
- affecters
- affecting
- affection
- affections
- affects
- assert
- asserted
- asserter
- asserters
- asserting
- assertion
- assertions
- asserts
- compact
- compacted
- compacter
- compacters
- compacting
- compaction
- compactions
- compacts
- corrupt
- corrupted
- corrupter
- corrupters
- corrupting
- corruption
- corruptions
- corrupts
- depict
- depicted
- depicter
- depicters
- depicting
- depiction
- depictions
- depicts
- desert
- deserted
- deserter
- deserters
- deserting
- desertion
- desertions
- deserts
- detect
- detected
- detecter
- detecters
- detecting
- detection
- detections
- detects
- fruit
- fruited
- fruiter
- fruiters
- fruiting
- fruition
- fruitions
- fruits
- impact
- impacted
- impacter
- impacters
- impacting
- impaction
- impactions
- impacts
- indent
- indented
- indenter
- indenters
- indenting
- indention
- indentions
- indents
- insert
- inserted
- inserter
- inserters
- inserting
- insertion
- insertions
- inserts
- intercept
- intercepted
- intercepter
- intercepters
- intercepting
- interception
- interceptions
- intercepts
- interrupt
- interrupted
- interrupter
- interrupters
- interrupting
- interruption
- interruptions
- interrupts
- invent
- invented
- inventer
- inventers
- inventing
- invention
- inventions
- invents
- perfect
- perfected
- perfecter
- perfecters
- perfecting
- perfection
- perfections
- perfects
- port
- ported
- porter
- porters
- porting
- portion
- portions
- ports
Slide 14
Unify suffixes as well as prefixes!
Dawg
- Directed: Edges are one-way
- Acyclic: No cycles,
banana
notbananana...na
- Word: Each path traces prefix, node has
isWord
flag (red/bold in diagram above) - Graph: Can be multiple paths between nodes
Slide 15
Lexicon as dawg
struct DawgNode {
char letter;
bool isWord;
int childIndex;
int nChildren;
};
class Lexicon {
private:
DawgNode* _root;
Compact trie into dawg
- English words have even more more suffix redundancy than prefix
- 250,000 trie nodes reduce to 80,000 dawg nodes
- 125,000 words and only 80,000 nodes means many words have no unique dawg node
Memory used
- 10 bytes * 80,000 ⇒ 800KB
- less than the size of the dictionary file itself!
Slide 16
Bitpack to squeeze further
struct DawgNode {
5 bits letter;
1 bit isWord;
1 bit for lastChild
25 bits childIndex;
};
Apportion each bit very precisely
- Only 26 letters, don't need full ASCII, squeeze in 5 bits (Huffman-esque)
- 1 bit boolean for each of
isWord
andlastChild
- Remaining 25 bits used for index
- 10-byte DawgNode reduced to 4 bytes
Memory used
- 4 * 80,000 ⇒ .32 MB total
- less than a third of the size of original dictionary file – wow!
Slide 17
Neat facts about the dawg
Super-efficient to read/write/reconstitute
- Write out entire array as-is, no processing
- Indexes don't change, no pointers!
- What you write is what you read, no allocate individual nodes, no add words one-by-one
- That's what the binary
EnglishWords.dat
file is
Data structure facilitates other interesting operations
- Tweak width of "beam" as you traverse to visit near neighbors
- Regex, edit distance, spell correction, autocomplete/predictive text
- Solving puzzles
- Anagrams, hangman, scrabble
Original reference
- "The World's Fastest Scrabble Program", Appel & Jacobsen, CACM May 1988