Lexicon ADT: A Case Study

CS 106B: Programming Abstractions

Autumn 2020, Stanford University Computer Science Department

Lecturers: Chris Gregg and Julie Zelenski

Photo of page in English dictionary


Slide 2

Announcements

  • Last step: personal project presentation
    • Report/code/tests submitted to paperless yesterday, hooray!
    • Sign up for time with your section leader to present your project
    • We are excited to see your creative accomplishments
      • Congratulations on completing your CS106B journey!

Today's Topics

  • Lexicon Case Study

Slide 3

What is a lexicon?

  • Lexicon is a special case of Set<string>
    • List of words (and nothing more, no definition/synonyms/etymology)
    • Words are (short) strings
    • Thousands of entries
    • Lexicon test case is OSPD2 125,000 words, average 8 chars, file on disk 1.1 MB
  • Define as C++ class, Lexicon ADT
    • Ease of use for client, keep details private, can swap out implementation
    • Core operations: contains and containsPrefix
  • Key requirements for implementation
    • Very efficient search
    • Very compact space usage
    • Stretch goal: fast/easy to populate

Slide 4

Lexicon as sorted vector

class Lexicon {
private:
    Vector<string> _words;  
                 // ordered lexicographically

Diagram of sorted vector of words

How to implement core operations?

  • contains?
    • binary search, O(logN)
  • containsPrefix?
    • binary search (substr comparison), O(logN)
  • Memory used?
    • Each string requires sizeof(char)*length, average word length 8
    • ≈ 8 bytes per word * 125,000 words ⇒ 1MB total


Slide 5

Lexicon as binary search tree

class Lexicon {
private:
    TreeNode* _root;

Diagram of binary search tree

How to implement core operations?

  • contains?
    • tree search, O(logN)
  • containsPrefix?
    • tree search (substr comparison), O(logN)
  • Memory used
    • Each word is tree node with string (8 bytes) and left/right pointers (each 8 bytes)
    • ≈ 24 bytes per word * 125,000 words ⇒ 3MB total


Slide 6

Lexicon as hash table

class Lexicon {
private:
    ListNode* _buckets[NBuckets];

Diagram of hash table

How to implement core operations?

  • contains?
    • hash to bucket, search bucket, O(1)
  • containsPrefix?
    • search all buckets, O(N) (ouch!)
  • Memory used
    • Each word is one list node = string (8) and next pointer (8), don't forget also N pointers in bucket array
    • ≈ 24 bytes per word * 125,000 words ⇒ 3MB total


Slide 7

Options so far

  contains    containsPrefix    Memory
Sorted vector O(logN) O(logN) 1 MB
BST O(logN) O(logN) 3 MB
Hashtable O(1) O(N) 3 MB

Slide 8

Notice any patterns…?

  • straddle
  • straddled
  • straddler
  • straddlers
  • straddles
  • straddling
  • strafe
  • strafed
  • strafer
  • strafers
  • strafes
  • strafing
  • straggle
  • straggled
  • straggler
  • stragglers
  • straggles
  • stragglier
  • straggliest
  • straggling
  • straggly
  • straight
  • straightaway
  • straightaways
  • straighted
  • straightedge
  • straightedges
  • straighten
  • straightened
  • straightener
  • straighteners
  • straightening
  • straightens
  • straighter
  • straightest
  • straightforward
  • straightforwardly
  • straightforwardness
  • straighting
  • straightjacket
  • straightjackets
  • straightlaced
  • straightly
  • straights
  • straightway
  • strain
  • strained
  • strainer
  • strainers
  • straining
  • strains
  • strait
  • straiten
  • straitened
  • straitening
  • straitens
  • straiter
  • straitest
  • straitjacket
  • straitjackets
  • straitlaced
  • straitly
  • straitness
  • straits
  • strake
  • straked
  • strakes
  • stramash
  • stramashes
  • stramonies
  • stramony
  • strand
  • stranded
  • strander
  • stranders
  • stranding
  • strands
  • strang
  • strange
  • strangely
  • strangeness
  • stranger
  • strangered
  • strangering
  • strangers
  • strangest
  • strangle
  • strangled
  • stranglehold
  • strangleholds
  • strangler
  • stranglers
  • strangles
  • strangling
  • strangulate
  • strangulated
  • strangulates
  • strangulating
  • strangulation
  • strangulations
  • strap
  • strapless
  • strapped
  • strapper
  • strappers
  • strapping
  • strappings
  • straps
  • strass
  • strasses
  • strata
  • stratagem
  • stratagems
  • stratal
  • stratas
  • strategic
  • strategical
  • strategically
  • strategies
  • strategist
  • strategists
  • strategy
  • strath
  • straths
  • stratification
  • stratifications
  • stratified
  • stratifies
  • stratify
  • stratifying
  • stratosphere
  • stratospheres
  • stratospheric
  • stratous
  • stratum
  • stratums
  • stratus
  • stravage
  • stravaged
  • stravages
  • stravaging
  • stravaig
  • stravaiged
  • stravaiging
  • stravaigs
  • straw
  • strawberries
  • strawberry
  • strawed
  • strawhat

Slide 9

Letter trie

Diagram of letter trie

Trie from retrieval

  • (pronounced "try")
  • 26-way branching tree
  • Paths are word prefixes
  • Path that ends at red letter is a word
  • Check out this neat demo from Chris!

Slide 10

Lexicon as trie

struct TrieNode {
    bool isWord;
    TrieNode* children[26];
};

class Lexicon {
private:
    TrieNode* _root;

Diagram of trie

How to implement core operations?

  • contains?
    • trace path, check isWord flag, O(L) (L is length of word)
  • containsPrefix?
    • trace path, O(L)
  • Memory used
    • Each TrieNode 1 byte + 26 pointers (8) = 209 bytes
    • Trie for OSPD2 has ~250,000 nodes
    • 209 * 54 ⇒ 52.5MB total (ouch!)


Slide 11

Save space: dynamic array of pointers

struct TrieNode {
    char letter;
    bool isWord;
    TrieNode** children;
    int nChildren;
};

Diagram of trie

Replace fixed-size array of children pointers with right-sized array

  • Most of 26 entries are nullptr anyway

Memory used

  • Each TrieNode 1 + 1 + 4 + 8 + 8*numChildren (5 children on average) = 54 bytes total
  • 54 * 250,000 nodes ⇒ 13.5 MB total

Slide 12

Save space: flatten into array

struct TrieNode {
    char letter;
    bool isWord;
    int childIndex;
    int nChildren;
};

Diagram of trie

Flatten tree into array

  • Similar to how we did for binary heap
    • some complication because not always 2 children
  • No longer store any pointers!
  • Data structure less flexible (more expensive to update array)

Memory used

  • Each TrieNode reduced to 10 bytes
  • 10 * 250,000 nodes ⇒ 2.5 MB total
    • Now less memory than bst/hash!

Slide 13

Where to go next?

  • adapt
  • adapted
  • adapter
  • adapters
  • adapting
  • adaption
  • adaptions
  • adapts
  • adopt
  • adopted
  • adopter
  • adopters
  • adopting
  • adoption
  • adoptions
  • adopts
  • affect
  • affected
  • affecter
  • affecters
  • affecting
  • affection
  • affections
  • affects
  • assert
  • asserted
  • asserter
  • asserters
  • asserting
  • assertion
  • assertions
  • asserts
  • compact
  • compacted
  • compacter
  • compacters
  • compacting
  • compaction
  • compactions
  • compacts
  • corrupt
  • corrupted
  • corrupter
  • corrupters
  • corrupting
  • corruption
  • corruptions
  • corrupts
  • depict
  • depicted
  • depicter
  • depicters
  • depicting
  • depiction
  • depictions
  • depicts
  • desert
  • deserted
  • deserter
  • deserters
  • deserting
  • desertion
  • desertions
  • deserts
  • detect
  • detected
  • detecter
  • detecters
  • detecting
  • detection
  • detections
  • detects
  • fruit
  • fruited
  • fruiter
  • fruiters
  • fruiting
  • fruition
  • fruitions
  • fruits
  • impact
  • impacted
  • impacter
  • impacters
  • impacting
  • impaction
  • impactions
  • impacts
  • indent
  • indented
  • indenter
  • indenters
  • indenting
  • indention
  • indentions
  • indents
  • insert
  • inserted
  • inserter
  • inserters
  • inserting
  • insertion
  • insertions
  • inserts
  • intercept
  • intercepted
  • intercepter
  • intercepters
  • intercepting
  • interception
  • interceptions
  • intercepts
  • interrupt
  • interrupted
  • interrupter
  • interrupters
  • interrupting
  • interruption
  • interruptions
  • interrupts
  • invent
  • invented
  • inventer
  • inventers
  • inventing
  • invention
  • inventions
  • invents
  • perfect
  • perfected
  • perfecter
  • perfecters
  • perfecting
  • perfection
  • perfections
  • perfects
  • port
  • ported
  • porter
  • porters
  • porting
  • portion
  • portions
  • ports

Slide 14

Unify suffixes as well as prefixes!

Diagram of dawg

Dawg

  • Directed: Edges are one-way
  • Acyclic: No cycles, banana not bananana...na
  • Word: Each path traces prefix, node has isWord flag (red/bold in diagram above)
  • Graph: Can be multiple paths between nodes

Slide 15

Lexicon as dawg

struct DawgNode {
    char letter;
    bool isWord;
    int childIndex;
    int nChildren;
};
class Lexicon {
private:
    DawgNode* _root;

Diagram of trie

Compact trie into dawg

  • English words have even more more suffix redundancy than prefix
  • 250,000 trie nodes reduce to 80,000 dawg nodes
  • 125,000 words and only 80,000 nodes means many words have no unique dawg node

Memory used

  • 10 bytes * 80,000 ⇒ 800KB
    • less than the size of the dictionary file itself!

Slide 16

Bitpack to squeeze further

struct DawgNode {
  5 bits letter;
  1 bit isWord;
  1 bit for lastChild
  25 bits childIndex;
};

Diagram of bit-packed dawg

Apportion each bit very precisely

  • Only 26 letters, don't need full ASCII, squeeze in 5 bits (Huffman-esque)
  • 1 bit boolean for each of isWord and lastChild
  • Remaining 25 bits used for index
  • 10-byte DawgNode reduced to 4 bytes

Memory used

  • 4 * 80,000 ⇒ .32 MB total
    • less than a third of the size of original dictionary file – wow!

Slide 17

Neat facts about the dawg

Super-efficient to read/write/reconstitute

  • Write out entire array as-is, no processing
  • Indexes don't change, no pointers!
  • What you write is what you read, no allocate individual nodes, no add words one-by-one
  • That's what the binary EnglishWords.dat file is

Data structure facilitates other interesting operations

  • Tweak width of "beam" as you traverse to visit near neighbors
  • Regex, edit distance, spell correction, autocomplete/predictive text
  • Solving puzzles
  • Anagrams, hangman, scrabble

Original reference