Idea To Text

Oct 17, 2019


Being able to simulate believable, labelled data turns out to be one of the pro-tips of modern machine learning. But how can you generate believable data if what you are synthesizing is something as complex as text (or code)? Enter ideaToText.

ideaToText is a python module which is designed to make it easy to generate realistic text (which could be code as text, etc) by first thinking about the ideas a person has, and then simulating what their text output could look like given those ideas.

It was originally developed for Rubric Sampling where the generated text was meant to be samples of high school students writing code for challenges on code.org. In the case of simulating students, ideaToText would first imagine the set of misconceptions a student has, and then would generate what sort of program we imagine they might right. Mathematically you can think of ideaToText as a python mechanism for a "grammar". Not a context free grammar (CFG), a CFG would never be able to simulate believable sentences. Instead it is very much a context rich grammar with all the complexity that is expresible via a python program! Using these grammars, for the first time we were able to achieve human level accuracy on the code-feedback challenge.

Grammars written in ideaToText have several nice properties:

  1. Decisions can be nicely decomposed, and any "state" that needs to be remember from one decision to another can easily flow down from earlier decisions to future ones.
  2. Inspired by react, we very clearly separate "state" from the "text" being produced. This allows for clean grammars, despite their potential complexity.
  3. By using a structured language, your grammars can easily get hooked up to tools such as: Adaptive Sampling, Neural Approximate Parsing (though to be fair you won't need those tools for your homework)

Sampling


Here is sample code that makes a sampler and then draws 10 samples from it. Each sample has text and the corresponding ideas that led to the text.

# idea to text is a module you can import
import ideaToText

if __name__ == '__main__':
  # you can make a sampler based on an ideaToText grammar
  sampler = ideaToText.Sampler('grammars/demo')
  
  # once you have a sampler, you can draw samples
  sample = sampler.singleSample()
  # each sample is "labelled" with the choices that led to the text
  text = sample['text']
  choices = sample['choices']
  rubric = sample['rubric']
  print(text, '\t', choices)

The particular grammar I wrote happens to give random sentences with either a friendly or grumpy sentiment. The grammar is truly simple so that I can highlight the mechanics of ideaToText. Here is output from drawing 10 samples using code like the one above:

welcome.        {'mood': 'friendly', 'punct': '.', 'friendlyPhrase': 'welcome'}
what a day      {'mood': 'friendly', 'punct': '', 'friendlyPhrase': 'what a day'}
leave me alone! {'mood': 'grumpy', 'punct': '!', 'grumpyPhrase': 'leave me alone'}
good morning    {'mood': 'friendly', 'punct': '', 'friendlyPhrase': 'good morning'}
grrrr.          {'mood': 'grumpy', 'punct': '.', 'grumpyPhrase': 'grrrr'}
I appreciate you. {'mood': 'friendly', 'punct': '.', 'friendlyPhrase': 'I appreciate you'}
im tired        {'mood': 'grumpy', 'punct': '', 'grumpyPhrase': 'im tired'}
good morning    {'mood': 'friendly', 'punct': '', 'friendlyPhrase': 'good morning'}
leave me alone  {'mood': 'grumpy', 'punct': '', 'grumpyPhrase': 'leave me alone'}
get off my lawn {'mood': 'grumpy', 'punct': '', 'grumpyPhrase': 'get off my lawn'}
      

Decisions


The grammar is based on giving you a way to articulate all the decision points in the process of generating text. You author a "decision" point by subclassing off of a special class called "Decision". In your subclass you will overload three methods:

  1. registerChoices: In this method you define all the randomness involved in the decision. We call the random variables defined in this method "choices"
  2. render You define how to render text based off the choices made.
  3. updateRubric Turn on rubric items based on the choices made. We will save our conversation about rubric items for later.

Lets take a look at a Decision class:

All grammars start by expanding the decision point called "Start". Here is the Start decision point for the Grumpy/Friendly grammar

from ideaToText import Decision

# Class: Start
# ------------
# "Start" is a special decision which is invoked by the Sampler
# to generate a single sample. All decisions should use the ideaToText.Decision subclass
class Start(Decision):

    # Method to overload: Register Choices
    # ----------------
    # Predeclare any choices that are made here. This method
    # will be called by the "Sampler"
    def registerChoices(self):
        # This sentence creator starts with two choices.
        # One choice is important (the mood). The other
        # choice is unimportant (if the pharse has punctuation)
        self.addChoice('mood', {
            'friendly':10, # one possible decision and its weight
            'grumpy':5     # another decision and weight
        })

        # Choices have two parts, an identifier and a dictionary
        # which maps possible outcomes to their relative likelihood
        self.addChoice('punct', {
            '':10,
            '.':5,
            '!':1
        })

    # Method to overload: Update Rubric
    # ----------------
    # Based on the choices you have made, you can turn on binary
    # rubric items. This is a way to record if you made a pedagogically
    # important decision
    def updateRubric(self):
        # update rubric is a space for you to separate the
        # important choices from the unimportant ones
        if self.getChoice('mood') == 'grumpy':
            self.turnOnRubric('isGrumpy')

    # Method to overload: Render
    # ----------------
    # When this method is called, you should assume that a decision has been made
    # for each of the choices declared in registerChoices. Render
    # should return a string that could result from the given choices.
    # Render should *not* use any randomness, but it can recursively ask for
    # the engine to expant other decisions
    def render(self):
        # once choices have been made, you now have to turn
        # those ideas into text
        punctuation = self.getChoice('punct')
        mood = self.getChoice('mood')
        if mood == 'friendly':
            phrase = self.expand('Friendly')
        if mood == 'grumpy':
            phrase = self.expand('Grumpy')
        return phrase + punctuation

Important: You never have to pre-declare Start. Simply put all decision points in one directly and the Sampler will intelligently search through that directory.

We would similarly need to define Decision classes named "Friendly" and "Grumpy"

Decision Methods:


Any of these can be called from a grammar class (aka one that has Decision as a subclass)

Expand

You can ask the engine to expand a decision and return you back the text that is rendered:
self.expand(nonterminalName, optionalParams = {})

Add Choice

Specify a random variable. Can only be called in registerChoices. Each choice has a name and a dictionary which maps potential outcomes to their weights. The Sampler will sample from the multinomial that your map specifies.
self.addChoice(choiceName, mapOfOutcomes)

Get Choice

What good is a choice if you can't access it? Returns the outcome that the sampler decided on.
self.getChoice(choiceName)

Has Choice

You can check if a choice has been made.
self.hasChoice(choiceName)

Set State

self.setState(stateKey, value)

Get State

self.getState(stateKey)

Has State

hasState(stateKey)

Turn on Rubric

turnOnRubric(rubricKey)

Get Name

Returns the name of the class as a string
self.getName()

Get Instance Name

Returns a name which is unique for each invocation of the decision
self.getInstanceName()

Get Last Instance Name

Useful for recency bias. Gets the name given to the last invocation
self.getLastInstanceName()

Contents: