Oct 17, 2019
Being able to simulate believable, labelled data turns out to be one of the pro-tips of modern machine learning. But how can you generate believable data if what you are synthesizing is something as complex as text (or code)? Enter ideaToText.
ideaToText
is a python module which is designed to make it easy to generate realistic text (which could be code as text, etc) by first thinking about the ideas a person has, and then simulating what their text output could look like given those ideas.
It was originally developed for Rubric Sampling where the generated text was meant to be samples of high school students writing code for challenges on code.org. In the case of simulating students, ideaToText would first imagine the set of misconceptions a student has, and then would generate what sort of program we imagine they might right. Mathematically you can think of ideaToText as a python mechanism for a "grammar". Not a context free grammar (CFG), a CFG would never be able to simulate believable sentences. Instead it is very much a context rich grammar with all the complexity that is expresible via a python program! Using these grammars, for the first time we were able to achieve human level accuracy on the code-feedback challenge.
Grammars written in ideaToText
have several nice properties:
Here is sample code that makes a sampler and then draws 10 samples from it. Each sample has text and the corresponding ideas that led to the text.
# idea to text is a module you can import import ideaToText if __name__ == '__main__': # you can make a sampler based on an ideaToText grammar sampler = ideaToText.Sampler('grammars/demo') # once you have a sampler, you can draw samples sample = sampler.singleSample() # each sample is "labelled" with the choices that led to the text text = sample['text'] choices = sample['choices'] rubric = sample['rubric'] print(text, '\t', choices)
The particular grammar I wrote happens to give random sentences with either a friendly or grumpy sentiment. The grammar is truly simple so that I can highlight the mechanics of ideaToText. Here is output from drawing 10 samples using code like the one above:
welcome. {'mood': 'friendly', 'punct': '.', 'friendlyPhrase': 'welcome'} what a day {'mood': 'friendly', 'punct': '', 'friendlyPhrase': 'what a day'} leave me alone! {'mood': 'grumpy', 'punct': '!', 'grumpyPhrase': 'leave me alone'} good morning {'mood': 'friendly', 'punct': '', 'friendlyPhrase': 'good morning'} grrrr. {'mood': 'grumpy', 'punct': '.', 'grumpyPhrase': 'grrrr'} I appreciate you. {'mood': 'friendly', 'punct': '.', 'friendlyPhrase': 'I appreciate you'} im tired {'mood': 'grumpy', 'punct': '', 'grumpyPhrase': 'im tired'} good morning {'mood': 'friendly', 'punct': '', 'friendlyPhrase': 'good morning'} leave me alone {'mood': 'grumpy', 'punct': '', 'grumpyPhrase': 'leave me alone'} get off my lawn {'mood': 'grumpy', 'punct': '', 'grumpyPhrase': 'get off my lawn'}
The grammar is based on giving you a way to articulate all the decision points in the process of generating text. You author a "decision" point by subclassing off of a special class called "Decision". In your subclass you will overload three methods:
registerChoices
: In this method you define all the randomness involved in the decision. We call the random variables defined in this method "choices"
render
You define how to render text based off the choices made.
updateRubric
Turn on rubric items based on the choices made. We will save our conversation about rubric items for later.
Lets take a look at a Decision class:
All grammars start by expanding the decision point called "Start". Here is the Start decision point for the Grumpy/Friendly grammar
from ideaToText import Decision # Class: Start # ------------ # "Start" is a special decision which is invoked by the Sampler # to generate a single sample. All decisions should use the ideaToText.Decision subclass class Start(Decision): # Method to overload: Register Choices # ---------------- # Predeclare any choices that are made here. This method # will be called by the "Sampler" def registerChoices(self): # This sentence creator starts with two choices. # One choice is important (the mood). The other # choice is unimportant (if the pharse has punctuation) self.addChoice('mood', { 'friendly':10, # one possible decision and its weight 'grumpy':5 # another decision and weight }) # Choices have two parts, an identifier and a dictionary # which maps possible outcomes to their relative likelihood self.addChoice('punct', { '':10, '.':5, '!':1 }) # Method to overload: Update Rubric # ---------------- # Based on the choices you have made, you can turn on binary # rubric items. This is a way to record if you made a pedagogically # important decision def updateRubric(self): # update rubric is a space for you to separate the # important choices from the unimportant ones if self.getChoice('mood') == 'grumpy': self.turnOnRubric('isGrumpy') # Method to overload: Render # ---------------- # When this method is called, you should assume that a decision has been made # for each of the choices declared in registerChoices. Render # should return a string that could result from the given choices. # Render should *not* use any randomness, but it can recursively ask for # the engine to expant other decisions def render(self): # once choices have been made, you now have to turn # those ideas into text punctuation = self.getChoice('punct') mood = self.getChoice('mood') if mood == 'friendly': phrase = self.expand('Friendly') if mood == 'grumpy': phrase = self.expand('Grumpy') return phrase + punctuation
Important: You never have to pre-declare Start. Simply put all decision points in one directly and the Sampler will intelligently search through that directory.
We would similarly need to define Decision classes named "Friendly" and "Grumpy"
Any of these can be called from a grammar class (aka one that has Decision as a subclass)
self.expand(nonterminalName, optionalParams = {})
self.addChoice(choiceName, mapOfOutcomes)
self.getChoice(choiceName)
self.hasChoice(choiceName)
self.setState(stateKey, value)
self.getState(stateKey)
hasState(stateKey)
turnOnRubric(rubricKey)
self.getName()
self.getInstanceName()
self.getLastInstanceName()
Contents: