Assignment 3B: Grammar Solver

Assignment by Marty Stepp and Victoria Kirst. Based on a problem by Stuart Reges of U of Washington.

Overview

For this part of the assignment you will write a function to generate random sentences from what linguists, programmers, and logicians refer to as a “grammar.” Let’s begin with some definitions. A formal language is a set of words and/or symbols along with a set of rules, collectively called the syntax of the language, that define how those symbols may be used together—programming languages like Java and C++ are formal languages, as are certain mathematical or logical notations. A grammar, in this context, is a way of describing the syntax and symbols of a formal language. All programming languages then have grammars; here is a link to a formal grammar for the Java language. Your task for this problem will be to write a function that accepts a reference to an input file representing a language’s grammar (in a format that we will explain below), along with a symbol to randomly generate, and the number of times to generate it. Your function should generate the given number of random expansions of the given symbol and return them as a Vector of strings.

Vector<string> grammarGenerate(istream& input, string symbol,
                                  int times)

Sample Output

Your program should exactly reproduce the format and general behavior demonstrated in these logs, although it may not exactly recreate these particular scenarios because of the randomness inherent in your code. But your function should return valid random results as per the grammar that was given to it. (Don't forget to use the course File → Compare Output... feature in the console, or the course web site's Output Comparison Tool - in the Tools dropdown - to check output for various test cases.)

sentence.txt

sentence2.txt

expression.txt

We have also written a Grammar Verifier web tool where you can paste in the randomly generated sentences and phrases from your program, and our page will do its best to validate that they are legal phrases that could have come from the original grammar file. This isn't a perfect test, but it is useful for finding some common types of mistakes.

Reading BNF Grammars

Many language grammars are described in a format called Backus-Naur Form (BNF), which is what we'll use in this assignment. In BNF, some symbols in a grammar are called terminals because they represent fundamental words of the language (i.e., symbols that cannot be transformed or reduced beyond their current state). A terminal in English might be "dog" or "run" or "Chris". Other symbols of the grammar are called non-terminals and represent higher-level parts of the language syntax, such as a noun phrase or a sentence in English. Every non-terminal consists of one or more terminals; for example, the verb phrase "throw a ball" consists of three terminal words.

The BNF description of a language consists of a set of derivation rules, where each rule names a symbol and the legal transformations that can be performed between that symbol and other constructs in the language. You might think of these rules as providing translations between terminals and other elements, either terminals or non-terminals. For example, a BNF grammar for the English language might state that a sentence consists of a noun phrase and a verb phrase, and that a noun phrase can consist of an adjective followed by a noun or just a noun. Note that rules can be described recursively (i.e., in terms of themselves). For example, a noun phrase might consist of an adjective followed by another noun phrase, such as the phrase "big green tree" which consists of the adjective "big" followed by the noun phrase "green tree".

A BNF grammar is specified as an input file containing one or more rules, each on its own line, of the form:

non-terminal::=rule|rule|rule|...|rule

A separator of ::= (colon colon equals) divides the non-terminal from its transformation/expansion rules. There will be exactly one such ::= separator per line. A | (pipe) separates each rule; if there is only one rule for a given non-terminal, there will be no pipe characters. Each rule will consist of one or more whitespace-separated tokens. To illustrate how to read a BNF grammar, let’s consider a very simple example. Suppose that we had the BNF input file below, which describes a language comprising a tiny subset of English. The non- terminal symbols <s>, <pn>, and <iv> are short for grammatical elements: sentences, proper nouns, and intransitive verbs (i.e., verbs that do not take an object). Note that in the following two examples, non-terminal elements are contained in angle brackets, while terminal elements (here, valid English words, such as “Lupe”, “Camillo”, “laughed”, etc.) are not.

<s>::=<pn> <iv>
<pn>::=Lupe|Camillo
<iv>::=laughed|wept

To transform a symbol into a valid output string (that is, into a sentence described by this grammar’s rules), we recursively expand the non-terminal symbols contained in its transformation rules until we have an output made up entirely of terminals—at that point our transformation is complete, and we’re done. So in our trivial example of a grammar, if we are given the symbol <s>, which represents the non-terminal for “sentence”, we examine the rules that describe how to expand the sentence non- terminal, and discover that it is comprised of a proper noun followed by an intransitive verb (<s>::=<pn> <iv>). We can begin by transforming the <pn> symbol, which we do by choosing at random from among the available transformations provided on the right-hand side of the <pn> rule (here, “Lupe” and “Camillo”). After this transformation, our sentence-in-progress might look like this:

Camillo <iv>

At each step of our transformation, we perform a single transformation on one of the remaining non- terminal symbols. In this simple example, there is only one non-terminal left, <iv>. We accordingly transform that symbol by choosing randomly among the transformations provided for it, which could produce the following output:

Camillo laughed

Since there are no more non-terminal symbols remaining, our transformation is complete. Note that other valid sentences in this language (i.e., other valid transformations of the symbol <s>) include the sentences “Camillo wept”, “Lupe laughed”, and “Lupe wept”. Because we can choose at random from the transformations provided for each non-terminal symbol, all of these variations are possible and valid for this very simple language.

Now let’s look at an example of a more complex BNF input file, sentence.txt, which describes a slightly larger subset of the English language. Here, the non-terminal symbols <np>, <dp>, and <tv> are short for the linguistic elements noun phrase, determinate article, and transitive verb.

<s>::=<np> <vp>
<np>::=<dp> <adjp> <n>|<pn>
<dp>::=the|a
<adjp>::=<adj>|<adj> <adjp>
<adj>::=big|fat|green|wonderful|faulty|subliminal|pretentious
<n>::=dog|cat|man|university|father|mother|child|television
<pn>::=John|Jane|Sally|Spot|Fred|Elmo
<vp>::=<tv> <np>|<iv>
<tv>::=hit|honored|kissed|helped
<iv>::=died|collapsed|laughed|wept

Following the rules provided by this input file for expanding non-terminals, this grammar's language can represent sentences such as "The fat university laughed" and "Elmo kissed a green pretentious television". This grammar cannot describe the sentence "Stuart kissed the teacher", because the words "Stuart" and "teacher" are not part of the grammar. It also cannot describe "fat John collapsed Spot" because there are no rules that permit an adjective before the proper noun "John", nor an object after intransitive verb "collapsed".

Though the non-terminals in the previous two example languages are surrounded by < >, this is not required. By definition any token that ever appears on the left side of the ::= of any line is considered a non-terminal, and any token that appears only on the right-hand side of ::= in any line(s) is considered a terminal (so for the above example, as we have noted, <np> is a non- terminal, but “John” is a terminal). Do not assume that non-terminals will be surrounded by < > in your code. Each line's non-terminal will be a non-empty string that does not contain any whitespace.

You may assume that individual tokens in a rule are separated by a single space, and that there will be no outer whitespace surrounding a given rule or token.

Your grammarGenerate function will perform two major tasks:

read an input file with a grammar in Backus-Naur Form and turns its contents into a data structure
randomly generate elements of the grammar (recursively)

You may want to separate these steps into one or more helper function(s), each of which performs one step. It is important to separate the recursive part of the algorithm from the non-recursive part. You are given a client program that handles the user interaction. The main function supplies you with an input file stream to read the BNF file. Your code must read in the file's contents and break each line into its symbols and rules so that it can generate random elements of the grammar as output. When you generate random elements, you store them into a Vector to be returned. The provided main program loops over the vector and prints the elements stored inside it.

Part 1: Reading the Input File

For this program you must store the contents of the grammar input file into a Map. As you know, maps keep track of key/value pairs, where each key is associated with a particular value. In our case, we want to store information about each non-terminal symbol, such that the non-terminal symbols become keys and their rules become values. Other than the Map requirement, you are allowed to use whatever constructs you need from the Stanford C++ libraries. You don't need to use recursion on this part of the algorithm; just loop over the file as needed to process its contents.

One problem you will have to deal with early in this program is breaking strings into various parts. To make it easier for you, the Stanford library's "strlib.h" library provides a stringSplit function that you can use on this assignment:

Vector<string> stringSplit(string s, string delimiter)

This function breaks a large string into a Vector of smaller string tokens; it accepts a delimiter string parameter and looks for that delimiter as the divider between tokens. Here is an example call to this function:

string s = "example;one;two;;three";
Vector<string> v = stringSplit(s, ";");
// now v = {"example", "one", "two", "", "three"}

The parts of a rule will be separated by whitespace, but once you've split the rule by spaces, the spaces will be gone. If you want spaces between words when generating strings to return, you must concatenate those yourself. If you find that a string has unwanted spaces around its edges, you can remove them by calling the trim function, also from "strlib.h":

string s2 = " hello there sir ";
s2 = trim(s2); // "hello there sir"

Part 2: Generating Random Expansions from the Grammar

As we mentioned, producing random grammar expansions is a two-step process. The first step, which we’ve just outlined, requires reading the input grammar file and turning it into an appropriate data structure (non-recursively). The second step requires your program to recursively walk that data structure to generate elements by successively expanding them.

The recursive algorithm that your program will use to generate a random occurrence of a symbol S should have the following logic:

If S is a terminal symbol, there is nothing to do; the result is the symbol itself.
If S is a non-terminal symbol, choose a random expansion rule R for S, and for each of the symbols in that rule R, generate a random occurrence of that symbol.

For example, the example grammar given in the Problem Description above could be used to randomly generate an <s> non-terminal for the sentence, "Fred honored the green wonderful child", as shown in the following diagram.

If the string passed to your function is a non-terminal in your grammar, use the grammar's rules to recursively expand that symbol fully into a sequence of terminals. For example, using the grammar on the previous pages, <np> might potentially expand to the string "the green wonderful child".

Generating a non-terminal involves picking one of its rules at random and then generating each part of that rule, which might involve more non-terminals to recursively generate. For each of these you pick rules at random and generate each part, etc.

When you encounter a terminal, simply include it in your string. This becomes a base case. If the string passed to your function is not a non-terminal in your grammar, you should assume that it is a terminal symbol and simply return it. For example, "green" should just cause "green" to be returned (without any spaces around it).

Special cases to handle: You should throw an exception if the grammar contains more than one line for the same non-terminal. For example, if two lines both specified rules for symbol "<s>", this would be illegal and should result in the exception being thrown. You should throw an exception if the symbol parameter passed to your function is empty, "". You can do this by calling the built-in throw function, which takes a string parameter containing a description of the error that occurred. The throw function will terminate the program.

Testing: The provided input files and main may not test all of the above cases; it is your job to come up with tests for them.

Your function may assume that the input file exists, is non-empty, and is in a proper valid format otherwise. If the number of times passed is 0 or less, return an empty vector.

Implementation Details

The hardest part of this problem is the recursive generation, so make sure you have read the input file and built your data structure properly before tackling the recursive part. Loops are not forbidden in this part of the assignment. In fact, you should definitely use loops for some parts of the code where they are appropriate. For example, the directory crawler example from section uses a for-each loop. This is perfectly acceptable; if you find that part of this problem is easily solved with a loop, please use one. In the directory crawler, the hard part was traversing all of the different sub-directories, and that's where we used recursion. For this program, the hard part is following the grammar rules to generate all the parts of the grammar, so that is the place to use recursion. If your recursive function has a bug, try putting a cout statement at the start of your recursive function to print its parameter values; this will let you see the calls being made. As a final tip: look up the randomInteger function from "random.h" to help you make random choices between rules.

Creative Component: mygrammar.txt

Along with your code, submit a file mygrammar.txt that contains a BNF grammar that can be used as input. For full credit, the file should be in valid BNF format, contain at least 5 non-terminals, and should be your own work (do more than just changing the terminal words in sentence.txt, for example). This is worth a small part of your grade.

Possible Extra Features

Here are some ideas for extra features that you could add to your program for a very small amount of extra credit:

Robust grammar solver: Make your grammar solver able to handle files with excessive whitespace placed between tokens, such as:
```
 "  <adjp> ::=   <adj> |    <adj>    <adjp>  "
```
Inverse grammar solver: Write code that can verify whether a given expansion could possibly have come from a given grammar. For example, "The fat university laughed" could come from sentence.txt, but "Marty taught the curious students" could not. To answer such a question, you may want to generate all possible expansions of a given length that could come from a grammar, and see whether the given expansion is one of them. This is a good use of recursive backtracking and exhaustive searching. (Note that it’s tricky.)
Other: If you have your own creative idea for an extra feature, ask your SL and/or the instructor about it.

Indicating that you have done extra features: If you complete any extra features, then in the comment heading on the top of your program, please list all extra features that you worked on and where in the code they can be found (what functions, lines, etc. so that the grader can easily identify this code).

Submitting a program with extra features: Since we use automated testing for part of our grading process, it is important that you submit a program that conforms to the preceding spec, even if you want to do extra features. If your feature(s) cause your program to change the output that it produces in such a way that it no longer matches the expected sample output test cases provided, you should submit two versions of your program file: a first one with the standard file name with no extra features added (or with all necessary features disabled or commented out), and a second one whose file name has the suffix -extra.cpp with the extra features enabled. Please distinguish them by explaining which is which in the comment header. Our submission system saves every submission you make, so if you make more than one we will be able to view them all; your previously submitted files will not be lost or overwritten.

Frequently Asked Questions (FAQ)

Q: I'm getting an error in my GrammarSolver project when running that "Your project's properties (from your .pro file) are too old". What should I do?: A: This is an issue that has been fixed in the current version of the starter code. To fix it in a project you already have, download this new .Pro file and put it in your GrammarSolver project instead, and re-import your GrammarSolver project (delete the debug folder and .pro.user file).
Q: I'd like to write a helper function and declare a prototype for it, but am I allowed to modify the .h file to do so?: A: Function prototypes don't have to go in a .h file. Declare them near the top of your .cpp file and it will work fine.
Q: What does it mean if my program "unexpectedly finishes"?: A: It probably means that you have infinite recursion, which usually comes when you have not set a proper base case, or when your recursive calls don't pass the right parameters between each other. Try running your program in Debug Mode (F5) and viewing the call stack trace when it crashes to see what line it is on when it crashed.
Q: The spec says that I need to throw an exception in certain cases. But it looks like the provided main program never goes into those cases; it checks for that case before ever calling my function. Doesn't that mean the exception is useless? Do I actually have to do the exception part?: A: The exception is not useless, and yes you do need to do it. Understand that your code might be used by other client code other than that which was provided. You have to ensure that your function is robust no matter what parameter values are passed to it. Please follow the spec and throw the exceptions in the cases specified.
Q: I am confused about grammars; what this assignment is about?: A: In terms of this assignment, a grammar is a set of rules that corresponds to a symbol. For instance, the adj grammar relates to many different adjectives like purple, loud, sticky, and fake. To put it simply, one grammar has many rules. This relationship is also known as non-terminal to terminal. For this assignment, we want you to navigate through these grammars and create them. Notice that some grammars have grammars in their rule sets as well. Upon encountering this situation, remember that it is a grammar that needs to be generated as well. Take a look at the decision tree in the assignment specifications, this should help give you a visualization of the problem at hand.
Q: What type of data should my Map store?: A: Unfortunately, this aspect of the assignment is up to you to figure out. Remember the functions you're going to have to call, along with the idea of a grammar and how it relates to its rules.
Q: I'm having trouble breaking apart the strings.: A: Use the provided stringSplit function. Print lots of debug messages to make sure that your split calls are doing what you think they are.
Q: How do I spot nonterminal symbols if they don't start/end with "<" and ">"?: A: The "<" and ">" symbols are not special cases you should be concerned with. Treat them as you would any other symbol, word, or character.
Q: Spaces are in the wrong places in my output. Why?: A: Try using other characters like "_", "~", or "!" instead of spaces to find out where there's extra whitespaces. Also remember that you can use the trim function to remove surrounding whitespace from strings.
Q: How do I debug my recursive function?: A: Try using print statements to find out which grammar symbols are being generated at certain points in your program. Try before and after your recursive step as well as in your base case.
Q: In my recursive case, why doesn't it join together the various pieces correctly?: A: Remember that setting a string = to another variable within a recursive step will not translate in all instances of that function. Be sure to actively add onto your string instead of reassigning it.
Q: My recursive function is really long and icky. How can I make it cleaner?: A: Remind yourself of the strategy in approaching recursive solutions. When are you stopping each recursive call? Make sure you're not making any needless or redundant checks inside of your code.
Q: When I run expression.txt, I get really long math expressions? Is that okay?: A: Yes, that's expected.
Q: What does it mean if my program "unexpectedly finishes"?: A: It might mean that you have infinite recursion, which usually comes when you have not set a proper base case, or when your recursive calls don't pass the right parameters between each other. Try running your program in Debug Mode (F5) and viewing the call stack trace when it crashes to see what line it is on when it crashed. You could also try printing a message at the start of every call to make sure that you don't have infinite recursion. If you are certain that you do not have infinite recursion, it is possible that there are just too many calls on the call stack because of extremely deep recursion. This would not be your fault.
Q: Can I use one of the STL containers from the C++ standard library?: A: No.
Q: I already know a lot of C/C++ from my previous programming experience. Can I use advanced features, such as pointers, on this assignment?: A: No; you should limit yourself to using the material that was taught in class so far.
Q: Can I add any other files to the program? Can I add some classes to the program?: A: No; you should limit yourself to the files and functions in the spec.