Problem Set 6

Due Friday, August 9 at 5:30 pm Pacific

Solutions are available!

🔑 Solutions

What can you do with regular expressions? What are the limits of regular languages? And how does the material on discrete structures from the first half of this quarter come into play in this latter half on automata and computation? In this sixth problem set, you'll explore the answers to these questions along with their practical consequences.

As always, please feel free to drop by office hours, ask on EdStem, or send us emails if you have any questions. We'd be happy to help out.

Good luck, and have fun!

Starter Files

Here are the starter files for the problems you'll submit electronically. Note in particular that there are two separate sets of starter files that you will need to download for this assignment:

📦 PS6 Starter Files - Regular Expressions

📦 PS6 Starter Files - CFGs

Unpack the files somewhere convenient and open up the bundled project.

If you would like to type your solutions in $\LaTeX$, you may want to download this template file where you can place your answers:

🖋 PS6 $\LaTeX$ Template

Problem One: Designing Regular Expressions

Below are a list of alphabets and languages over those alphabets. For each language, write a regular expression for that language. Provide your answers by downloading the first set of starter files for Problem Set Six and editing the file res/RegularExpressions.regexes. (Unlike the DFA/NFA editor from last time, you’ll need to manually edit these files by opening them in Qt Creator.) Feel free to test your answers locally, and submit your work on GradeScope.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}, \mathtt{c}, \mathtt{d}, \mathtt{e}}$. Write a regular expression for the language $L$ = { $w \in \Sigma^\star \suchthat$ the letters in $w$ are sorted alphabetically }. For example, $\mathtt{abcde} \in L\text,$ $\mathtt{bee} \in L\text,$ $\mathtt{a} \in L\text,$ and $\varepsilon \in L\text,$ but $\mathtt{decade} \notin L\text.$

Write a regular expression for the complement of the language from part (i) of this problem.

Like NFAs and unlike DFAs, there isn't a simple construction that starts with a regex for $L$ and turns it into a regex for $\overline{L}$.

On Unix-style operating systems like macOS or Linux, files are organized into directories. You can reference a file by giving a path to the file, a series of directory names separated by slashes. For example, the path /home/username/ might represent a user’s home directory, and a path like /home/username/Documents/PS6.tex might represent that person’s solution to this problem set. Paths that start with a slash character are called absolute paths and say exactly where the file is on disk. Paths that don’t start with a slash are called relative paths and say where, relative to the current folder, a file can be found. For example, if I’m logged into my computer and am in my home folder, I could look up the file Documents/PS6.tex to find my solution to this problem set.

The general pattern here is that a file path consists of a series of directory or file names separated by slashes. That path might optionally start with a slash, but isn’t required to, and it might optionally end with a slash, but isn’t required to. However, you can’t have two consecutive slashes.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{/}}$. Write a regular expression for $L$ = { $w \in \Sigma^\star \suchthat w$ represents the name of a file path on a Unix-style system }. For example, $\mathtt{/aaa/a/aa} \in L$, $\mathtt{/} \in L$, $\mathtt{a} \in L$, $\mathtt{/a/a/a/} \in L$, and $\mathtt{aaa/} \in L$, but $\mathtt{//a//} \notin L$, $\mathtt{a//a} \notin L$, and $\varepsilon \notin L$.

Fun fact: this problem comes from my experience in industry, where we had a bug in our code that arose when someone wrote the wrong regex for this language. Oops.

Suppose you are taking a walk with your dog on a leash of length two. Let $\Sigma = \Set{\mathtt{y}, \mathtt{d}}$ and let $L$ = { $w \in \Sigma^\star \suchthat w$ represents a walk with your dog on a leash where you and your dog both end up at the same location }. For example, we have $\mathtt{yyddddyy} \in L$ because you and your dog are never more than two steps apart and both of you end up four steps ahead of where you started; similarly, $\mathtt{ddydyy} \in L$. However, $\mathtt{yyyydddd} \notin L$, since halfway through your walk you’re three steps ahead of your dog; $\mathtt{ddyd} \notin L$, because your dog ends up two steps ahead of you; and $\mathtt{ddyddyyy} \notin L$, because at one point your dog is three steps ahead of you.

Write a regular expression for $L$.

Note that, unlike Problem Set Five, you and your dog must end at the same position.

Let $\Sigma = \Set{\mathtt{M}, \mathtt{D}, \mathtt{C}, \mathtt{L}, \mathtt{X}, \mathtt{V}, \mathtt{I}}$ and let $L$ = { $w \in \Sigma^\star \suchthat w$ is number less than $2,000$ represented in Roman numerals }. For example, $\mathtt{CMXCIX} \in L$, since it represents the number $999$, as are the strings L (50), VIII (8), DCLXVI (666), CXXXVII (137), CDXII (412), and MDCXVIII (1,618). However, we have that $\mathtt{VIIII} \notin L$ (you'll never have four I's in a row; use IX or IV instead), that $\mathtt{MM} \notin L$ (it's a Roman numeral, but it's for $2,000$, which is too large), that $\mathtt{VX} \notin L$ (this isn't a valid Roman numeral), and that $\mathtt{IM} \notin L$ (the notation of using a smaller digit to subtract from a larger one only lets you use I to prefix V and X, or X to prefix L and C, or C to prefix D and M). The Romans didn't have a way of expressing $0$, so to make your life easier we'll say that $\varepsilon \in L$ and that the empty string represents $0$. (Oh, those silly Romans.)

Write a regular expression for $L$.

As a note, we’re using the “standard form” of Roman numerals. You can see a sample of numbers written out this way via this link.)

Problem Two: Designing CFGs

For each of the following languages, design a CFG for that language. To do so, download the second set of starter files for Problem Set Six and extract them somewhere convenient. Write and save your answers in the file res/Grammars.cfgs. Use the provided program to test and explore your CFG, and submit on Gradescope once you’re finished.

As a note – tests for CFGs take longer to run than tests for regular expressions, DFAs, etc. If the test driver is running slowly on your system, set the project to build in Release rather than Debug mode. To do this, go to the left side of the Qt Creator window and click the picture of a monitor with “Debug” written below it. Then, choose the “Release” option. This disables debugging and turns on optimization, which markedly speeds up the tests.

Before starting this problem, read the Guide to CFGs for advice and tips on how to design context-free grammars.

Given $\Sigma = \Set{\mathtt{a}, \mathtt{b}, \mathtt{c}}$, write a CFG for the language { $w \in \Sigma^\star \suchthat w$ contains aa as a substring }. For example, the strings aa, baac, and ccaabb are all in the language, but aba is not.

In our lecture on regular expressions, we wrote the following regular expression that matched email addresses:

a⁺(.a⁺)*@a⁺(.a⁺)⁺

Given $\Sigma = \Set{\mathtt{@}, \mathtt{.}, \mathtt{a}}$, write a CFG whose language is the same as the language of this regular expression.

Given $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$, write a CFG for the language $L$ = { $w \in \Sigma^\star \suchthat w$ is not a palindrome }, the language of strings that are not the same when read forwards and backwards. For example, $\mathtt{aab} \in L$ and $\mathtt{baabab} \in L$, but $\mathtt{aba} \notin L$, $\mathtt{bb} \notin L$, and $\varepsilon \notin L$.

Don’t try solving this one by starting with the CFG for palindromes and making modifications to it. In general, there’s no way to mechanically turn a CFG for a language $L$ into a CFG for the language $\overline{L}$, since the context-free languages aren’t closed under complementation. However, the idea of looking at the first and last characters of a given string might still be a good idea.

Suppose that we want to check whether $x + y = z$, where $x$, $y$, and $z$ are all natural numbers. If we want to phrase this as a problem as a question of strings and languages, we will need to find some way to standardize our notation. In this problem, we will be using the unary number system, a number system in which the number $n$ is represented by writing out $n$ $1$'s. For example, the number $5$ would be written as $11111$, the number $7$ as $1111111$, the number $12$ as $111111111111$, and the number $0$ as $\varepsilon$.

Given the alphabet $\Sigma = \Set{\mathtt{1}, \mathtt{+}, \mathtt{=}}$, we can consider strings encoding $x + y = z$ by writing out $x$, $y$, and $z$ in unary. For example:

$\begin{array}{rcl} 4 + 3 = 7 &\text{ would be encoded as } & \mathtt{1111+111=1111111} \newline 7 + 1 = 8 &\text{ would be encoded as } & \mathtt{1111111+1=11111111} \newline 0 + 1 = 1 &\text{ would be encoded as } & \mathtt{+1=1} \end{array}$

Consider the following language, which we’ll call $ADD$:

$\Set{ \mathtt{1}^m\mathtt{+1}^n\mathtt{=1}^{m+n} \suchthat m, n \in \naturals }$

For example, the strings $\mathtt{111}\mathtt{+1}\mathtt{=1111}$ and $\mathtt{+1}\mathtt{=1}$ are in the language, but $\mathtt{1}\mathtt{+11}\mathtt{=11}$ is not, nor is the string $\mathtt{1}\mathtt{+1}\mathtt{+1}\mathtt{=111}$.

Write a CFG for $ADD.$

You may find it easier to first write a CFG for strings purely of the form $\mathtt{1}^m\mathtt{=}\mathtt{1}^m\text.$ Once you've done that, see if you can then extend it to the full language.

Let $\Sigma$ be an alphabet containing these symbols:
\[\emptyset \quad \naturals \quad \{ \quad \} \quad , \quad \cup\]
We can form strings from these symbols which represent sets. Here's some examples:
- $\emptyset$
- $\Set{\Set{\naturals, \emptyset} \cup \Set{\emptyset}}$
- $\Set{\emptyset, \Set{\emptyset, \Set{\emptyset}}}$
- $\Set{\emptyset, \naturals} \cup \naturals \cup \emptyset$
- $\naturals \cup \Set{\naturals, \emptyset}$
- $\Set{\Set{\Set{\Set{\naturals}}}}$
- $\Set{\Set{}}$
- $\Set{\emptyset} \cup \naturals \cup \Set{\naturals}$
- $\Set{}$
- $\naturals$
- $\Set{\emptyset, \emptyset, \emptyset}$
- $\Set{\naturals}$
- $\Set{\emptyset, \Set{}}$
Notice that some of these sets, like $\Set{\emptyset, \emptyset}$, are syntactically valid but redundant, and others like {} are syntactically valid but not the cleanest way of writing things. Here's some examples of strings that don't represent sets or aren't syntactically valid:
- $\varepsilon$
- $\naturals, \emptyset, \Set{\emptyset}$
- $\{\emptyset$
- $\}\emptyset\{$
- $\Set{, \naturals}$
- $\}\} \naturals$
- $\emptyset \Set{\naturals}$
- ${ \naturals \emptyset },$
- $\Set{ \emptyset, \emptyset, \emptyset, }$
- $\{\{\}$
- $\Set{,}$
- $\Set{\naturals, , , \emptyset}$
Write a CFG for the language { $w \in \Sigma^\star \suchthat w$ is a syntactically valid string representing a set }. Please use the letters n, u, and o in place of $\naturals$, $\cup$, and $\emptyset$, respectively.

Fun fact: the starter files for Problem Set One contain a parser that’s designed to take as input a string representing a set and to reconstruct what set that is. The logic we wrote to do that parsing was based on a CFG we wrote for sets and set theory. Take CS143 if you’re curious how to go from a grammar to a parser!

As a hint, as is often the case when writing CFGs, we recommend that you use different nonterminals to represent different components of the string. For example, structure of a comma-separated list of sets is different than the structure of an expression representing a single set.

Problem Three: Finite Languages

A language $L$ is called finite if $L$ contains finitely many strings (that is, $\abs{L}$ is a natural number). Given a finite language $L$, explain how to write a regular expression for $L$. Briefly justify your answer; no formal proof is necessary. This shows that all finite languages are regular.

Watch for edge cases!

Problem Four: Distinguishability

The concepts of distinguishability and distinguishing sets play a key role in the Myhill-Nerode theorem. This problem explores them in more depth.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ w \in \Sigma^\star \suchthat \text{the number of }\mathtt{a}\text{'s in }w\text{ is a multiple of three }}\text.$ Which of the following pairs of strings are distinguishable relative to $L$? For each pair that is distinguishable relative to $L$, give us the shortest string $w \in \Sigma^\star$ such that exactly one of $xw$ and $yw$ is in $L$. (If multiple strings are tied for having the shortest length, you can pick any one of them.) No justification is necessary.
1. $x = \mathtt{a}, y = \mathtt{aa}\text.$
2. $x = \mathtt{a}, y = \mathtt{aaa}\text.$
3. $x = \mathtt{aa}, y = \mathtt{aaa}\text.$
4. $x = \mathtt{a}, y = \mathtt{aaaa}\text.$

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ w \in \Sigma^\star \suchthat w \text{ contains } \mathtt{aa} \text{ as a substring }}\text.$ Which of the following are distinguishing sets for $L$? For each set that isn't a distinguishing set, give us a pair of distinct strings from the set that aren't distinguishable relative to $L$. No justification is necessary.
1. $\Set{ \mathtt{a}^n \suchthat n \in \naturals }\text.$
2. $\emptyset\text.$
3. $\Sigma^\star\text.$
4. $\Set{\mathtt{abba}}\text.$
5. $\Set{\mathtt{abba}, \mathtt{abbababba}}\text.$
6. $\Set{\varepsilon, \mathtt{a}, \mathtt{aa}}\text.$
7. $\Set{\varepsilon, \mathtt{a}, \mathtt{b}}\text.$

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and consider this language $L\text:$
\[L = \Set{ w \in \Sigma^\star \suchthat w \text{ has even length and its first half does not equal its second half } }\text.\]
Give a distinguishing set $S$ for $L$ containing four strings. Then, for each pair of distinct strings $x, y \in S$, show that they're distinguishable by giving a string $w$ where exactly one of $xw$ and $yw$ belongs to $L$.

As above, let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L$ be this language:
\[L = \Set{ w \in \Sigma^\star \suchthat w \text{ has even length and its first half does not equal its second half} }\text.\]
Prove that $L$ is not regular.

See if you can find a way of generalizing the distinguishing set you found in part (iii) to have four strings, then five, and eventually infinitely many.

Problem Five: Brzozowski’s Theorem

Let's begin with a new definition. If $L$ is a language over $\Sigma$ and $x \in \Sigma^\star$, then the derivative of $L$ with respect to $\Sigma$, denoted $\partial_x L$, is the language defined below:

\[\partial_x L = \Set{w \in \Sigma^\star \suchthat xw \in L }\text.\]

This definition is a bit dense, so let’s start by unpacking it.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ \varepsilon, \mathtt{aba}, \mathtt{bab}, \mathtt{aaab}, \mathtt{bbbaaa} }$. Answer each of the following questions; no justification is necessary.
1. What is the longest string in $\partial_\mathtt{a} L$? If two or more strings are tied for having the longest length, list any one of them.
2. What is the longest string $x \in \Sigma^\star$ for which $\abs{\partial_x L} = 1$? If two or more such strings are tied for having the longest length, list any one of them.
3. What is the shortest string $x \in \Sigma^\star$ for which $\abs{\partial_x L} = 0$? If two or more such strings are tied for having the shortest length, list any one of them.

Now, for a remarkable theorem:

Brzozowski’s Theorem: Let $L$ be a language over $\Sigma$ and let $S \subseteq \Sigma^\star$ be a set containing infinitely many strings. If the following statement is true, then $L$ is not regular:
\[\forall x \in S. \forall y \in S. (x \ne y \to \partial_x L \ne \partial_y L)\]

Another way of stating this theorem is that if $L$ is a language with infinitely many different derivatives, then $L$ isn’t regular.

Prove Brzozowski’s Theorem.

Language derivatives have both practical and theoretical applications. Surprisingly, it's possible to build regular expression matchers that work by recursively computing derivatives of regular languages, and recent work has extended this to context-free grammar parsers. And amazingly, the converse of Brzozowski's theorem is also true: if a language $L$ has only finitely many derivatives, then $L$ is regular. Take CS154 to learn more about this!

Optional Fun Problem: Birget’s Theorem

The Myhill-Nerode theorem we proved in lecture is actually a special case of a slightly broader theorem that says the following:

Theorem: Let $L$ be a language over $\Sigma$ and $S \subseteq \Sigma^*$ be a distinguishing set for $L$. Then every DFA for $L$ must have at least $|S|$ states.

It's not that difficult to adapt the proof of the Myhill-Nerode theorem from lecture to prove this version of the theorem. For brevity's sake we aren't going to ask you to do this, but you are welcome to tinker around with this result if you'd like.

Unfortunately, distinguishability is not a powerful enough technique to lower-bound the sizes of NFAs. In fact, it's in general quite hard to bound NFA sizes; there's a million-dollar prize for anyone who finds a efficient algorithm (for some precise definition of “efficient”) that, given an arbitrary NFA, converts it to the smallest possible equivalent NFA!

Although it's generally difficult to lower-bound the sizes of NFAs, there are some techniques we can use to find lower bounds on the sizes of NFAs. Let $L$ be a language over $\Sigma$. A generalized fooling set for $L$ is a set $\mathscr{F}$ of pairs of strings in $\Sigma^\star$ with the following properties:

For any $(x, y) \in \mathscr{F}$, we have $xy \in L$.
For any distinct pairs $(x_1, y_1), (x_2, y_2) \in \mathscr{F}$, we have $x_1y_2 \notin L$ or $x_2y_1 \notin L$ (this is an inclusive OR.)

Prove that if $L$ is a language and there is a generalized fooling set $\mathscr{F}$ for $L$ that contains $n$ pairs of strings, then any NFA for $L$ must have at least $n$ states.