Problem Set 7


Due Friday, May 24 at 3:00 pm Pacific


Solutions are available!

🔑 Solutions


What can you do with regular expressions? What are the limits of regular languages? And how does the material on discrete structures from the first half of this quarter come into play in this latter half on automata and computation? In this seventh problem set, you'll explore the answers to these questions along with their practical consequences.

As always, please feel free to drop by office hours, ask on EdStem, or send us emails if you have any questions. We'd be happy to help out.

Good luck, and have fun!

Starter Files

Here are the starter files for the problems you'll submit electronically:

📦 PS7 Starter Files

Unpack the files somewhere convenient and open up the bundled project.

If you would like to type your solutions in $\LaTeX$, you may want to download this template file where you can place your answers:

đź–‹ PS7 $\LaTeX$ Template

Problem One: Designing Regular Expressions

Below are a list of alphabets and languages over those alphabets. For each language, write a regular expression for that language. Provide your answers by downloading the starter files for Problem Set Seven from Canvas and editing the file res/RegularExpressions.regexes. (Unlike the DFA/NFA editor from last time, you’ll need to manually edit these files by opening them in Qt Creator.) Feel free to test your answers locally, and submit your work on GradeScope.

  1. Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}, \mathtt{c}, \mathtt{d}, \mathtt{e}}$. Write a regular expression for the language $L$ = { $w \in \Sigma^\star \suchthat$ the letters in $w$ are sorted alphabetically }. For example, $\mathtt{abcde} \in L\text,$ $\mathtt{bee} \in L\text,$ $\mathtt{a} \in L\text,$ and $\varepsilon \in L\text,$ but $\mathtt{decade} \notin L\text.$

  1. Write a regular expression for the complement of the language from part (i) of this problem.

    Like NFAs and unlike DFAs, there isn't a simple construction that starts with a regex for $L$ and turns it into a regex for $\overline{L}$.

  1. On Unix-style operating systems like macOS or Linux, files are organized into directories. You can reference a file by giving a path to the file, a series of directory names separated by slashes. For example, the path /home/username/ might represent a user’s home directory, and a path like /home/username/Documents/PS7.tex might represent that person’s solution to this problem set. Paths that start with a slash character are called absolute paths and say exactly where the file is on disk. Paths that don’t start with a slash are called relative paths and say where, relative to the current folder, a file can be found. For example, if I’m logged into my computer and am in my home folder, I could look up the file Documents/PS7.tex to find my solution to this problem set.

    The general pattern here is that a file path consists of a series of directory or file names separated by slashes. That path might optionally start with a slash, but isn’t required to, and it might optionally end with a slash, but isn’t required to. However, you can’t have two consecutive slashes.

    Let $\Sigma = \Set{\mathtt{a}, \mathtt{/}}$. Write a regular expression for $L$ = { $w \in \Sigma^\star \suchthat w$ represents the name of a file path on a Unix-style system }. For example, $\mathtt{/aaa/a/aa} \in L$, $\mathtt{/} \in L$, $\mathtt{a} \in L$, $\mathtt{/a/a/a/} \in L$, and $\mathtt{aaa/} \in L$, but $\mathtt{//a//} \notin L$, $\mathtt{a//a} \notin L$, and $\varepsilon \notin L$.

    Fun fact: this problem comes from former CS103 instructor Amy Liu, who fixed a bug in industrial code that arose when someone wrote the wrong regex for this language. Oops.

  1. Suppose you are taking a walk with your dog on a leash of length two. Let $\Sigma = \Set{\mathtt{y}, \mathtt{d}}$ and let $L$ = { $w \in \Sigma^\star \suchthat w$ represents a walk with your dog on a leash where you and your dog both end up at the same location }. For example, we have $\mathtt{yyddddyy} \in L$ because you and your dog are never more than two steps apart and both of you end up four steps ahead of where you started; similarly, $\mathtt{ddydyy} \in L$. However, $\mathtt{yyyydddd} \notin L$, since halfway through your walk you’re three steps ahead of your dog; $\mathtt{ddyd} \notin L$, because your dog ends up two steps ahead of you; and $\mathtt{ddyddyyy} \notin L$, because at one point your dog is three steps ahead of you.

    Write a regular expression for $L$.

    Note that, unlike Problem Set Six, you and your dog must end at the same position.

  1. Let $\Sigma = \Set{\mathtt{M}, \mathtt{D}, \mathtt{C}, \mathtt{L}, \mathtt{X}, \mathtt{V}, \mathtt{I}}$ and let $L$ = { $w \in \Sigma^\star \suchthat w$ is number less than $2,000$ represented in Roman numerals }. For example, $\mathtt{CMXCIX} \in L$, since it represents the number $999$, as are the strings L (50), VIII (8), DCLXVI (666), CXXXVII (137), CDXII (412), and MDCXVIII (1,618). However, we have that $\mathtt{VIIII} \notin L$ (you'll never have four I's in a row; use IX or IV instead), that $\mathtt{MM} \notin L$ (it's a Roman numeral, but it's for $2,000$, which is too large), that $\mathtt{VX} \notin L$ (this isn't a valid Roman numeral), and that $\mathtt{IM} \notin L$ (the notation of using a smaller digit to subtract from a larger one only lets you use I to prefix V and X, or X to prefix L and C, or C to prefix D and M). The Romans didn't have a way of expressing $0$, so to make your life easier we'll say that $\varepsilon \in L$ and that the empty string represents $0$. (Oh, those silly Romans.)

    Write a regular expression for $L$.

    As a note, we’re using the “standard form” of Roman numerals. You can see a sample of numbers written out this way via this link.)

Problem Two: Finite Languages

A language $L$ is called finite if $L$ contains finitely many strings (that is, $\abs{L}$ is a natural number). Given a finite language $L$, explain how to write a regular expression for $L$. Briefly justify your answer; no formal proof is necessary. This shows that all finite languages are regular.

Watch for edge cases!

Problem Three: State Elimination

The state elimination algorithm gives a way to transform a DFA or NFA into a regular expression. It's a beautiful algorithm once you get the hang of it. In this problem, you’ll use the state elimination algorithm to produce a regular expression for a language.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L$ = { $w \in \Sigma^\star \suchthat w$ has an even number of $\mathtt{a}$'s and an even number of $\mathtt{b}$'s}. Below to the left is a finite automaton for $L$ that we've prepared for the state elimination algorithm by adding in a new start state $q_{start}$ and a new accept state $q_{end}$. If you run two steps of the state elimination algorithm on the above automaton, first eliminating state $q_1$, then eliminating state $q_2$, you will get an automaton whose shape matches the diagram on the right.

First, the initial automata. There are six states: q0, q1, q2, q3, q_start, and q_end. State q_start is the start state and has an epsilon transition to q0. State q0 has a transition on a to q1, a transition on b to q2, and an epsilon transition to q_end. State q1 has a transition on a to q0 and a transition on b to q3. State q2 has a transition on a to q3 and a transition on b to q0. State q3 has a transition on a to q2 and a transition on b to q1. State q_end is accepting and has no outgoing transitions. Next, the automaton after one step of state elimination. There are three states: q_start, q_0, q_3, and q_end. State q_start is the start state and has a transition on epsilon to q_0. State q_0 has a transition to q3 labeled "part (i)", a transition to itself labeled "part (iii)", and a transition to q_end on epsilon. State q_1 has a transition to q0 labeled "part (ii)" and a transition to itself labeled "part (iv)". State q_end is accepting and has no outgoing transitions.

Edit the file res/StateElimination.regexes to answer the following questions.

  1. ii. iii. iv. What regular expressions go at the indicate positions in the diagram?

    Remember that to eliminate a state $q$, you should identify all pairs of states $q_{in}$ and $q_{out}$ where there’s a transition from $q_{in}$ to $q$ and from $q$ to $q_{out}$, then add shortcut edges from $q_{in}$ to $q_{out}$ to bypass state $q$. Remember that $q_{in}$ and $q_{out}$ may be the same state. To help you check your work: there are four such pairs for state $q_1$.

    If you’ve done everything properly, at the end of this stage, none of these transitions should use the Kleene star.

Now, eliminate $q_3$ and then $q_0$. Your automaton should look like this:

An automaton with two states, q_start and q_end. q_start is the start state and has a transition labeled "part (v)" to q_end. q_end is accepting and has no outgoing transitions

The regular expression on this edge has the same language as the original automaton!

  1. What regular expression goes at this spot in the above diagram?

    This is where you’ll start seeing Kleene stars.

Optional but recommended activity: look at the regular expression you ended up with in part (v) of this problem. How does it work? That is, how does that regular expression match all and only strings with an even number of a’s and an even number of b’s?

Problem Four: Distinguishability

The concepts of distinguishability and distinguishing sets play a key role in the Myhill-Nerode theorem. This problem explores them in more depth.

  1. Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ w \in \Sigma^\star \suchthat \text{the number of }\mathtt{a}\text{'s in }w\text{ is a multiple of three }}\text.$ Which of the following pairs of strings are distinguishable relative to $L$? For each pair that is distinguishable relative to $L$, give us the shortest string $w \in \Sigma^\star$ such that exactly one of $xw$ and $yw$ is in $L$. (If multiple strings are tied for having the shortest length, you can pick any one of them.) No justification is necessary.

    1. $x = \mathtt{a}, y = \mathtt{aa}\text.$
    2. $x = \mathtt{a}, y = \mathtt{aaa}\text.$
    3. $x = \mathtt{aa}, y = \mathtt{aaa}\text.$
    4. $x = \mathtt{a}, y = \mathtt{aaaa}\text.$
  1. Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ w \in \Sigma^\star \suchthat w \text{ contains } \mathtt{aa} \text{ as a substring }}\text.$ Which of the following are distinguishing sets for $L$? For each set that isn't a distinguishing set, give us a pair of distinct strings from the set that aren't distinguishable relative to $L$. No justification is necessary.

    1. $\Set{ \mathtt{a}^n \suchthat n \in \naturals }\text.$
    2. $\emptyset\text.$
    3. $\Sigma^\star\text.$
    4. $\Set{\mathtt{abba}}\text.$
    5. $\Set{\mathtt{abba}, \mathtt{abbababba}}\text.$
    6. $\Set{\varepsilon, \mathtt{a}, \mathtt{aa}}\text.$
    7. $\Set{\varepsilon, \mathtt{a}, \mathtt{b}}\text.$
  1. Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and consider this language $L\text:$

    \[L = \Set{ w \in \Sigma^\star \suchthat w \text{ has even length and its first half does not equal its second half } }\text.\]

    Give a distinguishing set $S$ for $L$ containing four strings. Then, for each pair of distinct strings $x, y \in S$, show that they're distinguishable by giving a string $w$ where exactly one of $xw$ and $yw$ belongs to $L$.

  1. As above, let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L$ be this language:

    \[L = \Set{ w \in \Sigma^\star \suchthat w \text{ has even length and its first half does not equal its second half} }\text.\]

    Prove that $L$ is not regular.

    See if you can find a way of generalizing the distinguishing set you found in part (iii) to have four strings, then five, and eventually infinitely many.

Problem Five: Brzozowski’s Theorem

Let's begin with a new definition. If $L$ is a language over $\Sigma$ and $x \in \Sigma^\star$, then the derivative of $L$ with respect to $x\text,$ denoted $\partial_x L$, is the language defined below:

\[\partial_x L = \Set{w \in \Sigma^\star \suchthat xw \in L }\text.\]

This definition is a bit dense, so let’s start by unpacking it.

  1. Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ \varepsilon, \mathtt{aba}, \mathtt{bab}, \mathtt{aaab}, \mathtt{bbbaaa} }$. Answer each of the following questions; no justification is necessary.

    1. What is the longest string in $\partial_\mathtt{a} L$? If two or more strings are tied for having the longest length, list any one of them.
    2. What is the longest string $x \in \Sigma^\star$ for which $\abs{\partial_x L} = 1$? If two or more such strings are tied for having the longest length, list any one of them.
    3. What is the shortest string $x \in \Sigma^\star$ for which $\abs{\partial_x L} = 0$? If two or more such strings are tied for having the shortest length, list any one of them.

Now, for a remarkable theorem:

Brzozowski’s Theorem: Let $L$ be a language over $\Sigma$ and let $S \subseteq \Sigma^\star$ be a set containing infinitely many strings. If the following statement is true, then $L$ is not regular:

\[\forall x \in S. \forall y \in S. (x \ne y \to \partial_x L \ne \partial_y L)\]

Another way of stating this theorem is that if $L$ is a language with infinitely many different derivatives, then $L$ isn’t regular.

  1. Prove Brzozowski’s Theorem.

Language derivatives have both practical and theoretical applications. Surprisingly, it's possible to build regular expression matchers that work by recursively computing derivatives of regular languages, and recent work has extended this to context-free grammar parsers. And amazingly, the converse of Brzozowski's theorem is also true: if a language $L$ has only finitely many derivatives, then $L$ is regular. Take CS154 to learn more about this!

Optional Fun Problem: Languages with Backspaces

Imagine that you’re typing a string of a’s and b’s into your computer. You might hit the backspace key to delete the most recent character that you entered. Let’s represent pressing the backspace key with the X character. So, if you typed in abXa, you’d end up with the string aa. If you typed in abbaXXXXaab, you’d end up with the string aab. If you typed in aXXXXb, you’d end up with the string b (pressing backspace when there’s no text entered has no effect). And if you typed in XXaabbbXX, you’d end up with the string aab.

Let's formalize this idea in the context of formal language theory. Let $\Sigma_\mathtt{X} = \Set{\mathtt{a}, \mathtt{b}, \mathtt{X}}$ and let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}\text.$ We define a function $\nabla : \Sigma_\mathtt{X} \to \Sigma$ such that $\nabla(w)$ is the string that results from applying all the backspace presses in $w\text.$ For example:

  • $\nabla(\varepsilon) = \varepsilon\text.$
  • $\nabla(\mathtt{XXX}) = \varepsilon\text.$
  • $\nabla(\mathtt{aXbXXbXa}) = \mathtt{a}\text.$
  • $\nabla(\mathtt{aaaXXXbbbXX}) = \mathtt{b}\text.$

Finally, let $\nabla^{-1}[L] = \set{ w \suchthat w \in \Sigma_\mathtt{X}^\star \land \nabla(w) \in L}\text.$ There are exactly two languages $L$ over $\Sigma$ for which $\nabla^{-1}[L]$ is regular. Tell us what those languages are, then prove that for every other language $L$ over $\Sigma\text,$ the language $\nabla^{-1}[L]$ is nonregular.