Problem Set 7

Due Friday, November 14 at 1:00 PM Pacific

Solutions are available!

🔑 Solutions

What can you do with regular expressions? What are the limits of regular languages? And what applications do these concepts have outside of the regular languages? In this seventh problem set, you'll explore the answers to these questions along with their practical consequences.

As always, please feel free to drop by office hours, ask on EdStem, or send us emails if you have any questions. We'd be happy to help out.

Good luck, and have fun!

Starter Files

Here are the starter files for the problems you'll submit electronically:

📦 PS7 Starter Files

Unpack the files somewhere convenient and open up the bundled project.

If you would like to type your solutions in $\LaTeX$, you may want to download this template file where you can place your answers:

🖋 PS7 $\LaTeX$ Template

General Advice

First, read through the index of problems below to get an idea of time management for this assignment.
Next, glance through all the questions before you start working. This not only ensures you won't be caught off guard as you reach later problems, but also gives your brain a chance to start working on some of these as a background process while you do other things.
If you get stuck, consider moving on to another problem for a while! Changing things up is a great way to get out of a rut, make new connections between concepts, and spark creative insights!
If you realize you don't have the requisite knowledge to complete a problem, take a step back from it and review the relevant lecture materials. Skipping lecture and trying to extract the relevant tidbits for each problem will lead to a more frustrating experience than engaging fully with a lecture and approaching the problem sets afterward.

Timeline

This pset has five required problems (one of which is autograded, the rest of which are written / manually graded). Because the problem set is somewhat condensed, we are listing them all on a single page.

We recommend the following timeline for completing questions:

Q1 and Q2 by Sunday evening
Q3 by Monday evening (backup plan: finish by Wednesday)
Q4 by Wednesday evening (backup plan: finish by Thursday evening)
Q5 by Thursday evening

Of course, we recommend reading through the entire problem set ASAP so you have a strong idea of how much time the problems will take and so your brain can start working on them as a background process. 😃

Problem One: Designing Regular Expressions

Below are a list of alphabets and languages over those alphabets. For each language, write a regular expression for that language. Provide your answers by downloading the starter files for Problem Set Seven from Canvas and editing the file res/RegularExpressions.regexes. (Unlike the DFA/NFA editor from last time, you’ll need to manually edit these files by opening them in Qt Creator.) Feel free to test your answers locally, and submit your work on GradeScope.

Before starting this problem, read our handy Guide to Regular Expressions for some useful insights about regex design.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}, \mathtt{c}, \mathtt{d}, \mathtt{e}}$. Write a regular expression for the language $L$ = { $w \in \Sigma^\star \suchthat$ the letters in $w$ are sorted alphabetically }. For example, $\mathtt{abcde} \in L\text,$ $\mathtt{bee} \in L\text,$ $\mathtt{a} \in L\text,$ and $\varepsilon \in L\text,$ but $\mathtt{decade} \notin L\text.$

Write a regular expression for the complement of the language from part (i) of this problem.

Like NFAs and unlike DFAs, there isn't a simple construction that starts with a regex for $L$ and turns it into a regex for $\overline{L}$.

On Unix-style operating systems like macOS or Linux, files are organized into directories. You can reference a file by giving a path to the file, a series of directory names separated by slashes. For example, the path /home/username/ might represent a user’s home directory, and a path like /home/username/Documents/PS7.tex might represent that person’s solution to this problem set. Paths that start with a slash character are called absolute paths and say exactly where the file is on disk. Paths that don’t start with a slash are called relative paths and say where, relative to the current folder, a file can be found. For example, if I’m logged into my computer and am in my home folder, I could look up the file Documents/PS7.tex to find my solution to this problem set.

The general pattern here is that a file path consists of a series of directory or file names separated by slashes. That path might optionally start with a slash, but isn’t required to, and it might optionally end with a slash, but isn’t required to. However, you can’t have two consecutive slashes.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{/}}$. Write a regular expression for $L$ = { $w \in \Sigma^\star \suchthat w$ represents the name of a file path on a Unix-style system }. For example, $\mathtt{/aaa/a/aa} \in L$, $\mathtt{/} \in L$, $\mathtt{a} \in L$, $\mathtt{/a/a/a/} \in L$, and $\mathtt{aaa/} \in L$, but $\mathtt{//a//} \notin L$, $\mathtt{a//a} \notin L$, and $\varepsilon \notin L$.

Fun fact: this problem comes from former CS103 instructor Amy Liu, who fixed a bug in industrial code that arose when someone wrote the wrong regex for this language. Oops.

Suppose you are taking a walk with your dog on a leash of length two. Let $\Sigma = \Set{\mathtt{y}, \mathtt{d}}$ and let $L$ = { $w \in \Sigma^\star \suchthat w$ represents a walk with your dog on a leash where you and your dog both end up at the same location }. For example, we have $\mathtt{yyddddyy} \in L$ because you and your dog are never more than two steps apart and both of you end up four steps ahead of where you started; similarly, $\mathtt{ddydyy} \in L$. However, $\mathtt{yyyydddd} \notin L$, since halfway through your walk you’re three steps ahead of your dog; $\mathtt{ddyd} \notin L$, because your dog ends up two steps ahead of you; and $\mathtt{ddyddyyy} \notin L$, because at one point your dog is three steps ahead of you.

Write a regular expression for $L$.

Note that, unlike Problem Set Six, you and your dog must end at the same position.

Let $\Sigma = \Set{\mathtt{M}, \mathtt{D}, \mathtt{C}, \mathtt{L}, \mathtt{X}, \mathtt{V}, \mathtt{I}}$ and let $L$ = { $w \in \Sigma^\star \suchthat w$ is number less than $2,000$ represented in Roman numerals }. For example, $\mathtt{CMXCIX} \in L$, since it represents the number $999$, as are the strings L (50), VIII (8), DCLXVI (666), CXXXVII (137), CDXII (412), and MDCXVIII (1,618). However, we have that $\mathtt{VIIII} \notin L$ (you'll never have four I's in a row; use IX or IV instead), that $\mathtt{MM} \notin L$ (it's a Roman numeral, but it's for $2,000$, which is too large), that $\mathtt{VX} \notin L$ (this isn't a valid Roman numeral), and that $\mathtt{IM} \notin L$ (the notation of using a smaller digit to subtract from a larger one only lets you use I to prefix V and X, or X to prefix L and C, or C to prefix D and M). The Romans didn't have a way of expressing $0$, so to make your life easier we'll say that $\varepsilon \in L$ and that the empty string represents $0$. (Oh, those silly Romans.)

Write a regular expression for $L$.

As a note, we’re using the “standard form” of Roman numerals. You can see a sample of numbers written out this way via this link.)

Problem Two: Finite Languages

A language $L$ is called finite if $L$ contains finitely many strings (that is, $\abs{L}$ is a natural number). Given a finite language $L$, explain how to write a regular expression for $L$. Briefly justify your answer; no formal proof is necessary. This shows that all finite languages are regular.

Watch for edge cases!

Problem Three: Distinguishability

The concepts of distinguishability and distinguishing sets play a key role in the Myhill-Nerode theorem. This problem explores them in more depth. Before starting this problem, read our Guide to the Myhill-Nerode Theorem, which reviews the major concepts and provides some good intuitions.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ w \in \Sigma^\star \suchthat \text{the number of }\mathtt{a}\text{'s in }w\text{ is a multiple of three }}\text.$ Which of the following pairs of strings are distinguishable relative to $L$? For each pair that is distinguishable relative to $L$, give us the shortest string $w \in \Sigma^\star$ such that exactly one of $xw$ and $yw$ is in $L$. (If multiple strings are tied for having the shortest length, you can pick any one of them.) No justification is necessary.
1. $x = \mathtt{a}, y = \mathtt{aa}\text.$
2. $x = \mathtt{a}, y = \mathtt{aaa}\text.$
3. $x = \mathtt{aa}, y = \mathtt{aaa}\text.$
4. $x = \mathtt{a}, y = \mathtt{aaaa}\text.$

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ w \in \Sigma^\star \suchthat w \text{ contains } \mathtt{aa} \text{ as a substring }}\text.$ Which of the following are distinguishing sets for $L$? For each set that isn't a distinguishing set, give us a pair of distinct strings from the set that aren't distinguishable relative to $L$. No justification is necessary.
1. $\Set{ \mathtt{a}^n \suchthat n \in \naturals }\text.$
2. $\emptyset\text.$
3. $\Sigma^\star\text.$
4. $\Set{\mathtt{abba}}\text.$
5. $\Set{\mathtt{abba}, \mathtt{abbababba}}\text.$
6. $\Set{\varepsilon, \mathtt{a}, \mathtt{aa}}\text.$
7. $\Set{\varepsilon, \mathtt{a}, \mathtt{b}}\text.$

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and consider this language $L\text:$
\[L = \Set{ w \in \Sigma^\star \suchthat w \text{ has even length and its first half does not equal its second half } }\text.\]
Give a distinguishing set $S$ for $L$ containing four strings. Then, for each pair of distinct strings $x, y \in S$, show that they're distinguishable by giving a string $w$ where exactly one of $xw$ and $yw$ belongs to $L$.

As above, let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L$ be this language:
\[L = \Set{ w \in \Sigma^\star \suchthat w \text{ has even length and its first half does not equal its second half} }\text.\]
Prove that $L$ is not regular.

See if you can find a way of generalizing the distinguishing set you found in part (iii) to have four strings, then five, and eventually infinitely many.

Problem Four: That's Odd…

Let $\Sigma = \set{\mathtt{a}, \mathtt{b}}$ and let $L = \set{ w \in \Sigma^\star \suchthat \abs{w} \text{ is odd } }$ be a language over $L\text.$ This language is regular; one way to see this is that it's given by the regex $\Sigma\left(\Sigma\Sigma\right)^\star\text.$

However, below is an (incorrect!) proof that $L$ is not regular.

(Incorrect!) Theorem: $L$ is not regular.

(Incorrect!) Proof: Let $S = \set{ \mathtt{a}^n \in \Sigma^\star \suchthat n \in \naturals }\text.$ We claim that $S$ is an infinite distinguishing set for $L\text.$

To see that $S$ is infinite, note that it contains one string for each natural number.

To see that $S$ is a distinguishing set for $L\text,$ consider any two strings $\mathtt{a}^n, \mathtt{a}^{n+1} \in S\text.$ We will prove that $\mathtt{a}^n \not\equiv_L \mathtt{a}^{n+1}\text.$ To do so, we will show there exists a string $w \in \Sigma^\star$ such that exactly one of $\mathtt{a}^n w$ and $\mathtt{a}^{n+1} w$ is in $L\text.$

Specifically, pick $w = \mathtt{a}^n\text.$ First, notice that $\mathtt{a}^n w = \mathtt{a}^n \mathtt{a}^n = \mathtt{a}^{2n}\text.$ This means that $\abs{\mathtt{a}^{2n}} = 2n\text,$ which is even, and so $\mathtt{a}^n w = \mathtt{a}^{2n} \notin L\text.$ Second, observe that $\mathtt{a}^{n+1}w = \mathtt{a}^{n+1}\mathtt{a}^n = \mathtt{a}^{2n+1}\text.$ Because $\abs{\mathtt{a}^{2n+1}} = 2n+1\text,$ which is odd, we see that $\mathtt{a}^{n+1}w = \mathtt{a}^{2n+1} \in L\text.$ Thus exactly one of $\mathtt{a}^n w$ and $\mathtt{a}^{n+1}w$ is in $L\text,$ as needed.

Because $S$ is an infinite distinguishing set for $L\text,$ by the Myhill-Nerode theorem, we see that $L$ is not regular. $\qed$

This proof has to be wrong, because $L$ is indeed regular.

What's wrong with this proof? Be as specific as possible. For full credit, you should identify the first sentence in the proof that contains a logical error and explain what that error is. (We say "first sentence" because once the proof makes a logic error, everything after that point will be incorrect.)

Problem Five: Properties (?) of Languages

Below is a list of statements about languages. Some number of these statements are true. Some number of these statements are false. For each of the true statements, prove that the statement is true. For each of the false statements, disprove the statement by proving that the negation is true. Here's the general template for a disproof:

Claim: [statement to disprove].

Disproof: [proof of the negation of the statement]. $\qed$

Notice that in a disproof we speak of a claim rather than a theorem, since a theorem is a provably true statement and thus couldn't be disproven.

Prove or disprove: for all languages $A\text,$ $B\text,$ and $C\text,$ we have $A(B \cap C) = AB \cap AC\text.$

Prove or disprove: for all finite languages $L$ and all natural numbers $n \ge 1\text,$ we have $\abs{L^n} = \abs{L}^n\text.$

Just to make sure the notation is clear, the left-hand side of the equality is "the cardinality of the language $L^n\text,$" and the right-hand side of the equality is "the $n\text{th}$ power of the cardinality of the language $L\text{."}$

Prove or disprove: there is a language $L$ where $\overline{\left( L^\star \right)} = \left( \overline{L} \right)^\star\text.$

A great warm-up exercise here: what is $\emptyset^\star\text?$ Don't guess - go back to the formal definition and see what it says.

Optional Fun Problem: Languages with Backspaces

Imagine that you’re typing a string of a’s and b’s into your computer. You might hit the backspace key to delete the most recent character that you entered. Let’s represent pressing the backspace key with the X character. So, if you typed in abXa, you’d end up with the string aa. If you typed in abbaXXXXaab, you’d end up with the string aab. If you typed in aXXXXb, you’d end up with the string b (pressing backspace when there’s no text entered has no effect). And if you typed in XXaabbbXX, you’d end up with the string aab.

Let's formalize this idea in the context of formal language theory. Let $\Sigma_\mathtt{X} = \Set{\mathtt{a}, \mathtt{b}, \mathtt{X}}$ and let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}\text.$ We define a function $\nabla : \Sigma_\mathtt{X}^\star \to \Sigma^\star$ such that $\nabla(w)$ is the string that results from applying all the backspace presses in $w\text.$ For example:

$\nabla(\varepsilon) = \varepsilon\text.$
$\nabla(\mathtt{XXX}) = \varepsilon\text.$
$\nabla(\mathtt{aXbXXbXa}) = \mathtt{a}\text.$
$\nabla(\mathtt{aaaXXXbbbXX}) = \mathtt{b}\text.$

Finally, let $\nabla^{-1}[L] = \set{ w \suchthat w \in \Sigma_\mathtt{X}^\star \land \nabla(w) \in L}\text.$ There are exactly two languages $L$ over $\Sigma$ for which $\nabla^{-1}[L]$ is regular. Tell us what those languages are, then prove that for every other language $L$ over $\Sigma\text,$ the language $\nabla^{-1}[L]$ is nonregular.