% 
% LaTeX Problem Set Template by Sachin Padmanabhan
% I created this when I was a freshman in CS 103,
% and I continue to use it to this day.
%
% Hope you enjoy!
%
% There may be problems with this template.
% If so, feel free to contact me.
%

% Update for Fall 2018 by Michael Zhu

\documentclass{article}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{mathdots}
\usepackage[pdftex]{graphicx}
\usepackage{fancyhdr}
\usepackage[margin=1in]{geometry}
\usepackage{multicol}
\usepackage{bm}
\usepackage{listings}
\PassOptionsToPackage{usenames,dvipsnames}{color}  %% Allow color names
\usepackage{pdfpages}
\usepackage{algpseudocode}
\usepackage{tikz}
\usetikzlibrary{automata,positioning}
\usepackage{enumitem}
\usepackage[T1]{fontenc}
\usepackage{inconsolata}
\usepackage{framed}
\usepackage{wasysym}
\usepackage[thinlines]{easytable}
\usepackage{wrapfig}
\usepackage{hyperref}
\usepackage{mathrsfs}
\usepackage{tabularx} % also loads 'array' package
\newcolumntype{C}{>{\centering\arraybackslash}X} % centered version of 'X' columns
\usepackage{diagbox}

\title{CS 103: Mathematical Foundations of Computing\\Problem Set \#7}
\author{Your Name(s) Here}
\date{\today}
\lhead{Your Name(s) Here}
\chead{Problem Set \#7}
\rhead{\today}
\lfoot{}
\cfoot{CS 103: Mathematical Foundations of Computing --- Autumn 2018}
\rfoot{\thepage}

\newcommand{\abs}[1]{\lvert #1 \rvert}
\newcommand{\absfit}[1]{\left\lvert #1 \right\rvert}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\eval}[3]{\left[#1\right]_{#2}^{#3}}
\renewcommand{\(}{\left(}
\renewcommand{\)}{\right)}
\newcommand{\floor}[1]{\left\lfloor#1\right\rfloor}
\newcommand{\ceil}[1]{\left\lceil#1\right\rceil}
\newcommand{\pd}[1]{\frac{\partial}{\partial #1}}
\newcommand{\inner}[1]{\langle#1\rangle}
\newcommand{\cond}{\bigg|}
\newcommand{\rank}[1]{\mathbf{rank}(#1)}
\newcommand{\range}[1]{\mathbf{range}(#1)}
\newcommand{\nullsp}[1]{\mathbf{null}(#1)}
\newcommand{\repr}[1]{\left\langle#1\right\rangle}
\newcommand\VRule[1][\arrayrulewidth]{\vrule width #1}

\DeclareMathOperator{\Var}{Var}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator{\Tr}{\mathbf{Tr}}
\DeclareMathOperator{\diag}{\mathbf{diag}}
\DeclareMathOperator{\dist}{\mathbf{dist}}
\DeclareMathOperator{\prob}{\mathbf{prob}}
\DeclareMathOperator{\dom}{\mathbf{dom}}
\DeclareMathOperator{\E}{\mathbf{E}}
\DeclareMathOperator{\Q}{\mathbb{Q}}
\DeclareMathOperator{\R}{\mathbb{R}}
\DeclareMathOperator{\var}{\mathbf{var}}
\DeclareMathOperator{\quartile}{\mathbf{quartile}}
\DeclareMathOperator{\conv}{\mathbf{conv}}
\DeclareMathOperator{\VC}{VC}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\Ber}{Bernoulli}
\DeclareMathOperator{\NP}{\mathbf{NP}}
\DeclareMathOperator{\coNP}{\mathbf{coNP}}
\DeclareMathOperator{\TIME}{\mathsf{TIME}}
\DeclareMathOperator{\polytime}{\mathbf{P}}
\DeclareMathOperator{\PH}{\mathbf{PH}}
\DeclareMathOperator{\SIZE}{\mathbf{SIZE}}
\DeclareMathOperator{\ATIME}{\mathbf{ATIME}}
\DeclareMathOperator{\SPACE}{\mathbf{SPACE}}
\DeclareMathOperator{\ASPACE}{\mathbf{ASPACE}}
\DeclareMathOperator{\NSPACE}{\mathbf{NSPACE}}
\DeclareMathOperator{\Z}{\mathbb{Z}}
\DeclareMathOperator{\N}{\mathbb{N}}
\DeclareMathOperator{\EXP}{\mathbf{EXP}}
\DeclareMathOperator{\NEXP}{\mathbf{NEXP}}
\DeclareMathOperator{\NTIME}{\mathbf{NTIME}}
\DeclareMathOperator{\DTIME}{\mathbf{DTIME}}
\DeclareMathOperator{\poly}{poly}
\DeclareMathOperator{\BPP}{\mathbf{BPP}}
\DeclareMathOperator{\ZPP}{\mathbf{ZPP}}
\DeclareMathOperator{\RP}{\mathbf{RP}}
\DeclareMathOperator{\coRP}{\mathbf{coRP}}
\DeclareMathOperator{\BPL}{\mathbf{BPL}}
\DeclareMathOperator{\IP}{\mathbf{IP}}
\DeclareMathOperator{\PSPACE}{\mathbf{PSPACE}}
\DeclareMathOperator{\NPSPACE}{\mathbf{NPSPACE}}
\DeclareMathOperator{\SAT}{\mathsf{SAT}}
\DeclareMathOperator{\NL}{\mathbf{NL}}
\DeclareMathOperator{\PCP}{\mathbf{PCP}}
\DeclareMathOperator{\PP}{\mathbf{PP}}
\DeclareMathOperator{\cost}{cost}
\let\Pr\relax
\DeclareMathOperator*{\Pr}{\mathbf{Pr}}

\definecolor{shadecolor}{gray}{0.9}

\theoremstyle{plain}
\newtheorem*{lem}{Lemma}

\theoremstyle{plain}
\newtheorem*{claim}{Claim}

\newtheorem{theorem}{Theorem}[section]
\newtheorem*{thm}{Theorem}
\newtheorem{corollary}{Corollary}[theorem]
\newtheorem{lemma}[theorem]{Lemma}

\renewcommand{\headrulewidth}{0.4pt}
\renewcommand{\footrulewidth}{0.4pt}

\setlength{\parindent}{0pt}

\pagestyle{fancy}

\renewcommand{\thefootnote}{\fnsymbol{footnote}}

% start MZ
\usetikzlibrary{automata,positioning}
\let\oldemptyset\emptyset
\renewcommand{\emptyset}{\text{\O}}
\renewcommand\qedsymbol{$\blacksquare$}
\newenvironment{prf}{{\bfseries Proof.}}{\qedsymbol}
\renewcommand{\emph}[1]{\textit{\textbf{#1}}}
\newcommand{\annotate}[1]{\textit{\textcolor{blue}{#1}}}
\usepackage{stmaryrd}
\usepackage{minted}
\usepackage{mathtools}
\usepackage{alltt}
\newcommand{\ttt}[1]{\texttt{#1}}
\newcommand{\match}[1]{\textbf{\underline{#1}}}
\usepackage{parskip}
\newcommand{\lcurly}{\textbf{\ttt{\{}}}
\newcommand{\rcurly}{\textbf{\ttt{\}}}}

% end MZ

\begin{document}

\maketitle{\vspace{-1.0cm}}

\section{Designing Regular Expressions}

Below are a list of alphabets and languages over those alphabets. For each language, write a regular expression for that language. \emph{Please use our online tool to design, test, and submit your regular expressions. Typed or handwritten solutions will not be accepted.} To use it, visit the CS103 website and click the ``Regex Editor'' link under the ``Resources'' header. If you submit in a pair, please tell us in your GradeScope submission who submitted your answers to this question. Also, as a reminder, please test your submissions thoroughly, since we'll be grading them with an autograder.

\begin{enumerate}[label*=\roman*.,ref=\roman*]
    \item Let $\Sigma = \{\ttt{a}, \ttt{b}, \ttt{c}, \ttt{d}, \ttt{e}\}$. Write a regular expression for the language $\{w \in \Sigma^* \mid $ the letters in $w$ are sorted alphabetically\}. 
    
    \item Write a regular expression for the complement of the language from part (i) of this problem.
    
    \annotate{As with NFAs, there's no simple way to start with a regex for a language $L$ and to turn it into a regex for $\overline{L}$.}
    
    \item On Unix-style operating systems like macOS or Linux, files are organized into directories. You can reference a file by giving a \textit{path} to the file, a series of directory names separated by slashes. For example, the path \ttt{/home/username/} might represent a user's home directory, and a path like \ttt{/home/username/Documents/PS7.tex} might represent that person's solution to this problem set. Paths that start with a slash character are called \textit{absolute paths} and say exactly where the file is on disk. Paths that don't start with a slash are called \textit{relative paths} and say where, relative to the current folder, a file can be found. For example, if I'm logged into my computer and am in my home folder, I could look up the file \ttt{Documents/PS7.tex} to find my solution to this problem set.
    
    The general pattern here is that a file path consists of a series of directory or file names separated by slashes. That path might optionally start with a slash, but isn't required to, and it might option- ally end with a slash, but isn't required to. However, you can't have two consecutive slashes. \footnote{In some cases you technically \textit{can} have multiple consecutive slashes, but we'll ignore that for now.}
    
    Let $\Sigma = \{a,/\}$. Write a regular expression for $L = \{w \in \Sigma^* \mid w$ represents the name of a file path on a Unix-style system \}. For example, $\ttt{/aaa/a/aa} \in L$, $\ttt{/} \in L$, $\ttt{a} \in L$, $\ttt{/a/a/a/} \in L$, and $\ttt{aaa/} \in L$, but $\ttt{//a//} \notin L$, $\ttt{a//a} \notin L$, and $\varepsilon \notin L$.
    
    Fun fact: this problem comes from former TA Amy Liu, who fixed a bug in industrial code that arose when someone wrote the wrong regex for this language. Oops.
    
    \item Suppose you are taking a walk with your dog on a leash of length two. Let $\Sigma = \{\ttt{y}, \ttt{d}\}$ and let  $L = \{w \in \Sigma^* \mid w$ represents a walk with your dog on a leash where you and your dog both end up at the same location\}. For example, we have $\ttt{yyddddyy} \in L$ because you and your dog are never more than two steps apart and both of you end up four steps ahead of where you started; similarly, $\ttt{ddydyy} \in L$. However, $\ttt{yyyyddd} \notin L$, since halfway through your walk you're three steps ahead of your dog; $\ttt{ddyd} \notin L$, because your dog ends up two steps ahead of you; and $\ttt{ddyddyyy} \notin L$, because at one point your dog is three steps ahead of you. Write a regular expression for $L$.
    
    \annotate{Note that, unlike Problem Set Six, you and your dog \textbf{must} end at the same position.}
    
    \item Let $\Sigma = \{\texttt{M, D, C, L, X, V, I}\}$ and let $L = \{w \in \Sigma^* \mid w$ is number less than 2,000 represented in Roman numerals\}. For example, $\texttt{CMXCIX} \in L$, since it represents the number 999, as are the strings \ttt{L} (50), \ttt{VIII} (8), \ttt{DCLXVI} (666), \ttt{CXXXVII} (137), \ttt{CDXII} (412), and \ttt{MDCXVIII} (1,618). However, we have $\ttt{VIIII} \notin L$ (you'll never have four \ttt{I}'s in a row; use \ttt{IX} or \ttt{IV} instead), that $\ttt{MM} \notin L$ (it's a Roman numeral, but it's for 2,000, which is too large), that $\ttt{VX} \notin L$ (this isn't a valid Roman numeral), and that $\ttt{IM} \not\in L$ (the notation of using a smaller digit to subtract from a larger one only lets you use \ttt{I} to prefix \ttt{V} and \ttt{X}, or \ttt{X} to prefix \ttt{L} and \ttt{C}, or \ttt{C} to prefix \ttt{D} and \ttt{M}). The Romans didn't have a way of expressing the number 0, so to make your life easier we'll say that $\varepsilon \in L$ and that the empty string represents 0. (Oh, those silly Romans.) Write a regular expression for $L$. 
    
    (As a note, we're using the ``standard form'' of Roman numerals. You can see a sample of numbers written out this way via \href{http://literacy.kent.edu/Minigrants/Cinci/romanchart.htm}{\underline{this link}}.)
\end{enumerate}
\begin{shaded}
Write the SUNetID of the person who submitted the regular expressions here. 
\end{shaded}

\pagebreak

\section{Finite Languages}

A language $L$ is called \emph{finite} if $L$ contains finitely many strings (that is, $|L|$ is a natural number).

\begin{enumerate}[label*=\roman*.,ref=\roman*]
    \item Given a finite language $L$, explain how to write a regular expression for $L$. Briefly justify your answer; no formal proof is necessary. This shows that all finite languages are regular.
    
    \annotate{Watch for edge cases!}
    
    \begin{shaded}
    Provide a proof here. 
    \end{shaded}
    
    \item Look up the \emph{trie} data structure on Wikipedia. Explain, in your own words, how a trie can be thought of as an NFA for a finite language. This gives another view on why finite languages are regular.
    
    \begin{shaded}
    Provide a proof here. 
    \end{shaded}
\end{enumerate}

\pagebreak

\section{State Elimination}

The state elimination algorithm gives a way to transform a finite automaton (a DFA or a NFA) into a regular expression. It's a really beautiful algorithm once you get the hang of it, so we thought that we'd let you try it out on a particular example.

Let $\Sigma = \{\ttt{a}, \ttt{b}\}$ and let $L = \{w \in \Sigma^* \mid w$ has an even number of \ttt{a}'s and an even number of \ttt{b}'s\}. Below is a finite automaton for $L$ that we've prepared for the state elimination algorithm by adding in a new start state $q_{\text{start}}$ and a new accept state $q_{\text{acc}}$:

\begin{center}
\tikzstyle{accepting}=[double distance=2pt, outer sep=1pt+\pgflinewidth]
\begin{tikzpicture}[shorten >=1pt,node distance=2.2cm,on grid,auto, state/.style={circle, draw, minimum size=1.1cm}] 
   % \node[state, options?] (identifier) [placement] {node label};
   \node[state, initial] (s) {$q_{\text{start}}$};
   \node[state] (0) [right=of s] {$q_0$};
   \node[state] (1) [right=of 0] {$q_1$}; 
   \node[state, accepting] (a) [above=of 0] {$q_{\text{acc}}$};
   \node[state] (2) [below=of 0] {$q_2$}; 
   \node[state] (3) [below=of 1] {$q_3$}; 
    
    % (start identifier) edge [options?] node {edge label} (target identifier)
    %                    edge [options?] node {edge label} (another target identifier)                  
    \path[->] 
    (s) edge node {$\varepsilon$} (0)
    (0) edge [bend left=10] node {$a$} (1)
        edge node {$\varepsilon$} (a)
        edge [bend left=10] node {$b$} (2)
    (1) edge [bend left=10] node {$a$} (0)
        edge [bend left=10] node {$b$} (3)
    (2) edge [bend left=10] node {$b$} (0)
        edge [bend left=10] node {$a$} (3)
    (3) edge [bend left=10] node {$b$} (1)
        edge [bend left=10] node {$a$} (2);
\end{tikzpicture}
\end{center}

We'd like you to use the state elimination algorithm to produce a regular expression for $L$.
\begin{enumerate}[label*=\roman*.,ref=\roman*]
    \item Run two steps of the state elimination algorithm on the above automaton. Specifically, first remove state $q_1$, then remove state $q_2$. Show your result at this point.
    
    \textit{\textcolor{blue}{ Go slowly. Remember that to eliminate a state q, you should identify all pairs of states $q_{in}$ and $q_{out}$ where there's a transition from $q_{in}$ to q and from q to $q_{out}$, then add shortcut edges from $q_{in}$ to $q_{out}$ to bypass state q. Remember that $q_{in}$ and $q_{out}$ can be the same state. If you've done everything right at the end of this stage, none of the transitions you have at this point should have Kleene stars in them. }}
    
    \begin{shaded}
    Provide the resulting automaton here. 
    % You can copy and modify the above automaton drawn using the tikz automata package, 
    % or use a finite automata editor such as http://madebyevan.com/fsm/ 
    % If you choose to draw and then scan/take a picture, *please* make sure that the image is clearly legible in Gradescope!
    \end{shaded}
    
    \item Finish the state elimination algorithm, showing your work. \emph{Submit your resulting regular expression through our regular expression tool} along the lines of what you did in Problem One.
    
    \annotate{You should start seeing Kleene stars appearing as you remove the remaining states.}
    
    \annotate{Not sure whether you have the right answer? \textbf{Test your result thoroughly!}}
    
    \begin{shaded}
    Provide an answer here. 
    \end{shaded}
    
    \item Without making reference to the original automaton given above, give an intuitive explanation for how the regular expression you found in part (ii) works.
    
    \annotate{The regex you've found works by matching strings in a very creative way. Tell us what that way is.}
    
    \begin{shaded}
    Provide an answer here. 
    \end{shaded}
\end{enumerate}

\pagebreak

\section{At Least Three Point Five}

The Myhill-Nerode theorem is a powerful tool for finding nonregular languages, but it can take some adjusting to get used to.

Let $\Sigma = \{\ttt{a}, \ttt{b}\}$ and consider the following language:
\begin{equation*}
L = \{w \in \Sigma^* \mid \text{there are at least as many \ttt{a}'s as \ttt{b}'s in }w\}.
\end{equation*}
This language $L$ isn't regular; make sure you have an intuition for why this is. 

Below is an attempted proof that $L$ isn't regular. Although the claim it's proving is indeed true, this proof contains an error that renders it incorrect.
\begin{quote}
\emph{Theorem:} $L$ is not a regular language.

\emph{Proof:} Consider the set $S = \{ a^n\ |\ n \in \N \}$. This set is infinite because it contains one string for each natural number. We will prove that any two distinct strings in $S$ are distinguishable relative to $L$. To do so, consider any distinct strings $\ttt{a}^m, \ttt{a}^n \in S$, and assume without loss of generality that $m > n$. Then $\ttt{b}^m\ttt{a}^m \in L$ because this string contains the same number of \ttt{a}'s and \ttt{b}'s, but $\ttt{b}^m\ttt{a}^n \notin L$ because it contains $m$ \ttt{b}'s and $n$ \ttt{a}'s and $m > n$. Therefore, we see that $\ttt{a}^m \not\equiv_L \ttt{a}^n$. This means that $S$ is an infinite set of strings that are pairwise distinguishable relative to $L$. Therefore, by the Myhill-Nerode theorem, $L$ is not regular. \qedsymbol
\end{quote}

What's wrong with this proof? Be as specific as possible.

\annotate{The best way to identify a flaw in a proof is to point to a specific claim that's being made that's not true or not properly substantiated and to explain why.}

\begin{shaded}
Write your answer here.
\end{shaded}

\pagebreak

\section{Embracing the Braces}

Let $\Sigma$ be an alphabet containing two characters, the open curly brace character \lcurly and the close curly brace character \rcurly. Consider the following language over $\Sigma$:
\begin{equation*}
L_1 = \{w \in \Sigma^* \mid w\text{ is a string of balanced curly braces}\}
\end{equation*}
For example, we have $\lcurly\rcurly \in L_1$, $\lcurly\lcurly\rcurly\rcurly \in L_1$, $\lcurly\lcurly\rcurly\lcurly\rcurly\rcurly\lcurly\rcurly \in L_1$, $\varepsilon \in L_1$, and $\lcurly\lcurly\rcurly\rcurly\lcurly\lcurly\lcurly\rcurly\lcurly\rcurly\rcurly\rcurly \in L_1$, but $\rcurly\lcurly\ \not\in L_1$,
$\lcurly\lcurly\rcurly \not\in L_1$, and $\lcurly\lcurly\lcurly\rcurly\rcurly\rcurly\rcurly \not\in L_1$. This question explores properties of this language.

\begin{enumerate}[label*=\roman*.,ref=\roman*]
    \item Prove that $L_1$ is not a regular language. One consequence of this result -- which you don't need to prove -- is that most languages that support some sort of nested parentheses, such as most programming languages and HTML, aren't regular and so can't be parsed using regular expressions.

    \annotate{As with all problems involving nonregular languages, proceed with this one in stages. First, ask yourself: if you were reading an input string from left to right, what information would you have to keep track of? The Myhill-Nerode theorem asks you to find an infinite set of strings that are all pairwise distinguishable, so try creating an infinite set of strings, one for each possible value that this information could take on.}
    
    \begin{shaded}
    Provide a proof here. 
    \end{shaded}
\end{enumerate}

Let's say that the \emph{nesting depth} of a string of balanced parentheses is the maximum number of unmatched open parentheses at any point inside the string. For example, the string \lcurly\lcurly\rcurly\lcurly\lcurly\rcurly\rcurly\rcurly\ has nesting depth three, the string \lcurly\lcurly\rcurly\lcurly\rcurly\rcurly\lcurly\rcurly\ has nesting depth two, and the string $\varepsilon$ has nesting depth zero. 

Consider the language $L_2 = \{w \in \Sigma^* \mid w$ is a string of balanced curly braces with nesting depth at most 4\}. For example, $\lcurly\rcurly \in L_2$, $\lcurly\lcurly\rcurly\lcurly\rcurly\rcurly \in L_2$, and $\lcurly\lcurly\lcurly\lcurly\rcurly\rcurly\rcurly\rcurly\lcurly\lcurly\rcurly\rcurly \in L_2$, but $\lcurly\lcurly\lcurly\lcurly\lcurly\rcurly\rcurly\rcurly\rcurly\rcurly \notin L_2$ because although it's a string of balanced curly braces, the nesting goes five levels deep.

\begin{enumerate}[resume*]
    \item Design a DFA for $L_2$, showing that $L_2$ is regular. A consequence of \textit{this} result is that while you can't parse all programs or HTML with regular expressions, you can parse programs with low nesting depth or HTML documents without deeply-nested tags using regexes. \emph{Please submit this DFA using the DFA editor on the course website} and tell us on GradeScope who submitted it.
    
    \begin{shaded}
    Write the SUNetID of the person who submitted the DFA here. 
    \end{shaded}
    
    \item Look back at your proof from part (i) of this problem. Imagine that you were to take that exact proof and blindly replace every instance of ``$L_1$'' with ``$L_2$.'' This would give you a (incorrect) proof that $L_2$ is nonregular (which we know has to be wrong because $L_2$ is indeed regular.) Where would the error be in that proof? Be as specific as possible.
    
    \annotate{Again, you should be able to point at a specific spot in the proof that contains a logic error and explain exactly why the statement in question is not true or not supported by the preceding statements. If you can't do this, it likely means you have an error in your proof from part (i)!}
    
    \begin{shaded}
    Provide an answer here. 
    \end{shaded}
    
\end{enumerate}

\pagebreak 

\section{State Lower Bounds}

The Myhill-Nerode theorem we proved in lecture is actually a special case of a more general theorem about regular languages that can be used to prove lower bounds on the number of states necessary to construct a DFA for a given language.

\begin{enumerate}[label*=\roman*.,ref=\roman*]
    \item Let $L$ be a language over $\Sigma$. Suppose there's a set $S$, which may be finite and which may be infinite, such that any two distinct strings $x, y \in S$ are distinguishable relative to $L$ (that is, $x \not\equiv_L y$ for any two strings $x, y \in S$ where $x \ne y$.) Prove that any DFA for $L$ must have at least $|S|$ states. (You sometimes hear this referred to as \textit{lower-bounding} the size of any DFA for $L$.)
    
    \begin{shaded}
    Provide a proof here. 
    \end{shaded}
\end{enumerate}

According to old-school Twitter rules, all tweets need to be 140 characters or less. Let $\Sigma$ be the alphabet of characters that can legally appear in a tweet and consider the following language:
\begin{equation*}
\textit{TWEETS} = \{w \in \Sigma^* \mid |w| \leq 140\}.
\end{equation*}
This is the language of all legal tweets, assuming the empty string is a legal tweet. The good news is that this language is regular. The bad news is that any DFA for it has to be pretty large.

\begin{enumerate}[resume*]
\item Using your result from part (i), prove that any DFA for \textit{TWEETS} must have at least 142 states.

\annotate{It might be easier to tackle this problem if you consider replacing 140 and 142 with some smaller numbers (say, 2 and 4) to build up an intuition. And work backwards -- what will you need to do to invoke part (i)?}

\begin{shaded}
Provide a proof here. 
\end{shaded}

\item Define a 142-state DFA for \textit{TWEETS} using the formal 5-tuple definition of a DFA. Briefly explain how your DFA works. No formal proof is necessary.

\annotate{Again, this might be a lot easier to do if you first reduce 140 and 142 to 2 and 4, respectively, and see what you come up with. Start by drawing out what the DFA would look like, then think about how you'd formalize your idea as a 5-tuple.\\
Remember to define the transition function you'll want to use our methods of formally defining a function that we learned earlier in the course; in particular look at piece-wise functions that you made for PSet 4 Question 1.}

\begin{shaded}
Provide an answer here. 
\end{shaded}
\end{enumerate}

Your results from parts (ii) and (iii) collectively establish that the smallest possible DFA for \textit{TWEETS} has exactly 142 states. This approach to finding the smallest object of some type -- using some theorem to prove a lower bound (``we need at least this many states'') combined with a specific object of the given type (``we certainly can't do worse than this'') is a common strategy in algorithm design and computational complexity theory. If you take classes like CS161, CS254, etc., you'll likely see similar sorts of approaches!

\pagebreak

\section{The Extended Transition Function}

As you saw on Problem Set Six, formally speaking, a DFA is a 5-tuple $(Q, \Sigma, \delta, q_0, F)$. You used the 5-tuple definition to pin down edge cases of DFAs. But we can also use this formal definition to rigorously define concepts about automata that, at this point, we've only discussed at a high-level.

Let $D = (Q, \Sigma, \delta, q_0, F)$ be a DFA. We're going to define a function $\delta^* : \Sigma^* \rightarrow Q$ called the extended transition function of D. Intuitively, the function $\delta^*$ takes as input a string and outputs what state that string would end up in if run through the DFA $D$. The function $\delta^*$ is defined, recursively, as follows:
\begin{itemize}
    \item \emph{Base case}: $\delta^*(\varepsilon) = q_0$.
    \item \emph{Recursive case}: If $w \in \Sigma^*$ and $a \in \Sigma$, then $\delta^*(wa) = \delta(\delta^*(w), a)$.
\end{itemize}
That's quite a mouthful, but there's a nice explanation for what's going on here.

\begin{enumerate}[label=\roman*.]
    \item Explain $\delta^*$'s base case in plain English and why that makes sense given what $\delta^*$ represents.
    
    \begin{shaded}
    Write your answer here.
    \end{shaded}
    
    \item Explain $\delta^*$'s recursive case in plain English and why that makes sense given what $\delta^*$ represents.

    \annotate{The notation here is dense, so proceed slowly. What's the input to the function $\delta^*$ in this case? What are w and a? You may want to draw out an actual DFA and try expanding out $\delta^*$ for a couple of strings.}

    \begin{shaded}
    Write your answer here.
    \end{shaded}

    \item Explain why, formally speaking, we can define $L(D) = \{ w \in \Sigma^* \mid \delta^*(w) \in F \}$. 
    
    \annotate{That's a lot of symbols! Write out what each of them mean. Compare this definition against the one from lecture -- see if you can account for why this definition has the same meaning.}
    
    \begin{shaded}
    Write your answer here.
    \end{shaded}

    \item Let $D = (Q, \Sigma, \delta, q_0, F)$ be a DFA, and let $x, y \in \Sigma^*$ be two strings where $\delta^*(x)=\delta^*(y)$. Prove that, for any string $z \in \Sigma^*$, we have $\delta^*(xz) = \delta^*(yz)$.
    
    \annotate{Your proof will need to rely on the definition of $\delta^*$. Since $\delta^*$ is defined recursively, what style of proof do you think you might want to use here?}
    
    \annotate{Make sure you have an intuition as to what you're asked to prove here. After you peel back the layers of notation, you're left with a nice statement about the behavior of DFAs that's related to something from lecture. While you should use your intuition to guide you, your proof should specifically use the 5-tuple definition of a DFA and the formal definition of the extended transition function. For example, you should not include statements like ``run the DFA on $xz$'' or ``the DFA ends in an accepting state,'' since you now have more formal notation available to express those ideas.}
    
    \begin{shaded}
    Provide a proof here.
    \end{shaded}
\end{enumerate}

In lecture, we used this theorem about distinguishable strings to prove certain languages aren't regular:
\begin{center}
    \emph{Theorem:} Let $x$ and $y$ be strings where $x \not\equiv_L y$. Then $x$ and $y$ cannot end up \\
    in the same state after being run through any DFA for the language $L$.
\end{center}
We can recast this theorem in terms of the $\delta^*$ function that we just defined above:
\begin{center}
    \emph{Theorem (Formalized):} Let $x$ and $y$ be strings where $x \not\equiv_L y$. Then for any DFA $D$ for $L$, \\
    if $\delta^*$ is the extended transition function for $D$, we have $\delta^*(x) \neq \delta^*(y)$.
\end{center}
Now that you're equipped with the formal definition of $\delta^*$, you can rigorously prove the above statement.

\begin{enumerate}[resume*]
    \item Prove the formalized theorem (the second one). Since the goal is to write a rigorous proof of the theorem, you should not cite the informal one from lecture as part of your proof.
    
    \annotate{Use your intuition about DFAs to think through his one, but as with part (iv) of this problem, use the 5-tuple definition of a DFA and the formal definition of the extended transition function in your proof. For example, you should not use phrases like ``run the DFA on $x$'' or ``the DFA ends up in an accepting state,'' since you have more formal notation at your disposal.}
    
    \annotate{Once you've finished, take a minute to marvel at the fact that you're able to read (and prove!) statements like these. Not bad for seven weeks!}
    
    \begin{shaded}
    Provide a proof here.
    \end{shaded}
\end{enumerate}
\pagebreak

\section{Regular Languages and Equivalence Relations}

Throughout this problem set you've been working with the idea that we can take a language $L$ over some alphabet $Î£$, then work with its distinguishability relation $\not\equiv_L$. A closely related binary relation is the \emph{indistinguishability} relation for L, denoted $\equiv_L$. It's also a binary relation over $\Sigma^*$, and its definition is the negation of the one for distinguishability:
$$x \equiv_L y \text{ if } \forall w \in \Sigma^*. (xw \in L \leftrightarrow yw \in L).$$
Amazingly, this is always an equivalence relation, regardless of what $L$ is!

\begin{enumerate}[label=\roman*.]
    \item Prove that if $L$ is a language over $\Sigma$, then $\equiv_L$ is an equivalence relation over $\Sigma^*$.

    \begin{shaded}
    Provide a proof here.
    \end{shaded}
\end{enumerate}

Whenever you see an equivalence relation, you should immediately start thinking about what its equivalence classes are. Doing so will usually tell you something interesting.

Let's make this more concrete. Let $\Sigma = \{a, b\}$ and consider the language $M = \{w \in \Sigma^* \mid$ the number of \ttt{b}'s in $w$ is congruent to 1 modulo 5 or to 3 modulo 5\}. For example, $\ttt{a\match{b}a} \in M, \ttt{\match{b}aaa\match{b}aaa\match{b}} \in M$, and $\ttt{\match{bbbbbb}} \in M$, but $\ttt{aa} \notin M$ and $\ttt{a\match{bb}a}\notin M$.

\begin{enumerate}[resume*]
    \item How many equivalence classes does the $\equiv_M$ relation have? Briefly describe what those equivalence classes are and give a system of representatives for $\equiv_M$.
    
    \annotate{Need a refresher on systems of representatives? Check out Problem Set Three.}

    \begin{shaded}
    Write your answer here.
    \end{shaded}
\end{enumerate}

You might have noticed that each equivalence class of $\equiv_M$ either consists of a bunch of strings not in $M$ or of a bunch of strings that are in $M$. That's not a coincidence!

\begin{enumerate}[resume*]
    \item Let L be a language over some alphabet $\Sigma$ and let $x \in \Sigma^*$ be some string. Prove that either \textit{every string} in $[x]_{\equiv_L}$ is in $L$ or that \textit{no strings} in $[x]_{\equiv_L}$ are.
    
    \begin{shaded}
    Provide a proof here.
    \end{shaded}
\end{enumerate}

The number of equivalence classes of an equivalence relation is called its \emph{index}; the index of an equivalence relation $R$ is denoted \emph{I(R)}. This quantity might be finite, or it might be an infinite cardinality like $\aleph_0$, or even one of the infinities bigger than that.

Armed with the idea of an index, we can state a powerful theorem about finite automata:
\begin{center}
    \emph{Theorem:} If $L$ is a language over $\Sigma$, then every DFA for $L$ must have at least $I(\equiv_L)$ states.
\end{center}
In other words, there's a connection between the number of equivalence classes of a particular binary relation and the minimum sizes of DFAs for that language!

\begin{enumerate}[resume*]
    \item Prove the above theorem. Feel free to use the \emph{axiom of choice}, which says that every equivalence relation has at least one system of representatives.
    
    \annotate{Proving this theorem is mostly an exercise in connecting together ideas you've seen used in other places. Think about the relationship between indices and systems of representatives, between distinguishability and indistinguishability, and between what you're doing here and what you've done earlier on this problem set.}
    
    \begin{shaded}
    Provide a proof here.
    \end{shaded}
\end{enumerate}

There's a very nice intuition for what this theorem says. You can think of the indistinguishability relation for a language $L$ as pinning down the idea ``a DFA for $L$ can't tell the difference between these two strings.'' If you think back to our intuition behind DFA design -- build a DFA where each state keeps track of some different piece of information -- then you can think of $I(\equiv_L)$ as capturing the number of different pieces of information you'd need to remember. The theorem then says that if you want to build a DFA for a language $L$, you'll need at least one state per piece of information.

\pagebreak

\section*{Optional Fun Problem: Generalized Fooling Sets (Extra Credit)}

In Problem Seven, you saw how to use distinguishability to lower-bound the size of DFAs for a particular language. Unfortunately, distinguishability is not a powerful enough technique to lower-bound the sizes of NFAs. In fact, it's in general quite hard to bound NFA sizes; there's a \$1,000,000 prize for anyone who finds an efficient algorithm (for some precise definition of ``efficient'') that, given an arbitrary NFA, converts it to the smallest possible equivalent NFA! 

Although it's generally difficult to lower-bound the sizes of NFAs, there are some techniques we can use to find lower bounds on the sizes of NFAs. Let $L$ be a language over $\Sigma$. A \emph{generalized fooling set} for $L$ is a set $\mathscr{F} \subseteq \Sigma^* \times \Sigma^*$ is a set with the following properties:

\begin{itemize}
\item For any $(x, y) \in \mathscr{F}$, we have $xy \in L$.
\item For any distinct pairs $(x_1, y_1), (x_2, y_2) \in \mathscr{F}$, we have $x_1y_2 \not\in L$ or $x_2y_1 \not\in L$ (this is an inclusive OR.)
\end{itemize}
Prove that if $L$ is a language and there is a generalized fooling set $\mathscr{F}$ for $L$ that contains $n$ pairs of strings,
then any NFA for $L$ must have at least $n$ states.

\annotate{Don't let the notation scare you off. This is a really cool problem to work through!}

\begin{shaded}
Provide a proof here. 
\end{shaded}

\end{document}