\documentclass[11pt]{article}
\usepackage[T1]{fontenc}
\usepackage{textcomp}
\usepackage{lmodern}
\usepackage[margin=1in]{geometry}
\usepackage[pdftex]{graphicx, color}
\usepackage{mathtools}
\usepackage{listings}
\usepackage[pdfusetitle]{hyperref}

\usepackage{tikz}
\usetikzlibrary{automata,positioning}

\renewcommand{\epsilon}{\varepsilon}
\newcommand{\tikzname}{Ti\emph{k}Z}
\tikzset{shorten >=1pt, node distance=2cm, on grid, baseline={([yshift=-8pt] current bounding box.north)}}

\lstset{basicstyle=\small\ttfamily,breaklines=true}

\title{CS143 Spring 2025 -- Written Assignment 1}

\begin{document}
\begin{center}
% Change this:
{\LARGE{CS143 Spring 2025 -- Written Assignment 1}} \\
{\large Due Thursday, April 17, 2025 11:59 PM PDT}
\end{center}

This assignment covers regular languages, finite automata, and lexical
analysis. You may discuss this assignment with other students and work on the
problems together. However, your write-up should be your own individual work,
and you should indicate in your submission who you worked with, if applicable.
Assignments can be submitted electronically through Gradescope as a
\textsc{pdf} by 11:59 \textsc{pm pdt} on the due date. Please review the course
policies for more information:
\url{http://web.stanford.edu/class/cs143/policies/index.html}. A \LaTeX{}
template for writing your solutions is available on the course website. To
create finite automata diagrams, you can either use the \tikzname{} package
directly by following the examples in the template, or a tool like
\url{https://madebyevan.com/fsm/}.

Questions on assignments or exams that ask for regular expressions should be written using the five operators introduced in lecture 3 (in particular, even though the set of regular languages is closed under negation, there is no notation for negation available).
Other notation is fine so long as you concisely define it first (e.g. $A? := A + \epsilon$ is ok).

\begin{enumerate}
% problem 1
\item Write regular expressions for the following languages over the alphabet $\Sigma = \{0, 1\}$.
  \textbf{Note:} the next question will ask you to write DFAs for each of these languages.
    Starting by writing down a DFA/NFA may or may not be easier than the regular expression.
\begin{enumerate}

  \item The set of strings in which the substring $111$ does \emph{not} appear.
    Example strings in the language: $\epsilon$, $1101$, $0101100011$.

    \textbf{Solution:}

    \vspace{1.5in}
  \item
    The set of strings in which the first, fourth, seventh, $\ldots$ (continuing for every third) character, if present, is a 1.
    Example strings in the language: $1001$, $110100$, $\epsilon$.

    \textbf{Solution:}

\end{enumerate}

\newpage

% problem 2
\item Draw DFAs for each of the languages from question~1. Note that a DFA must have a transition defined for every state and symbol pair. You must take this fact into account for your transformations. Your DFAs should not have more than 10 states. Submissions with unnecessarily complex DFAs may not receive full credit.

Notice that a short regular expression does not automatically imply a DFA with few states, nor vice versa.

\begin{enumerate}
    \item
    \textbf{Solution:}

    \vspace{3.0in}
    \item
    \textbf{Solution:}

\end{enumerate}

\newpage

% Problem 3
\item Using the techniques covered in class, transform the following NFAs over the alphabet $\{a, b, c\}$ into DFAs. Your DFAs should not have more than 10 states.  Note that a DFA must have a transition defined for every state and symbol pair, whereas a NFA need not. You must take this fact into account for your transformations. Hint: Is there a subset of states the NFA transitions to when fed a symbol for which the set of current states has no explicit transition?

Also include a mapping from each state of your DFA to the corresponding states of the original NFA.  Specifically, a state $s$ of your DFA maps to the set of states $Q$ of the NFA such that an input string stops at $s$ in the DFA if and only if it stops at one of the states in $Q$ in the NFA.

Tip: for readability, states in the DFA may be labeled according to the set of states they represent in the NFA.  For example, state $q_{012}$ in the DFA would correspond to the set of states $\{q_0, q_1, q_2\}$ in the NFA, whereas state $q_{13}$ would correspond to set of states $\{q_1, q_3\}$ in the NFA.

\begin{enumerate}
    \item \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
        \node[state,initial] (q_0)   {$q_0$};
        \node[state] (q_1) [above right of=q_0] {$q_1$};
        \node[state,accepting] (q_2) [below right of=q_1] {$q_2$};
        \path[->]
        (q_0) edge [loop below] node {$a$,$b$,$c$} (q_0)
              edge  node  {$a$} (q_2)
              edge  node  {$b$} (q_1)
        (q_1) edge  node  {$c$} (q_2);
    \end{tikzpicture}

\textbf{Solution:}
    \vspace{2in}

    \item \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
        \node[state,initial] (q_0)   {$q_0$};
        \node[state] (q_1) [above right of=q_0] {$q_1$};
        \node[state,accepting](q_2) [below right of=q_1] {$q_2$};
        \path[->]
        (q_0) edge [loop below] node {$a$,$b$} (q_0)
              edge node {$a$} (q_1)
        (q_1) edge node {$c$} (q_2)
        (q_2) edge node {$\epsilon$} (q_0);
    \end{tikzpicture}

\textbf{Solution:}

\end{enumerate}

\newpage

\item

  Starting next week, we will discuss parsing and context free languages.
  The language over $\Sigma = \{0,1\}$ consisting of ``all strings with an unequal number of 0s and 1s''
  turns out to be an example of a language that is context free but not regular (there is no DFA that recognizes it exactly).

  \begin{enumerate}
    \item Give a regular expression for an \textbf{infinite subset} of this language (you can pick  any subset you like, so long as it is regular and contains an infinite number of different strings).
      \vspace{1.0in}
    \item Give a regular expression that matches a \textbf{superset} of this language but does not match \emph{every} string in $\Sigma^*$ (that is, $(0+1)^*$ is not a valid answer).
  \end{enumerate}

      \vspace{1.0in}
  (\emph{both parts have short answers that do not depend on any knowledge of context-free grammars})

\newpage

% Problem 4
\newpage
\item
  Consider the following program written in a hypothetical C-like language:
\begin{verbatim}
    while get() == 0:
        pass
    halt
\end{verbatim}
This language only has a few features:

\paragraph{values} Every value is either 0 or 1.
\paragraph{expressions}
\begin{itemize}
\item Literals for 0 and 1.
\item Each call to \verb|get()| asks the environment (e.g. a human user) to input 0 or 1 and returns the response.
\item The \verb|==| operator evaluates to 1 if its two arguments are the same and 0 otherwise.
\end{itemize}

\paragraph{statements}
\begin{itemize}
\item The \verb|pass| statement does nothing but proceed to the next statement.
\item The \verb|halt| statement ends the execution.
\item To evaluate the \verb|while| construct, we evaluate its condition expression and check if it is 1;
if it is, we execute the body and return to the condition (unless a halt instruction was reached in the body);
if the condition is 0, we continue past the body to the next statement.
We re-evaluate the condition for each iteration.
Syntactically, loop bodies are indented once.
\end{itemize}

\emph{(this spec might not be fully precise, but we aren't trying to trick you; it's just a simple language with loops, input, and a `halt' instruction that behave how you would expect)}

When running this example program, there are three types of execution traces that might arise:
\begin{itemize}
  \item The user \emph{stops responding}; in this case, the program gets stuck and does not reach a halt instruction (example \emph{stuck input}: $000$).
  \item The user responds in such a way that the program continues to transition between states but never reaches a halt instruction (example \emph{non-terminating input}: $0000\ldots$).
  \item The user gives an input which reaches a halt instruction; these are called \emph{terminating inputs} (examples: $1$, $01$, $001$).
    Note that after reaching a halt instruction, no further input is possible, even if there are more statements following it.
\end{itemize}

The set of terminating inputs for this particular program form a regular language, specifically $0^*1$.
For each of the following three programs, give a regular expression that matches the set of {terminating inputs} for it (and no other strings):

\begin{enumerate}
\item
\begin{verbatim}
    while get():
        halt
    while get():
        pass
    halt
\end{verbatim}

\newpage
\item
\begin{verbatim}
    while (get() == get()) == 0:
        pass
    halt
\end{verbatim}

\vspace{0.75in}
\item
\begin{verbatim}
    while get() == 0:
        while get() == 1:
            pass
    halt
\end{verbatim}
\end{enumerate}

\vspace{0.75in}
\textbf{Completely optional food for thought:} Is it possible to write a program in this language whose terminating inputs are \emph{not} a regular language?
If not, could you prove it by writing a program that \emph{compiles} an input program into its regex?
What if the language also had variable expressions and assignment statements that could store a boolean value?


\newpage

% Problem 5
\item Consider the following tokens and their associated regular expressions, given as a \textbf{flex} scanner specification:
\begin{quote}
\begin{lstlisting}
%%
(01|10)                   printf("apple");
1(01)*0                   printf("banana");
(1011*0|0100*1)           printf("coconut");
\end{lstlisting}
\end{quote}
Give an input to this scanner such that the output string is $((\mathtt{apple})^3 \mathtt{banana})^3 ((\mathtt{apple})\mathtt{coconut})^2$,
where $\mathtt{A}^i$ denotes $\mathtt{A}$ repeated $i$ times.   (And, of course, the parentheses are not part of the output.)
You may use similar shorthand notation in your answer.

\textbf{Solution:}


\newpage

% Problem 6
\item Recall from the lecture that, when using regular expressions to scan an input, we resolve conflicts greedily by taking the largest possible match at any point.
  That is, if we have the following \textbf{flex} scanner specification:
\begin{quote}
\begin{lstlisting}
%%
do                      { return T_Do; }
[A-Za-z_][A-Za-z0-9_]*  { return T_Identifier; }
\end{lstlisting}
\end{quote}
and we see the input string ``\texttt{dot}'', we will match the second rule and emit T\_Identifier for the whole string, not T\_Do.

However, it is possible to have a set of regular expressions for which we can tokenize a particular string, but for which taking the largest possible match will fail to break the input into tokens.
Give an example of no more than two regular expressions and an input string such that: a) the string can be broken into substrings, where each substring matches one of the regular expressions, b) our usual lexer algorithm, taking the largest match at every step, will fail to break the string in a way in which each piece matches one of the regular expressions.
Explain how the string can be tokenized and why taking the largest match won't work in this case.

As an optional challenge (no extra credit), try to find a solution that only uses one regular expression.

\textbf{Solution}:

\end{enumerate}
\end{document}
