Today: while, break parsing, parse words out of string patterns

Data and Parsing

Here's some fun looking data...

$GPGGA,005328.000,3726.1389,N,12210.2515,W,2,07,1.3,22.5,M,-25.7,M,2.0,0000*70
$GPGSA,M,3,09,23,07,16,30,03,27,,,,,,2.3,1.3,1.9*38
$GPRMC,005328.000,A,3726.1389,N,12210.2515,W,0.00,256.18,221217,,,D*78
$GPGGA,005329.000,3726.1389,N,12210.2515,W,2,07,1.3,22.5,M,-25.7,M,2.0,0000*71
$GPGSA,M,3,09,23,07,16,30,03,27,,,,,,2.3,1.3,1.9*38
$GPRMC,005329.000,A,3726.1389,N,12210.2515,W,0.00,256.18,221217,,,D*79
$GPGGA,005330.000,3726.1389,N,12210.2515,W,2,07,1.3,22.5,M,-25.7,M,3.0,0000*78
$GPGSA,M,3,09,23,07,16,30,03,27,,,,,,2.3,1.3,1.9*38
...

The above is what a GPS chip outputs
-buried deep in your phone, this is going on
A standard: NMEA
- NMEA_018 (wikipedia)
Things to notice: it's just text
-a series of text lines ending with \n
-each line is made of chars
Text is a super common exchange format between systems
"Parsing"
-Have "raw" text like this
-Find and pull out the data you want

Foreshadow: Advance With `var += 1`

Framing for today's example
Imagine a string on paper
Moving your finger right looking for something
Python: have some var index into string
var += 1 .. like moving one to the right
e.g. looking for the space after 'abc'

'xx @abc @xyz xx'

See "parse1" Problems on Server

See the "parse1" examples on the experimental server

Start off with some Notes, use them later on real examples

Note 1: Loop Break - censored()

"break" exit the loop immediately
Jumps to the line after the loop, as if it had finished normally
1. Write loop normally
2. Can use if-break inside to break out early
All languages have this, typically called "break"
Can write censored with break nicely
What are the 2 ways to end up at the "return nums" line?

Censor problem: censored(n, censor): Given a non-negative int n. Return a list of the ints [1, 2, 3, ... n]. Except as soon as a number in the given censor list is seen, end the list without that number, so censored(10, [5, 4]) returns [1, 2, 3]. Use "break"

Solution

def censored(n, censor):
    nums = []
    for i in range(1, n + 1):
        if i in censor:  # Key: if-break
            break
        nums.append(i)
    # "break" jumps to here
    return nums

Note 2: Can you control `i` Within Range Loop? No!

Tempting: inside a loop, control i with = for some case. This does not work at all. At the top of the loop, for/range uses = to set i to its liking. Therefore: cannot use for/i/range if we want to adjust i with our own code.

def censored(n, censor):
    nums = []
    for i in range(n):
        ....
        if something:
            i += 10    # NO does not work
        ...
    return nums

Aside: `continue`

Rarely used, similar to break: continue
Jumps to the top of the loop, does the next iteration
So in effect, skip this iteration
If censored() used this instead of break, it would skip over each censored number, but continue going through all the numbers
We're never going to use this in CS106A, so just FYI

`for i/range` vs. `while`

The for/i/range form is great for going through numbers which you know ahead of time - a common pattern in real programs. However, while is more flexible - can test as we go, not needing to know ahead of time. Ultimately you need both forms.

Note 3: while Equivalent of for/range

Use for/i/range if have a series of numbers to step through. That is a common case, and for/i/range is perfect for it. We'll use while for situations that require more flexibility.

for i in range(n) - go-to solution for that sequence
Can write this as a while .. do steps manually
Three parts: init, test, update
Use range() for common cases
Use while where need fine control of i (examples to follow)
Beware: easy to forget update step, result is infinite loop
for/range is so common .. we don't have muscle-memory for the update

Here is the while-equivalent to for i in range(n)

i = 0         # 1. init
while i < n:  # 2. test
    # use i
    i += 1    # 3. update, loop-bottom (easy to forget)

Example while_double()

double_char() written as a while (using a range() is easier for this problem, so this just demonstrates what while would look like)

def while_double(s):
    result = ''
    i = 0
    while i < len(s):
        result += s[i] + s[i]
        i += 1
    return result

Recall: s.find(target, start)

The str.find() function with an added 2 parameter, start index, indicates where to begin scanning for the target. The default start index is 0

>>> s = 'xx[abc[xx'
>>> s.find('[')     # start = 0, the default
2
>>> s.find('[', 2)  # start = 2, no help
2
>>> s.find('[', 3)  # start = 3, find next one
6

Today - Bad news / Good news

Here we have three main code examples for today. Bad news: these are challenging. Good news: these follow a parsing pattern that we'll use again and again so you get used to it. It will also be useful in your future career.

1. all_lefts()

This function is a halfway point, working out half of the difficulties of really solving this.

Want - find all the left brackets in s, return a list of them:

'xx[xxx[xxxx[xxx' -> ['[', '[', '[']

Figure Out all_lefts()

Challenges:

Find the '['
Slice out the substring
Update search variable at loop bottom
Exit: detect no more brackets

Space for a drawing, work out the algorithm:



'xx[xxx[xxxx[xxx'

Given string s. Return a list of all the '[' strings in s. Use s.find() within a while loop to find all the '['.

Use search as index variable, marking current position of search within s - this will be a stereotypical pattern for searching through a string. Starting code - work from here with drawing

def all_lefts(s):
    search = 0
    result = []
    while search < len(s):
        # code in here:
        # -s.find() to find the '['
        # -slice to grab '['
        # -update search = ???
        # -if/break when no more '['
    return result

Solution

def all_lefts(s):
    search = 0
    result = []
    while search < len(s):
        # Find '[', at search index
        left = s.find('[', search)
        # No '[' -> exit loop
        if left == -1:
            break
        result.append(s[left:left + 1])
        # Update search for next iteration
        search = left + 1
    return result

alt: drawing in s=search and L=left on top of input string to work out algorithm

2. all_brackets()

Now use both left and right brackets, so this is more realistic. Find all bracket-pairs in s, return a list of all the contained data

'xx[abc]xxx[hi]xxx[woot]xxx' -> ['abc', 'hi', 'woot']

Given string s. Return a list of all the 'abc' strings for each '[abc]' substring within s. For each left bracket, find the right bracket that follows it. End the search if no left or right bracket is found. Use s.find() within a while loop.

Use the all_brackets() code as a starting point. Challenges:

Find the ']' after the '[', slice out data
Cases to exit the loop?
Update search variable at loop bottom

Figure Out all_brackets()

Input case: 'xx[abc]xx[42]xx' -> ['abc', '42']

Draw on top of the input to work out the algorithm:



'xx[abc]xx[42]xx'

Standard CS106A steps - draw an input case, introduce vars left + right on the drawing to work out the details. Run it.

Starter code:

def all_brackets(s):
    search = 0
    result = []
    while search < len(s):
        left = s.find('[', search)
        if left == -1:
            break

        # Your code here





    return result

all_brackets() Solution

alt: drawing in s=search and L=left R=right on top of input string to work out algorithm

def all_brackets(s):
    search = 0
    result = []
    while search < len(s):
        left = s.find('[', search)
        if left == -1:
            break
        right = s.find(']', left)  # or left+1
        if right == -1:
            break
        result.append(s[left + 1:right])
        # Update search at loop end
        search = right  # or right+1
    return result

Q: What does `while search < len(s):` do?

What does test do: while search < len(s):
Turns out: nothing!
Try changing it to while True:
Now the if/break logic takes care of exiting the loop always
Think about the s = 'xxx' case
no '[' at all
trace the code for this case
works fine
Code pattern:
Use while True: at the top
Use if/break logic inside the loop to detect "done" condition

When to Use `while True:`

When to use while True:
If the "done" checking requires a couple steps
e.g. calling s.find() and saving it in a var, then do the check
If the test naturally appears halfway through the body steps
For complex cases: while-True / if-break
For simple cases: while test:
If the loop can be done with test at top .. great
Don't use if/break unless required by test structure
Most problems don't require if/break

`search = right + 1` ?

(optional)
Update search var at bottom of loop
What value works there?
The simplest is fine:
search = right
Slight optimization:
We know that "right" is not a left bracket
So we could begin the search 1 char later
search = right + 1
Both of these are fine
Maybe KISS is better, omitting the + 1
But it's a little satisfying to put it in

3. at_words() Example

Demonstrate several patterns on this one
Find all the @aaaa words in a string, return in a list
Where 'a' is any alphabetic char
'xx @abc @xyz xx' -> ['abc', 'xyz']
Not just the first word, all the words
Establishes code patterns we'll re-use
We'll work through this one carefully
Points about this code:
Keep "search" index - how far we are through s
Use while True + if/break to detect done
Use str.find() to locate each @
Use a nested while to skip over alpha chars to find end
Use < len(s) to protect use of s[xxx]
Remember to update "search" at loop bottom
Var names: search, at, end - try to keep things straight

at_words(s): For each '@' in s, parse out the "word" substring of 1 or more alphabetic chars which immediately follow the '@', so '@abc @ @xyz' returns ['abc', 'xyz'].

'xx @abc xx @xyz' -> ['abc', 'xyz']

at_words() #1

Say we have a loop structure to find the '@' as we have before

    at = s.find('@', search)
    if at == -1:
        break

    end = at + 1
    # loop, advance end past alpha chars



'xx @abc @xyz xx'

at_words(): Find End of Alpha Chars

AKA skip over the alpha chars
Loop test: when true, keep advancing end
Test: s[end].isalpha()
The loop below is close
One bug case to fix below

    end = at + 1
    while s[end].isalpha():
        end += 1

at_words(): After "end" computed

Use slice to pull out word
Then advance search for next iteration

    word = s[at + 1:end]
    result.append(word)
    search = end

at_words(): End Bug

There is a bug in the end/loop. It has to do with this input case:



'@abc'
 0123

Here is the code again. Think about how the loop works when advancing "end" for '@xyz':

    end = at + 1
    while s[end].isalpha():
        end += 1

Problem: keep advancing "end" .. past the end of the string, eventually end is 4. Then the code s[end].isalpha() throws an error since end (4) is past the end of the string.

The loop above translates to: "advance end so long as the char it refers to is alphabetic"

To fix the bug, we modify the test to: "advance end so long as it refers to a char that exists and that char is alphabetic"

Fix End Bug

Bug: run end off the end of s, testing non-existent s[end]
e.g. this happens if input is s = '@abc'
Think through how the loop works for that case
Solution:
Add guard <
This is the fixed loop:
while end < len(s) and s[end].isalpha():
Q: How to test if index i is valid in s?
A: i < len(s)
Only check s[end] after checking that end is valid
Boolean Short Circuit
Python evalutes expression left-right
As soon as boolean value determined, stops trying
A False in the midst of an and stops
So the < guards the s[end].isalpha()
Common guard pattern:
Check i < len(s) before trying s[i]

at_words() Case #2 - Zero Chars

What about 'xx@abc @ @xyz'
Consider slice of middle @ above
s[at + 1:end]
Turns out to be like s[8:8]
Which is the empty string ''
Add logic to screen out empty string:
if len(word) > 0: result.append(word)

at_words() Solution

def at_words(s):
    search = 0
    words = []
    while True:
        at = s.find('@', search)
        if at == -1:
            break
            
        # Pass over alpha chars to find end
        end = at + 1
        while end < len(s) and s[end].isalpha():
            end += 1
        
        word = s[at + 1:end]
        # Screen out len-0 word
        if len(word) > 0:
            words.append(word)
        
        # Set up next iteration
        search = end
    return words

Extra Example exclaim_words()

Like at_words, but right-left
'x hey!@ho! -> ['hey!', 'ho!']

exclaim_words(s): For each '!' in s, parse out the "word" substring of one or more alphabetic chars which are immediately to the left of the '!'. Return a list of all such words including the '!', so 'x hey!@ho! returns ['hey!', 'ho!']. (Like at_words, but right-to-left)