Python Strings

A Python string, like 'Hello' stores text as a sequence of individual characters. Text is central to many compautions - urls, chat messages, the underlying HTML code that makes up web pages.

Python strings are written between single quote marks like 'Hello' or alternately they can be written in double quote marks like "There".

a = 'Hello'
b = "isn't"

Each character in a string is drawn from the unicode character set, which includes the "characters" or pretty much every language on earth, plus many emojis. See the unicode section below for more information.

String len()

The len() function returns the length of a string, the number of chars in it. It is valid to have a string of zero characters, written just as '', called the "empty string". The length of the empty string is 0. The len() function in Python is omnipresent - it's used to retrieve the length of every data type, with string just a first example.

>>> s = 'Python'
>>> len(s)
6
>>> len('')   # empty string
0

Convert Between Int and String

The formal name of the string type is "str". The str() function serves to convert many values to a string form. Here is an example this code computes the str form of the number 123:

>>> str(123)
'123'

Looking carefully at the values, 123 is a number, while '123' is a string length-3, made of the three chars '1' '2' and '3'.

Going the other direction, the formal name of the integer type is "int", and the int() function takes in a value and tries to convert it to be an int value:

>>> int('1234')
1234
>>> int('xx1234')   # fails due to extra chars
ValueError: invalid literal for int() with base 10: 'xx1234'

String Indexing [ ]

Chars are accessed with zero-based indexing with square brackets, so the first chars is index 0, the next index 1, and the last char is at index len-1.

string 'Python' shown with index numbers 0..5

Accessing a too large index number is an error. Strings are immutable, so they cannot be changed once created. Code to compute a different string always creates a new string in memory to represent the result (e.g. + below), leaving the original strings unchanged.

>>> s = 'Python'
>>> len(s)
6
>>> s[0]
'P'
>>> s[1]
'y'
>>> s[5]
'n'
>>> s[6]
IndexError: string index out of range
>>> s[0] = 'x'   # no, string is immutable
TypeError: 'str' object does not support item assignment

String +

The + operator combines (aka "concatenates") two strings to make a bigger string. This creates new strings to represent the result, leaving the original strings unchanged.

>>> s1 = 'Hello'
>>> s2 = 'There'
>>> s3 = s1 + ' ' + s2
>>> s3
'Hello There'
>>> s1
'Hello'

Concatenate + only works with 2 or more strings, not for example to concatenate a string and an int. Call the function str() function to make a string out of an int, then concatenation works.

>>> 'score:' + 6
TypeError: can only concatenate str (not "int") to str
>>> 'score:' + str(6)
'score:6'

String in

The in operator checks, True or False, if something appears anywhere in a string. In this and other string comparisons, characters much match exactly, so 'a' matches 'a', but does not match 'A'.(Mnemonic: this is the same word "in" as used in the for-loop.)

>>> 'c' in 'abcd'
True
>>> 'c' in 'ABCD'
False
>>> 'aa'  in 'iiaaii'  # test string can be any length
True
>>> 'aaa' in 'iiaaii'
False
>>> '' in 'abcd'       # empty string in always True
True

Character Class Tests

It's handy to divide characters into broad classes: "alphabetic" chars like 'abc' that make words, "digits" '0', '1' to make numbers, "space" chars like space, tab, and newline. Then there are all the other miscellaneous characters like '$' and '^' which are not alphabetic, digit, or space.

These test functions return True if all the chars in s are in the given class:

s.isdigit() - True if all chars in s are digits '0..9'

s.isalpha() - True for alphabetic word char, i.e. a-z A-Z (applies to "word" characters in other unicode alphabets too like 'Σ')

s.isalnum() - alphanumeric, just combines isalpha() and isdigit()

s.isspace() - True for whitespace char, e.g. space, tab, newline

s.isupper(), s.islower() - True for uppercase / lowercase alphabetic. False for characters like '9' and '$' which do not have upper/lower versions.

>>> '6'.isdigit()
True
>>> 'a'.isalpha()
True
>>> '$'.isalpha()
False
>>> s = '\u03A3'
>>> s
'Σ'
>>> s.isalpha()
True
>>> 'a'.islower()
True
>>> 'a'.isupper()
False
>>> '$'.islower()
False
>>> '\n'.isspace()
True

Startswith EndsWith

These convenient functions return a boolean True/False depending on what appears at one end of a string. These are convenient when you need to check for something at an end, e.g. if a filename ends with '.html'.

s.startswith(x) - True if s start with string x

s.endswith(x) - True if s ends with string x

>>> 'Python'.startswith('Py')
True
>>> 'Python'.startswith('Px')
False
>>> 'resume.html'.endswith('.html')
True

String find()

s.find(x) - searches s left to right, returns int index where string x appears, or -1 if not found. Use s.find() to compute the index where a substring first appears.

>>> s = 'Python'
>>> s.find('y')
1
>>> s.find('tho')
2
>>> s.find('xx')
-1

s.find(x, start_index) - variant of find(), beginning the search at the given index

Change Upper/Lower Case

s.lower() - returns a new version of s where each char is converted to its lowercase form, so 'A' becomes 'a'. Chars like '$' are unchanged. The original s is unchanged - a good example of strings being immutable. Each unicode alphabet includes its own rules about upper/lower case.

s.upper() - returns an uppercase version of s

>>> s = 'Python123'
>>> s.lower()
'python123'
>>> s.upper()
'PYTHON123'
>>> s
'Python123'

Stripe Whitespace

s.strip() - return a version of s with the whitespace characters from the very start and very end of the string all removed. Handy to clean up strings parsed out of a file.

>>> '   hi there  \n'.strip() 
'hi there'

String Replace

s.replace(old, new) - returns a version of s where all occurrences of old have been replaced by new. Does not pay attention to word boundaries, just replaces every instance of old in s. Replacing with the empty string effectively deletes the matching strings.

>>> 'this is it'.replace('is', 'xxx')
'thxxx xxx it'
>>> 'this is it'.replace('is', '')
'th  it'

Backslash Special Chars

A backslash \ in a string "escapes" a special char we wish to include in the string, such as a quote or \n newline. Common backslash escapes:

\'   # single quote
\"   # double quote
\\   # a backslash
\n   # newline char

A string using \n:

a = 'First line\nSecond line\nThird line\n'

Python strings can be written within triple ''' or """, in which case they can span multiple lines. This is useful for writing longer blocks of text.

a = """First line
Second line
Third line
"""

String Format

The string .format() function is a handy way to paste values into a string. It uses the special marker {} within a string to mark where things go, like this:

>>> 'Count: {}'.format(67)
'Count: 67'
>>> 'Count: {} and word: {}'.format(67, 'Yay')
'Count: 67 and word: Yay'

The older approach would be to compute str(67) manually and use + to put the result string together. The str.format() function is a more convenient tool for that situation.

For floating point values, typically you do not wantn to print all 15 digits of a float value. The format marker {:.4g} means print at most 4 digits to the right of the decimal; "g" here is the general format, that works for float and int values as appropriate.

>>> 2/3   # has lots of digits
0.6666666666666666
>>> 'val: {:.4g}'.format(2/3)
'val: 0.6667'
>>> 'val: {:.2g}'.format(2/3)
'val: 0.67'
>>> 'val: {:.2g}'.format(45)
'val: 45'

There are many, many other options for format markers, but {:.4g} is a good one to know for the common situation of printing float values.

String Loops

Standard i/range() loop goes through all index numbers for s:

for i in range(len(s)):
    # use s[i] in here

The "foreach" loop works on strings too, accessing each char. Unlike the above form, here you do not have access to the index of each char as it accessed.

for char in s:
    # use char in here

list('abc') of a string yields a list ['a', 'b', 'c'] of its chars.

More details at official Python String Docs

String Slices

string 'Python' shown with index numbers 0..5

Slice syntax is a powerful way to refer to sub-parts of a string instead of just 1 char. s[ start : end ] - returns a substring from s beginning at start index, running up to but not including end index. If the start index is omitted, starts from the beginning of the string. If the end index is omitted, runs through the end of the string. If the start index is equal to the end index, the slices is the empty string.

>>> s = 'Python'
>>> s[2:4]
'th'
>>> s[2:]
'thon'
>>> s[:5]
'Pytho'
>>> s[4:4]  # start = end: empty string
''

If the end index is too large (out of bounds), the slice just runs through the end of the string. This is the a case where Python is permissive about wrong/out-of-bounds indexes. Similarly, if the start index is larger than the end index, the slice is just the empty string.

>>> s = 'Python'
>>> s[2:999]
'thon'
>>> s[3:2]  # zero chars
''

Negative numbers also work within [ ] and slices: -1 is the rightmost char, -2 is the char to its left, and so on. This is convenient when you want to extract chars relative to their position from the end of the string.

>>> s[-1]
'n'
>>> s[-2:]
'on'

String split()

str.split(',') is a string function which divides a string up into a list of string pieces based on a "separator" parameter that separates the pieces:

>>> 'a,b,c'.split(',')
['a', 'b', 'c']
>>> 'a:b:c'.split(':')
['a', 'b', 'c']

A returned piece will be the empty string if we have two separators next to each other, e.g. the '::', or the separator is at the very start or end of the string:

>>> ':a:b::c:'.split(':')
['', 'a', 'b', '', 'c', '']

Special whitespace: split with no arguments at all splits on whitespace (space, tab, newline), and it groups multiple whitespace together. So it's a simple way to break a line of text into 'words' based on whitespace (note how the punctuation is lumped onto each 'word'):

>>> 'Hello there,     he said.'.split()
['Hello', 'there,', 'he', 'said.']

File strategy: a common pattern is to use 'for line in f' to loop over the lines in a file and 'line.split()' to break each line up into pieces. Some text file formats have a format that split() works on easily.

String Join

','.join(lst) is a string function which is approximately the opposite of split: take a list of strings parameter and forms it into a big string, using the string as a separator:

>>> ','.join(['a', 'b', 'c'])
'a,b,c'

The elements in the list should be strings, and join just puts them all together to make one big string. Note that split() and join() are both noun.verb on string. The list is just passed in as a parameter.

Unicode Characters

In the early days of computers, the ASCII character encoding was very common, encoding the roman a-z alphabet. ASCII is simple, and requires just 1 byte to store 1 character, but it has no ability to represent characters of other languages.

Each character in a Python string is a unicode character, so characters for all languages are supported. Also, many emoji have been added to unicode as a sort of character.

Every unicode character is defined by a unicode "code point" which is basically a big int value that uniquely identifies that character. Unicode characters can be written using the "hex" version of their code point, e.g. "03A3" is the "Sigma" char Σ, and "2665" is the heart emoji char ♥.

Hexadecimal aside: hexadecimal is a way of writing an int in base-16 using the digits 0-9 plus the letters A-F, like this: 7F9A or 7f9a. Two hex digits together like 9A or FF represent the value stored in one byte, so hex is a traditional easy way to write out the value of a byte. When you look up an emoji on the web, typically you will see the code point written out in hex, like 1F644, the eye-roll emoji 🙄.

You can write a unicode char out in a Python string with a \u followed by the 4 hex digits of its code point. Notice how each unicode char is just one more character in the string:

>>> s = 'hi \u03A3'
>>> s
'hi Σ'
>>> len(s)
4
>>> s[0]
'h'
>>> s[3]
'Σ'
>>>
>>> s = '\u03A9'  # upper case omega
>>> s
'Ω'
>>> s.lower()     # compute lowercase
'ω'
>>> s.isalpha()   # isalpha() knows about unicode
True
>>>
>>> 'I \u2665'
'I ♥'

For a code point with more than 4-hex-digits, use \U (uppercase U) followed by 8 digits with leading 0's as needed, like the fire emoji 1F525, and the inevitable 1F4A9.

>>> 'the place is on \U0001F525'
'the place is on 🔥'
>>> s = 'oh \U0001F4A9'
>>> len(s)
4

Not all computers have the ability to display all unicode chars, so the display of a string may fall back to something like \x0001F489 - telling you the hex digits for the char, even though it can't be drawn on screen.