Sorting

Sorting is a classic computer algorithm and all computer languages provide it for you. There are piles of CS research and language features on just this one topic which I would venture say is now a solved problem. (In CS106B you'll design and analyze your own sort algorithm.) Here we'll look at the sorted() function which works on any linear data structure.

Sorting 1.0

The sorted() function orders the elements into increasing order. The elements can be any type (str, int, float, ..) so long as that type supports < and ==.

The simplest use of sorted is just ordering a linear series of strs or ints into increasing order. Sorted() returns a new list, leaving the original list unchanged:

>>> nums = [23, 42, 4, 8, 15, 16]
>>> sorted(nums)
[4, 8, 15, 16, 23, 42]
>>> nums
[23, 42, 4, 8, 15, 16]   # original list not changed
>>>
>>> strs = ['banana', 'zebra', 'apple', 'donut']
>>> sorted(strs)
['apple', 'banana', 'donut', 'zebra']

By default, sorted() returns elements in increasing order. The optional reverse=True parameter returns the elements in decreasing order:

>>> sorted(strs, reverse=True)
['zebra', 'donut', 'banana', 'apple']

Sort Upper / Lower

By default, uppercase chars come before lowercase chars, so uppercase strings will sort to the front of the list:

>>> strs = ['donut', 'ZEBRA', 'BANANA', 'apple']
>>> sorted(strs)
['BANANA', 'ZEBRA', 'apple', 'donut']

One possible fix it to convert the strings to all upper or lower case, although this is not the best looking solution. If the user's data is in a mix of upper and lower case, it's nice to preserve that for them. A fix that preserves the upper/lower case is shown in Sort Upper/Lower Fix below.

Sorting 2.0 - Tuples

Say you have a list of tuples, giving a web domain name and a path on that domain. A common goal for this data is to sort first by domain, but when domains are equal, fall-back to sorting by path. This fall-back strategy is exactly what sorted() does, ordering first by [0], then falling back to [1] and so on.

>>> domains = [('stanford.edu', '/meh'), ('google.com','/search'), ('stanford.edu', '/admit')]
>>> sorted(domains)
[('google.com', '/search'), ('stanford.edu', '/admit'), ('stanford.edu', '/meh')]

Stable - a common feature of good sort algorithm is that it only re-orders elements from the original if they are not ==. So if we had 2 copies of the ('stanford.edu', '/admit') tuple in that list, they would stay in the same order after sorting. The built in sorting algorithm is stable in this way.

Sorting 3.0 - Custom Sorting

Say you have a list of strings, and you want to sort in a way other than alphabetic. Python has a very nice way to do this, the "key" function.

The key function is a function of 1 parameter that produces a sort-by value to use for comparisons in place of the original value.

Sort Key Example 1

Sort strings by their length. Introduce by_len(s) function that takes in one string and returns its length.

When calling sorted, pass in key=by_len. In this way, the sorting uses the length of each string for comparisons.

>>> a = ['aaaa', 'bb', 'z']
>>> sorted(a)
['aaaa', 'bb', 'z']               # sort by alphabetic
>>>
>>> def by_len(s):                # by_len(s) - returns len(s)
..  return len(s)  
>>>
>>> sorted(a, key=by_len)         # sort by len
['z', 'bb', 'aaaa']
>>>
>>> sorted(a, key=by_len, reverse=True)
['aaaa', 'bb', 'z']

Note that in sorted(a, key=by_len) we have the name of the function by_len but it is not followed by parenthesis. This is the rare case of identifying a function by name, but not calling it.

Sort Key Example 2

Say we have (str, int) "food" tuples where the int is a count, such as from a dict-count algorithm.

>>> foods = [('apple', 5), ('banana', 2), ('chocolate', 137), ('bread', 10)]
>>> sorted(foods)
[('apple', 5), ('banana', 2), ('bread', 10), ('chocolate', 137)]

Say we want to sort these same tuples, but in increasing order by the count. Introduce a by_count(food) function.

>>> def by_count(food): return food[1] 
>>> sorted(foods)
[('apple', 5), ('banana', 2), ('bread', 10), ('chocolate', 137)]
>>> sorted(foods, key=by_count)

What if we want the highest count first? Use the reverse=True parameter.

>>> sorted(foods, key=by_count, reverse=True)
[('chocolate', 137), ('bread', 10), ('apple', 5), ('banana', 2)]

Sort Lambda

Lambda is a a very compact wy of writing a short function without requiring a separate def. This works well with sorted(). Here is the food by_count() example, re-written with this lambda

  lambda food: food[1]

The one parameter to the lambda is one food tuple, and the lambda returns food[1] — in this way, sorting uses the count from each food tuple for sorting. Rather than use "return", the lambda just has the expression to use, food[1] in this case. See the map/lambda chapter for more information about lambdas.

>>> foods = [('apple', 5), ('banana', 2), ('chocolate', 137), ('bread', 10)]
>>> # Lambda equivalent of by_count():
>>> sorted(foods, key=lambda food: food[1])
[('banana', 2), ('apple', 5), ('bread', 10), ('chocolate', 137)]
>>>
>>> # Lambda sorting by name:
>>> sorted(foods, key=lambda food: food[0])
[('apple', 5), ('banana', 2), ('bread', 10), ('chocolate', 137)]

Sort Upper/Lower Fix

With lambda, we can write a sort that is not tripped up by upper/lower case in the input strings. The key= lambda computes the lowercase form of each string to use in the comparisons:

>>> strs = ['donut', 'ZEBRA', 'BANANA', 'apple']
>>> sorted(strs)
['BANANA', 'ZEBRA', 'apple', 'donut']      # upper before lower - lame
>>> 
>>> sorted(strs, key=lambda s: s.lower())  # sort by lowercase form - nice
['apple', 'BANANA', 'donut', 'ZEBRA']

Sorting Speed?

How much time does sorting take? An algorithm that makes 1 pass over a list, doing a fixed-cost computation for each element could be called "linear" time cost. If you double the length of the list, the time taken roughly doubles. Many algorithms we've worked out have this linear 1-pass-over-the-data quality.

Sorting is more expensive than linear. It has to go back and forth repeatedly, some elements have to be examined many times. For CS106A, just remember that sorting is medium expensive.