More on Binary Trees

Thursday, August 1


Today, we continue our discussion of binary trees and binary search trees. We encounter some light puzzles, a few applications, and some binary tree code.


Lecture Video


Prezi

Here is the Prezi from today's lecture:


Contents

1. Topic Overview

2. Deleting Values from BSTs

3. Additional BST Deletion Examples

4. Supplementary BST Deletion Code

5. Summary of Runtimes

6. Self-Balancing BSTs

7. So, Why Binary Trees?

8. Traversal Applications

9. BST Code from Class: Insertion, Traversals, and Deallocation

10. Incredibly Important! Clarification: nullptr Assignments in the forestFire() Function

11. Supplementary Level-Order Traversal Code

12. What's next?

13. Exam Prep


Topic Overview

Today, we continued our discussion of BSTs. We covered the deletion algorithm for BSTs (three cases: deletion of a node with no children, one child, or two children) and discussed the best-, average-, and worst-case runtimes for that deletion operation. We also talked briefly about self-balancing BSTs and expected runtimes for operations on a BST that has been kept balanced.

We then went through the Prezi included at the top of today's notes, where we talked about tree applications, a few exotic tree variants, and some benefits and disadvantages of trees compared to arrays and linked lists.

Finally, we coded up the following:

  • recursive BST insertion function
  • preorder, postorder, and inorder traversal functions
  • a forestFire()function for freeing all dynamically allocated memory associated with a binary tree
  • a test case that checked for memory leaks using the TRACK_ALLOCATIONS_OF approach discussed in your current programming assignment!


Deleting Values from BSTs

Today, we continued our discussion of BSTs. We started by covering the deletion algorithm for BSTs.

There are three distinct cases for deletion:

  • If the node to be deleted is a leaf node:
    Just prune it! Goodbye, leaf node! For example:

  • If the node to be deleted has exactly one child:
    Its child moves up to take is place, bringing along any children of its own. For example:

  • If the node to be deleted has two children:
    Replace the node's value with the maximum value in its left subtree, and then (recursively) delete that maximum value from the left subtree. For example:

Alternatively, if we delete a node that has two children, we could replace the node's value with the minimum value from its right subtree (and then delete that min element from the right subtree). However, for the sake of keeping everything consistent and easy to grade on exams in this class, I always want you to choose option shown in class: use the maximum element from the node's left subtree.


Additional BST Deletion Examples

Here are some supplementary BST deletion examples to help clarify :


Supplementary BST Deletion Code

(Spoiler alert!) You should try to implement a function for deleting values from a BST before peeking at the code below! This code was not covered in class today, and writing this up from scratch is one of the exam prep problems below.

// Recall that we pass a reference to our root pointer because we might want to
// change what this pointer is pointing to. That happens if we end up deleting the

// root node in this function call, in which case we want that pointer back in our
// calling function to be set to nullptr.

void
bstDelete(Node*& root, int data)
{
// Base case. :) If there's nothing here, there's nothing to delete. Just return.
if (root == nullptr)
{
return;
}

if (data < root->data)
{
bstDelete(root->left, data);
}
else if (data > root->data)
{
bstDelete(root->right, data);
}
else
{
// Hooray, we found the value we want to delete!

if (root->left == nullptr && root->right == nullptr)
{
// This is the case where the node has no children. We can simply remove it
// from the tree.
delete root;
root = nullptr;
}
else if (root->right == nullptr)
{
// Here, we have a single child: a left child. It might seem a bit jarring
// not to check above whether root->left != nullptr, but we know that if
// root->right == nullptr, root->left can't be nullptr. If it were, then
// both children would be null, and we would have triggered the if condition
// above and never made it to this else-if condition.
//

// In this case, we need to delete the current node and move its left child
// up to take its place. Note that we can't safely access root->left after
// deleting root. The arrow operator (->) dereferences our struct pointer,
// and we should never dereference a pointer to something that has been
// deleted. So, we have to do a delicate dance here.

// Hold onto the left child. This will become the new root of this subtree.
Node *temp = root->left;

// Delete the node 'root' is pointing to. (This doesn't delete the local
// variable called 'root'. It's still a reference to the root pointer passed
// to this function. This just deletes the dynamically allocated node that
// 'root' was pointing to.)
delete root;

// Now set the root to point to the temp node -- the left child that is
// moving up to take its parent's place.
root = temp;
}
else if (root->left == nullptr)
{
// This is the case where we have just one child: a right child. The
// operation is symmetric to the one above.
Node *temp = root->right;
delete root;
root = temp;
}
else
{
// This is the case where we have two children. The max value in the left
// subtree needs to move up. Note that I'm not rewiring the tree here. I
// just leave this node in place and change the value it contains.
root->data = bstFindMax(root->left);
bstDelete(root->left, root->data);
}
}
}

To enable the bstDelete() function, here is a bit of a cheeky bstFindMax() function that doesn't bother checking whether root is null initially. For the purposes of deletion, it doesn't have to. We only ever call this function from bstDelete() if the node we're trying to delete has two children -- meaning that its left child is definitely non-null.

If someone were to pass this function an empty tree, however, it would totally segfault.

// Returns the largest value in this binary tree.
//
WARNING! This function assumes root is non-null!

int
bstFindMax(Node *root)
{
Node *current = root;

// To find the largest value in a BST, simply go right until we can't go
// right anymore!

while (current->right != nullptr)
{
current = current->right;
}

return current->data;
}

An alternative approach to this function, by the way, is to get rid of the current pointer altogether. We can change what root is pointing to locally because it's not a reference; it's just a local pointer variable:

// Returns the largest value in this binary tree.
//
WARNING! This function assumes root is non-null!

int
bstFindMax(Node *root)
{
// To find the largest value in a BST, simply go right until we can't go
// right anymore!

while (root->right != nullptr)
{
root = root->right;
}

return root->data;
}


Summary of Runtimes

Now that we have covered the algorithms for search, insertion, and deletion in a BST, here is the summary of their runtimes:

operation best case worst case average case
search O(1) O(n)* O(log n)*
insertion O(1) O(n)* O(log n)*
deletion O(1) O(n)* O(log n)*

Not mentioned in class:

(*) Note: The claim of worst-case O(n) runtimes above is rooted in the possibility of a BST devolving in a linked list. Similarly, the average-case O(log n) runtimes are based on the observation that if we construct all possible BSTs from the same set of n distinct elements, the average height of those trees is O(log n). In a nice, bushy BST whose height is logarithmic, about half the nodes are leaf nodes, and they take O(log n) time to reach.

Some people object to the runtimes as listed above, however, and say that if we had a tree that has devolved into a linked list, we wouldn't reasonably expect to get average-case runtimes of O(log n) out of it. Similarly, in a nice, balanced, bushy, logarithmic-height BST, our worst-case runtimes for the operations above would be O(log n), not O(n).

We could get around those arguments by simply representing those runtimes as O(h), where h is the height of our tree. On the one hand, that expression can generalize to BSTs regardless of their height. On the other hand, an O(h) runtime is less informative to someone who only knows they're throwing n elements into some data structure and isn't aware of the maximum and minimum possible heights of a BST.

For this class, we'll stick to the forms given in the table above, which are the more conventional ways to express those runtimes. I'm only adding this note as a bit of an aside for the sake of completeness.


Self-Balancing BSTs

Having a BST devolve into a linked list is immensely problematic for our runtimes. Who wants to perform an O(n) search on a data structure that could be rearranged to allow for O(log n) searches, if only we had kept it nice and bushy and balanced?

This isn't a pedantic or hypothetical problem, either. A lot of data in the real world is stored in sorted order and would lead to problematically tall BSTs if we were to blindly dump that data into a BST. Suppose, for example, that we have an EnglishWords.txt file that has a list of English words -- something we have seen come up in a few places already this quarter. If the words in that file are stored in alpha order, inserting them one-by-one into a BST would lead to a linked list with progressively slower insertion, deletion, and look-up times. And in fact, it would be super chaotic for the words in that file not to be stored in alpha order.

Enter self-balancing BSTs. A self-balancing BST has mechanisms built into its insertion and deletion algorithms that prevent the tree from devolving into a linked list. A tree (or subtree) is said to be "balanced" if its left and right subtrees are approximately equal in height -- generally no more than one level deeper on one side than on the other, although some self-balancing BSTs keep subtree heights perfectly balanced at every node. By guaranteeing that a tree (and all the subtrees) are balanced, we end up with nice, bushy trees that are O(log n) in height -- meaning that our average- and worst-case runtimes are O(log n).

Keeping these trees balanced comes at a cost, of course, albeit a fairly negligible one: the insertion and deletion operations have some extra runtime overhead involved with keeping the trees balanced, but not enough to boot those operations out of the O(log n) runtime club.

Today's Prezi lists several examples of self-balancing BSTs, such as AVL trees and red-black trees. There are even exotic variations of balanced trees that aren't quite BSTs, such as 2-4 trees. You'll encounter some of those in later classes; I'm mentioning them here in case you're super curious and want to start reading ahead.

(Key take-aways!) You don't need to be familiar with all the details above. The main things I want you to know are:

  • Self-balancing BSTs have worst-case search, insertion, and deletion runtimes of O(log n) rather than the O(n) worst-case runtimes seen in regulary BSTs. ✨ This is awesome! ✨
  • The Stanford C++ Set and Map classes are driven by self-balancing BSTs, which is how they achieve O(log n) runtimes for insertion, deletion, and search operations.


So, Why Binary Trees?

We then talked about some applications for binary trees. For that, I defer largely to the Prezi included at the top of today's notes. I do want to draw your attention to the following, though:

At one point, we talked about why we would ever want to use a BST to store data as opposed to, say, a vector, an array, or even a linked list. Here is a summary of key points on that topic:

  • Balanced BSTs support worst-case logarithmic insertion, search, and deletion operations. If we want to maintain a sorted vector or array, inserting a new smallest element would require us to scoot all the existing elements over by one position, leading to an O(n) runtime, which is significantly worse than O(log n). (Of course, if we use a regular BST and not a balanced BST, we could always devolve into a linked list and incur O(n) runtimes for our insertion, search, and deletion operations.)
  • With BSTs, we don't encounter the same space constraints we do with fixed-size arrays. Recall that a fixed-size array can end up using more space than necessary if we over-estimate how large it needs to be and never end up using all of its cells. Conversely, if we have an array that fills up and we want to add an element to it, we need to go through an expansion operation where we create a new, larger array, copy elements from our old array into the new one, and then add the new element to our larger, expanded array -- again, an O(n) operation. In contrast, with a BST, we always have exactly the number of nodes we need, and we create and delete nodes on the fly as needed.
  • Arrays require contiguous blocks of memory. If we're working in a highly memory constrained system where memory has become fragmented, it might not be possible to find a contiguous block of memory large enough for some array we want to allocate, but it might still be possible to proceed with a linked data structure that spreads its nodes across memory.
  • BSTs are great for streaming input. If we want to be able to binary search a collection of values, and all those values are provided to us ahead of time, sure -- we can just throw them into an array of the appropriate size, sort the array, and do binary search from there. However, if we have a situation where new values could stream into our collection at any time after we have already started performing searches and we want to be able to add those to our container and process them in real time, a balanced BST might be the way to go. A balanced BST will ensure that it remains nice and balanced and bushy and always has worst-case logarithmic runtimes for insertion, search, and deletion. In contrast -- as mentioned above -- with a vector or array, adding new elements could lead to expensive O(n) expansion operations or O(n) operations to scoot larger elements over to make room for new smaller elements at the beginning of our data structure.
  • A key benefit of BSTs over linked lists is that if we keep a BST balanced, we get efficient search runtimes of O(log n). With a linked list, even if our elements are sorted, our worst-case runtime for search is O(n). (And of course, we have talked ad nauseam about how much better logarithmic runtimes are than linear runtimes. What is log2(1,000,000,000) again? πŸ˜‰)

There are a few drawbacks to BSTs compared to arrays, though:

  • Firstly, there are a lot of pointers involved, and dereferencing those pointers and journeying around memory could involve some extra runtime overhead compared to array cell accesses. For the purposes of anything we do in this class, that runtime hit would be so negligible as to be unobservable. That's something to keep in mind for software that needs to operate at a much larger scale, though.
  • Secondly, and more importantly, the nodes in our BTSs take up more space than the cells in an array. On most systems these days, a single int variable in a C++ program takes up 32 bits (or 4 bytes) of memory. So, an array of n integers would take 4n bytes of memory. In a BST of integers, however, each node also keeps track of a left and right child pointer. On most systems these days, a pointer in C++ takes up 64 bits (or 8 bytes) of memory. The total memory for each node in a BST, then, is 4 + 8 + 8 = 20 bytes, meaning our BST takes up 20n bytes of memory in total. That's five times the space required to story those integers in an array!


Traversal Applications

Throughout today's lecture, I also sprinkled the following examples of where we might use pre-, post-, or in-order traversals:

  • preorder: This is used in Javascript's getElementById() function, which traverses the DOM tree of a webpage to find an element with a particular ID. (See the Prezi for more information about the DOM.) We wouldn't want to do an in- or post-order traversal for that, as those both postpone processing certain elements until after they have returned from explorations that dive deeper into the tree. Why delve deeper into the tree if we've already stumbled upon the element we want? Preorder is the way to go here.
  • inorder: If we have a BST, we can use an inorder traversal to process its elements in sorted order. Recall that this is only guaranteed with binary search trees. In a regular binary tree, inorder won't necessarily visit our elements in sorted order.
  • postorder: We used a postorder traversal when coding up the forestFire() function that deallocated all the nodes in a binary tree. After we free up all the dynamically allocated memory in a node's left and right subtrees, we can safely return to the node in question and free it up, as well, since we no longer need the left and right subtree pointers it contains.


BST Code from Class: Insertion, Traversals, and Deallocation

πŸ“¦ Attachment: bst.zip

Finally, we coded up the following:

  • recursive BST insertion function
  • preorder, postorder, and inorder traversal functions
  • a forestFire()function for freeing all dynamically allocated memory associated with a binary tree
  • a test case that checked for memory leaks using the TRACK_ALLOCATIONS_OF approach discussed in your current programming assignment!

See the code attached above. Comments are sparse since we discussed the code in class today as we implemented it.


Incredibly Important! Clarification: nullptr Assignments in the forestFire() Function

Here's the forestFire() function we wrote in class today:

void forestFire(Node*& root)
{
if (root == nullptr)
{
return;
}

forestFire(root->left);
forestFire(root->right);
delete root;

// Set root to nullptr to protect ourselves from accidentally
// dereferencing this address later, since our program no
// longer has any claim over this address in memory.
root = nullptr;
}

Every now and then, I got questions about how we can set root equal to nullptr after having deleted that variable, as in the code above. I want to expand upon that here, because understanding how and why that works is essential to developing a strong understanding of pointers and dynamic memory management.

When we delete root, we aren't deleting the root variable itself; rather, root contains the address of some dynamically allocated node in memory, and we're giving that address to delete and releasing any claim our program had over that chunk of memory. The root variable still exists after that operation is finished, but it contains an address that's unsafe for us to return to -- hence my desire to set that equal to nullptr.

Note that this operation is mostly useless as we traverse the tree. We're passing child pointers by reference and setting them to nullptr before returning to previous function calls and destroying the nodes where those pointers resided. This is efficacious, however, when we get back to the first call to this function. The original root of the tree was passed by reference, and if we simply returned back to whoever called this function without setting that equal to nullptr, that function would have a lingering record of the deallocated root's now-invalid address. That last line of code zaps that address for the reasons mentioned above.

Here are some diagrams to illustrate what's happening:

Immediately BEFORE we hit the delete root line in our initial call to this function:

===========
STACK SPACE
===========

main()
+---------------+
| root |
| +-----------+ |
| | 0x85020 | |
| +-----------+ |
+---------------+

forestFire()
+---------------+
| πŸŒ€ root |
| +-----------+ |
| | 0x85020 | | <-- This is just a reference (πŸŒ€) to the root
| +-----------+ | variable in main().
+---------------+

==========
HEAP SPACE
==========

0x85020
+----------------+
| data: 42 | <-- Our program currently has a claim on
| left: nullptr | this memory address
| right: nullptr |
+----------------+


Immediately AFTER we delete root:

===========
STACK SPACE
===========

main()
+---------------+
| root |
| +-----------+ |
| | 0x85020 | | <-- MINOR PROBLEM: This still refers to an
| +-----------+ | address we no longer have any claim over!
+---------------+ If we dereference it, bad things could
happen!
forestFire()
+---------------+
| πŸŒ€ root |
| +-----------+ |
| | 0x85020 | | <-- This is just a reference (πŸŒ€) to the root
| +-----------+ | variable in main().
+---------------+

==========
HEAP SPACE
==========

0x85020
+----------------+
| 7 42 ,| <-- We no longer have ownership over this
| ` %$ | chunk of memory. It could contain ANYTHING
| b ^ 2 | at this point!
+----------------+


Memory AFTER we execute root = nullptr:

===========
STACK SPACE
===========

main()
+---------------+
| root |
| +-----------+ |
| | nullptr | | <-- Nice! We no longer have the address of that
| +-----------+ | chunk of memory that we released with
+---------------+ "delete root"!

forestFire()
+---------------+
| πŸŒ€ root |
| +-----------+ |
| | nullptr | | <-- This is just a reference (πŸŒ€) to the root
| +-----------+ | variable in main().
+---------------+

==========
HEAP SPACE
==========

0x85020
+----------------+
| m 42 ,| <-- We no longer have ownership over this
| x `$ | chunk of memory. It could contain ANYTHING
| 3 ^@ | at this point!
+----------------+


Memory AFTER returning to main() from the forestFire() function:

===========
STACK SPACE
===========

main()
+---------------+
| root |
| +-----------+ |
| | nullptr | | <-- All is well. :) No lingering record of the
| +-----------+ | 0x85020 address, so we can't accidentally
+---------------+ attempt to go back there. :)

==========
HEAP SPACE
==========

0x85020
+----------------+
| q42q ,| <-- We no longer have ownership over this
| * `$ | chunk of memory. It could contain ANYTHING
| 3 mmm4 | at this point!
+----------------+


Supplementary Level-Order Traversal Code

(Spoiler alert!) You should try to implement a level-order traversal function before peeking at the code below! This code was not covered in class today (we only coded preorder, postorder, and inorder traversals), and writing this up from scratch is one of the exam prep problems below.

void levelorder(Node *root)
{
Queue<Node*> q;
q.enqueue(root);

cout << "Level-Order Traversal:";

while (!q.isEmpty())
{
Node *current = q.dequeue();

// If we pull a nullptr out of the queue, simply continue
// on to the next iteration of the loop (skipping over the
// remaining lines in this code block for this iteration).
if (current == nullptr)
{
continue;
}

cout << " " << current->data;
q.enqueue(current->left);
q.enqueue(current->right);
}

cout << endl;
}


What's next?

On Monday, we will talk about Huffman coding -- a data compression algorithm that will be the central topic for our final assignment in the course. Later next week, we will also talk about two incredibly important CS topics: hashing and graphs.

After that, we'll head into the final week of class, which we've designed to be a bit light in hopes that you won't be too overwhelmed or overworked as we head into final exams.


Exam Prep

1. What are the best-, worst-, and average-case runtimes for the search, insertion, and deletion operations on BSTs, and -- more importantly -- what leads to each of those cases?

2. Be sure to code up all the functions we did in class today from scratch. Make your best effort to write and test each function without peeking back at today's implementations. The functions to implement include:

  • bstInsert(Node*& root, int data) - A recursive function to insert data into the given BST. Note that root could be nullptr initially, indicating an empty BST.
  • preorder(Node* root) - Print the preorder traversal of the given binary tree. Ensure there is a space between each value in your output, but that there is no dangling, invisible space after the final element in your output.
  • postorder(Node* root) - Print the tree's postorder traversal with the same formatting restriction as above.
  • inorder(Node* root) - Print the tree's inorder traversal with the same formatting restriction as above.
  • forestFire(Node*& root) - Deallocate all the dynamically allocated memory associated with the given tree, and ultimately set the root pointer to nullptr so that it will not accidentally be dereferenced from outside this function with the assumption that it still points somewhere useful.

3. In the previous question, why is root passed by reference to bstInsert() and forestFire() but not the other functions listed there?

4. Write an iterative bstInsert(Node*& root, int data) function. Test its functionality by constructing a few BSTs with it and then verifying their structures by printing their preorder, postorder, and inorder traversals.

5. Write an iterative forestFire(Node*& root) function. This one is a bit tricky. You might find it helpful to use a stack or queue to assist.

6. For this question, you have two options:

  1. Write a levelorder(Node *root) function that prints the level-order traversal of a given binary tree.
  2. Trace through the levelorder(Node *root) code provided above in today's lecture notes with an example binary tree of your choosing in order to develop a solid understanding of how it's working.

7. Implement a bstDelete(Node*& root, int data) function that takes the root of a BST and some value to be deleted, and deletes the first occurrence of that value that it encounters in the BST using the algorithm discussed in class.

8. Write a forestFire() function that deallocates all the nodes in a given binary tree in a preorder fashion (i.e., we deallocate a (sub)tree's root before descending into its left and right subtrees, as opposed to the postorder approach we used in class today). Note that the following will not work. You'll have to find a workaround for the issue explained below.

void forestFire(Node *root)
{
if (root == nullptr)
{
return;
}

delete root;

// WARNING! WARNING! WARNING!
// This approach is unsafe! After we delete root, we should not
// attempt to dereference it and access root->left and root->right!
// We no longer have any claim over that address in memory, and
// there is no guarantee that the left and right pointers will not
// have been overwritten by the time we reach these lines of code!
forestFire(root->left);
forestFire(root->right);
}

9. What's wrong with the following function, whose goal is to find the height of a binary tree?

// This function returns the height of a binary tree based on the observation that
// a tree's height is 1 plus the height of the root's left subtree or right subtree
// -- whichever one is taller.
//
// For example, in the following diagram, the height of the root's left subtree
// (the subtree rooted at 3) is 2. The height of the root's right subtree (the
// subtree rooted at 7) is 1. The max of those two values (2 vs. 1) is 2. We add
// 1 to that to get the overall height of the tree. (Adding 1 accounts for the
// extra level that is added when we unite the left and right subtree beneath the
// root node.)
//
//
9 | height: 1 + max(2, 1) = 3
// / \ |
// height: 2 | 3 7 | height: 1 |
// | / \ / \ | |
// | 38 20 10 14 | |
// | / \ |
// | 15 12 |

int
height(Node *root)
{
// Recall that the height of a tree with a single node is zero (0).
// If this node has no children, then we can simply return its height.

if (root->left == nullptr && root->right == nullptr)
{
     return 0;
}

return 1 + max(height(root->left), height(root->right));
}

10. Write a bstFindMax(Node *root) function that returns the maximum value in a non-empty BST (i.e., you may assume the root is non-null). Write two versions of the function: one iterative and one recursive.

11. Write a version of findMax(Node *root) that works for arbitrary binary trees (not necessarily binary search trees). In the case of an empty tree, have your function return numeric_limits<int>::min(), which you can use if you #include <limits> at the top of your program.

12. Write a findMin(Node *root) function that returns the smallest value in the given binary tree (not necessarily a binary search tree). Have your function throw an error if the initial tree is empty. Write a corresponding recursive version of this function that works on BSTs, and then write an iterative version of the BST function as well.

13. The previous three questions are designed, in part, to showcase various approaches we have to handling empty trees in functions that don't really have a good, correct return value when they receive empty trees as their inputs. Take note of those various options, and consider their pros and cons. Namely:

  • Simply assume the tree passed in is non-empty. This is a bit dangerous. If someone violates that assumption, the program might crash with a segmentation fault.
  • Return an extreme value designed to indicate failure. This is not always the best approach. Consider what happens if you return numeric_limits<int>::min() (C++'s smallest possible int value) from findMax() to indicate that the tree was empty. What happens, then, if we have a binary tree that contains only one value, and that value happens to be numeric_limits<int>::min()? When we return that value from our function, whoever processes that result has no way of determining whether it's a flag designed to signal that our tree was empty or an actual return result from a tree that contained that integer as its largest value.
  • Throw an error if the tree is empty. This is not always ideal either. There are many situations where, rather than being forced to terminate, an algorithm could proceed gracefully if it fails to find any values in a binary tree.

Can you think of any alternatives to the three options showcased above?

14. Write a recursive contains(Node *root, int data) function that returns true if the given binary tree (not necessarily a binary search tree) contains data, false otherwise. Write a corresponding version of this function that works on BSTs, as well. Implement the BST version both iteratively and recursively.

15. For a more challenging exercise, write an iterative version of the function from the previous question that operates on regulary binary trees (not necessarily binary search trees). Do you see why this is a bit more difficulty to do iteratively? Can you write a solution nonetheless?