Due: Tuesday Oct 21, before 9:00 pm
Objectives
- Implementing a balanced binary search tree (BST) data structure.
- Practice writing rebalancing routines for an AVL tree.
- Practice writing re-structuring routines for a Splay tree.
- Practice use of recursion in programs.
- Practice writing unit tests.
- Practice working in a Linux environment.
- Practice analyzing and understanding a project requirements.
Introduction
With the advances in the field of Natural Language Processing companies are looking for more efficient data structures since they need to work with the big data. You are assigned a task to develop a data structure as the core component of an application for text processing.
The data structure consists of a Splay tree in which we can store the information about several books. The information to store is the title, the author, and the text. The text will be stored in an AVL tree. It means every node in the Splay tree has an AVL tree. The user will be able to access the text by searching the Splay tree for a specific title. The text of every book is stored as words in the AVL tree. The user is able to search the AVL tree and collect some information about the text. The following presents some examples of queries:
- The number of occurrences of a word in a book.
- Check whether a word is commonly used by multiple authors.
In this project we focus more on the correctness of trees implementation than queries. Of course, we need to perform some queries for testing purposes. We read the information about books from a data base, parse the content of the text file, and insert the information in the data structure.
Binary Search Tree (BST)
A binary tree is a tree structure in which each node has either 0, 1, or 2 children. A BST is a derivative of a binary tree where each node contains a key and value pair. The key determines the nodes' placement in the tree, and the value is the data to be stored. Given a set of rules about how to order the keys, we can create a structure where we can query data from it with a specified key. For a BST, we define these rules as follows:
- If the target key is less than the key at the current node, traverse to the left child.
- If the target key is greater than the key at the current node, traverse to the right child.
- If the keys are equal, the action is determined by our application of the tree. More on this later.
A BST on its own can be efficient, but as the dataset increases in size, we can start running into problems. In the worst case, our BST can become a linked list where each of the new keys is greater than or less than the previous one inserted. On the contrary, the best case is inserting elements into the tree in a way to make it a complete tree. Either case is rare to occur with a large dataset, but imbalances are common. An imbalance can be defined when one subtree on a node becomes significantly larger in size or height compared to the other subtree. As the tree becomes increasingly imbalanced, our average query times begin to increase. Luckily, we have methods to prevent large imbalances.
The AVL Tree
An AVL tree employs rotations during insertions or deletions to balance a BST. As the name implies, nodes are literally rotated up the tree to keep its structure complete. A complete tree, or ideally a perfect tree, is the most efficient kind of binary tree. Insertions, deletions, and queries all take O(log(n)) time in such a case. AVL trees have two types of rotations, left and right, which are shown in the diagram below:
The variables "x" and "y" refer to 2 specific nodes whereas the subtrees "a", "b", and "c" refer to subtrees (which is just a pointer to a node which may or may not have more children). Note that the pointers to "a", "b", and/or "c" can be null, but "x" nor "y" will never be null.
The key to keeping an AVL tree efficient is when we perform these rotations. A rotation is performed on a node that is imbalanced, and an imbalance occurs when the node's children's heights differ by more than 1. For example, in the above diagram, consider node "y" to be imbalanced in the right rotation and node "x" to be imbalanced in the left rotation. Using a left and right rotation, we can perform four rotation combinations. The imbalance in the following examples occurs on the node with the height of 2 (in red).
- Single left rotation: This is a simple case where we can apply a left rotation to the top node to balance the tree.
- Single right rotation: Similar to the above case, we can apply a single right rotation to the top node to balance the tree.
- Double left-right rotation: The following two cases become more complicated and require two rotations. In this example, the imbalance still occurs at the node with height 2. If we perform a single right rotation, we still end up with an unbalanced tree, just mirrored (draw a diagram). So, we must perform two rotations. The first left rotation should transform the tree into a form we can balance with a second right rotation. Which node should the first rotation be performed on (hint: it's not necessarily the node with height 2)?
- Double right-left rotation: Likewise, this case uses a right rotation followed by a left rotation.
The Splay Tree
Splay trees are binary search trees in which we store the recently accessed node at the root of tree. Such a tree would be a good choice for data structure, if in the application some data points are accessed more frequently than others. Although some work is required in this data structure to bring up the recently accessed data point to the root, but as soon as a node is at the root, the next time its access time is O(1). The amortized analysis of Splay trees reveals that the search operation is O(log n).
Assignment
Your assignment is to implement a data structure combining multiple binary search trees with balancing and restructuring methods.
For this project, you are provided with the skeleton .h and .cpp files and a sample driver:
- booktree.h – Interface for multiple classes.
- booktree.cpp – A skeleton for the implementation of the BookTree, WordTree, and BNode classes.
- driver.cpp – A sample driver program. (Note: this file is provided to show a typical usage. Since the project is not implemented, trying to compile and run this driver program will not generate the sample output in driver.txt. Once you develop your project, you should be able to generate the similar output as driver.txt by running this driver program. Please note, the randomly generated values are different on different platforms.)
- driver.txt – A sample output produced by driver.cpp.
Please note, you may not change any of the private variables or public function declarations or file names. They will be used for grading. Also, any provided function implementations may not be modified. You may, however, add your own private functions as helpers. The current private function declarations are provided as a backbone to help you.
Additionally, you are responsible for thoroughly testing your program. Your test program, mytest.cpp, must be submitted along with other files. For grading purposes, your implementation will be tested on input data of varying sizes, including very large data. Your submission will also be checked for memory leaks and memory errors.
Specifications
There are four classes in this project. The class BookTree has a member variable of type BNode. The class BNode has a member variable of type WordTree. The class WordTree has a member variable of type Node.
Class BookTree
The class BookTree constructs a Splay tree. The nodes in this tree are of type BNode. The key for a node is the title of a book. The following table describes the public interface of the class.
This operation tries to access a node in the splay tree, then it requires splay operation too. If the title is not found the would be parent of that title is splayed.
This operation tries to access a node in the splay tree, then it requires splay operation too. If the title does not exist in the splay tree, the function returns zero, otherwise it returns the number of operations that happened in the AVL tree. This function does not count the number of operations in the splay tree. If the title is not found the would be parent of that title is splayed.
This function does not perform the splay operation if the title does not exist otherwise it does splay.
Note: The implementation of this function is provided. You do not need to modify this function.
Note: The implementation is provided. You do not need to modify it.
Class BNode
The class BNode constructs a node in the splay tree, i.e. BookTree object. It stores the title and author of a book, and a WordTree object. Once a node is inserted into the splay tree, BNode object constructs the WordTree object, i.e. AVL tree. The following table describes the public interface of the class.
Note: Please make sure the string parsing operation you implement will compile and execute in GL server. Some of the functions provided by Visual Studio compiler are specific to Microsoft development stack. They will not compile in GL (Linux environment).
Class WordTree
The class WordTree constructs an AVL tree in which the nodes are words received from the book database for a book title. The key of a node is the word. In addition, every node stores the number of occurrences of the word in a book title, and the height of the node in the tree. The leaf nodes in this tree carry data and the height of a leaf node is zero. The following table describes the public interface of the class.
Note: The implementation of this function is provided to you. You do not need to modify it.
Note: The implementation for this function is provided. You do not need to modify it.
Additional Requirements
- The class declarations and provided function implementations in booktree.cpp may not be modified in any way. No additional libraries may be used. However, private helper functions are permitted in the BNode, BookTree, and WordTree classes.
- No STL containers or additional libraries may be used in the implementation of the project classes. However, you can use STL containers in the Tester class for the testing purposes.
- The required functionality is provided in the Node class. There is no need for any modifications to the implementation of this class.
- Thee class Book is provided to help with creating a database. For the example of a database please refer to the driver.cpp sample file.
- Your code should not have any memory leaks or memory errors.
- In the AVL tree the lowest level of nodes which store the keys have zero height.
- Your test program must use real text not randomly generated data.
- Follow all coding standards as described on the C++ Coding Standards. In particular, indentations and meaningful comments are important.
- The function WordTree::dump(...) prints out the word (word is the node key) followed by a : followed by the frequency of the word followed by another : followed by the height of the node. Furthermore, WordTree::dump(...) should print an open square bracket before visiting the left subtree and a close square bracket after visiting the right subtree. The following shows an example of such an output. There should not be any space.
The tree viewer tool shows a graphical representation of the output of the dump function. You can copy & paste the dump output in the viewer. This tool facilitates debugging.
Note: The implementation for this requirement is already provided.
[[apple:2:1[banana:1:0]]cherry:1:2[[orange:1:0]plum:1:1[tomato:1:0]]]

Here, the apple:2:1 indicates that the node with key apple has height 1 and the word apple happens 2 times in the text represented by this tree. - The function BookTree::dump(...) prints out the nodes information in an in-order traversal. For every node, it prints the book's title by default. If true value is provided through the verbose parameter, the function prints the book's author too.
The tree viewer tool shows a graphical representation of the output of the dump function. You can copy & paste the dump output in the viewer. This tool facilitates debugging.
Note: The implementation of the dump function is provided to you.
((((((((A TALE OF TWO CITIES)ADVENTURES OF HUCKLEBERRY FINN)Great Expectations)Moby Dick)Pride and Prejudice)THE ADVENTURES OF TOM SAWYER)The Brothers Karamazov)The Masque of the Red Death(The Wonderful Wizard of Oz))
Testing
You need to test your project and you need to submit your tests along with your project. Tests must be submitted in a file called mytest.cpp.
- The test file name must be mytest.cpp; the file name must be in lower case, a file name like myTest.cpp is not acceptable.
- The test file must contain the declaration and implementation of your Tester class and the main() function as well as all your test cases, i.e. calls to your test functions.
- You are responsible for thoroughly testing your work before submission. The following section presents a non-exhaustive list of tests to perform on your implementation.
- You must write a separate function for every test case.
- Every test function must return true/false depending on passing or failing the test. Visual outputs are not accepted as test results.
- Tests cannot be interactive. The test file mytest.cpp must compile and run to completion.
- An example of declaring, implementing, and calling a test function, and outputting the test results was provided in the driver.cpp file of project 0.
- The testing guidelines page provides information that helps you to write more effective test cases.
Note: Testing incrementally makes finding bugs easier. Once you finish a function and it is testable, make sure it is working correctly.
Testing Project Classes
- Test whether the AVL trees are balanced after a decent number of insertions, e.g. inserting the provided database. (Note: this requires visiting all nodes and checking the height values are correct.)
- Test whether the BST property of the AVL trees is preserved after all insertions. (Note: this requires visiting all nodes and comparing key values.)
- Test whether the Splay tree performs the splay operations when inserting. For example, we can insert multiple nodes in the splay tree and after every insertion we check whether the inserted node is at root and the tree preserves the BST property.
- Test whether the AVL trees are balanced after multiple removals. For example, insert database, then remove half of the words, and check the AVL property.
- Test whether the BST property is preserved after multiple removals from the AVL trees.
- Test whether the function WordTree::remove(...) works correctly for multiple word removals, e.g. 50 removals.
- Test whether the function WordTree::find(...) works correctly for multiple word searches, e.g. 50 searches.
- Test whether the function BookTree::searchCount(...) works correctly, i.e. it returns a correct value and it performs the splay operation.
- Test whether the function BookTree::findFrequency(...) works correctly, i.e. it returns a correct value and it performs the splay operation.
- Test whether the BST property is preserved in the Splay tree after multiple Splay operations.
Memory leaks and errors
- Run your test program in valgrind; check that there are no memory leaks or errors.
Note: If valgrind finds memory errors, compile your code with the -g option to enable debugging support and then re-run valgrind with the -s and --track-origins=yes options. valgrind will show you the lines numbers where the errors are detected and can usually tell you which line is causing the error. - Never ignore warnings. They are a major source of errors in a program.
What to Submit
You must submit the following files to the proj2 directory.
- booktree.h
- booktree.cpp
- mytest.cpp - (Note: This file contains the declaration and implementation of your Tester class as well as all your test cases and a main function.)
If you followed the instructions in the Project Submission page to set up your directories, you can submit your code using the following command:
cp booktree.h booktree.cpp mytest.cpp ~/cs341proj/proj2/
Grading Rubric
For details of grading rubric please refer to Project Grading Policy