CMSC 341 - Project 2: Natural Language Processing - Fall 2025

Due: Tuesday Oct 21, before 9:00 pm

Objectives

  • Implementing a balanced binary search tree (BST) data structure.
  • Practice writing rebalancing routines for an AVL tree.
  • Practice writing re-structuring routines for a Splay tree.
  • Practice use of recursion in programs.
  • Practice writing unit tests.
  • Practice working in a Linux environment.
  • Practice analyzing and understanding a project requirements.

Introduction

With the advances in the field of Natural Language Processing companies are looking for more efficient data structures since they need to work with the big data. You are assigned a task to develop a data structure as the core component of an application for text processing.

The data structure consists of a Splay tree in which we can store the information about several books. The information to store is the title, the author, and the text. The text will be stored in an AVL tree. It means every node in the Splay tree has an AVL tree. The user will be able to access the text by searching the Splay tree for a specific title. The text of every book is stored as words in the AVL tree. The user is able to search the AVL tree and collect some information about the text. The following presents some examples of queries:

  • The number of occurrences of a word in a book.
  • Check whether a word is commonly used by multiple authors.

In this project we focus more on the correctness of trees implementation than queries. Of course, we need to perform some queries for testing purposes. We read the information about books from a data base, parse the content of the text file, and insert the information in the data structure.

Binary Search Tree (BST)

A binary tree is a tree structure in which each node has either 0, 1, or 2 children. A BST is a derivative of a binary tree where each node contains a key and value pair. The key determines the nodes' placement in the tree, and the value is the data to be stored. Given a set of rules about how to order the keys, we can create a structure where we can query data from it with a specified key. For a BST, we define these rules as follows:

  1. If the target key is less than the key at the current node, traverse to the left child.
  2. If the target key is greater than the key at the current node, traverse to the right child.
  3. If the keys are equal, the action is determined by our application of the tree. More on this later.

A BST on its own can be efficient, but as the dataset increases in size, we can start running into problems. In the worst case, our BST can become a linked list where each of the new keys is greater than or less than the previous one inserted. On the contrary, the best case is inserting elements into the tree in a way to make it a complete tree. Either case is rare to occur with a large dataset, but imbalances are common. An imbalance can be defined when one subtree on a node becomes significantly larger in size or height compared to the other subtree. As the tree becomes increasingly imbalanced, our average query times begin to increase. Luckily, we have methods to prevent large imbalances.

The AVL Tree

An AVL tree employs rotations during insertions or deletions to balance a BST. As the name implies, nodes are literally rotated up the tree to keep its structure complete. A complete tree, or ideally a perfect tree, is the most efficient kind of binary tree. Insertions, deletions, and queries all take O(log(n)) time in such a case. AVL trees have two types of rotations, left and right, which are shown in the diagram below:

The variables "x" and "y" refer to 2 specific nodes whereas the subtrees "a", "b", and "c" refer to subtrees (which is just a pointer to a node which may or may not have more children). Note that the pointers to "a", "b", and/or "c" can be null, but "x" nor "y" will never be null.

The key to keeping an AVL tree efficient is when we perform these rotations. A rotation is performed on a node that is imbalanced, and an imbalance occurs when the node's children's heights differ by more than 1. For example, in the above diagram, consider node "y" to be imbalanced in the right rotation and node "x" to be imbalanced in the left rotation. Using a left and right rotation, we can perform four rotation combinations. The imbalance in the following examples occurs on the node with the height of 2 (in red).

  1. Single left rotation: This is a simple case where we can apply a left rotation to the top node to balance the tree.
  2. Single right rotation: Similar to the above case, we can apply a single right rotation to the top node to balance the tree.
  3. Double left-right rotation: The following two cases become more complicated and require two rotations. In this example, the imbalance still occurs at the node with height 2. If we perform a single right rotation, we still end up with an unbalanced tree, just mirrored (draw a diagram). So, we must perform two rotations. The first left rotation should transform the tree into a form we can balance with a second right rotation. Which node should the first rotation be performed on (hint: it's not necessarily the node with height 2)?
  4. Double right-left rotation: Likewise, this case uses a right rotation followed by a left rotation.

The Splay Tree

Splay trees are binary search trees in which we store the recently accessed node at the root of tree. Such a tree would be a good choice for data structure, if in the application some data points are accessed more frequently than others. Although some work is required in this data structure to bring up the recently accessed data point to the root, but as soon as a node is at the root, the next time its access time is O(1). The amortized analysis of Splay trees reveals that the search operation is O(log n).

Assignment

Your assignment is to implement a data structure combining multiple binary search trees with balancing and restructuring methods.

For this project, you are provided with the skeleton .h and .cpp files and a sample driver:

  • booktree.h – Interface for multiple classes.
  • booktree.cpp – A skeleton for the implementation of the BookTree, WordTree, and BNode classes.
  • driver.cpp – A sample driver program. (Note: this file is provided to show a typical usage. Since the project is not implemented, trying to compile and run this driver program will not generate the sample output in driver.txt. Once you develop your project, you should be able to generate the similar output as driver.txt by running this driver program. Please note, the randomly generated values are different on different platforms.)
  • driver.txt – A sample output produced by driver.cpp.

Please note, you may not change any of the private variables or public function declarations or file names. They will be used for grading. Also, any provided function implementations may not be modified. You may, however, add your own private functions as helpers. The current private function declarations are provided as a backbone to help you.

Additionally, you are responsible for thoroughly testing your program. Your test program, mytest.cpp, must be submitted along with other files. For grading purposes, your implementation will be tested on input data of varying sizes, including very large data. Your submission will also be checked for memory leaks and memory errors.

Specifications

There are four classes in this project. The class BookTree has a member variable of type BNode. The class BNode has a member variable of type WordTree. The class WordTree has a member variable of type Node.

Class BookTree

The class BookTree constructs a Splay tree. The nodes in this tree are of type BNode. The key for a node is the title of a book. The following table describes the public interface of the class.

BookTree::BookTree()
Constructor initializes an empty object.
BookTree::~BookTree()
Destructor deallocates all dynamically created memory.
void BookTree::loadData(Book data[ ], int arrayLength)
This function reads the database structure, and parses the data. It inserts the data into the the Splay tree.
int BookTree::findFrequency(string title, string word)
This function returns the number of occurrences of the word in a book title. The number of occurrences is stored in the nodes of WordTree. If the title or the word is not found the function returns zero.
This operation tries to access a node in the splay tree, then it requires splay operation too. If the title is not found the would be parent of that title is splayed.
int BookTree::searchCount(string title, string word)
This function returns the number of visited nodes in the WordTree when searching for a word. It starts at zero when visiting root, then the path from root to its child counts as one, and it continues until the word is found or is not found.
This operation tries to access a node in the splay tree, then it requires splay operation too. If the title does not exist in the splay tree, the function returns zero, otherwise it returns the number of operations that happened in the AVL tree. This function does not count the number of operations in the splay tree. If the title is not found the would be parent of that title is splayed.
int BookTree::getTextTreeHeight(string title)
This function returns the height of the word tree for a book title. If the title does not exist the function returns -1.
This function does not perform the splay operation if the title does not exist otherwise it does splay.
string BookTree::getRootKey()
This function returns the root's title of the splay tree. It will be used to test the splay operation. For example, after an access the root should contain the node that was accessed.
Note: The implementation of this function is provided. You do not need to modify this function.
bool BookTree::insert(string title, string author, string text)
This function creates a new node and inserts the information for a book in the splay tree. The splay operation is required when inserting a new node. If the title exists in the tree this function does not modify the node and returns false. If the insertion is successful the function returns true.
bool BookTree::removeWord(string title, string word)
This function removes the word from the AVL tree of the title. This function does not perform the splay operation whether removal is successful or not. It returns true if the removal is successful otherwise it returns false.
void BookTree::dump(bool verbose=false) const
This function prints out the contents of the splay tree using in-order traversal. If verbose is true, it also will dump the contents of the AVL trees. Please refer to the Addition Requirements section on this page or the provided driver.cpp and driver.txt to observe an example of the output.
Note: The implementation is provided. You do not need to modify it.

Class BNode

The class BNode constructs a node in the splay tree, i.e. BookTree object. It stores the title and author of a book, and a WordTree object. Once a node is inserted into the splay tree, BNode object constructs the WordTree object, i.e. AVL tree. The following table describes the public interface of the class.

BNode::BNode()
The default constructor creates an empty object.
BNode::BNode(string title, string author, string text)
This alternative constructor initializes an object with the provided information. Moreover, it needs to parse the provided text by space character (DELIMITER), and to insert the words in the AVL tree.
Note: Please make sure the string parsing operation you implement will compile and execute in GL server. Some of the functions provided by Visual Studio compiler are specific to Microsoft development stack. They will not compile in GL (Linux environment).
BNode::~BNode()
The destructor does not perform any specific operations, no memory allocation is required in this class.
int BNode::findFrequency(string word)
This function calls the corresponding function from WordTree class and returns the word's number of occurrences. If the word is not found, the function returns zero.
int BNode::searchCount(string word)
This function calls the corresponding function from WordTree class.
int BNode::getTextTreeHeight()
This function calls the corresponding function from WordTree class.

Class WordTree

The class WordTree constructs an AVL tree in which the nodes are words received from the book database for a book title. The key of a node is the word. In addition, every node stores the number of occurrences of the word in a book title, and the height of the node in the tree. The leaf nodes in this tree carry data and the height of a leaf node is zero. The following table describes the public interface of the class.

WordTree::WordTree()
Constructor creates an empty object.
Note: The implementation of this function is provided to you. You do not need to modify it.
WordTree::~WordTree()
Destructor deallocates all dynamically created memory.
void WordTree::insert(const string& word)
This function inserts the word in the tree. Since, the tree is an AVL tree, it needs to rebalance the tree and recalculate heights after insertion. A duplicate key is not allowed. In the case of duplicate insertion we only update the word count in the node.
bool WordTree::remove(const string& word)
This function removes the word from the AVL tree. If the remove operation is successful the function returns true, otherwise it returns false.
int WordTree::searchCount(string word)
This function returns the number of visited nodes in the WordTree when searching for a word. It starts at zero when visiting root, then the path from root to its child counts as one, and it continues until the word is found or is not found.
int WordTree::getTreeHeight()
This function returns the height of the tree.
void WordTree::dump(std::ostream& ostr = std::cout)
This function prints out the list of all nodes in the tree. The word (node key) is printed along with its number of occurrences and its height in the tree. The default output for the function is the standard output (std::cout). An example of the output is provided in the Additional Requirements section below.
Note: The implementation for this function is provided. You do not need to modify it.
Node* WordTree::find(const string& word)
This function finds a node using the provided key and returns the node. If the node does not exist the function returns nullptr.

Additional Requirements

  • The class declarations and provided function implementations in booktree.cpp may not be modified in any way. No additional libraries may be used. However, private helper functions are permitted in the BNode, BookTree, and WordTree classes.
  • No STL containers or additional libraries may be used in the implementation of the project classes. However, you can use STL containers in the Tester class for the testing purposes.
  • The required functionality is provided in the Node class. There is no need for any modifications to the implementation of this class.
  • Thee class Book is provided to help with creating a database. For the example of a database please refer to the driver.cpp sample file.
  • Your code should not have any memory leaks or memory errors.
  • In the AVL tree the lowest level of nodes which store the keys have zero height.
  • Your test program must use real text not randomly generated data.
  • Follow all coding standards as described on the C++ Coding Standards. In particular, indentations and meaningful comments are important.
  • The function WordTree::dump(...) prints out the word (word is the node key) followed by a : followed by the frequency of the word followed by another : followed by the height of the node. Furthermore, WordTree::dump(...) should print an open square bracket before visiting the left subtree and a close square bracket after visiting the right subtree. The following shows an example of such an output. There should not be any space. The tree viewer tool shows a graphical representation of the output of the dump function. You can copy & paste the dump output in the viewer. This tool facilitates debugging. Note: The implementation for this requirement is already provided.
      [[apple:2:1[banana:1:0]]cherry:1:2[[orange:1:0]plum:1:1[tomato:1:0]]]


    Here, the apple:2:1 indicates that the node with key apple has height 1 and the word apple happens 2 times in the text represented by this tree.
  • The function BookTree::dump(...) prints out the nodes information in an in-order traversal. For every node, it prints the book's title by default. If true value is provided through the verbose parameter, the function prints the book's author too. The tree viewer tool shows a graphical representation of the output of the dump function. You can copy & paste the dump output in the viewer. This tool facilitates debugging. Note: The implementation of the dump function is provided to you.
    ((((((((A TALE OF TWO CITIES)ADVENTURES OF HUCKLEBERRY FINN)Great Expectations)Moby Dick)Pride and Prejudice)THE ADVENTURES OF TOM SAWYER)The Brothers Karamazov)The Masque of the Red Death(The Wonderful Wizard of Oz))

Testing

You need to test your project and you need to submit your tests along with your project. Tests must be submitted in a file called mytest.cpp.

  • The test file name must be mytest.cpp; the file name must be in lower case, a file name like myTest.cpp is not acceptable.
  • The test file must contain the declaration and implementation of your Tester class and the main() function as well as all your test cases, i.e. calls to your test functions.
  • You are responsible for thoroughly testing your work before submission. The following section presents a non-exhaustive list of tests to perform on your implementation.
  • You must write a separate function for every test case.
  • Every test function must return true/false depending on passing or failing the test. Visual outputs are not accepted as test results.
  • Tests cannot be interactive. The test file mytest.cpp must compile and run to completion.
  • An example of declaring, implementing, and calling a test function, and outputting the test results was provided in the driver.cpp file of project 0.
  • The testing guidelines page provides information that helps you to write more effective test cases.

Note: Testing incrementally makes finding bugs easier. Once you finish a function and it is testable, make sure it is working correctly.

Testing Project Classes

  • Test whether the AVL trees are balanced after a decent number of insertions, e.g. inserting the provided database. (Note: this requires visiting all nodes and checking the height values are correct.)
  • Test whether the BST property of the AVL trees is preserved after all insertions. (Note: this requires visiting all nodes and comparing key values.)
  • Test whether the Splay tree performs the splay operations when inserting. For example, we can insert multiple nodes in the splay tree and after every insertion we check whether the inserted node is at root and the tree preserves the BST property.
  • Test whether the AVL trees are balanced after multiple removals. For example, insert database, then remove half of the words, and check the AVL property.
  • Test whether the BST property is preserved after multiple removals from the AVL trees.
  • Test whether the function WordTree::remove(...) works correctly for multiple word removals, e.g. 50 removals.
  • Test whether the function WordTree::find(...) works correctly for multiple word searches, e.g. 50 searches.
  • Test whether the function BookTree::searchCount(...) works correctly, i.e. it returns a correct value and it performs the splay operation.
  • Test whether the function BookTree::findFrequency(...) works correctly, i.e. it returns a correct value and it performs the splay operation.
  • Test whether the BST property is preserved in the Splay tree after multiple Splay operations.

Memory leaks and errors

  • Run your test program in valgrind; check that there are no memory leaks or errors.
    Note: If valgrind finds memory errors, compile your code with the -g option to enable debugging support and then re-run valgrind with the -s and --track-origins=yes options. valgrind will show you the lines numbers where the errors are detected and can usually tell you which line is causing the error.
  • Never ignore warnings. They are a major source of errors in a program.

What to Submit

You must submit the following files to the proj2 directory.

  • booktree.h
  • booktree.cpp
  • mytest.cpp - (Note: This file contains the declaration and implementation of your Tester class as well as all your test cases and a main function.)

If you followed the instructions in the Project Submission page to set up your directories, you can submit your code using the following command:

   cp booktree.h booktree.cpp mytest.cpp ~/cs341proj/proj2/

Grading Rubric

For details of grading rubric please refer to Project Grading Policy