CMSC 341 - Project 4: Virus Detection App - Spring 2026

Due: Tuesday May 5, before 9:00 pm

Objectives

  • To implement a hash table.
  • To implement linear, quadratic and double-hash probing to manage hash collisions.
  • To implement an incremental re-hashing algorithm.
  • To practice use of function pointers.
  • To practice writing unit tests.
  • To practice working in a Linux environment.
  • To practice analyzing and understanding a project requirements.

Introduction

Some viruses do not have any effects on their hosts, but many are harmful. A pharmaceutical company plans to bring the virus detection to the hands of public. They have designed a nucleic-acid test that can be used by consumers to capture some signature DNA sequences from viruses and search a database to find the information for the potentially harmful virus. Here is a quote from Wikipedia about nucleic acid test:

“ A nucleic acid test (NAT) is a technique used to detect a particular nucleic acid sequence and thus usually to detect and identify a particular species or subspecies of organism, often a virus or bacterium that acts as a pathogen in blood, tissue, urine, etc. NATs differ from other tests in that they detect genetic materials (RNA or DNA) rather than antigens or antibodies. Detection of genetic materials allows an early diagnosis of a disease because the detection of antigens and/or antibodies requires time for them to start appearing in the bloodstream. ”

The DNA molecules have four chemical building blocks which can appear in different orders or different sequences. The sequence of these building blocks tells scientists the kind of genetic information that is carried in a particular DNA segment. For example, the sequence can highlight changes in a gene that may cause a disease. The building blocks are called Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). The notation to present the sequences uses the characters A, C, G, and T. The sequence AACAGGGTCA presents a sample sequence.

You are assigned the task of developing the database. This database uses the DNA sequences as keys to store the information. The main required operations in the database are insertion, search and removal of records. The database will use a hash table to enable the fast search time. The project manager decides to test different collision handling schemes. Therefore, a decision has been made to implement both quadratic and double hashing collision handling policies and allowing to switch between policies during the runtime. Moreover, since the database can be modified the project manager decided the table can be resized, i.e. expand or shrink. The application should be able to rehash the entire table.

Hash Tables

A hash table is a data structure which implements the map ADT. It stores the pairs of key-value. The key determines the index number of an array cell (bucket) in which the value will be stored. A hash function calculates the index number. If there is more than one value for a key in the data set, the hash table uses a collision handling scheme to find another cell for storing the new value.

Linear Probing

When we try to find an index in the hash table for a key and we find out that the determined bucket is taken, we need to find another bucket to store the key. In a linear probing we store the information in the next available bucket. In this scheme the jump step to find the next available bucket is one.

Quadratic Probing

Linear probing can cause clustering, i.e. having many data points in a row. Clustering reduces the search efficiency. Using quadratic probing the jump step to find the next available bucket is the square of iteration number.

Double-Hash Probing

In this project we use the following equation to find the next bucket for the purpose of the double hashing collision handling, in which i = 0,1,2,3 ....

  index = ((Hash(key) % TableSize) + i x (11-(Hash(key) % 11))) % TableSize

Assignment

Your assignment is to complete the class VDetect and write the appropriate test functions.

The application starts with a hash table of size MINPRIME. After certain criteria appearing it will switch to another table and it transfers all data nodes from the current table to the new one incrementally. Once the switching process starts it scans 25% of the table and transfers any live nodes it finds in the old table and at every consecutive operation (insert/remove) It continues to scan 25% more of the table and transfers live data from the old table to the new table until all data is transferred. We do not transfer deleted buckets to the new table.

After an insertion, if the load factor becomes greater than 0.5, we need to rehash to a new hash table. The capacity of the new table would be the smallest prime number greater than 4 times the current number of data points. The current number of data points is total number of occupied buckets minus total number of deleted buckets.

After a deletion, if the number of deleted buckets is more than 80 percent of the total number of occupied buckets, we need to rehash to a new table. The capacity of the new table would be the smallest prime number greater than 4 times the current number of data points. The current number of data points is total number of occupied buckets minus total number of deleted buckets.

During a rehashing process the deleted buckets will be removed from the system permanently. They will not be transferred to the new table.

If a change of collision handling policy is requested by the user while the program is running, the new policy will be applied to the new table when a new rehash is triggered.

For this project, you are provided with the following files:

  • vdetect.h – header file for the Virus and VDetect classes.
  • vdetect.cpp – Complete your VDetect implementation in this file.
  • driver.cpp – A sample driver program.
  • driver.txt – A sample output produced by driver.cpp.

Specifications

VDetect Class

The VDetect class uses the Virus class. It has a member variable to store a pointer to a hash function. It also has two member variables to store pointers to two arrays of Virus objects. These arrays are m_currentTable and m_oldTable, and the m_key member variable of the Virus object is used as the key for hashing purposes. A Virus object has another member variable which stores an ID for the object. The ID and the key define the uniqueness of a Virus object together.

VDetect::VDetect(int size, hash_fn hash, prob_t probing = DEFPOLCY)
The constructor takes size to specify the length of the current hash table, and hash is a function pointer to a hash function. The type of hash is defined in vdetect.h.
The table size must be a prime number between MINPRIME and MAXPRIME. If the user passes a size less than MINPRIME, the capacity must be set to MINPRIME. If the user passes a size larger than MAXPRIME, the capacity must be set to MAXPRIME. If the user passes a non-prime number the capacity must be set to the smallest prime number greater than user's value. The probing parameter specifies the type of collision handling policy for the current hash table.
Moreover, the constructor creates memory for the current table and initializes all member variables.
VDetect::~VDetect()
Destructor deallocates the memory.
bool VDetect::insert(Virus virus)
This function inserts an object into the current hash table. The insertion index is determined by applying the hash function m_hash that is set in the VDetect constructor and then reducing the output of the hash function modulo the table size. A sample hash function is provided in the driver.cpp file.
Hash collisions should be resolved using the probing policy specified in the m_currProbing variable. We insert into the table indicated by m_currentTable. After every insertion we need to check for the proper criteria, and if it is required, we need to rehash the entire table incrementally into a new table. The incremental rehashing proceeds with scanning 25% of the table at a time and transfer any live data found to the new table. Once we transferred the live nodes in the first 25% of the table, the second 25% live data will be transferred at the next operation (insertion or removal). Once all data is transferred to the new table, the old table will be removed, and its memory will be deallocated.
If the Virus object is inserted, the function returns true, otherwise it returns false. A Virus object can only be inserted once. The hash table does not contain duplicate objects. Moreover, the ID value should be a valid one falling in the range [MINID-MAXID]. Every Virus object is a unique object carrying the Virus's key and the ID. The Virus's key is used for hashing.
bool VDetect::remove(Virus virus)
This function removes a data point from either the current hash table or the old hash table where the object is stored. In a hash table we do not empty the bucket, we only tag it as deleted. To tag a removed bucket we can use the member variable m_used in the Virus class. To find the bucket of the object we should use the proper probing policy for the table.
After every deletion we need to check for the proper criteria, and if it is required, we need to rehash the entire table incrementally into the current table. The incremental rehashing proceeds with scanning 25% of the table at a time and transfer any live data found to the new table. Once we transferred the live nodes in the first 25% of the table, the second 25% live data will be transferred at the next operation (insertion or removal). Once all data is transferred to the new table, the old table will be removed, and its memory will be deallocated.
If the Virus object is found and is deleted, the function returns true, otherwise it returns false.
Virus VDetect::getVirus(string key, int id) const
This function looks for the Virus object with the sequence (key) and the ID in the database, if the object is found the function returns it, otherwise the function returns empty object.
bool updateID(Virus virus, int ID)
This function looks for the Virus object in the database, if the object is found the function updates its ID and returns true, otherwise the function returns false.
float VDetect::lambda() const
This function returns the load factor of the current hash table. The load factor is the ratio of occupied buckets to the table capacity. An occupied bucket is a bucket which can contain either a live data node (available to be used) or a deleted node.
float VDetect::deletedRatio() const
This function returns the ratio of the deleted buckets to the total number of occupied buckets .
void changeProbPolicy(prob_t policy)
The user can change the collision handling policy of the hash table at the runtime. If a change request has been submitted by a user, the new policy will be stored in the variable m_newPolicy. Once the next rehash operation is initiated the new policy will be used for the newly created hash table.
void VDetect::dump()
This function dumps the contents of the current hash table and the old hash table if it exists. It prints the contents of the hash table in array-index order. Note: The implementation of this function is provided. The function is provided to facilitate debugging.
int VDetect::findNextPrime(int current)
This function returns the smallest prime number greater than the argument "current". If "current" is less than or equal to MINPRIME, the function returns MINPRIME. If "current" is greater than or equal to MAXPRIME, the function returns MAXPRIME. In a hash table we'd like to use a table with prime size. Then, every time we need to determine the size for a new table, we use this function. Note: The implementation of this function is provided. Also, implement a helper function which returns the difference between current and the smallest prime number greater than current. This function is a helper utility related to table sizing, but it is not used by the main hash table operations. The signature of the function is int VDetect::nextPrimeGap(int current).
bool VDetect::isPrime(int number)
This function returns true if the passed argument "number" is a prime number, otherwise it returns false. Note: The implementation of this function is provided.

Additional Requirements

  • Private helper functions may be added to the VDetect class; however, they must be declared in the private section of the class declaration.
  • Every table may have its own collision handling policy. If previously a change of policy has been requested and then a rehash initiates the new table will use the new policy and the old table will use its own policy while it exists.
  • The allocated memory to the hash table must be dynamically managed at execution time when there is rehashing.
  • Once the current hash table exceeds some criteria the VDetect class rehashes the data into a new hash table. This requires creating a new table and swapping the two tables, so the newly created table becomes the current table.
  • The capacity of the new table is determined by the information from the current table (which will become the old table). It would be the smallest prime number greater than ((m_currentSize - m_numDeleted)*4).
  • For rehashing we scan 25% of the table at every operation and transfer any live data to the new table. The class VDetect has a member variable named m_transferIndex which can be used to keep track of the current status of transfer.
  • The 25% of data is an integer number, we use the floor value of the result.
  • When rehashing, the deleted buckets will be removed from the system.
  • The insertion of data points always happens in the current table.
  • The find operation can happen in the current table and the old table if it exists.
  • The remove operation can happen in the current table and the old table if it exists.
  • Once the incremental rehashing ends and there are no more live data objects in the old table, the old table must be removed and its memory must be deallocated.
  • Here are the rules for lazy deletion,
    • Treat deleted element as empty when inserting.
    • Treat deleted element as occupied when searching.
  • Here are the rules for rehashing criteria,
    • rehash once the load factor exceeds 50%.
    • rehash once the deleted ratio exceeds 80%.
  • The load factor is the number of occupied buckets divided by the table size. The number of occupied buckets is the total of available data and deleted data.
  • The deleted ratio is the number of deleted buckets divided by the number of occupied buckets.
  • No STL containers or additional libraries may be used in the VDetect class. STL containers can be used in mytest.cpp for testing purposes.

Testing

  • The test file name must be mytest.cpp; the file name must be in lower case, a file name like myTest.cpp is not acceptable.
  • The test file must contain the declaration and implementation of your Tester class and the main() function as well as all your test cases, i.e. calls to your test functions.
  • You are responsible for adequately testing your work before submission. The following section presents a non-exhaustive list of tests to perform on your implementation.
  • You must write a separate function for every test case.
  • Every test function must return true/false depending on passing or failing the test. Visual outputs are not accepted as test results.
  • The dump() function should not be called in the test functions. The dump() function is only provided to facilitate debugging.
  • Tests cannot be interactive. The test file mytest.cpp must compile and run to completion.
  • An example of declaring, implementing, and calling a test function, and outputting the test results was provided in the driver.cpp file of project 0.
  • The testing guidelines page provides information that helps you to write more effective test cases.

Note: Testing incrementally makes finding bugs easier. Once you finish a function and it is testable, make sure it is working correctly.

Testing VDetect Class

Please note: all tests must be performed using a descent number of data points, for example, 50 nodes.

  • Test the insertion operation in the hash table with a descent number of data points, e.g. 50 nodes. The following presents a sample algorithm to test the normal insertion operation:
    • There are some non-colliding data points in the hash table.
    • Insert multiple non-colliding keys.
    • Check whether they are inserted in the correct bucket (correct index).
    • Check whether the data size changes correctly.
  • Test the insertion operation in the hash table with a descent number of data points with half of them using colliding keys, e.g. 50 nodes.
  • Test the find operation (getVirus(...) function) for an error case, the Virus object does not exist in the database.
  • Test the find operation (getVirus(...) function) with several non-colliding keys.
  • Test the find operation (getVirus(...) function) with several colliding keys without triggering a rehash. This also tests whether the insertion works correctly with colliding data.
  • Test the remove operation with a few non-colliding keys.
  • Test the remove operation with a number of colliding keys without triggering a rehash.
  • Test the rehashing is triggered after a descent number of data insertion.
  • Test the rehash completion after triggering rehash due to load factor, i.e. all live data is transferred to the new table and the old table is removed.
  • Test the rehashing is triggered after a descent number of data removal.
  • Test the rehash completion after triggering rehash due to delete ratio, i.e. all live data is transferred to the new table and the old table is removed.

Random Numbers for Testing

For testing purposes, we need data. Data can be written as fixed values or can be generated randomly. Writing fixed data values might be a tedious work since we need a large amount of data points. You must use the provided Random class to generate test data.

In the file driver.cpp there is the class Random which generates pseudorandom numbers. The class is using a default seed value. It also is possible to change the seed value. On the same machine it always generates the same sequence of numbers. That is why the numbers are called pseudorandom numbers, they are not real random numbers. Please note, the numbers are machine dependent, therefore, the numbers you see in the sample file driver.txt might be different from the numbers your machine generates.

Testing For Memory Leaks / Memory Errors

  • Run your test program in valgrind; check that there are no memory leaks or errors.
    Note: If valgrind finds memory errors, compile your code with the -g option to enable debugging support and then re-run valgrind with the -s and --track-origins=yes options. valgrind will show you the line numbers where the errors are detected and can usually tell you which line is causing the error.
  • Never ignore warnings. They are a major source of errors in a program.

What to Submit

You must submit the following files to the proj4 submit directory:

  • vdetect.h
  • vdetect.cpp
  • mytest.cpp (Note: This file contains the declaration and implementation of your Tester class as well as all your test cases.)

If you followed the instructions in the Project Submission page to set up your directories, you can submit your code using the following command:

   cp vdetect.h vdetect.cpp mytest.cpp ~/cs341proj/proj4/

Grading Rubric

For details of grading rubric please refer to Project Grading Policy