Due: Tuesday May 6, before 9:00 pm
Objectives
- To implement a hash table.
- To implement linear, quadratic and double-hash probing to manage hash collisions.
- To implement an incremental re-hashing algorithm.
- To practice use of function pointers.
- To practice writing unit tests.
- To practice working in a Linux environment.
- To practice analyzing and understanding a project requirements.
Introduction
A research team collects and analyzes DNA samples from live organisms in a wide forest of amazon. You are assigned the task of developing a database application which is adapted to the project’s specific requirements. The main operations in this application are insert, find, and remove. Since the efficiency of operations is extremely important, we have decided to use a hash table to manage accessing the samples information.
The DNA molecules have four chemical building blocks which can appear in different orders or different sequences. The sequence of these building blocks tells scientists the kind of genetic information that is carried in a particular DNA segment. For example, the sequence can highlight changes in a gene that may cause a disease. The building blocks are called Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). The notation to present the sequences uses the characters A, C, G, and T. The sequence AACAGGGTCA presents a sample sequence.
In this application the keys are sequences, we know that this can cause collisions and clustering in a hash table. To manage the collision cases, we use multiple forms of probing for the collision handling policy. The number of samples in the database can change during different stages of the project. When a change is required, we need to rehash the entire table. However, the rehash operation happens incrementally during multiple regular operations.
Hash Tables
A hash table is a data structure which implements the map ADT. It stores the pairs of key-value. The key determines the index number of an array cell (bucket) in which the value will be stored. A hash function calculates the index number. If there is more than one value for a key in the data set, the hash table uses a collision handling scheme to find another cell for storing the new value.
Linear Probing
When we try to find an index in the hash table for a key and we find out that the determined bucket is taken, we need to find another bucket to store the key. In a linear probing we store the information in the next available bucket. In this scheme the jump step to find the next available bucket is one.
Quadratic Probing
Linear probing can cause clustering, i.e. having many data points in a row. Clustering reduces the search efficiency. Using quadratic probing the jump step to find the next available bucket is the square of iteration number.
Double-Hash Probing
In this project we use the following equation to find the next bucket for the purpose of the double hashing collision handling, in which i = 0,1,2,3 ....
index = ((Hash(key) % TableSize) + i x (11-(Hash(key) % 11))) % TableSize
Assignment
Your assignment is to complete the class DnaDb and write the appropriate test functions.
The application starts with a hash table of size MINPRIME. After certain criteria appearing it will switch to another table and it transfers all data nodes from the current table to the new one incrementally. Once the switching process starts it scans 25% of the table and transfers any live nodes it finds in the old table and at every consecutive operation (insert/remove) It continues to scan 25% more of the table and transfers live data from the old table to the new table until all data is transferred. We do not transfer deleted buckets to the new table.
After an insertion, if the load factor becomes greater than 0.5, we need to rehash to a new hash table. The capacity of the new table would be the smallest prime number greater than 4 times the current number of data points. The current number of data points is total number of occupied buckets minus total number of deleted buckets.
After a deletion, if the number of deleted buckets is more than 80 percent of the total number of occupied buckets, we need to rehash to a new table. The capacity of the new table would be the smallest prime number greater than 4 times the current number of data points. The current number of data points is total number of occupied buckets minus total number of deleted buckets.
During a rehashing process the deleted buckets will be removed from the system permanently. They will not be transferred to the new table.
If a change of collision handling policy is requested by the user while the program is running, the new policy will be applied to the new table when a new rehash is triggered.
For this project, you are provided with the following files:
- dnadb.h – header file for the DNA and DnaDb classes.
- dnadb.cpp – the template file for the DnaDb class. Complete your DnaDb implementation in this file.
- driver.cpp – A sample driver program.
- driver.txt – A sample output produced by driver.cpp.
Specifications
DnaDb Class
The DnaDb class uses the DNA class. It has a member variable to store a pointer to a hash function. It also has two member variables to store pointers to two arrays of DNA objects. These arrays are m_currentTable and m_oldTable, and the m_sequence member variable of the DNA object is used as the key for hashing purposes. A DNA object has another member variable which stores an ID for the location in which the sample has been found. The location ID and the sample sequence define the uniqueness of a DNA object together.
The table size must be a prime number between MINPRIME and MAXPRIME. If the user passes a size less than MINPRIME, the capacity must be set to MINPRIME. If the user passes a size larger than MAXPRIME, the capacity must be set to MAXPRIME. If the user passes a non-prime number the capacity must be set to the smallest prime number greater than user's value. The probing parameter specifies the type of collision handling policy for the current hash table.
Moreover, the constructor creates memory for the current table and initializes all member variables.
Hash collisions should be resolved using the probing policy specified in the m_currProbing variable. We insert into the table indicated by m_currentTable. After every insertion we need to check for the proper criteria, and if it is required, we need to rehash the entire table incrementally into a new table. The incremental rehashing proceeds with scanning 25% of the table at a time and transfer any live data found to the new table. Once we transferred the live nodes in the first 25% of the table, the second 25% live data will be transferred at the next operation (insertion or removal). Once all data is transferred to the new table, the old table will be removed, and its memory will be deallocated.
If the DNA object is inserted, the function returns true, otherwise it returns false. A DNA object can only be inserted once. The hash table does not contain duplicate objects. Moreover, the location ID value should be a valid one falling in the range [MINLOCID-MAXLOCID]. Every DNA object is a unique object carrying the DNA's sequence and the location ID. The DNA's sequence is the key which is used for hashing.
After every deletion we need to check for the proper criteria, and if it is required, we need to rehash the entire table incrementally into the current table. The incremental rehashing proceeds with scanning 25% of the table at a time and transfer any live data found to the new table. Once we transferred the live nodes in the first 25% of the table, the second 25% live data will be transferred at the next operation (insertion or removal). Once all data is transferred to the new table, the old table will be removed, and its memory will be deallocated.
If the DNA object is found and is deleted, the function returns true, otherwise it returns false.
Additional Requirements
- Private helper functions may be added to the DnaDb class; however, they must be declared in the private section of the class declaration.
- Every table may have its own collision handling policy. If previously a change of policy has been requested and then a rehash initiates the new table will use the new policy and the old table will use its own policy while it exists.
- The allocated memory to the hash table must be dynamically managed at execution time when there is rehashing.
- Once the current hash table exceeds some criteria the DnaDb class rehashes the data into a new hash table. This requires creating a new table and swapping the two tables, so the newly created table becomes the current table.
- The capacity of the new table is determined by the information from the current table (which will become the old table). It would be the smallest prime number greater than ((m_currentSize - m_numDeleted)*4).
- For rehashing we scan 25% of the table at every operation and transfer any live data to the new table. The class DnaDb has a member variable named m_transferIndex which can be used to keep track of the current status of transfer.
- The 25% of data is an integer number, we use the floor value of the result.
- When rehashing, the deleted buckets will be removed from the system.
- The insertion of data points always happens in the current table.
- The find operation can happen in the current table and the old table if it exists.
- The remove operation can happen in the current table and the old table if it exists.
- Once the incremental rehashing ends and there are no more live data objects in the old table, the old table must be removed and its memory must be deallocated.
-
Here are the rules for lazy deletion,
- Treat deleted element as empty when inserting.
- Treat deleted element as occupied when searching.
-
Here are the rules for rehashing criteria,
- rehash once the load factor exceeds 50%.
- rehash once the deleted ratio exceeds 80%.
- The load factor is the number of occupied buckets divided by the table size. The number of occupied buckets is the total of available data and deleted data.
- The deleted ratio is the number of deleted buckets divided by the number of occupied buckets.
- No STL containers or additional libraries may be used in the DnaDb class. STL containers can be used in mytest.cpp for testing purposes.
Testing
- The test file name must be mytest.cpp; the file name must be in lower case, a file name like myTest.cpp is not acceptable.
- The test file must contain the declaration and implementation of your Tester class and the main() function as well as all your test cases, i.e. calls to your test functions.
- You are responsible for adequately testing your work before submission. The following section presents a non-exhaustive list of tests to perform on your implementation.
- You must write a separate function for every test case.
- Every test function must return true/false depending on passing or failing the test. Visual outputs are not accepted as test results.
- The dump() function should not be called in the test functions. The dump() function is only provided to facilitate debugging.
- Tests cannot be interactive. The test file mytest.cpp must compile and run to completion.
- An example of declaring, implementing, and calling a test function, and outputting the test results was provided in the driver.cpp file of project 0.
- The testing guidelines page provides information that helps you to write more effective test cases.
Note: Testing incrementally makes finding bugs easier. Once you finish a function and it is testable, make sure it is working correctly.
Testing DnaDb Class
- Test the insertion operation in the hash table. The following presents a sample algorithm to test the normal insertion operation:
- There are some non-colliding data points in the hash table.
- Insert multiple non-colliding keys.
- Check whether they are inserted in the correct bucket (correct index).
- Check whether the data size changes correctly.
- Test the find operation (getDNA(...) function) for an error case, the DNA object does not exist in the database.
- Test the find operation (getDNA(...) function) with several non-colliding keys.
- Test the find operation (getDNA(...) function) with several colliding keys without triggering a rehash. This also tests whether the insertion works correctly with colliding data.
- Test the remove operation with a few non-colliding keys.
- Test the remove operation with a number of colliding keys without triggering a rehash.
- Test the rehashing is triggered after a descent number of data insertion.
- Test the rehash completion after triggering rehash due to load factor, i.e. all live data is transferred to the new table and the old table is removed.
- Test the rehashing is triggered after a descent number of data removal.
- Test the rehash completion after triggering rehash due to delete ratio, i.e. all live data is transferred to the new table and the old table is removed.
Random Numbers for Testing
For testing purposes, we need data. Data can be written as fixed values or can be generated randomly. Writing fixed data values might be a tedious work since we need a large amount of data points. You must use the provided Random class to generate test data.
In the file driver.cpp there is the class Random which generates pseudorandom numbers. The class is using a default seed value. It also is possible to change the seed value. On the same machine it always generates the same sequence of numbers. That is why the numbers are called pseudorandom numbers, they are not real random numbers. Please note, the numbers are machine dependent, therefore, the numbers you see in the sample file driver.txt might be different from the numbers your machine generates.
Testing For Memory Leaks / Memory Errors
- Run your test program in valgrind; check that there are no memory leaks or errors.
Note: If valgrind finds memory errors, compile your code with the -g option to enable debugging support and then re-run valgrind with the -s and --track-origins=yes options. valgrind will show you the line numbers where the errors are detected and can usually tell you which line is causing the error. - Never ignore warnings. They are a major source of errors in a program.
What to Submit
You must submit the following files to the proj4 submit directory:
- dnadb.h
- dnadb.cpp
- mytest.cpp (Note: This file contains the declaration and implementation of your Tester class as well as all your test cases.)
If you followed the instructions in the Project Submission page to set up your directories, you can submit your code using the following command:
cp dnadb.h dnadb.cpp mytest.cpp ~/cs341proj/proj4/
Grading Rubric
For details of grading rubric please refer to Project Grading Policy