Efficient Data Deduplication Mechanism for Genomic Data

  • Tin Thein Thwel Myanmar Institute of Information Technology(MIIT) Mandalay Myanmar
  • G R Sinha, DR Myanmar Institute of Information Technology, Myanmar
Keywords: Genomic data, Data deduplication, B tree indexing, Data storage, Chunking algorithm


During the data science age, many people tend to access health concerned information and diagnosis using information technology, including telemedicine. Therefore, many researchers attempting to work with medical experts as well as bioinformatics area. In the bioinformatics field, handling the genomic data of human beings becomes essential such as collecting, storing and processing. Genomic data refers to the genome and DNA data of an organism. Unavoidably, genomic data require huge amount of storage for the customized software to analyze. Recently, genome researchers are rising the alarms over big data.This research papers attempts in significant amount of reduction of data storage by applying data deduplication process in genomic data set. Data deduplication, ‘dedupe’ in short can reduce the amount of storage because of its single instance storage nature.Therefore, data deduplication becomes one of the solutions for optimizing the huge amount of storage spaces for genome storage.We have implemented data deduplication method and applied it to genomic data and the deduplication performed successfully by using secure hash algorithm, B++ tree and sub-file level chunking algorithm. The methods were implemented in integrated approach. The files are separated into different chunks with the help of Two Threshold Two Divisors algorithm and hash function is used to get chunk identifiers.  Indexing keys are constructed using the identifiersin B+ tree like index structure.Thissystem can reduce the storage space significantly when there exist duplicated data. The preliminary testing is made using NCBI datasets


1. Lopresti, D. P. (1999, September). Models and algorithms for duplicate document detection. In Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR'99 (Cat. No. PR00318) (pp. 297-300). IEEE.
2. Gabdank, I., Chan, E. T., Davidson, J. M., Hilton, J. A., Davis, C. A., Baymuradov, U. K., ... & Dreszer, T. R. (2018). Prevention of data duplication for high throughput sequencing repositories. Database, 2018.
3. Jared, D. et.al, (ICMLA’08) “Learning-based Fusion for Data Deduplication”, Seventh International Conference on Machine Learning and Applications, IEEE Computer Society, California, USA, 2008, pp. 66-71.
4. Eshghi, K. 2005 “A Framework Improving Content-based Chunking Technical Report HPL-2005-30 (R. Laboratories, Palo Alto), CA.
5. Papageorgiou, L., Eleni, P., Raftopoulou, S., Mantaiou, M., Megalooikonomou, V., & Vlachakis, D. (2018). Genomic big data hitting the storage bottleneck. EMBnet. Journal, 24.
6. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezis, G., & Camble, P. (2009, February). Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Fast (Vol. 9, pp. 111-123).
7. Michael T. Goodrich, 2009, Data Structures and Algorithm in C++, Wiley Publishing, pp. 598.
8. National Security Agency, 1995 “Secure Hash Standard”, Federal Information Processing Standards Publication 180-1, US government standards agency NIST.
9. Borry, P., Bentzen, H. B., Budin-Ljøsne, I., Cornel, M. C., Howard, H. C., Feeney, O., ... & Riso, B. (2018). The challenges of the expanded availability of genomic information: an agenda-setting paper. Journal of community genetics, 9(2), 103-116.
10. Dorok, S., Breß, S., Teubner, J., Läpple, H., Saake, G., & Markl, V. (2017). Efficient storage and analysis of genome data in databases. Datenbanksysteme für Business, Technologie und Web (BTW 2017).
11. Ramaswamy, S. (1997, January). Efficient indexing for constraint and temporal databases. In International Conference on Database Theory (pp. 419-431). Springer, Berlin, Heidelberg.
12. Santos, W., Teixeira, T., Machado, C., Meira Jr, W., Ferreira, R., Guedes, D., & Da Silva, A. S. (2007, October). A scalable parallel deduplication algorithm. In 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07) (pp. 79-86). IEEE.
13. The Medical Futurist, “The Genomic Data Challenges of The Future”, 27 October 2018, https://medicalfuturist.com.
14. Zhu, B., Li, K., & Patterson, R. H. (2008, February). Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Fast (Vol. 8, pp. 1-14).
How to Cite
Thwel, T., & Sinha, G. (2019). Efficient Data Deduplication Mechanism for Genomic Data. CSVTU International Journal of Biotechnology, Bioinformatics and Biomedical, 4(2), 52-58. https://doi.org/https://doi.org/10.30732/IJBBB.20190402004