Please use this identifier to cite or link to this item: http://hdl.handle.net/10266/6660
Title: Distributed Data Deduplication Techniques for Efficient Cloud Storage System
Authors: Kaur, Ravneet
Supervisor: Bhattacharya, Jhilik
Chana, Inderveer
Keywords: data deduplication;data reduction;cloud computing;CNN;storage system
Issue Date: 31-Oct-2023
Abstract: Cost-effective storage management has emerged as a critical challenge for cloud storage systems given the the exponential growth of digital data in contemporary times. Storing vast amounts of internet-generated data efficiently requires substantial computing and storage resources. This issue is further exacerbated by significant, redundant data, sig- nificantly impacting storage requirements. This thesis investigates and proposes deduplication techniques to reduce duplicate data in cloud storage systems. Data deduplication is crucial for large-scale distributed systems, particularly in dynamic infrastructures like cloud storage. The performance of dedupli- cation directly affects the overall efficiency and cost of the system. By reducing data volumes, storage providers can mitigate the costs of running large storage systems and conserve energy consumption. This work proposes an efficient data deduplication technique that effectively manages and eliminates duplicates in cloud storage systems. A comprehensive investigation of various deduplication techniques has been undertaken to study their efficacy in stor- age systems. Data-based deduplication techniques are categorized into text, image, and video-based methods. Scalability, reliability, distributed environment techniques, and fingerprint indexing emerge as key challenges for distributed data deduplication in cloud storage systems. This research work addresses these challenges and explores measures to overcome them. The thesis focuses on image deduplication techniques in cloud storage systems, with the aim of minimizing exact or near-exact image duplicates. A novel CNN-based online image deduplication technique is proposed to detect such duplicates. A Fine-Tuned AlexNet for cross-domain online image deduplication is proposed for exact near exact image detec- tion. Comparative analysis with existing CNN techniques demonstrates better accuracy in the proposed fine-tuned CNN-based feature extraction technique, surpassing AlexNet and VGGNet by 24% and 17%, respectively. Additionally, the research introduces the Hot Decomposition Vector (HDV), which optimally stores dissimilar parts of near-exact images for efficient reconstruction using a base image. HDV outperforms traditional im- age feature extraction approaches in terms of image-matching accuracy and computing time. Furthermore, a novel EsDeDUP energy-saving technique is proposed to analyze the im- pact of exact or near-exact image deduplication techniques on energy savings and storage reduction. Fine-tuned CNN-based image deduplication techniques have been proposed to compute the effectiveness of image deduplication techniques and compared with existing v hash-based image deduplication techniques. The technique assesses the power consump- tion and performance of various deduplication approaches to ascertain their energy-saving potential. The work evaluates the performance and power consumption of four hash-based duplicate image detection techniques: phash, whash, ahash, and dhash. Additionally, this research proposed fine-tuned CNN-based deduplication techniques using neural structures such as fine-tuned AlexNet, fine-tuned VGG-NET-16, and fine-tuned VGG-NET-19 for extracting exact and near-exact duplicate images. Empirical results demonstrate the effectiveness of the proposed fine-tuned CNN-based deduplication techniques, showcasing top-5 accuracy rates of 83.1%, 93.2%, and 92.8% for fine-tuned AlexNet, VGGNet-19, and VGGNet-16, respectively, using augmented ImageNet-Min dataset. Furthermore, the CNN-based deduplication method achieves storage reduction of 37.2% to 42.4% when applied to augmented ImageNet-Mini datasets. Conversely, the hash-based techniques (aHash, dHash, pHash, and wHash) exhibit top- 5 accuracy rates of 23.4%, 21.8%, 38.3%, and 37.1%, respectively, using augmented ImageNet-Min dataset, thereby achieving storage reduction of 11% to 17.4%. Fine-tuned CNN-based deduplication techniques exhibit promising results in terms of accuracy but require higher power consumption compared to hash-based techniques for exact and near- exact image detection. This research work contributes to the advancement of cloud storage efficiency through innovative deduplication techniques, with a particular focus on image data. The proposed methods offer potential cost savings, energy conservation, and improved performance in cloud storage systems.
URI: http://hdl.handle.net/10266/6660
Appears in Collections:Doctoral Theses@CSED

Files in This Item:
File Description SizeFormat 
Ravneet_Deduplication_21_7_2022 (9).pdfPhD Thesis15.56 MBAdobe PDFView/Open    Request a copy


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.