Distributed Data Deduplication Techniques for Efficient Cloud Storage System
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Cost-effective storage management has emerged as a critical challenge for cloud storage
systems given the the exponential growth of digital data in contemporary times. Storing
vast amounts of internet-generated data efficiently requires substantial computing and
storage resources. This issue is further exacerbated by significant, redundant data, sig-
nificantly impacting storage requirements.
This thesis investigates and proposes deduplication techniques to reduce duplicate data in
cloud storage systems. Data deduplication is crucial for large-scale distributed systems,
particularly in dynamic infrastructures like cloud storage. The performance of dedupli-
cation directly affects the overall efficiency and cost of the system. By reducing data
volumes, storage providers can mitigate the costs of running large storage systems and
conserve energy consumption.
This work proposes an efficient data deduplication technique that effectively manages
and eliminates duplicates in cloud storage systems. A comprehensive investigation of
various deduplication techniques has been undertaken to study their efficacy in stor-
age systems. Data-based deduplication techniques are categorized into text, image, and
video-based methods. Scalability, reliability, distributed environment techniques, and
fingerprint indexing emerge as key challenges for distributed data deduplication in cloud
storage systems. This research work addresses these challenges and explores measures to
overcome them.
The thesis focuses on image deduplication techniques in cloud storage systems, with the
aim of minimizing exact or near-exact image duplicates. A novel CNN-based online image
deduplication technique is proposed to detect such duplicates. A Fine-Tuned AlexNet for
cross-domain online image deduplication is proposed for exact near exact image detec-
tion. Comparative analysis with existing CNN techniques demonstrates better accuracy
in the proposed fine-tuned CNN-based feature extraction technique, surpassing AlexNet
and VGGNet by 24% and 17%, respectively. Additionally, the research introduces the
Hot Decomposition Vector (HDV), which optimally stores dissimilar parts of near-exact
images for efficient reconstruction using a base image. HDV outperforms traditional im-
age feature extraction approaches in terms of image-matching accuracy and computing
time.
Furthermore, a novel EsDeDUP energy-saving technique is proposed to analyze the im-
pact of exact or near-exact image deduplication techniques on energy savings and storage
reduction. Fine-tuned CNN-based image deduplication techniques have been proposed to
compute the effectiveness of image deduplication techniques and compared with existing
v
hash-based image deduplication techniques. The technique assesses the power consump-
tion and performance of various deduplication approaches to ascertain their energy-saving
potential. The work evaluates the performance and power consumption of four hash-based
duplicate image detection techniques: phash, whash, ahash, and dhash. Additionally, this
research proposed fine-tuned CNN-based deduplication techniques using neural structures
such as fine-tuned AlexNet, fine-tuned VGG-NET-16, and fine-tuned VGG-NET-19 for
extracting exact and near-exact duplicate images.
Empirical results demonstrate the effectiveness of the proposed fine-tuned CNN-based
deduplication techniques, showcasing top-5 accuracy rates of 83.1%, 93.2%, and 92.8%
for fine-tuned AlexNet, VGGNet-19, and VGGNet-16, respectively, using augmented
ImageNet-Min dataset. Furthermore, the CNN-based deduplication method achieves
storage reduction of 37.2% to 42.4% when applied to augmented ImageNet-Mini datasets.
Conversely, the hash-based techniques (aHash, dHash, pHash, and wHash) exhibit top-
5 accuracy rates of 23.4%, 21.8%, 38.3%, and 37.1%, respectively, using augmented
ImageNet-Min dataset, thereby achieving storage reduction of 11% to 17.4%. Fine-tuned
CNN-based deduplication techniques exhibit promising results in terms of accuracy but
require higher power consumption compared to hash-based techniques for exact and near-
exact image detection.
This research work contributes to the advancement of cloud storage efficiency through
innovative deduplication techniques, with a particular focus on image data. The proposed
methods offer potential cost savings, energy conservation, and improved performance in
cloud storage systems.
