An Efficient Framework for Privacy Preservation for Big Data Applications
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In the modern data-driven world, the actual advantage of big data can be realized if data is
efficiently processed and knowledge extracted from it can serve as an important component
in decision making. Data mining techniques have been used to discover interesting patterns
and knowledge from large datasets. Providing all the data to data miners may provide good
analytics, but it can also raise many security challenges since such data can be misused by
malicious users. Thus, equilibrium should be maintained between data availability and data
security as one needs to secure the confidentiality of sensitive data without affecting the efficiency
of applications.
Privacy preserving data mining techniques are used to extract useful information from
data without compromising the security of sensitive information contained in it. Before performing
any analysis on data set, it is anonymized by encryption techniques or by removing
the personally identifiable information from data sets, such that the person whom the data
refers will remain anonymous. The data sets used for the data mining purpose can be centralized
owned by a single owner or it can be distributed among multiple parties having horizontal,
vertical or arbitrary distribution. Usage of traditional cryptographic techniques for protecting
the information leads to large computation and communication overheads especially,
for large datasets. The anonymization techniques have less computation and communication
overheads, but there is a risk of re-identification of anonymized dataset, since a large amount
of data is available and by linking the different data sources with the anonymized dataset, the
probability of re-identification of data is higher.
This thesis proposes a framework for privacy preserving data mining on big data. Based
on the proposed framework, two application domains have been identified. The first one is
privacy preserving collaborative filtering technique used for recommendation generation in
the healthcare system where data is arbitrarily distributed among multiple healthcare sites.
xiii
It is an item-based collaborative filtering technique where item-item similarity is securely
computed using homomorphic encryption technique and secure scalar dot product algorithm.
The second is cloud-based privacy preserving collaborative filtering technique based
on naive Bayesian classifier for recommendation generation on arbitrarily distributed data
among multiple parties. In this technique, conditional probability is securely calculated using
proposed privacy preserving conditional probability algorithm and prior probability is
securely calculated using homomorphic encryption technique. Both techniques are secure
and having less computation overhead as compared to the state of art privacy preserving
collaborative filtering techniques. Further, k- anonymization based on neural network and
support vector machine classifiers helps in the anonymization of social network data before
sharing or performing any analysis on it. The proposed technique is evaluated on different parameters:
Precision, Recall, F-measure, Information loss and Average path length. Through
this thesis work, it can be concluded that efficient data analytics can be performed securely
for both centralized and distributed data sets without much computational overheads.
