Analyzing and Visualizing the Effect of Noise on Compressed Textual Data

Kumar, Sudesh

Analyzing and Visualizing the Effect of Noise on Compressed Textual Data

Files

Analysing and Visualizing the effect of noise on Compressed Textual Data.pdf (1.49 MB)

Date

2009-07-13T07:33:04Z

Authors

Kumar, Sudesh

Supervisors

Batra, Shalini

Abstract

Clustering organizes data into groups such that each group contains similar type of data. The majority of document (contain similar data) clustering algorithms require a vector representation for each document. Consequently, the memory required during clustering can be extremely high when clustering hundreds of thousands of documents. In recent year there is lot of work going on clustering by using compression. Compression reduces the run-time memory requirements for clustering and also does not degrade the final cluster quality. Compression is one of the techniques for better utilization of storage devices, resulting in saving of storage space. This thesis addresses the compression by using the technique called Normalized Compression Distance (NCD). The Normalized Compression Distance is based on algorithmic complexity developed by Kolmogorov, called Normalized Information Distance. Normalized Compression Distance can be used to cluster objects of any kind, such as music, texts, or gene sequences (microarray classification). The NCD between two binary strings is defined in terms of compressed sizes of the two strings and of their concatenation; it is designed to be an effective approximation of the non computable but universal Kolmogorov distance between two strings. This correspondence studies the influence of noise on the normalized compression distance, a measure based on the use of compressors to compute the degree of similarity of two files. This influence is approximated by a first order differential equation which gives rise to a complex effect, which explains the fact that the NCD may give values greater than 1. Finally, the influence of noise on the clustering of files of different types is explored and the final analysis is that the NCD performs well even in the presence of quite high noise levels

Keywords

Normalised Information Distance, Normalized Compression Distance, Kolmogrov Complexity

URI

http://hdl.handle.net/10266/790

Collections

Masters Theses@CSED

Full item page

Analyzing and Visualizing the Effect of Noise on Compressed Textual Data

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By