Query Execution and Effect of Compression on NoSQL Column Oriented Data-store Using Hadoop and HBase
Loading...
Files
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Today everyone is connected over the Internet and look to find relevant results instantaneously. Terabytes of data is being generated everyday and is subsequently processed. Data size in data-stores has been ever increasing. In large scale and high concurrency applications using the traditional relational database to store and query dynamic user data have come out to be inadequate.
Increased networking and business demands directly increases the cost of resources needed in terms of space and network utilization. To store terabytes of data, especially of the type human-readable text, it is beneficial to compress the data to gain significant savings in required raw storage. Compression techniques have not been considerably used in traditional relational database systems. The exchange between time and space for compression is not much pleasing for relational databases. Column oriented data-stores has all values of a single column stored as a row followed by all values of the next column. Such an approach of storing records helps in data compression since values of the same column are of the similar type and may repeat. Storing data in columns introduce a number of possibilities for better performance by compression algorithms.
Intend of the thesis is to see the effect of compression on NoSQL column oriented data-store. To perform this work, HBase - a Hadoop database was chosen. It is one of the most prominent NoSQL column oriented datastore and is being used by big companies like Facebook. Effect of compression and analysis has been performed with three compression codecs, Snappy, LZO and GZIP using only human readable text data. Hadoop Map Reduce framework has been used for loading the bulk data. Performance evaluation on compressed and uncompressed tables has been done by executing queries using advanced HBase API.
Results shows that using compression in NoSQL column oriented data-store like HBase increases the performance of the system overall. Snappy performs consistently well in saving CPU and network and memory usage. Second runner up is LZO. Whereas GZIP is not optimal choice where speed and memory usage is main concern but can work perfectly well where size disk space is a constraint.
Description
ME, CSED
