Distributed Stream Processing of Twitter Data using Apache Spark

Shruti Arora

Distributed Stream Processing of Twitter Data using Apache Spark

Files

801632045_CSE_ShrutiArora.pdf (2.45 MB)

Date

2018-08-08

Authors

Shruti Arora

Supervisors

Rani, Rinkle

Abstract

Data is continuously being generated from sources such as machines, network traffic, sensor networks, etc. Twitter is an online social networking service with more than 300 million users, generating a huge amount of information every day. Twitter’s most important characteristic is its ability for users to tweet about events, situations, feelings, opinions, or even something totally new, in real time. Currently there are different workflows offering realtime data analysis for Twitter, presenting general processing over streaming data. This study will attempt to develop an analytical framework with the ability of in-memory processing to extract and analyze structured and unstructured Twitter data. The proposed framework includes data ingestion and stream processing and data visualization components with the Apache Kafka and Apache Flume messaging system that is used to perform data ingestion task. Furthermore, Spark makes it possible to perform sophisticated data processing and machine learning algorithms in real time. We have conducted a case study on tweets and analysis on the time and origin of the tweets. We also worked on study of SparkML component to study the K-Means Clustering algorithm.

Description

Master of Engineering- CSE

Keywords

Stream Processing, Apache Spark, Twitter, SparkML, Apache Kafka

URI

http://hdl.handle.net/10266/5179

Collections

Masters Theses@CSED

Full item page

Distributed Stream Processing of Twitter Data using Apache Spark

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By