Distributed Stream Processing of Twitter Data using Apache Spark
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Data is continuously being generated from sources such as machines, network traffic,
sensor networks, etc. Twitter is an online social networking service with more than 300
million users, generating a huge amount of information every day. Twitter’s most
important characteristic is its ability for users to tweet about events, situations, feelings,
opinions, or even something totally new, in real time. Currently there are different
workflows offering realtime data analysis for Twitter, presenting general processing over
streaming data. This study will attempt to develop an analytical framework with the
ability of in-memory processing to extract and analyze structured and unstructured
Twitter data. The proposed framework includes data ingestion and stream processing and
data visualization components with the Apache Kafka and Apache Flume messaging
system that is used to perform data ingestion task. Furthermore, Spark makes it possible
to perform sophisticated data processing and machine learning algorithms in real time.
We have conducted a case study on tweets and analysis on the time and origin of the
tweets. We also worked on study of SparkML component to study the K-Means
Clustering algorithm.
Description
Master of Engineering- CSE
