Please use this identifier to cite or link to this item:
Title: Distributed Stream Processing of Twitter Data using Apache Spark
Authors: Shruti Arora
Supervisor: Rani, Rinkle
Keywords: Stream Processing;Apache Spark;Twitter;SparkML;Apache Kafka
Issue Date: 8-Aug-2018
Abstract: Data is continuously being generated from sources such as machines, network traffic, sensor networks, etc. Twitter is an online social networking service with more than 300 million users, generating a huge amount of information every day. Twitter’s most important characteristic is its ability for users to tweet about events, situations, feelings, opinions, or even something totally new, in real time. Currently there are different workflows offering realtime data analysis for Twitter, presenting general processing over streaming data. This study will attempt to develop an analytical framework with the ability of in-memory processing to extract and analyze structured and unstructured Twitter data. The proposed framework includes data ingestion and stream processing and data visualization components with the Apache Kafka and Apache Flume messaging system that is used to perform data ingestion task. Furthermore, Spark makes it possible to perform sophisticated data processing and machine learning algorithms in real time. We have conducted a case study on tweets and analysis on the time and origin of the tweets. We also worked on study of SparkML component to study the K-Means Clustering algorithm.
Description: Master of Engineering- CSE
Appears in Collections:Masters Theses@CSED

Files in This Item:
File Description SizeFormat 
801632045_CSE_ShrutiArora.pdf2.51 MBAdobe PDFThumbnail

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.