Streaming Twitter Data using Flume

Stream twitter data using flume and hdfs

In this blog post, we will learn how to stream Twitter data using Flume on CloudxLab

For downloading tweets from Twitter, we have to configure Twitter App first.

Create Twitter App

Step 1

Navigate to Twitter app URL and sign in with your Twitter account

Step 2

Click on “Create New App”

Create New App

Step 3

Provide Name, Description, and Website of your app. Check the “Developer Agreement” checkbox and click on “Create your Twitter Application”

Create an application form

Step 4

After your application is successfully created, Twitter will show Consumer Key, Consumer Secret, Access Token and Access Token Secret. We will need these tokens to get tweets from Twitter. Please do not share these tokens and keys with others.

Twitter app keys and tokens

Setup flume agent

Step 1

Login to web console

Step 2

Create directory flume in your home folder in web console

mkdir flume

Step 3

Create flume.conf file copy paste the below code

vi flume/flume.conf

Step 4

Copy-paste below code in flume.conf

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxx
TwitterAgent.sources.Twitter.keywords = theinterview, 17YearsOfNash, Warnock, RioCompetition, cpfc, Palace, London, Christmas, New Years

################## SINK #################################
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs:///user/abhinav9884/Tweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

TwitterAgent.sinks.HDFS.hdfs.batchSize = 10
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

#################### CHANNEL #########################
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100
#default - TwitterAgent.channels.MemChannel.capacity = 100
TwitterAgent.channels.MemChannel.transactionCapacity = 100

Replace TwitterAgent.sources.Twitter.consumerKey, TwitterAgent.sources.Twitter.consumerSecret, TwitterAgent.sources.Twitter.accessToken and TwitterAgent.sources.Twitter.accessTokenSecret with your keys and tokens

Replace abhinav9884 with your CloudxLab username.

Save the file and exit from editor

Step 5

Run flume agent using below command. Replace abhinav9884 with your CloudxLab username

flume-ng agent -n TwitterAgent -Dtwitter4j.streamBaseURL=https://stream.twitter.com/1.1/ -c conf -f /home/abhinav9884/flume/flume.conf

Step 6

Check the Twitter data in HDFS.  There will be files with name FlumeData.* inside Tweets directory in your home directory in HDFS

hadoop fs -ls Tweets/

We can see tweets with below command. Replace FlumeData.1515474234091 with file inside your Tweets directory

hadoop fs -cat Tweets/FlumeData.1515474234091

Step 7

Kill the flume agent once you are done by pressing “Ctrl + c”.

In this blog post, we learned how to stream Twitter data using Flume and store it on HDFS. Hope you liked the blog post.Please feel free to leave your comments