In this blog post, we will learn how to stream Twitter data using Flume on CloudxLab
For downloading tweets from Twitter, we have to configure Twitter App first.
Create Twitter App
Step 1
Navigate to Twitter app URL and sign in with your Twitter account
Step 2
Click on “Create New App”
Step 3
Provide Name, Description, and Website of your app. Check the “Developer Agreement” checkbox and click on “Create your Twitter Application”
Step 4
After your application is successfully created, Twitter will show Consumer Key, Consumer Secret, Access Token and Access Token Secret. We will need these tokens to get tweets from Twitter. Please do not share these tokens and keys with others.
Setup flume agent
Step 1
Login to web console
Step 2
Create directory flume in your home folder in web console
mkdir flume
Step 3
Create flume.conf file copy paste the below code
vi flume/flume.conf
Step 4
Copy-paste below code in flume.conf
TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey = xxxxxx TwitterAgent.sources.Twitter.consumerSecret = xxxxxxx TwitterAgent.sources.Twitter.accessToken = xxxxxxx TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxx TwitterAgent.sources.Twitter.keywords = theinterview, 17YearsOfNash, Warnock, RioCompetition, cpfc, Palace, London, Christmas, New Years ################## SINK ################################# TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs:///user/abhinav9884/Tweets TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 10 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 #################### CHANNEL ######################### TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 100 #default - TwitterAgent.channels.MemChannel.capacity = 100 TwitterAgent.channels.MemChannel.transactionCapacity = 100
Replace TwitterAgent.sources.Twitter.consumerKey, TwitterAgent.sources.Twitter.consumerSecret, TwitterAgent.sources.Twitter.accessToken and TwitterAgent.sources.Twitter.accessTokenSecret with your keys and tokens
Replace abhinav9884 with your CloudxLab username.
Save the file and exit from editor
Step 5
Run flume agent using below command. Replace abhinav9884 with your CloudxLab username
flume-ng agent -n TwitterAgent -Dtwitter4j.streamBaseURL=https://stream.twitter.com/1.1/ -c conf -f /home/abhinav9884/flume/flume.conf
Step 6
Check the Twitter data in HDFS. There will be files with name FlumeData.* inside Tweets directory in your home directory in HDFS
hadoop fs -ls Tweets/
We can see tweets with below command. Replace FlumeData.1515474234091 with file inside your Tweets directory
hadoop fs -cat Tweets/FlumeData.1515474234091
Step 7
Kill the flume agent once you are done by pressing “Ctrl + c”.
In this blog post, we learned how to stream Twitter data using Flume and store it on HDFS. Hope you liked the blog post.Please feel free to leave your comments