Streaming Twitter Data using Flume

Stream twitter data using flume and hdfs

In this blog post, we will learn how to stream Twitter data using Flume on CloudxLab

For downloading tweets from Twitter, we have to configure Twitter App first.

Create Twitter App

Step 1

Navigate to Twitter app URL and sign in with your Twitter account

Step 2

Click on “Create New App”

Create New App

Step 3

Provide Name, Description, and Website of your app. Check the “Developer Agreement” checkbox and click on “Create your Twitter Application”

Create an application form

Step 4

After your application is successfully created, Twitter will show Consumer Key, Consumer Secret, Access Token and Access Token Secret. We will need these tokens to get tweets from Twitter. Please do not share these tokens and keys with others.

Twitter app keys and tokens

Setup flume agent

Step 1

Login to web console

Step 2

Create directory flume in your home folder in web console

Step 3

Create flume.conf file copy paste the below code

Step 4

Copy-paste below code in flume.conf

Replace TwitterAgent.sources.Twitter.consumerKey, TwitterAgent.sources.Twitter.consumerSecret, TwitterAgent.sources.Twitter.accessToken and TwitterAgent.sources.Twitter.accessTokenSecret with your keys and tokens

Replace abhinav9884 with your CloudxLab username.

Save the file and exit from editor

Step 5

Run flume agent using below command. Replace abhinav9884 with your CloudxLab username

Step 6

Check the Twitter data in HDFS.  There will be files with name FlumeData.* inside Tweets directory in your home directory in HDFS

We can see tweets with below command. Replace FlumeData.1515474234091 with file inside your Tweets directory

Step 7

Kill the flume agent once you are done by pressing “Ctrl + c”.

In this blog post, we learned how to stream Twitter data using Flume and store it on HDFS. Hope you liked the blog post.Please feel free to leave your comments

  • Mohammed

    Can you please specify about exactly what data is returned on executing the flume agent?

  • bintao li

    Hi Abhinav,
    As the step 3 of create twitter App, I do not know how to create the Website of my app and just have https://twitter.com/ as a placeholder. I wonder how to create one?

    As the step 5 of set up flame agent, I follow your instruction and get the error message: “Warning: JAVA_HOME is not set!”. I wonder how to set it?

    Regards,
    Bintao Li

    • Abhinav Singh

      Hi @bintaoli:disqus,

      If you do not have any website of your own, then please put any valid website with proper protocol like https://cloudxlab.com

      Regarding error in step 5, please ignore the warning. JAVA_HOME will be available to agent.

      Hope this helps.

      Thanks

      • bintao li

        Hi Abhinav, OK, thanks.

        I wonder what I can do with these tweets?
        Could you design a project with these tweets and with hive,pig, hbase and spakeSql to practice? thanks.

        Regards,
        Bintao Li

        • Abhinav Singh

          Hi @bintaoli:disqus,

          You can do sentiment analysis on the tweets as you would have done in the Hive project. There are many interesting analysis you can do with these tweets like

          + Building word cloud to find out the main keywords in the tweets
          + Find the user’s sentiment across geographies
          + Find out the influencers for the particular topic
          + Find out the traffic trend across the day for the particular topic

          I will share more use cases. Also we’ve noted your feedback and are working on providing more projects on every topic.

          Hope this helps.

          Thanks

          Regards,
          Abhinav

  • Redrichmond

    Ok this worked.. However it errored after 30 files. “Block under-replication detection. Rotating File.”…..

    “Hit max consecutive under-replication rotations (30);will not continue rolling files under this path due to under -replication”

    • Abhinav Singh

      Hi @@redrichmond:disqus,

      I can see that you have posted this question on the forum https://discuss.cloudxlab.com/t/hit-max-consecutive-under-replication-rotations-30/1583

      Can you please let me know if you had run the same commands as shown in the post?

      Thanks

      • Redrichmond

        Yes exactly the same..

      • Redrichmond

        The only difference was the folder i called it “Flumes”, you have an issue with your configuration somewhere

  • Redrichmond

    “flume-ng agent -n TwitterAgent -Dtwitter4j.streamBaseURL=https://stream.twitter.com/1.1/ -c conf
    -f /home/my_username_goes_here/flumes/flume.conf”

  • Prakul Tomar

    “TwitterAgent.sources.Twitter.keywords = theinterview, 17YearsOfNash, Warnock, RioCompetition, cpfc, Palace, London, Christmas, New Years”
    Is the keywords (theinterview, Palace…) are case sensitive or theinterview is same as TheInterview

  • Rakhee Balaraman

    I tried this and got below error on starting the flume agent. Seems like some jars are missing.
    ————————————————
    18:10:43.416 [conf-file-poller-0] ERROR org.apache.flume.node.PollingPropertiesFileConfigurationProvider – Failed to load configuration data. Exception follows.
    org.apache.flume.FlumeException: Unable to load source type: com.cloudera.flume.source.TwitterSource, class: com.cloudera.flume.source.TwitterSource
    at org.apache.flume.source.DefaultSourceFactory.getClass(DefaultSourceFactory.java:67) ~[flume-ng-core-1.5.2.2.6.5.0-292.jar:1.5.2.2.6.5.0-292]
    at org.apache.flume.source.DefaultSourceFactory.create(DefaultSourceFactory.java:40) ~[flume-ng-core-1.5.2.2.6.5.0-292.jar:1.5.2.2.6.5.0-292]
    at org.apache.flume.node.AbstractConfigurationProvider.loadSources(AbstractConfigurationProvider.java:328) ~[flume-ng-node-1.5.2.2.6.5.0-292.jar:1.5.2.2.6.
    5.0-292]
    at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102) ~[flume-ng-node-1.5.2.2.6.5.0-292.jar:1.5.2
    .2.6.5.0-292]
    at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:140) [flume-ng-
    node-1.5.2.2.6.5.0-292.jar:1.5.2.2.6.5.0-292]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_181]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_181]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
    Caused by: java.lang.ClassNotFoundException: com.cloudera.flume.source.TwitterSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[?:1.8.0_181]
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_181]
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) ~[?:1.8.0_181]
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_181]
    at java.lang.Class.forName0(Native Method) ~[?:1.8.0_181]
    at java.lang.Class.forName(Class.java:264) ~[?:1.8.0_181]
    at org.apache.flume.source.DefaultSourceFactory.getClass(DefaultSourceFactory.java:65) ~[flume-ng-core-1.5.2.2.6.5.0-292.jar:1.5.2.2.6.5.0-292]
    … 11 more

  • Subhash Inti

    Yes,I have done exactly the same and also I got the same output like this.But can u explain the format of the data in FlumeData.5775513355891 .I cannot able to understand the format.please explain. And how to understand the data.