The objective of this step is to create an external table which contains tweets of Iron Man 3 movie and answer the following question:

Question-

How many records are in the tweets_raw table?

INSTRUCTIONS
  1. Make sure you have copied the Iron Man 3 movie tweets in your HDFS home directory in previous slide, if not please copy the data using.

    hadoop fs -cp /data/SentimentFiles /user/$USER
    
  2. Launch hive console by typing the hive command in the web console.

    hive
    
  3. Run below commands in the hive:

    ADD JAR hdfs:///data/hive/json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar;
    SET hive.support.sql11.reserved.keywords=false;
    
  4. Select your database. Run below commands. ${env:USER} gets replaced by your username automatically:

    CREATE DATABASE IF NOT EXISTS ${env:USER};
    USE ${env:USER};
    
  5. Create a tweets_raw external table. It contains details of each tweet. Run below commands in the hive. ${env:USER} gets replaced by your username automatically:

    CREATE EXTERNAL TABLE tweets_raw (
        id BIGINT,
        created_at STRING,
        source STRING,
        favorited BOOLEAN,
        retweet_count INT,
        retweeted_status STRUCT<
        text:STRING,
        users:STRUCT<screen_name:STRING,name:STRING>>,
        entities STRUCT<
        urls:ARRAY<STRUCT<expanded_url:STRING>>,
        user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
        hashtags:ARRAY<STRUCT<text:STRING>>>,
        text STRING,
        user STRUCT<
        screen_name:STRING,
        name:STRING,
        friends_count:INT,
        followers_count:INT,
        statuses_count:INT,
        verified:BOOLEAN,
        utc_offset:STRING, -- was INT but nulls are strings
        time_zone:STRING>,
        in_reply_to_screen_name STRING,
        year int,
        month int,
        day int,
        hour int
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
    LOCATION '/user/${env:USER}/SentimentFiles/SentimentFiles/upload/data/tweets_raw';
    
  6. Run below query on hive console to answer the question:

    SELECT count(id) FROM tweets_raw;
    

Note - Having trouble with the assessment engine? Follow the steps listed here


Loading comments...