Project - Sentiment Analysis in Hive

7 / 10

In this step, we will create tweets_simple and tweets_clean views.

Question-

From which country tweet with id 330044004693598208 was tweeted?

INSTRUCTIONS
  1. Launch hive console by typing the hive command in the web console.

    hive
    
  2. Run below commands in the hive. This gives the ability to handle JSON format. Please note that you will have to do 'add jar' in every session.

    ADD JAR hdfs:///data/hive/json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar;
    SET hive.support.sql11.reserved.keywords=false;
    
  3. Select your database after creating your database. Run below commands. ${env:USER} gets replaced by your username automatically:

    CREATE DATABASE IF NOT EXISTS ${env:USER};
    USE ${env:USER};
    
  4. Create view tweets_simple. The tweets_simple view contains tweet id, the timestamp of the tweet, tweet text, and user's time zone. Run below command in the Hive on your web console. See the manual of from_unixtime, unix_timestamp, concat, substring.

    CREATE VIEW tweets_simple AS
    SELECT          
    id,
    cast ( from_unixtime( unix_timestamp(concat( '2013 ', substring(created_at,5,15)), 'yyyy MMM dd hh:mm:ss')) as timestamp) ts,
    text,
    user.time_zone 
    FROM tweets_raw;
    

    Here:

    from_unixtime(bigint unixtime[, string format]) - Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".

    unix_timestamp() - Gets current Unix timestamp in seconds. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant.

    concat(string|binary A, string|binary B...) - Returns the string or bytes resulting from concatenating the strings or bytes passed in as parameters in order. For example, concat('foo', 'bar') results in 'foobar'. Note that this function can take any number of input strings.

    substring(string|binary A, int start) - Returns the substring or slice of the byte array of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' (see [http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_substr]).

    Sample rows of tweets_simple views are: tweet_simple

  5. Create view tweets_clean. The tweets_clean view maps the user's timezone to the country. Each row of the tweets_clean view contains tweet_id, the timestamp of the tweet, tweet text, and user's country (which is derived from time zone). Run below command in the Hive on your web console.

    CREATE VIEW tweets_clean AS
    SELECT
    id,
    ts,
    text,
    m.country 
    FROM tweets_simple t LEFT OUTER JOIN time_zone_map m ON t.time_zone = m.time_zone;
    

    Sample rows of tweets_simple views are<br><br> <img src="https://s3.amazonaws.com/cloudxlab/static/images/aha/hive-project/tweets_clean_view.png" class="img-responsive" />


Note - Having trouble with the assessment engine? Follow the steps listed here


Loading comments...