Hive

17 / 18

Hive - MovieLens Assignment

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of

  1. 100,000 ratings (1-5) from 943 users upon 1682 movies.
  2. Each user has rated at least 20 movies.
  3. Simple demographic info for the users (age, gender, occupation, zip)

Movielens dataset is located at /data/ml-100k in HDFS. Read the README.md file to understand the dataset.

We will load the u.data file in Hive managed table. u.data contains dataset where each row represents userid, movieid, rating, and timestamp fields. Fields are terminated by "\t"

INSTRUCTIONS

Steps:

  • Login to the web console.
  • Copy the data from /data directory in HDFS to your home directory in HDFS. Run below command in Linux console.

    hadoop fs -cp /data/ml-100k/u.data /user/$USER/
    
  • Launch hive from the console.

  • Run below commands in the Hive. Follow the given steps to create a managed table u_data in your database.

  • Create a database with your CloudxLab username if not exist.

    create database ${env:USER};
    
  • Select your database

    use ${env:USER};
    
  • Create a table

    CREATE TABLE IF NOT EXISTS u_data( 
        userid INT, 
        movieid INT, 
        rating INT, 
        unixtime TIMESTAMP
        )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE;
    
  • Load the data from your home directory in HDFS.

    LOAD DATA INPATH 'hdfs:///user/${env:USER}/u.data' overwrite into table u_data;
    
  • Check if data is loaded. Run below command in the Linux console. Go to the warehouse directory at /apps/hive/warehouse in the web console using the following command:

    hadoop fs -ls /apps/hive/warehouse/$USER.db
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...