Hive

10 / 18

Hive - Partitions




Not able to play video? Try with vimeo

Let us understand partitioning in hive. Say we have a huge data set of employees containing first name department and year of joining and suppose we have to retrieve details of all employees who joined in the year 2012. The query for retrieving above details will scan the whole table and because of this it will take a lot of time to execute. However if you partition the employee's data by their year of joining and store it in different files the query will become really fast. Let's understand the above scenario with a hands-on exercise login to CloudxLab linux console. Our data set is located at /data/bdhs/employees directory in hdfs run command hadoop fs -ls /data/bdhs/employees to view the data set. The data set contains two files 2012.csv and 2015.csv each csv file contains the data of the employees joined in that year. Copy the data of your home directory in hdfs run command hadoop fs -cp /data/bdhs/employees /user/$USER. Now type hive and wait for hive prompt to appear. Use your database run use followed by database name here the string ${env:USER} is replaced by your username automatically this would work only if your username is same as the name of your database. Create the employees table. Copy the create command and paste it in the hive shell in web console. Note that we are creating a partition table having a partition based on year and fields are terminated by comma. Now let's load 2012.csv and specify the year as 2012 run

load data inpath 'hdfs:///user/${env:USER}/employees/2012.csv' into table
employees partition (year=2012);

Now load 2015.csv and specify partition year as 2015 run

load data inpath 'hdfs:///user/${env:USER}/employees/2015.csv' into table
employees partition (year=2015);

Now check the partitions using the show partitions employees command. Partitions avoid the full table scan hence improving the query performance. The data gets stored in different files in the warehouse as specified by the partitions. We can define a partition using partition by keyword while creating a table. We can also add a partition in the existing table please note that we can specify multiple columns in a partition in the above example the partition can be on year month and day column.

INSTRUCTIONS

Steps:

  • Log in to your web console.
  • Data is located at /data/bdhs/employees/ on HDFS.
  • Copy data to your home directory in HDFS.
  • Run the following command, $USER gets replaced by your username automatically:

    hadoop fs -cp /data/bdhs/employees /user/$USER
    
  • Launch Hive using hive command in the web console. Run the below commands in Hive.

  • Use your database. ${env:USER} gets replaced by your username automatically:

    use ${env:USER};
    
  • To create the table

    CREATE TABLE employees(
        name STRING,
        department STRING,
        somedate DATE
    )
    PARTITIONED BY(year STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
  • Load dataset 2012.csv.

    load data inpath 'hdfs:///user/${env:USER}/employees/2012.csv' into table
    employees partition (year=2012);
    
  • Load dataset 2015.csv.

    load data inpath 'hdfs:///user/${env:USER}/employees/2015.csv' into table
    employees partition (year=2015);
    
  • To view the partitions

    SHOW PARTITIONS employees;
    

Loading comments...