Hive - Partitions






Important Note - We recommend you to execute the given commands on Hive Console instead of Hue. The video is only for representational purposes.

Let's understand partitioning in Hive. Say we have a huge dataset of employees containing their first name, department, and year of joining. Suppose, we have to retrieve details of all employees who joined in the year 2012. The query for retrieving the above details will scan the whole table and because of this, it will take a lot of time to execute. However, if we partition the employees' data by their year of joining and store it in different files, the query time will reduce.

Let's understand the above scenario with a hands-on. Our dataset is located at /data/bdhs/employees directory in HDFS. The dataset contains two files 2012.csv and 2015.csv. Each CSV file contains the data of the employees joined in that year. Copy the data to your home directory in HDFS. Create the employee table. Note that we are creating a partitioned table having a partition based on "year" and fields are terminated by ",". Let's load 2012.csv and specify partition year as 2012. Now load 2015.csv and specify partition year as 2015. Check the partitions using the "show partitions employees" command.

There are two partitions with the year 2012 and 2015. Let's check the metastore. As you can see year column is added to the table and it is also a partition column. Let's check the warehouse at /apps/hive/warehouse/. Select your database directory and click on employees. There are two directories with the name "year=2012 and year=2015". "year=2012" directory has 2012.csv file and "year=2015" directory has 2015.csv file.

Now the query for retrieving details of employees joined in the year 2012 will make use of partition "year=2012". This query will be a lot faster than doing a full table scan.

In short, partitions avoid the full table scan, hence improving the query performance. The data gets stored in different files in the warehouse as specified by the partitions. We can define a partition using partition by keyword while creating the table. We can also add a partition to the existing table. Please note that we can specify multiple columns in a partition. In the above example, the partition can be on year, month, and day column.

INSTRUCTIONS

Steps:

  • Log in to your web console.
  • Data is located at /data/bdhs/employees/ on HDFS.
  • Copy data to your home directory in HDFS.
  • Run the following command, $USER gets replaced by your username automatically:

    hadoop fs -cp /data/bdhs/employees /user/$USER
    
  • Launch Hive using hive command in the web console. Run the below commands in Hive.

  • Use your database. ${env:USER} gets replaced by your username automatically:

    use ${env:USER};
    
  • To create the table

    CREATE TABLE employees(
        name STRING,
        department STRING,
        somedate DATE
    )
    PARTITIONED BY(year STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
  • Load dataset 2012.csv.

    load data inpath 'hdfs:///user/${env:USER}/employees/2012.csv' into table
    employees partition (year=2012);
    
  • Load dataset 2015.csv.

    load data inpath 'hdfs:///user/${env:USER}/employees/2015.csv' into table
    employees partition (year=2015);
    
  • To view the partitions

    SHOW PARTITIONS employees;
    

Loading comments...