Hive

11 / 19

Hive - Partitions




Not able to play video? Try with youtube

Let us understand partitioning in hive. Say we have a huge data set of employees containing first name department and year of joining and suppose we have to retrieve details of all employees who joined in the year 2012. The query for retrieving above details will scan the whole table and because of this it will take a lot of time to execute. However if you partition the employee's data by their year of joining and store it in different files the query will become really fast. Let's understand the above scenario with a hands-on exercise login to CloudxLab linux console. Our data set is located at /data/bdhs/employees directory in hdfs run command hadoop fs -ls /data/bdhs/employees to view the data set. The data set contains two files 2012.csv and 2015.csv each csv file contains the data of the employees joined in that year. Copy the data of your home directory in hdfs run command hadoop fs -cp /data/bdhs/employees /user/$USER. Now type hive and wait for hive prompt to appear. Use your database run use followed by database name here the string ${env:USER} is replaced by your username automatically this would work only if your username is same as the name of your database. Create the employees table. Copy the create command and paste it in the hive shell in web console. Note that we are creating a partition table having a partition based on year and fields are terminated by comma. Now let's load 2012.csv and specify the year as 2012 run

load data inpath 'hdfs:///user/${env:USER}/employees/2012.csv' into table
employees partition (year=2012);

Now load 2015.csv and specify partition year as 2015 run

load data inpath 'hdfs:///user/${env:USER}/employees/2015.csv' into table
employees partition (year=2015);

Now check the partitions using the show partitions employees command. Partitions avoid the full table scan hence improving the query performance. The data gets stored in different files in the warehouse as specified by the partitions. We can define a partition using partition by keyword while creating a table. We can also add a partition in the existing table please note that we can specify multiple columns in a partition in the above example the partition can be on year month and day column.

INSTRUCTIONS

Steps:

  • Log in to your web console.
  • Data is located at /data/bdhs/employees/ on HDFS.
  • Copy data to your home directory in HDFS.
  • Run the following command, $USER gets replaced by your username automatically:

    hadoop fs -cp /data/bdhs/employees /user/$USER
    
  • Launch Hive using hive command in the web console. Run the below commands in Hive.

  • Use your database. ${env:USER} gets replaced by your username automatically:

    use ${env:USER};
    
  • To create the table

    CREATE TABLE employees(
        name STRING,
        department STRING,
        somedate DATE
    )
    PARTITIONED BY(year STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
  • Load dataset 2012.csv.

    load data inpath 'hdfs:///user/${env:USER}/employees/2012.csv' into table
    employees partition (year=2012);
    
  • Load dataset 2015.csv.

    load data inpath 'hdfs:///user/${env:USER}/employees/2015.csv' into table
    employees partition (year=2015);
    
  • To view the partitions

    SHOW PARTITIONS employees;
    

Please login to comment

7 Comments

Team, I come from Oracle Background and could you please help to make me understand how partition logc works in Hive.. In Oracle, Partition Key must be in the table as one of the columns and Partition Key will be defined  as like Range Partition or List Partition or Monthly or Daily Incremental Partitions..  When data comes to table, like Year 2012 then, It will check whether that segment is already there , If existing then, It will push the incoming records into the relevant segment , Otherwise It will create New Segment and will keep the record.

In Hive, 1.  Is it not required to have the same Partition key in Table Columns. 2. Do we need to create manual partitions or Hive will take care to create partitions itself when the relevant records comes. 3. If the data comes Non Partition Key like 2018 Instead of 2012 or 2015, Will it create New Partitions. 4. Will the partitions and its data be stored as distributed in hdfs cluster 

 

  Upvote    Share

In Hive, partitioning works similarly to how it does in Oracle. A partition key must be specified when creating a table, and this key is used to divide the data into different partitions. The key can be any valid Hive column, and does not have to be present in the table schema.

When data is inserted into a partitioned table in Hive, the partition key is used to determine which partition the data should be added to. If a partition for the specified key value does not already exist, a new partition will be created automatically.

In terms of data storage, partitions in Hive are typically stored as directories in the underlying HDFS file system, with each partition being a subdirectory under the table directory. The data in each partition is typically stored as one or more files in the corresponding partition directory. This allows for efficient querying of data in partitions, as only the relevant partitions need to be read in order to execute a query.

If data is inserted into a partitioned Hive table with a key value that does not match any existing partitions, a new partition will be created automatically to store the data. This means that if data is inserted with a key value of 2018 instead of 2012 or 2015, a new partition will be created for the key value 2018.

Overall, partitioning in Hive provides a way to organize and store large amounts of data in a scalable and efficient manner. It allows for data to be divided into logical units based on the values of one or more columns, which can improve query performance and make it easier to manage and organize the data.

 

 

 1  Upvote    Share

Thank you Shubh for detailed reply. 

  Upvote    Share

How can I locate data directory in hdfs using command?

  Upvote    Share

Hi,

The data is located at /data/bdhs/employees/ 

Navigation is quite not possible in hdfs. You have to navigate manually by providing the complete path using ls command.

  Upvote    Share

Hi Shubh,

So, if not provided by owner how can one list all the directories present in hdfs. "data" directory is not showing using "hadoop dfs -ls" command.

Thanks 

  Upvote    Share

Hi, Please make sure you provide the directory path in which you are looking for. As 'data' directory is located in root directory, your command should look like- 

hadoop fs -ls /

 

 1  Upvote    Share