Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
Apply NowLogin using Social Account
     Continue with GoogleLogin using your credentials
Let us understand partitioning in hive. Say we have a huge data set of employees containing first name department and year of joining and suppose we have to retrieve details of all employees who joined in the year 2012. The query for retrieving above details will scan the whole table and because of this it will take a lot of time to execute. However if you partition the employee's data by their year of joining and store it in different files the query will become really fast. Let's understand the above scenario with a hands-on exercise login to CloudxLab linux console. Our data set is located at /data/bdhs/employees
directory in hdfs run command hadoop fs -ls /data/bdhs/employees
to view the data set. The data set contains two files 2012.csv
and 2015.csv
each csv file contains the data of the employees joined in that year. Copy the data of your home directory in hdfs run command hadoop fs -cp /data/bdhs/employees /user/$USER
. Now type hive and wait for hive prompt to appear. Use your database run use followed by database name here the string ${env:USER}
is replaced by your username automatically this would work only if your username is same as the name of your database. Create the employees table. Copy the create command and paste it in the hive shell in web console. Note that we are creating a partition table having a partition based on year and fields are terminated by comma.
Now let's load 2012.csv
and specify the year as 2012 run
load data inpath 'hdfs:///user/${env:USER}/employees/2012.csv' into table
employees partition (year=2012);
Now load 2015.csv
and specify partition year as 2015 run
load data inpath 'hdfs:///user/${env:USER}/employees/2015.csv' into table
employees partition (year=2015);
Now check the partitions using the show partitions employees
command.
Partitions avoid the full table scan hence improving the query performance. The data gets stored in different files in the warehouse as specified by the partitions. We can define a partition using partition by
keyword while creating a table. We can also add a partition in the existing table please note that we can specify multiple columns in a partition in the above example the partition can be on year month and day column.
/data/bdhs/employees/
on HDFS.Run the following command, $USER
gets replaced by your username automatically:
hadoop fs -cp /data/bdhs/employees /user/$USER
Launch Hive using hive
command in the web console. Run the below commands in Hive.
Use your database. ${env:USER}
gets replaced by your username automatically:
use ${env:USER};
To create the table
CREATE TABLE employees(
name STRING,
department STRING,
somedate DATE
)
PARTITIONED BY(year STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Load dataset 2012.csv
.
load data inpath 'hdfs:///user/${env:USER}/employees/2012.csv' into table
employees partition (year=2012);
Load dataset 2015.csv
.
load data inpath 'hdfs:///user/${env:USER}/employees/2015.csv' into table
employees partition (year=2015);
To view the partitions
SHOW PARTITIONS employees;
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Loading comments...