13 / 18

Hive - Sorting & Bucketing

Not able to play video? Try with youtube

Let's understand sorting in the Hive. Order by orders the data and guarantees global ordering. All of the data goes through only a single reducer and sorting happens on that single reducer. For large datasets, this is unacceptable as it will overload the single reducer. Though you will get the sorted file as output but this is unacceptable as it will take a lot of time.

To overcome the single reducer problem with order by, sort by and distribute by commands were introduced in Hive. In sort by, the data get sorted at multiple reducers. There is one reducer for each 1 GB of data. Though the output files generated by each reducer are sorted individually, if we concatenate these sorted files, the final file is not sorted because these files have overlapping ranges.

Let's understand Sort by with an example. As you can see the data at Reducer 1 and Reducer 2 is sorted but if we concatenate the output of the two reducers, the output is not sorted.

Distribute by x works like a partitioner. Each of the N reducers gets non-overlapping ranges of x. But the output of each reducer may not be sorted and we end up with N unsorted files with non-overlapping ranges.

Let's understand distribute by with an example. As you can see each reducer is having non-overlapping ranges but output at Reducer 1 is not sorted.

To optimally sort data in Hive and ensure global ordering we use cluster by. cluster by combines the distribute by and sort by. We should always use cluster by instead of order by on large datasets.

We can cluster a table into multiple buckets. This ensures that the data is distributed and makes it easy to process in parallel. As displayed on the screen, we are bucketing a table into 32 buckets based on userid.

Loading comments...