Login using Social Account
     Continue with GoogleLogin using your credentials
Let's understand sorting in the Hive. Order by
orders the data and guarantees global ordering. All of the data goes through only a single reducer and sorting happens on that single reducer. For large datasets, this is unacceptable as it will overload the single reducer. Though you will get the sorted file as output but this is unacceptable as it will take a lot of time.
To overcome the single reducer problem with order by
, sort by
and distribute by
commands were introduced in Hive. In sort by
, the data get sorted at multiple reducers. There is one reducer for each 1 GB of data. Though the output files generated by each reducer are sorted individually, if we concatenate these sorted files, the final file is not sorted because these files have overlapping ranges.
Let's understand Sort by
with an example. As you can see the data at Reducer 1 and Reducer 2 is sorted but if we concatenate the output of the two reducers, the output is not sorted.
Distribute by x
works like a partitioner. Each of the N reducers gets non-overlapping ranges of x. But the output of each reducer may not be sorted and we end up with N unsorted files with non-overlapping ranges.
Let's understand distribute by
with an example. As you can see each reducer is having non-overlapping ranges but output at Reducer 1 is not sorted.
To optimally sort data in Hive and ensure global ordering we use cluster by
. cluster by
combines the distribute by
and sort by
. We should always use cluster by
instead of order by
on large datasets.
We can cluster a table into multiple buckets. This ensures that the data is distributed and makes it easy to process in parallel. As displayed on the screen, we are bucketing a table into 32 buckets based on userid.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Loading comments...