#NoPayJan Offer - Access all CloudxLab Courses for free between 1st to 31st JanEnroll Now >>
Let's understand sorting in the Hive.
Order by orders the data and guarantees global ordering. All of the data goes through only a single reducer and sorting happens on that single reducer. For large datasets, this is unacceptable as it will overload the single reducer. Though you will get the sorted file as output but this is unacceptable as it will take a lot of time.
To overcome the single reducer problem with
sort by and
distribute by commands were introduced in Hive. In
sort by, the data get sorted at multiple reducers. There is one reducer for each 1 GB of data. Though the output files generated by each reducer are sorted individually, if we concatenate these sorted files, the final file is not sorted because these files have overlapping ranges.
Sort by with an example. As you can see the data at Reducer 1 and Reducer 2 is sorted but if we concatenate the output of the two reducers, the output is not sorted.
Distribute by x works like a partitioner. Each of the N reducers gets non-overlapping ranges of x. But the output of each reducer may not be sorted and we end up with N unsorted files with non-overlapping ranges.
distribute by with an example. As you can see each reducer is having non-overlapping ranges but output at Reducer 1 is not sorted.
To optimally sort data in Hive and ensure global ordering we use
cluster by combines the
distribute by and
sort by. We should always use
cluster by instead of
order by on large datasets.
We can cluster a table into multiple buckets. This ensures that the data is distributed and makes it easy to process in parallel. As displayed on the screen, we are bucketing a table into 32 buckets based on userid.
No hints are availble for this assesment
Answer is not availble for this assesment