MapReduce - Understanding Sorting

Not able to play video? Try with youtube

Say you have a computer with a 1Ghz processor and 2GB RAM. How much time will take to sort or order 1TB data? This data has 10 billion names having 100 characters each.

It would take around 6-10 hours.

What's wrong with getting it done in 6-10 hours? Sorting is a very common work, we would need it to be done faster. We might require sorting of bigger data and more often. On eight September 2011, Google was able to sort 10 petabytes of data in 6.5 hours using 8000 computers with their MapReduce framework.

Why are we talking about Sorting? Why is it a big deal? When we talk about data processing, we often think about SQL because the majority of data processing tasks can be performed with SQL.

If you take a look at SQL queries, most of the operations use or are impacted by sorting algorithm of the database. For example, if we want to create an index on a table the "where" clause becomes enormously faster. And indexing basically is "sorting" under the hood. Similarly, the "group by" construct of SQL involves first sorting and then finding unique.

Joins are easier if tables are already indexed. And "Order BY" is obvious just sorting of the data. Other than SQL, complex algorithms can benefit partly by sorting algorithm.

MapReduce Basics

MapReduce - Understanding Sorting

XP

Loading comments...