MapReduce Basics

2 / 9

MapReduce - Overview

Not able to play video? Try with youtube

So, What is MapReduce?

MapReduce, in very simple words, is a programming paradigm to help us solve Big Data problems. Hadoop MapReduce is the framework that works on this paradigm. This is specifically great for the tasks which are sorting or disk read-intensive.

Ideally, you would write two functions or pieces of logic - mapper and reducer. The mapper converts every record from the input into key-value pairs. Reducer aggregates values for each key as defined by mapper or the map phase.

MapReduce is also supported by many other systems such as Apache Spark, MongoDB, CouchDB, and Cassandra. The MapReduce in Hadoop can be written in Java, Shell, Python, or any binaries.

Let's take a quick look at how map-reduce gets executed. In this diagram, we have three machines containing data on which map functions are getting executed. Mapper is a logic that you have defined. This logic takes a record as input and converts it into key-value pairs. Please note that map logic is provided by you. This logic can be very complex or very simple based on your need. And these key-value pairs are sorted and then grouped together by Hadoop based on the key. All of the values for each key are aggregated using your reducer logic.

So, if you want to group data based on some criteria, that criteria would be expressed in the mapper logic, and how to combine all the values for each key is governed by your logic of Reducer. The result of reducer is saved into the HDFS.

Let's imagine for a moment that we would like to prepare a veg burger on a very large scale. As you can see in the diagram the function cut_Into_Pieces() will be executed on each vegetable, chopping vegetables into pieces and the result will be reduced to form a burger.

Loading comments...