MapReduce Basics

3 / 9

MapReduce - Thinking in MR - Programatic & SQL

Not able to play video? Try with youtube

Converting a problem into map-reduce can be a little tricky and unintuitive. So, we will take it stepwise approach in order to understand MapReduce.

How would you find the frequencies of unique words in a text file? In other words, find the count of each unique word in the text file?

Consider it to be a practical assignment. You can suggest any tool or technique to accomplish this. There are few approaches. We will take you through these approaches and try to weigh pros and cons of each.

Lets first discuss the approach that involves programming. In this approach, we will use an in-memory data structure called, hash-map, or a dictionary. Initially, we create an empty dictionary such that the key will be the word and value will be how many times a word has occurred so far. we read the data one word at a time. for each word, we increase its count if it exists in the dictionary. Otherwise, we add the word in the dictionary with count as 1. And when there are no more words left, we print the key-values from the dictionary.

The same has been demonstrated using python code and flow chart.

What do you think are limitations of this approach?

It can not process the data more than what fits into the memory because the entire dictionary is maintained in the memory. So, if your text is really huge, you can not use this approach.

Second approach is to use SQL. Basically, break down the text from all these files into one word per line and then insert each word into a row in the table with single column. Once such table with one column has been created, we execute an SQL query with group by clause like: "select word, count(*) from table group by word".

This is a more practical approach because it doesn't involve writing too much of code. And it also does not suffer from memory overflow because databases manage memory properly while grouping the data.

Loading comments...