MapReduce Programming - Find anagrams in a text file

Problem

Write a MapReduce code to find anagrams in a text file stored in HDFS. An anagram is basically a different arrangement of letters in a word. Anagram does not need to be meaningful

Dataset

The file is located at

/data/mr/wordcount/big.txt

Sample Output

Output file will contain the anagrams in the text file

3   ['bowel,', 'elbow,', 'below,']
3   ['bore', 'boer', 'robe']
3   ['bears', 'baser', 'saber']

Steps

Check out the mapper.py and reducer.py in GitHub
If you haven't cloned the CloudxLab GitHub repository, then clone it in your home folder in web console using the below command
```
git clone https://github.com/singhabhinav/cloudxlab.git ~/cloudxlab
```
Else, update the local copy
```
cd ~/cloudxlab
git pull origin master
```

Go to find_anagrams directory

cd ~/cloudxlab/hdpexamples/python-streaming/find_anagrams

Run the MapReduce code using Hadoop streaming. Please make sure to save output in find_anagrams in mapreduce-programming/find_anagrams directory inside your home directory in HDFS. Run the below command

hadoop jar /usr/hdp/2.6.2.0-205/hadoop-mapreduce/hadoop-streaming.jar -input /data/mr/wordcount/big.txt -output mapreduce-programming/find_anagrams -mapper mapper.py -file mapper.py -reducer reducer.py -file reducer.py

Check the frequency of characters by typing below command.

hadoop fs -cat mapreduce-programming/find_anagrams/* | sort -nr | head -n 20

Note - Having trouble with the assessment engine? Follow the steps listed here

MapReduce Programming

MapReduce Programming - Find anagrams in a text file

XP

Loading comments...