MapReduce Programming - Find users having same DNA

Problem

Write a MapReduce code to find users having same DNA in the file stored in HDFS.

Dataset

The file is located at

/data/mr/dna/dna.txt

Sample Output

Output file will have the users having same DNA

ACG ['User5', 'User3']
ACGT    ['User4', 'User1']

Steps

Check out mapper.py and reducer.py in GitHub
If you haven't cloned the CloudxLab GitHub repository, then clone it in your home folder in web console using the below command
```
git clone https://github.com/singhabhinav/cloudxlab.git ~/cloudxlab
```
Else, update the local copy
```
cd ~/cloudxlab
git pull origin master
```

Go to same_dna directory

cd ~/cloudxlab/hdpexamples/python-streaming/same_dna/

Run the MapReduce code using Hadoop streaming. Please make sure to save output in mapreduce-programming/same_dna directory inside your home directory in HDFS. Run the below command

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -input /data/mr/dna/dna.txt -output mapreduce-programming/same_dna -mapper mapper.py -file mapper.py -reducer reducer.py -file reducer.py

In case you cannot find the hadoop streaming command run: find /usr/hdp -name hadoop-streaming.jar

Check the frequency of characters by typing below command.
```
hadoop fs -cat mapreduce-programming/same_dna/*
```

Note - Having trouble with the assessment engine? Follow the steps listed here

MapReduce Programming

MapReduce Programming - Find users having same DNA

XP

Loading comments...