Apache Spark Basics with Python

38 / 86

Apache Spark with Python - Counting Word Frequencies

Now that we know how to create an RDD, we would be using that to count the frequency of words in a file located at the following path in Hadoop:

/data/mr/wordcount/input/big.txt

To count the frequency of words we will be using the following method:

flatMap - It returns a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

map - The map method return a new RDD by applying a function to each element of this RDD.

reduceByKey - This method merge the values for each key using an associative and commutative reduce function. It also performs the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.

Let's get to work.

INSTRUCTIONS
  • Create an RDD from the file at the location given above and then save it in the variable linesRdd

    <<your code goes here>> = sc.textFile("/data/mr/wordcount/input/big.txt")
    
  • Use the flatMap method described above to split the lines into words and save the result in a variable named words

    <<your code goes here>> = linesRdd.flatMap(lambda x: x.split(" "))
    
  • Using the map method, create a key-value pair of each word where the word would be the key and and the number 1 would be the value and store the result in a variable named wordsKv

    wordsKv = words.<<your code goes here>>(lambda x: (x, 1))
    
  • Now let's group by the words and add the numbers associated with them so that we can get the total number of times that word appears in this file and store the result in a variable named output

    output = wordsKv.reduceByKey(lambda x, y: x + y)
    
  • Finally, let's see the first 10 words using the take method we used in the previous step

    output.<<your code goes here>>(10)
    
  • We can also save the result to a text file using the saveAsTextFile method

    output.<<your code goes here>>("my_result")
    
  • To check this my_result file in Hadoop, first login to the web console from another tab or click on the Console tab on the right side of this split screen

  • To check the content of entire file using the below command

    hadoop fs -ls  my_result
    
  • To check the content of the first part of the file use the below command

    hadoop fs -cat my_result/part-00000 | more
    
  • To check the content of the second part of the file use the below command

    hadoop fs -cat my_result/part-00001 | more
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Please login to comment

4 Comments

Hi Team,

I am getting below error, could ypu please help on this?

  Upvote    Share

Hi, 

For accessing Spark, you have to set several environment variables and system paths before using it. You can do that either manually or can use a package that does all this work for you. For the latter, findspark is a suitable choice. It wraps up all these tasks in just two lines of code:

import findspark
findspark.init('/usr/spark2.4.3')

Here, we have used spark version 2.4.3. You can specify any other version too whichever you want to use. You can check the available spark versions using the following command-

!ls /usr/spark*
  Upvote    Share

Hi Team,

I am getting below error as mentioned in screenshot . It's showing RDD is not defined even i have created "linesRdd" correctly. Kindly help to understand this.

  Upvote    Share

Hi,

It's working fine from my end. Can you check it again?

  Upvote    Share