Apache Spark Basics

37 / 89

Apache Spark - Counting Word Frequencies




Not able to play video? Try with youtube

INSTRUCTIONS
  • Given below is the Scala code for counting word frequencies

    var linesRdd = sc.textFile("/data/mr/wordcount/input/big.txt")
    var words = linesRdd.flatMap(x => x.split(" "))
    var wordsKv = words.map(x => (x, 1))
    //def myfunc(x:Int, y:Int): Int = x + y
    var output = wordsKv.reduceByKey(_ + _)
    output.take(10)
    

    We can also save the output to HDFS:

    output.saveAsTextFile("my_result")
    

Note - In this video, we used Hue to access the results in HDFS. We have deprecated the Hue. Please use the below commands in the web console to access the files

  • Login to the web console
  • Check the files

    hadoop fs -ls  my_result
    
  • Check the content of the first part

    hadoop fs -cat my_result/part-00000 | more
    
  • Check the content of the second part

    hadoop fs -cat my_result/part-00001 | more
    

Please login to comment

10 Comments

Hi, I am unable to see Hue near Ambari, Jupyter and Web console. Could you please me with that?

  Upvote    Share

Hi Tanvi,

We have disabled Hue. You can refer to https://discuss.cloudxlab.com/t/should-we-be-using-hue/5821/2?u=shubh_tripathi for more details.

  Upvote    Share

Its not creating part-00000 files for me result is different 

 

 

[aimlankit2262@cxln4 ~]$ hadoop fs -ls  my_result
Found 1 items
drwxr-xr-x   - aimlankit2262 aimlankit2262          0 2022-10-03 03:39 my_result/_temporary
[aimlankit2262@cxln4 ~]$

 

  Upvote    Share

Hi Ankit, 

Can you please share the screenshot of the code you used for counting word frequencies?

  Upvote    Share

 

 

Here inside mydirectory its showing temporaary file it should show some part directory. 

  Upvote    Share

But, where have you created the partition? 

  Upvote    Share

I am getting this error. Can anybody please help me on this?

 

<console>:1: error: Decimal integer literals may not have a leading zero. (Octal syntax is obsolete.)
hadoop fs -cat my_result/part-00000 | more

  Upvote    Share

Hi, It's working fine from my end. Can you please check it again? If you are still facing the problem, share the code of counting word frequencies here.

  Upvote    Share

Please add the Scala course before Apache Spark course? This course is not organized correctly. I am not able to follow the course. 

 

.

  Upvote    Share

This comment has been removed.