Adv Spark Programming

36 / 52

Adv Spark Programming - Key Performance Considerations - Partitions

Number of partition in YARN

Let take a look at huge file

hadoop fs -ls /data/msprojects/in_table.csv

Divide the number using python:

$ python
>>> 8303338297.0/128.0/1024.0/1024.0
61.86469120532274

/data/msprojects/in_table.csv has 62 blocks theoratically. Lets check.

Let check actual number of partitions

hdfs fsck /data/msprojects/in_table.csv

Lets run spark-shell in yarn mode

spark-shell --packages net.sf.opencsv:opencsv:2.3 --master yarn

On the spark shell load the file and check partitions

var myrdd = sc.textFile("/data/msprojects/in_table.csv")
myrdd.partitions.length

So, number of partitions is a function of number of data blocks in case of sc.textFile.

Number of partitions in local mode

Check partitions

var myrdd = sc.parallelize(1 to 100000)
myrdd.partitions.length

Find number of processors - Execute it on the shell

[sandeep@ip-172-31-60-179 ~]$ cat /proc/cpuinfo|grep processor
processor : 0
processor : 1
processor : 2
processor : 3

Since my machine has 4 cores, it has created 4 partitions.

Number of partitions after loading using parallelize on yarn

$ spark-shell --master yarn

scala> var myrdd = sc.parallelize(1 to 100000)
scala> myrdd.partitions.length
res6: Int = 2

Slides - Adv Spark Programming (2)


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

20 Comments

Whats the difference between these two

1.

$ spark-shell --packages net.sf.opencsv:opencsv:2.3 --master yarn
var myrdd = sc.parallelize(1 to 100000)

 

2.

$ spark-shell --master yarn
var myrdd = sc.parallelize(1 to 100000)

  Upvote    Share

Hi,

To use any package with spark, we use '--packages' option. Since here we want to use opencsv package with spark, we have used  '--packages' option with the command 'spark-shell'. If we don't want to use any package we simply exclude the package option.

  Upvote    Share

Remember, '--packages' option is used for external packages.

  Upvote    Share

According to the above statement when I run below code:

var myrdd = sc.parallelize(1 to 100000)
myrdd.partitions.length

I get the partitions length as 2.

But, "cat /proc/cpuinfo|grep processor"  displays 6 processors,

Can anyone explain how this statement "Since my machine has X cores, it has created X partitions" still hold in this case?

  Upvote    Share

Hi,

Please refer to the video from 1:56 for it.

  Upvote    Share

This comment has been removed.

The statements "So, number of partitions is a function of number of data blocks in case of sc.textFile." and "When we are running in yarn mode, the number of partitions is function of tasks that can be executed on a node, Here it is 2."

What is being conveyed? When it is ran under yarn mode would it not use all the 16 processors?

  Upvote    Share

Hi Punit,

When we are in the Yarn then we can only use the resources available to the Yarn as it said in the video as well, only 2 is for yarn. 

 

Hope thsi helps you.

  Upvote    Share

I am getting 16 partitions in both cases...in default spark-shell and in spark-shell with master yarn..

  Upvote    Share

can u explain 2 examples  with code with differences of coalsce and repartition ;the example with filter from S3 in the video cant be visualized;please give more clarity

 

  Upvote    Share

Hi Satyajit,

The last 2 slides (62 and 63) has same code. On slide # 62 only first 2 lines are changed as per scala, remaining code is as per python. Could CXL provide scala code?

I tried to execute the python code but the command "rdd.getNumPartitions()" is taking ages to return. Its executing from more than 1 hour but no response.

  Upvote    Share

You are right, Python equivalent of 'myrdd.partitions.length' scala code is 'myrdd.getNumPartitions()'. 

Sometimes the job get stuck on YARN/spark, please try again in such cases.

  Upvote    Share

not able to do my labs for Spark hands on. please rectify ASAP

  Upvote    Share

Hi, Amit.

Kindly check now if still the error persists. We have restarted the namenode.
Kindly send the screenshots if still the error exits.

All the best!

-- Satyajit Das

  Upvote    Share

its fine now. Thanks

  Upvote    Share

something wrong at my account as geeting error in HDFS. Getting this error:

"Cannot access: /user/amitkumar190919768036. The HDFS REST service is not available.
HTTPConnectionPool(host='cxln1.c.thelab-240901.internal', port=50070): Max retries exceeded with url: /webhdfs/v1/user/amitkumar190919768036?op=GETFILESTATUS&user.name=hue&doas=amitkumar190919768036 (Caused by NewConnectionError('<requests.packages.urllib3.connection.httpconnection object="" at="" 0x7f3e0b52ad10="">: Failed to establish a new connection: [Errno 111] Connection refused',))"

  Upvote    Share

What will happen if i create more number of partitions and i have less number of nodes in cluster?
Why partitions are necessary while handling RDDs?

  Upvote    Share

you can create any no. of partition but in the above vedio it has been shown that by default spark will create the no. of partition = to the no. of cores.

  Upvote    Share