Adv Spark Programming

36 / 52

Adv Spark Programming - Key Performance Considerations - Partitions

Number of partition in YARN

Let take a look at huge file

hadoop fs -ls /data/msprojects/in_table.csv

Divide the number using python:

$ python
&gt;&gt;&gt; 8303338297.0/128.0/1024.0/1024.0
61.86469120532274

/data/msprojects/in_table.csv has 62 blocks theoratically. Lets check.

Let check actual number of partitions

hdfs fsck /data/msprojects/in_table.csv

Lets run spark-shell in yarn mode

spark-shell --packages net.sf.opencsv:opencsv:2.3 --master yarn

On the spark shell load the file and check partitions

var myrdd = sc.textFile("/data/msprojects/in_table.csv")
myrdd.partitions.length

So, number of partitions is a function of number of data blocks in case of sc.textFile.

Number of partitions in local mode

Check partitions

var myrdd = sc.parallelize(1 to 100000)
myrdd.partitions.length

Find number of processors - Execute it on the shell

[sandeep@ip-172-31-60-179 ~]$ cat /proc/cpuinfo|grep processor
processor : 0
processor : 1
processor : 2
processor : 3

Since my machine has 4 cores, it has created 4 partitions.

Number of partitions after loading using parallelize on yarn

$ spark-shell --master yarn

scala&gt; var myrdd = sc.parallelize(1 to 100000)
scala&gt; myrdd.partitions.length
res6: Int = 2

Slides - Adv Spark Programming (2)

Previous Index Next

Please login to comment

20 Comments

Mano Bhargav

3 years ago

Whats the difference between these two

$ spark-shell --packages net.sf.opencsv:opencsv:2.3 --master yarn
var myrdd = sc.parallelize(1 to 100000)

$ spark-shell --master yarn
var myrdd = sc.parallelize(1 to 100000)

Upvote Share

Shubh Tripathi

3 years ago

Hi,

To use any package with spark, we use '--packages' option. Since here we want to use opencsv package with spark, we have used '--packages' option with the command 'spark-shell'. If we don't want to use any package we simply exclude the package option.

Upvote Share

Shubh Tripathi

3 years ago

Remember, '--packages' option is used for external packages.

Upvote Share

Mano Bhargav

3 years ago

According to the above statement when I run below code:

var myrdd = sc.parallelize(1 to 100000)
myrdd.partitions.length

I get the partitions length as 2.

But, "cat /proc/cpuinfo|grep processor" displays 6 processors,

Can anyone explain how this statement "Since my machine has X cores, it has created X partitions" still hold in this case?

Upvote Share

Shubh Tripathi

3 years ago

Hi,

Please refer to the video from 1:56 for it.

Upvote Share

This comment has been removed.

Punit Bhilota

4 years ago

The statements "So, number of partitions is a function of number of data blocks in case of sc.textFile." and "When we are running in yarn mode, the number of partitions is function of tasks that can be executed on a node, Here it is 2."

What is being conveyed? When it is ran under yarn mode would it not use all the 16 processors?

Upvote Share

Vikas Gupta

4 years ago

Hi Punit,

When we are in the Yarn then we can only use the resources available to the Yarn as it said in the video as well, only 2 is for yarn.

Hope thsi helps you.

Upvote Share

Sujeet Pathak

5 years ago

I am getting 16 partitions in both cases...in default spark-shell and in spark-shell with master yarn..

Upvote Share

Satyajit Das

5 years ago

Hi, Sujeet.

Yes you are correct there are 16 partitions being created :https://discuss.cloudxlab.com/t/error-quota-exceeded-while-running-spark-job-but-havent-used-much-of-disk-space-solved/3472/4

All the best!

Upvote Share

Amit Padhi

5 years ago

can u explain 2 examples with code with differences of coalsce and repartition ;the example with filter from S3 in the video cant be visualized;please give more clarity

Upvote Share

Satyajit Das

5 years ago

Here is the link you can refer for the examples. : https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

1 Upvote Share

Punit Bhilota

4 years ago

Hi Satyajit,

The last 2 slides (62 and 63) has same code. On slide # 62 only first 2 lines are changed as per scala, remaining code is as per python. Could CXL provide scala code?

I tried to execute the python code but the command "rdd.getNumPartitions()" is taking ages to return. Its executing from more than 1 hour but no response.

Upvote Share

Sandeep Giri

4 years ago

You are right, Python equivalent of 'myrdd.partitions.length' scala code is 'myrdd.getNumPartitions()'.

Sometimes the job get stuck on YARN/spark, please try again in such cases.

Upvote Share

Amit Gupta

5 years ago

not able to do my labs for Spark hands on. please rectify ASAP

Upvote Share

CloudxLab

5 years ago

Hi, Amit.

Kindly check now if still the error persists. We have restarted the namenode.
Kindly send the screenshots if still the error exits.

All the best!

-- Satyajit Das

Upvote Share

Amit Gupta

5 years ago

its fine now. Thanks

Upvote Share

Amit Gupta

5 years ago

something wrong at my account as geeting error in HDFS. Getting this error:

"Cannot access: /user/amitkumar190919768036. The HDFS REST service is not available.
HTTPConnectionPool(host='cxln1.c.thelab-240901.internal', port=50070): Max retries exceeded with url: /webhdfs/v1/user/amitkumar190919768036?op=GETFILESTATUS&user.name=hue&doas=amitkumar190919768036 (Caused by NewConnectionError('<requests.packages.urllib3.connection.httpconnection object="" at="" 0x7f3e0b52ad10="">: Failed to establish a new connection: [Errno 111] Connection refused',))"

Upvote Share

Kishan Pandey

5 years ago

What will happen if i create more number of partitions and i have less number of nodes in cluster?
Why partitions are necessary while handling RDDs?

Upvote Share

Anubhav Gupta

5 years ago

you can create any no. of partition but in the above vedio it has been shown that by default spark will create the no. of partition = to the no. of cores.

Upvote Share