Adv Spark Programming

38 / 54
   

Adv Spark Programming - Key Performance Considerations - Partitions

Number of partition in YARN

Let take a look at huge file

hadoop fs -ls /data/msprojects/in_table.csv

Divide the number using python:

$ python
>>> 8303338297.0/128.0/1024.0/1024.0
61.86469120532274

/data/msprojects/in_table.csv has 62 blocks theoratically. Lets check.

Let check actual number of partitions

hdfs fsck /data/msprojects/in_table.csv

Lets run spark-shell in yarn mode

spark-shell --packages net.sf.opencsv:opencsv:2.3 --master yarn

On the spark shell load the file and check partitions

var myrdd = sc.textFile("/data/msprojects/in_table.csv")
myrdd.partitions.length

So, number of partitions is a function of number of data blocks in case of sc.textFile.

Number of partitions in local mode

Check partitions

var myrdd = sc.parallelize(1 to 100000)
myrdd.partitions.length

Find number of processors - Execute it on the shell

[sandeep@ip-172-31-60-179 ~]$ cat /proc/cpuinfo|grep processor
processor : 0
processor : 1
processor : 2
processor : 3

Since my machine has 4 cores, it has created 4 partitions.

Number of partitions after loading using parallelize on yarn

$ spark-shell --master yarn

scala> var myrdd = sc.parallelize(1 to 100000)
scala> myrdd.partitions.length
res6: Int = 2