hadoop fs -ls /data/msprojects/in_table.csv
$ python
>>> 8303338297.0/128.0/1024.0/1024.0
61.86469120532274
hdfs fsck /data/msprojects/in_table.csv
spark-shell --packages net.sf.opencsv:opencsv:2.3 --master yarn
var myrdd = sc.textFile("/data/msprojects/in_table.csv")
myrdd.partitions.length
So, number of partitions is a function of number of data blocks in case of sc.textFile.
var myrdd = sc.parallelize(1 to 100000)
myrdd.partitions.length
[sandeep@ip-172-31-60-179 ~]$ cat /proc/cpuinfo|grep processor
processor : 0
processor : 1
processor : 2
processor : 3
Since my machine has 4 cores, it has created 4 partitions.
$ spark-shell --master yarn
scala> var myrdd = sc.parallelize(1 to 100000)
scala> myrdd.partitions.length
res6: Int = 2
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Please login to comment
20 Comments
Whats the difference between these two
1.
$ spark-shell --packages net.sf.opencsv:opencsv:2.3 --master yarn
var myrdd = sc.parallelize(1 to 100000)
2.
$ spark-shell --master yarn
Upvote Sharevar myrdd = sc.parallelize(1 to 100000)
Hi,
To use any package with spark, we use '--packages' option. Since here we want to use opencsv package with spark, we have used '--packages' option with the command 'spark-shell'. If we don't want to use any package we simply exclude the package option.
Upvote ShareRemember, '--packages' option is used for external packages.
Upvote ShareAccording to the above statement when I run below code:
var myrdd = sc.parallelize(1 to 100000)
myrdd.partitions.length
I get the partitions length as 2.
But, "cat /proc/cpuinfo|grep processor" displays 6 processors,
Can anyone explain how this statement "Since my machine has X cores, it has created X partitions" still hold in this case?
Upvote ShareHi,
Please refer to the video from 1:56 for it.
Upvote ShareThis comment has been removed.
The statements "So, number of partitions is a function of number of data blocks in case of sc.textFile." and "When we are running in yarn mode, the number of partitions is function of tasks that can be executed on a node, Here it is 2."
What is being conveyed? When it is ran under yarn mode would it not use all the 16 processors?
Upvote ShareHi Punit,
When we are in the Yarn then we can only use the resources available to the Yarn as it said in the video as well, only 2 is for yarn.
Hope thsi helps you.
Upvote ShareI am getting 16 partitions in both cases...in default spark-shell and in spark-shell with master yarn..
Upvote ShareHi, Sujeet.
Yes you are correct there are 16 partitions being created :https://discuss.cloudxlab.com/t/error-quota-exceeded-while-running-spark-job-but-havent-used-much-of-disk-space-solved/3472/4
All the best!
Upvote Sharecan u explain 2 examples with code with differences of coalsce and repartition ;the example with filter from S3 in the video cant be visualized;please give more clarity
Here is the link you can refer for the examples. : https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
1 Upvote ShareHi Satyajit,
The last 2 slides (62 and 63) has same code. On slide # 62 only first 2 lines are changed as per scala, remaining code is as per python. Could CXL provide scala code?
I tried to execute the python code but the command "rdd.getNumPartitions()" is taking ages to return. Its executing from more than 1 hour but no response.
Upvote ShareYou are right, Python equivalent of 'myrdd.partitions.length' scala code is 'myrdd.getNumPartitions()'.
Sometimes the job get stuck on YARN/spark, please try again in such cases.
Upvote Sharenot able to do my labs for Spark hands on. please rectify ASAP
Upvote ShareHi, Amit.
Kindly check now if still the error persists. We have restarted the namenode.
Kindly send the screenshots if still the error exits.
All the best!
-- Satyajit Das
Upvote Shareits fine now. Thanks
Upvote Sharesomething wrong at my account as geeting error in HDFS. Getting this error:
"Cannot access: /user/amitkumar190919768036. The HDFS REST service is not available.
Upvote ShareHTTPConnectionPool(host='cxln1.c.thelab-240901.internal', port=50070): Max retries exceeded with url: /webhdfs/v1/user/amitkumar190919768036?op=GETFILESTATUS&user.name=hue&doas=amitkumar190919768036 (Caused by NewConnectionError('<requests.packages.urllib3.connection.httpconnection object="" at="" 0x7f3e0b52ad10="">: Failed to establish a new connection: [Errno 111] Connection refused',))"
What will happen if i create more number of partitions and i have less number of nodes in cluster?
Upvote ShareWhy partitions are necessary while handling RDDs?
you can create any no. of partition but in the above vedio it has been shown that by default spark will create the no. of partition = to the no. of cores.
Upvote Share