Apache Spark - Running On Cluster - Cluster Mode - YARN

If you already have a hadoop cluster, you can use --master option with yarn.

Please note that the spark applications tasks would be run inside the yarn's containers on various node managers.

Before launching spark in yarn cluster mode, you must set the two environment variables YARN_CONF_DIR and HADOOP_CONF_DIR to the location of configuration directory of hadoop. In case of cloudxlab or hortonworks data platform, the location is /etc/hadoop/conf/

Let's take a look at how to launch a spark application on YARN. For this, we are going to run the example application shipped with spark. This example application computes the value of pi by first counting pixels that lie inside circle and square and then finding the proportional area from squares area which is a*a.

For this we are going to first export the two variables and then execute the command.

Let us take a look. First login to cloudxlab webconsole or ssh, then export the two variables. And then run the spark-submit command. As you can see that the value of PI is roughly 3.142344.

Also, we can take a look in hue. The spark job should be there in our yarn jobs. Login into hue with your login and password. Open Job Browser, you can see that there is only one job so far with the name Spark Pi. So, our last job was really executed via Yarn.

Apache Spark - Running On Cluster

Spark On Cluster

Apache Spark - Running On Cluster - Cluster Mode - YARN

XP

Loading comments...