Spark On Cluster

18 / 23

Previous Index Next

Apache Spark - Running On Cluster - Cluster Mode - Mesos+AWS

Mesos Is a general-purpose cluster manager

it runs both analytics workloads and long-running services (DBs)

To use Spark on Mesos, pass a mesos:// URI to spark-submit:

spark-submit --master mesos://masternode:5050 yourapp

You can use ZooKeeper to elect master in mesos in case of multi-master

Use a mesos://zk:// URI pointing to a list of ZooKeeper nodes.

Ex:, if you have 3 nodes (n1, n2, n3) having ZK on port 2181, use URI:

mesos://zk://n1:2181/mesos,n2:2181/mesos,n3:2181/mesos

Spark comes with a built-in script to launch clusters on Amazon EC2.

First create an Amazon Web Services (AWS) account

Obtain an access key ID and secret access key.

export these as environment variables:

export AWS_ACCESS_KEY_ID="..."

export AWS_SECRET_ACCESS_KEY="..."

Create an EC2 SSH key pair and download its private key file (helps in SSH)

Launch command of the spark-ec2 script:

cd /path/to/spark/ec2

./spark-ec2 -k mykeypair -i mykeypair.pem launch mycluster

Start with a local mode if this is a new deployment. To use richer resource scheduling capabilities (e.g., queues), use YARN and Mesos When sharing amongst many users is primary criteria, use Mesos In all cases, it is best to run Spark on the same nodes as HDFS for fast access to storage. You can either install Mesos or Standalone cluster on Datanodes Or Hadoop distributions already install YARN and HDFS together

Apache Spark - Running On Cluster

Previous Index Next

Please login to comment

9 Comments

Mohd Zaid

4 years ago

Please help in this

Upvote Share

Abhinav Singh

4 years ago

Hi Mohd,

These commands will not work on the lab as we do not have mesos. Also these commands should be run in the console not in spark shell

Upvote Share

Punit Bhilota

4 years ago

Which cluster/resource manager is more efficient Yarn or Mesos with respect to Spark?

Upvote Share

Abhinav Singh

4 years ago

Hi Punit,

Both YARN and Mesos are good for distributed resource management and they support a variety of workloads like MapReduce, Spark, Flink, Storm, etc. So there is no specific answer for which one is more efficient with respect to Spark.

Hope this helps.

Upvote Share

Punit Bhilota

4 years ago

Thanks Abhinav

Upvote Share

Punit Bhilota

4 years ago

1. Is there a MESOS setup on CloudXLab? Are there any examples for Mesos and EC2?

2. "Spark comes with a built-in script to launch clusters on Amazon EC2."

Is built script available for other cloud platforms like Azure, GCP etc.?

Upvote Share