Spark On Cluster

1 / 23

Apache Spark - Running On Cluster - Architecture

As part of this session we would learn how to run spark on the cluster.

Let us try to understand the architecture of Spark or what goes on inside when we execute something on spark.

As we understand that Spark is all about distributed or parallel computing meaning utilizing many computers or processors in order to accomplish humongous tasks.

So, at the core of spark it is a master slave architecture made of two kinds of nodes - driver and worker. There is only one driver and workers are many. While workers execute the workload, the driver launches the workload and coordinates between user and workers.

The user interacts with Driver. The driver interacts with individual workers.

Spark Driver is the Process where main() method of your spark program runs.

When you launch a Spark shell, you’ve created a driver program which is running locally.

Once the driver terminates, the application is finished. Meaning the driver remains for the lifetime of the application.

While running, the driver first convert a user's program into tasks by the way to two steps. First, it converts a user program into DAG of transformations and actions. Then, it converts DAG (logical graph) into a physical execution plan

The second task that driver performs is scheduling tasks on executors.

What does scheduling tasks on executors involve?

Scheduling involves coordinating the execution of individual tasks on executors

The driver schedules tasks based on data placement.

Tracks cached data and uses it to schedule future tasks

The driver also runs Spark web interface at port 4040.

Spark utilizes these existing resource managers in order to execute workload. In case you do not have any resource managers, spark comes with its own resource manager, we call it stand alone.

Spark is agnostic to the underlying cluster manager. As long as spark can acquire executor processes, and these processes can communicate with each other, it can run. It is easy to run it even on a cluster manager such as Mesos and YARN that also supports other applications.

The cluster manager is a pluggable component in spark to make spark talk to any resource manager.

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.

Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.

Executors are the worker processes that run tasks of a job

Once finished, executors return results back to the driver

Executors are launched once at the beginning of a Spark application.

Executors run for the entire lifetime of an application,

Executors also provide in-memory storage for cached RDDs using something called Block Manager

Finally, SparkContext sends tasks to the executors to run.

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads.

This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs).

However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.

The driver program must listen for and accept incoming connections from its executors throughout its lifetimeAs such, the driver program must be network addressable from the worker nodes.

Inside task, the spark application is executed.

The Spark Application or task does the actual work of executing various workloads and maintaining RDD.

If spark is being run on YARN, it runs inside yarn containers. The spark application can also run on the local machine if we are running in standalone mode. We will learn about standalone mode soon.

The Driver Application translates user's program into executable tasks and assigns it on each spark Application. It starts the spark Applications.

The driver applications can run on the client machine or inside the cluster.

Slides Download

Index Next

Please login to comment

14 Comments

Saswat Ray

4 years ago

What does it mean ?

converts DAG (logical graph) into a physical execution plan ?

Upvote Share

Abhinav Singh

4 years ago

Hi Saswat,

DAG which is a logical graph is converted into a sequence of operations like which function to execute first and which one to execute after first.

You should read more about execution DAG here for better understanding https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

1 Upvote Share

Punit Bhilota

4 years ago

1. Are terms Cluster Manager and Resource Manager interchangable?

2. Executors - "Provide in-memory storage for cached RDDs via Block Manager" - Where this block manager resides?

3. Slide 31 Launching a Program point # "The driver process runs through the user application" - Does it mean following actions?

? Converting a user program into tasks
? Convert a user program into tasks - units of execution.
? Converts DAG (logical graph) into a physical execution plan

4. What is the use of Worker Node except providing hardware resources? Rest all the actions are performed by Driver, Resource/Cluster Manager and Executer.

Upvote Share

Punit Bhilota

4 years ago

Please respond

Upvote Share

Sandeep Giri

4 years ago

> 1. Are the terms Cluster Manager and Resource Manager interchangeable?

Yes. Cluster Manager is a term of Spark while Resource Manager is from YARN. If you are using something other than YARN then the cluster manager will different from the Resource Manager.

> 2. Executors - "Provide in-memory storage for cached RDDs via Block Manager" - Where this block manager resides?

The executor has the block manager.

> Does it mean following actions?
Yes.

> 4. What is the use of Worker Node except providing hardware resources? Rest all the actions are performed by Driver, Resource/Cluster Manager and Executer.

Worker node runs executors and even driver sometimes.

1 Upvote Share

Amit Padhi

5 years ago

can we have the transcript of this video ; its quite difficult to recollect

1 Upvote Share

Satyajit Das

5 years ago

Hi, Amit.

I recommend you to keep a running notes with you while watching the video otherwise its quite difficult to recollect later.

All the best!

Upvote Share

Siddharth

5 years ago

Will each worker node have only one spark application running which will have one executor program running(which can run multiple tasks), or each worker node can have multiple applications running, each in turn having its own executor ?

Upvote Share

Shubh Wadekar

5 years ago

Yes, it is possible for multiple applications to be running on the same worker node at the given time and the each application will have its own executor.

Thanks and Regards,

Shubh Wadekar

Team CloudxLab

Upvote Share

Avinash Bansal

5 years ago

hey i want to understand the architecture of the whole setup like which are the things we need to consider when creating a multi-node cluster and how to integrate all the components?

Upvote Share

Avinash Bansal

5 years ago

is it possible to have spark driver node and worker node on the same machine ?

Upvote Share

Avinash Bansal

5 years ago

should yarn(/any other resource manager) and spark driver node be on the same machine?

Upvote Share

Man

5 years ago

Hortonworks or Cloudera - Just curious that here in CLOUDXLAB which distribution are we using. I see that we have both AMBARI and HUE which I believe are products from two different distribution.

Upvote Share

Satyajit Das

5 years ago

Hi, Man.

CloudxLab have taken the features from both the Hortonworks and Cloudera.
CloudxLab is more compatible with Hortonworks.

All the best!

Upvote Share

Spark On Cluster

Apache Spark - Running On Cluster - Architecture

XP

Please login to comment

14 Comments