Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left

  Apply Now

Spark On Cluster

1 / 23

Apache Spark - Running On Cluster - Architecture

As part of this session we would learn how to run spark on the cluster.

Let us try to understand the architecture of Spark or what goes on inside when we execute something on spark.

As we understand that Spark is all about distributed or parallel computing meaning utilizing many computers or processors in order to accomplish humongous tasks.

So, at the core of spark it is a master slave architecture made of two kinds of nodes - driver and worker. There is only one driver and workers are many. While workers execute the workload, the driver launches the workload and coordinates between user and workers.

The user interacts with Driver. The driver interacts with individual workers.

Spark Driver is the Process where main() method of your spark program runs.

When you launch a Spark shell, you’ve created a driver program which is running locally.

Once the driver terminates, the application is finished. Meaning the driver remains for the lifetime of the application.

While running, the driver first convert a user's program into tasks by the way to two steps. First, it converts a user program into DAG of transformations and actions. Then, it converts DAG (logical graph) into a physical execution plan

The second task that driver performs is scheduling tasks on executors.

What does scheduling tasks on executors involve?

Scheduling involves coordinating the execution of individual tasks on executors

The driver schedules tasks based on data placement.

Tracks cached data and uses it to schedule future tasks

The driver also runs Spark web interface at port 4040.

Spark utilizes these existing resource managers in order to execute workload. In case you do not have any resource managers, spark comes with its own resource manager, we call it stand alone.

Spark is agnostic to the underlying cluster manager. As long as spark can acquire executor processes, and these processes can communicate with each other, it can run. It is easy to run it even on a cluster manager such as Mesos and YARN that also supports other applications.

The cluster manager is a pluggable component in spark to make spark talk to any resource manager.

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.

Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.

Executors are the worker processes that run tasks of a job

Once finished, executors return results back to the driver

Executors are launched once at the beginning of a Spark application.

Executors run for the entire lifetime of an application,

Executors also provide in-memory storage for cached RDDs using something called Block Manager

Finally, SparkContext sends tasks to the executors to run.

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads.

This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs).

However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.

The driver program must listen for and accept incoming connections from its executors throughout its lifetimeAs such, the driver program must be network addressable from the worker nodes.

Inside task, the spark application is executed.

The Spark Application or task does the actual work of executing various workloads and maintaining RDD.

If spark is being run on YARN, it runs inside yarn containers. The spark application can also run on the local machine if we are running in standalone mode. We will learn about standalone mode soon.

The Driver Application translates user's program into executable tasks and assigns it on each spark Application. It starts the spark Applications.

The driver applications can run on the client machine or inside the cluster.

Slides Download


No hints are availble for this assesment

Answer is not availble for this assesment

Loading comments...