Apache Spark Basics

1 / 89

Apache Spark ecosystem walkthrough




Not able to play video? Try with youtube

Apache Spark is a fast and general engine for large-scale data processing.

It is around 100 times faster than MapReduce using only RAM and 10 times faster if using the disk.

It builds upon similar paradigms as MapReduce.

It is well integrated with Hadoop as it can run on top of YARN and can access HDFS.

Resource Managers

A cluster resource manager or resource manager is a software component which manages the various resources such as memory, disk, CPU of the machines connected in the cluster.

Apache Spark can run on top of many cluster resource managers such YARN, Amazon EC2 or Mesos. If you don't have any resource managers yet, you can use Apache Spark in Standalone mode.

Sources

Instead of building own file or data storages, Apache spark made it possible to read from all kinds of data sources: Hadoop Distributed File System, HBase, Hive, Tachyon, Cassandra.

Libraries

Apache Spark comes with great set of libraries. Data frames provide a generic way to represent the data in the tabular structure. The data frames make it possible to query data using R or SQL instead of writing tonnes of code.

Streaming Library makes it possible to process fast incoming streaming of huge data using Spark.

MLLib is a very rich machine learning library. It provides very sophisticated algorithms which run in distributed fashion.

GraphX makes it very simple to represent huge data as a graph. It proves library of algorithms to process graphs using multiple computers.

Spark and its libraries can be used with Scala, Java, Python, R, and SQL. The only exception is GraphX which can only be used with Scala and Java.

With these set of libraries, it is possible to do ETL, Machine Learning, Real time data processing and graph processing on Big Data.

We will cover each component in details as we go forward.


Please login to comment

7 Comments

I am not able to access spark web ui 

http://10.142.0.5:4046/

  Upvote    Share

Hi Sachin,

Let’s say your web console in on f.cloudxlab.com and your Spark job is running on port 4045.
To access Spark UI, you will have to type in http://f.cloudxlab.com:4045 in your browser.

  Upvote    Share

http://f.cloudxlab.com:4040 tried this but it is not working 

 

  Upvote    Share

with pyspark I can access ui but with scala spark I get above error

 

  Upvote    Share

Hi,

Make sure the spark session is active. Also, check you are using the correct port number.

  Upvote    Share

http://f.cloudxlab.com:4040

http://10.142.0.5:4040/

 

spark session is active tried both urls spark ui  is not working with scala but it works fine with pyspark. Can you fix this ?

  Upvote    Share

Hi,

It's working fine from my end.

You need to follow the following steps for it:

1. Open spark shell by running:

spark-shell

2. It will display some logs while starting the shell. In that there will be a log at fourth number like: 'Spark context Web UI available at'. There will be mentioned a URL with the port number. You need to copy that port number and then check your lab URL.

3. Your lab URL will be either e.cloudxlab.com or f.cloudxlab.com. If it's f, then you need to open:http://f.cloudxlab.com:port_number

If it's e, then: http://e.cloudxlab.com:port_number.

The port_number will be one from your spark session.

 

  Upvote    Share