Apache Spark with Python - Apache Spark Ecosystem

Not able to play video? Try with youtube

Note: In some of the videos of this course, you may notice that the instructor has used Scala instead of Python. Even though this course is based on Python, you will see as you progress along the course that you can also use Scala, and other languages with Apache Spark.

We do have a dedicated course for learning Apache Spark with Scala, you can explore our course page for more details on the same.

You may also notice use of Hue in some of the videos. We have deprecated Hue in our lab, you can follow this discussion on our forum for more details on the same.

Show Transcript

Hide Transcript

What is Apache Spark?

Apache Spark is a fast and general engine for large-scale data processing.
It is around 100 times faster than MapReduce using only RAM and 10 times faster if using the disk.
It builds upon similar paradigms as MapReduce.
It is well integrated with Hadoop as it can run on top of YARN and can access HDFS.

Resource Managers - A cluster resource manager or resource manager is a software component which manages the various resources such as memory, disk, CPU of the machines connected in the cluster. - Apache Spark can run on top of many cluster resource managers such YARN, Amazon EC2 or Mesos. - If you don't have any resource managers yet, you can use Apache Spark in Standalone mode.

Sources - Instead of building own file or data storages, Apache spark made it possible to read from all kinds of data sources:

Hadoop Distributed File System
HBase
Hive
Tachyon
Cassandra.

Libraries

Apache Spark comes with great set of libraries.

Data frames provide a generic way to represent the data in the tabular structure. The data frames make it possible to query data using R or SQL instead of writing tons of code.
Streaming Library makes it possible to process fast incoming streaming of huge data using Spark.
MLLib is a very rich machine learning library. It provides very sophisticated algorithms which run in distributed fashion.
GraphX makes it very simple to represent huge data as a graph. It proves library of algorithms to process graphs using multiple computers.

Spark and its libraries can be used with Scala, Java, Python, R, and SQL. The only exception is GraphX which can only be used with Scala and Java.

With these set of libraries, it is possible to do ETL, Machine Learning, Real time data processing and graph processing on Big Data.

We will cover each component in details as we go forward.