Course on Big Data with Spark

Learn From Industry Experts

645 Ratings | 1800 Learners
Enroll Now >>

Why learn Big Data?

As humans, we are immersed in data in our every-day lives. As per IBM, the data doubles every two years on this planet. The value that data holds can only be understood when we can start to identify patterns and trends in the data. Normal computing principles do not work when data becomes huge.

There is massive growth in the big data space, and job opportunities are skyrocketing, making this the perfect time to launch your career in this space.

Big Data Market Salary


Learning with CloudxLab means top-notch training by industry experts and best-in-class learning content.
100% money-back guarantee (please check FAQ at bottom of this page for details on refund policy)

Self-paced Learning

Course Design

High-quality videos, slides, hands-on examples, quizzes, automated assessments, case studies, real-world projects

Course Material

Lifetime access to cutting-edge self-paced learning content


90 Days of CloudxLab access for hands-on practice


24x7 email support to answer your queries


Earn certificate in Big Data with Apache Spark
299   149

How will you benefit?

Skill Enhancement

Develop skills and competencies to excel and stand out in the Big Data domain

Career Growth

Get better roles and better packages

What do I get?

  • Lifetime access to course material

    Lifetime access to high-quality, self-paced learning content designed by industry experts Learn more

  • 90 days of CloudxLab access

    Learn by practicing on a real-time distributed environment

  • Best-in-class support

    24x7 email support to answer your queries. Get the answer to your queries in one business day

  • Training by professionals

    Learn from professionals having years of experience in churning Big Data and building enterprise products

  • Verified certificate

    Receive verified certificate and share it on LinkedIn

  • LinkedIn recommendation & endorsements

    We will provide a LinkedIn Recommendation based on your performance.


Prerequisites and Requirements

  • Basics Of SQL. You should know the basics of SQL and databases. If you know about filters in SQL, you are expected to understand the course.

  • A know-how of the basics of programming. If you understand 'loops' in any programming language, and if you are able to create a directory and see what's inside a file from the command line, you are good to get the concepts of this course even if you have not really touched programming for the last 10 years! In addition, we will be providing video classes on the basics of Python and Scala.

Course Syllabus

Preview Course

Student Rating

View all reviews


2-3 Months

Skill Level  



  • What is Big Data?
  • Why Now?
  • Big Data Use Cases
  • Various Solutions
  • Overview of Hadoop Ecosystem
  • Spark Ecosystem Walkthrough
  • Quiz

Foundation & Environment

  • Understanding the CloudxLab
  • CloudxLab Hands-On
  • Hadoop & Spark Hands-on
  • Quiz and Assessment
  • Basics of Linux - Quick Hands-On
  • Understanding Regular Expressions
  • Quiz and Assessment
  • Setting up VM (optional)

Data Formats & Management

  • InputFormat and InputSplit
  • JSON
  • XML
  • AVRO
  • How to store many small files - SequenceFile?
  • Parquet
  • Protocol Buffers
  • Comparing Compressions
  • Understanding Row Oriented and Column Oriented Formats - RCFile?

Scala Basics

  • Introduction to Scala?
  • Accessing Scala using CloudxLab
  • Getting Started: Interactive, Compilation, SBT
  • Types, Variables & Values
  • Functions
  • Collections
  • Classes
  • Parameters
  • More Features
  • Quiz and Assessment

Spark Basics

  • What is Apache Spark?
  • Why Spark?
  • Using the Spark Shell on CloudxLab
  • Example 1 - Performing Word Count
  • Understanding Spark Cluster Modes on YARN
  • RDDs (Resilient Distributed Datasets)
  • General RDD Operations: Transformations & Actions
  • RDD Lineage
  • RDD Persistence Overview
  • Distributed Persistence

Writing and Deploying Spark Applications

  • Creating the SparkContext
  • Building a Spark Application (Scala, Java, Python)
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Running Spark on Cluster
  • RDD Partitions
  • Executing Parallel Operations
  • Stages and Tasks
  • Project: Churning the logs of NASA Kennedy Space Center WWW server

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Example 1 - Data Cleaning (Movielens)
  • Example 2 - Understanding Spark Streaming
  • Understanding Kafka
  • Example 3 - Spark Streaming from Kafka
  • Iterative Algorithms in Spark
  • Project: Real-time analytics of orders in an e-commerce company

DataFrames and Spark SQL

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • DataFrames and RDDs
  • Comparing Spark SQL, Impala, and Hive-on-Spark

Machine Learning with Spark

  • GraphX: Graph Processing and Analysis
  • Understanding Machine Learning
  • MlLib Example: k-means
  • SparkR Example


Common questions and answers

  • How much time will it take to complete the course?

    It will take 2-3 months with 6-8 hours of effort per week.

  • What is the validity of course material?

    We understand that you might need course material for a longer duration to make most out of your subscription. You will get lifetime access (Till the company is operational) to the course material so that you can refer to the course material anytime

  • What is the certification process?

    At the end, of course, you will work on a real-time project. You will receive a problem statement along with a data-set to work on CloudxLab. Once you are done with the project (it will be reviewed by an expert), you will be awarded a certificate which you can share on LinkedIn.

  • How will be the practicals or hands-on be conducted?

    We will provide 90 days of access to CloudxLab so that you learn by practice in a real time environment.

  • I am not from a Java Background. Can I take this course?

    Yes. Java is generally required for understanding MapReduce. MapReduce is a programming paradigm for writing your logic in the form of Mapper and reducer functions. We provide a self-paced course on Java for free. As soon as you signup, it would be available in your account section.

  • What is the refund policy for courses taken from CloudxLab?

    For self-paced course, we provide 100% fees refund if the request is raised within 7 days from enrolment date. Thereafter, no refund is provided.
    For instructor-led course, we provide 100% refund if not more than 1 live session has been conducted -- and we provide 50% refund if 2-4 live sessions have been conducted. If 5 or more live sessions have been conducted, then no refund will be provided.

  • I have some more questions. Can I talk to someone?

    Absolutely! Please contact us at

Program Leads

Course Instructor
Sandeep GiriCourse Instructor
Course Developer
Abhinav SinghCourse Developer
Course Developer
Benjamin BertincourtCourse Advisor
Course Advisor
Jatin ShahCourse Advisor
Course Advisor
Amit UpadhyayCourse Advisor
Course Advisor
Ratnaker PandeyCourse Advisor