Specialization in
Big Data with Hadoop & Spark

Learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD, Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX From Industry Experts

(9,025 Learners)

  60+ hours training

  90 days of Lab

  24x7 Support

8 Projects

  Compatible with Hortonworks, Cloudera Certifications

About the Specialization

As humans, we are immersed in data in our every-day lives. As per IBM, the data doubles every two years on this planet. The value that data holds can only be understood when we can start to identify patterns and trends in the data. Normal computing principles do not work when data becomes huge.

There is massive growth in the big data space, and job opportunities are skyrocketing, making this the perfect time to launch your career in this space.

In this specialization, you will learn Hadoop and Spark to drive better business decisions and solve real-world problems.



2 courses

Learn from industry experts. 60+ hours of live video.

Projects & Lab

Apply the skills you learn on a distributed cluster to solve real-world problems.

Certificate

Highlight your new skills on your resume or LinkedIn.

1:1 Mentoring

Subscribe to 1:1 mentoring sessions and get guidance from industry leaders and professionals.

Best-in-class Support

24×7 support and forum access to answer all your queries throughout your learning journey.

Certifications

Compatible with: CCP Data Engineer, CCA Spark and Hadoop Developer, HDP Certified Developer, HDP Certified Developer: Spark
Enrollment

Self-Paced Learning

Start Immediately
Learn at your pace
90 days lab

190 299

Instructor-led Trainings

5 Aug
Sun, Sat
(10 weeks)
10:30 a.m. - 1:30 p.m. America/New_York

90 days lab
699 749
19 Aug
Sun, Sat
(10 weeks)
10:30 a.m. - 1:30 p.m. America/New_York

90 days lab
549 728
Enroll Now

(Early Bird Offer)

21 Oct
Sun, Sat
(10 weeks)
10:30 a.m. - 12:30 p.m. America/New_York

90 days lab
549 728
Enroll Now

(Early Bird Offer)

Learning Path

Course 1

Big Data with Hadoop

This is the first course in the specialization. In this course, we start with Big Data introduction and then we dive into Big Data ecosystem tools and technologies like ZooKeeper, HDFS, YARN, MapReduce, Pig, Hive, HBase, NoSQL, Sqoop, Flume, Oozie.

Each topic consists of high-quality videos, slides, hands-on assessments, quizzes and case studies to make learning effective and for life. With this course, you also get access to real-world production lab so that you will learn by doing.

You can choose to take this course only. Learn More

1.1 Big Data Introduction

1.2 Distributed systems

1.3 Big Data Use Cases

1.4 Various Solutions

1.5 Overview of Hadoop Ecosystem

1.6 Spark Ecosystem Walkthrough

1.7 Quiz

2.1 Understanding the CloudxLab

2.2 Getting Started - Hands on

2.3 Hadoop & Spark Hands-on

2.4 Quiz and Assessment

2.5 Basics of Linux - Quick Hands-On

2.6 Understanding Regular Expressions

2.7 Quiz and Assessment

2.8 Setting up VM (optional)

3.1 ZooKeeper - Race Condition

3.2 ZooKeeper - Deadlock

3.3 Hands-On

3.4 Quiz & Assessment

3.5 How does election happen - Paxos Algorithm?

3.6 Use cases

3.7 When not to use

3.8 Quiz & Assessment

4.1 Why HDFS or Why not existing file systems?

4.2 HDFS - NameNode & DataNodes

4.3 Quiz

4.4 Advance HDFS Concepts (HA, Federation)

4.5 Quiz

4.6 Hands-on with HDFS (Upload, Download, SetRep)

4.7 Quiz & Assessment

4.8 Data Locality (Rack Awareness)

5.1 YARN - Why not existing tools?

5.2 YARN - Evolution from MapReduce 1.0

5.3 Resource Management: YARN Architecture

5.4 Advance Concepts - Speculative Execution

5.5 Quiz

6.1 MapReduce - Understanding Sorting

6.2 MapReduce - Overview

6.3 Quiz

6.4 Example 0 - Word Frequency Problem - Without MR

6.5 Example 1 - Only Mapper - Image Resizing

6.6 Example 2 - Word Frequency Problem

6.7 Example 3 - Temperature Problem

6.8 Example 4 - Multiple Reducer

6.9 Example 5 - Java MapReduce Walkthrough

6.10 Quiz

7.1 Writing MapReduce Code Using Java

7.2 Building MapReduce project using Apache Ant

7.3 Concept - Associative & Commutative

7.4 Quiz

7.5 Example 8 - Combiner

7.6 Example 9 - Hadoop Streaming

7.7 Example 10 - Adv. Problem Solving - Anagrams

7.8 Example 11 - Adv. Problem Solving - Same DNA

7.9 Example 12 - Adv. Problem Solving - Similar DNA

7.10 Example 12 - Joins - Voting

7.11 Limitations of MapReduce

7.12 Quiz

8.1 Pig - Introduction

8.2 Pig - Modes

8.3 Getting Started

8.4 Example - NYSE Stock Exchange

8.5 Concept - Lazy Evaluation

9.1 Hive - Introduction

9.2 Hive - Data Types

9.3 Getting Started

9.4 Loading Data in Hive (Tables)

9.5 Example: Movielens Data Processing

9.6 Advance Concepts: Views

9.7 Connecting Tableau and HiveServer 2

9.8 Connecting Microsoft Excel and HiveServer 2

9.9 Project: Sentiment Analyses of Twitter Data

9.10 Advanced - Partition Tables

9.11 Understanding HCatalog & Impala

9.12 Quiz

10.1 NoSQL - Scaling Out / Up

10.2 NoSQL - ACID Properties and RDBMS Story

10.3 CAP Theorem

10.4 HBase Architecture - Region Servers etc

10.5 Hbase Data Model - Column Family Orientedness

10.6 Getting Started - Create table, Adding Data

10.7 Adv Example - Google Links Storage

10.8 Concept - Bloom Filter

10.9 Comparison of NOSQL Databases

10.10 Quiz

11.1 Sqoop - Introduction

11.2 Sqoop Import - MySQL to HDFS

11.3 Exporting to MySQL from HDFS

11.4 Concept - Unbounding Dataset Processing or Stream Processing

11.5 Flume Overview: Agents - Source, Sink, Channel

11.6 Example 1 - Data from Local network service into HDFS

11.7 Example 2 - Extracting Twitter Data

11.8 Quiz

11.9 Example 3 - Creating workflow with Oozie

Course 2

Big Data with Spark

This is the second course in the specialization. In this course, we start with Big Data and Spark introduction and then we dive into Scala and Spark concepts like RDD, transformations, actions, persistence and deploying Spark applications. We then cover Spark Streaming, Kafka, various data formats like JSON, XML, Avro, Parquet and Protocol Buffers. We conclude the course with very important topics such as Dataframes, SparkSQL, SparkR, MLlib and GraphX.

Each topic consists of high-quality videos, slides, hands-on assessments, quizzes and case studies to make learning effective and for life. With this course, you also get access to real-world production lab so that you will learn by doing.

You can choose to take this course only. Learn More

1.1 Apache Spark ecosystem walkthrough

1.2 Spark Introduction - Why Spark?

1.3 Quiz

2.1 Scala - Quick Introduction - Access Scala on CloudxLab

2.2 Scala - Quick Introduction - Variables and Methods

2.3 Getting Started: Interactive, Compilation, SBT

2.4 Types, Variables & Values

2.5 Functions

2.6 Collections

2.7 Classes

2.8 Parameters

2.9 More Features

2.10 Quiz and Assessment

3.1 Apache Spark ecosystem walkthrough

3.2 Spark Introduction - Why Spark?

3.3 Using the Spark Shell on CloudxLab

3.4 Example 1 - Performing Word Count

3.5 Understanding Spark Cluster Modes on YARN

3.6 RDDs (Resilient Distributed Datasets)

3.7 General RDD Operations: Transformations & Actions

3.8 RDD lineage

3.9 RDD Persistence Overview

3.10 Distributed Persistence

4.1 Creating the SparkContext

4.2 Building a Spark Application (Scala, Java, Python)

4.3 The Spark Application Web UI

4.4 Configuring Spark Properties

4.5 Running Spark on Cluster

4.6 RDD Partitions

4.7 Executing Parallel Operations

4.8 Stages and Tasks

5.1 Common Spark Use Cases

5.2 Example 1 - Data Cleaning (Movielens)

5.3 Example 2 - Understanding Spark Streaming

5.4 Understanding Kafka

5.5 Example 3 - Spark Streaming from Kafka

5.6 Iterative Algorithms in Spark

5.7 Project: Real-time analytics of orders in an e-commerce company

6.1 InputFormat and InputSplit

6.2 JSON

6.3 XML

6.4 AVRO

6.5 How to store many small files - SequenceFile?

6.6 Parquet

6.7 Protocol Buffers

6.8 Comparing Compressions

6.9 Understanding Row Oriented and Column Oriented Formats - RCFile?

7.1 Spark SQL - Introduction

7.2 Spark SQL - Dataframe Introduction

7.3 Transforming and Querying DataFrames

7.4 Saving DataFrames

7.5 DataFrames and RDDs

7.6 Comparing Spark SQL, Impala, and Hive-on-Spark

8.1 Machine Learning Introduction

8.2 Applications Of Machine Learning

8.3 MlLib Example: k-means

8.4 SparkR Example

Projects

Projects

1. Sentiment analysis of "Iron Man 3" movie using Hive and visualizing the sentiment data using BI tools such as Tableau


2. Process the NSE (National Stock Exchange) data using Hive for various insights


3. Analyze MovieLens data using Hive


4. Generate movie recommendations using Spark MLlib


5. Derive the importance of various handles at Twitter using Spark GraphX


6. Churn the logs of NASA Kennedy Space Center WWW server using Spark to find out useful business and devops metrics


7. Write end-to-end Spark application starting from writing code on your local machine to deploying to the cluster


8. Build real-time analytics dashboard for an e-commerce company using Apache Spark, Kafka, Spark Streaming, Node.js, Socket.IO and Highcharts

Certificate

Certificate

Earn your certificate

Our Specialization is exhaustive and the certificate rewarded by us is proof that you have taken a big leap in Big Data domain.


Differentiate yourself

The knowledge you have gained from working on projects, videos, quizzes, hands-on assessments and case studies gives you a competitive edge.


Share your achievement

Highlight your new skills on your resume, LinkedIn, Facebook and Twitter. Tell your friends and colleagues about it.

 Course Certificate Sample
Course Creators
Sandeep Giri

Sandeep Giri

Founder at CloudxLab, Past- Amazon, InMobi, D.E.Shaw
Course Developer
Abhinav Singh

Abhinav Singh

Co-Founder at CloudxLab, Past- Byjus
Course Developer
 Jatin Shah

Jatin Shah

LinkedIn, Yahoo, Yale CS Ph.D.
IIT-B
Course Advisor

Reviews

40 reviews
(4.9 out of 5)
...

Must have for practicing and perfecting hadoop. To setup in PC you need to have a very high end configuration and setup will be pseudo node setup.. For better understanding I recomend CloudxLab

...

They are great. They take care of all the Big Data technologies (Hadoop, Spark, Hive, etc.) so you do not have to worry about installing and running them correclty on your pc. Plus, they have a fantastic customer support. Even when I have had problems debugging my own programs, they have answered me with the correct solution in a few hours, and all of this for a more than reasonable price. I personally recommend it to everyone :)

...

I have been using CloudxLab for last 3 months for learning Hadoop and Spark, and I can vouch for it.

It’s a platform where you can learn from the tutorial videos and then practice in the lab they provide on cloud. The study materials are well-planned and I would be lying if I say its not great.
The video lectures explains the technical stuffs in very simple ways which makes it easier to grasp the concepts. Also, the customer service is great.
So, thumbs up for the team associated with CloudxLab.
To conclude my views, I would just say that, if you are willing to learn Big Data related stuff, I strongly recommend CloudxLab.

...

I think I can give some points on this . Am using cloudxlab for more than an year… my intention is for continuous learning.
For Students and technology change professionals :
In General Big data hadoop, (a) you can learn on your personal PC, but for that the minimum configuration of 12 GB Ram with good processing speed, still when you execute jobs it will take more time for processing jobs as it will be acting as single node.(b) If you try to install each and every components, it will take hell a lot of admin work , and some thing happens , you have to invest lot of time for debugging.
The main advantage of using cloudxlab,
a) Get 6 node production cluster with all installed components, just getting user and password, you can start working on it.
b) You have almost all the access.
c) Good amount of components installed.
d) You can play around with each of them with 5gb of test data.
e) So far I didnt experience any down time.
f) You can Practice in your college lab, on free time.
g) Good email support on technical perspective.
h) They have couple of test data, I use my own.
i) vi and nano editor supported.
j) Some of the components which I remember are HDFS,MapReduce2, YARN, Tez, ZooKeeper,Falcon,Storm, Kafka,Spark,Jupyter Notebook, Hive,HBase, Pig, Sqoop, Oozie, Flume,Accumulo,Ambari.

...

I have been using CloudxLab for sometime and based on my usage experience I can say that they have done a fabulous job.

The first problem anyone faces while learning Big Data technologies is running the VMs on his/her laptop. VMs require a good amount of dedicated RAM and so most of the times we end up spending in hardware upgrade. But even after an upgrade the requirement of a cluster is never met. The examples we try alaways runs on a single node setup.

To try this on a production like cluster setup we have something like AWS, but there is a good amount of cost involved in that. Also, they keep the credit card details with them which I feel not everyone feels safe to share.

And this is where I feel CloudxLab seems to be a better bet.Their pricing is very much competitive compared to the other offerings and also it doesn’t require any specific hardware requirement. Any desktop/laptop with any configuration which has connectivity to net is good for getting started.

No need to do any setups.Their clusters are fully loaded with all the latest Big Data packages.You can access them from anywhere.

The only thing you need to concentrate is on your learning :)

Hope this helps to anyone who is looking for an option beyond VMs.

FAQ

In Self-paced learning, you will get,

  • Lifetime access to the self-paced course including videos, assessments, quizzes, and projects
  • Recordings of the previous batch of instructor-led training
  • 24x7 support using discussion forums

In Instructor-led training, you will get

  • Lifetime access to the self-paced course including videos, assessments, quizzes, and projects
  • Access to live instructor-led training as per your enrolled batch
  • Learn from industry experts over online meeting tools like zoom
  • 24x7 support using discussion forums

1. Basics Of SQL. You should know the basics of SQL and databases. If you know about filters in SQL, you are expected to understand the course.

2. A know-how of the basics of programming. If you understand 'loops' in any programming language, and if you are able to create a directory and see what's inside a file from the command line, you are good to get the concepts of this course even if you have not really touched programming for the last 10 years! In addition, we will be providing video classes on the basics of Python and Scala.

The instructors for this course are industry experts having years of experience in mentoring students across the world.

It will take 2-3 months with 6-8 hours of effort per week.

We understand that you might need course material for a longer duration to make most out of your subscription. You will get lifetime access (Till the company is operational) to the course material so that you can refer to the course material anytime.

In online instructor-led training, Sandeep Giri along with his team of experts will train you with a group of our course learners for 25+ hours over online conferencing software like Zoom. Classes will happen every Saturday and Sunday

We offer mentoring sessions to our learners with industry leaders and professionals so you can get 1 on 1 help with any questions you may have, whether your questions are technical, job-related or anything else.
It is a paid service and exclusively available to learners enrolling for the course. We will provide more information on subscription information for the same after the course is launched.

At the end, of course, you will work on a real-time project. You will receive a problem statement along with a data-set to work on CloudxLab. Once you are done with the project (it will be reviewed by an expert), you will be awarded a certificate which you can share on LinkedIn.

Enrollment into self-paced course entails 90 days of free access to CloudxLab. Enrollment into instructor-led course entails 90 days of free access to Cloudxlab, depending on date of enrollment.

Yes. Java is generally required for understanding MapReduce. MapReduce is a programming paradigm for writing your logic in the form of Mapper and reducer functions. We provide a self-paced course on Java for free. As soon as you signup, it would be available in your account section.

Course requires a good internet (1 Mbps or more) and a browser to watch videos and do hands-on the lab. We've configured all the tools in the lab so that you can focus on learning and practicing in a real-world cluster.

For self-paced course, we provide 100/% fees refund if the request is raised within 7 days from enrollment date. Thereafter, no refund is provided.

For instructor-led course, we provide 100% refund if not more than 1 live session has been conducted -- and we provide 50% refund if 2-4 live sessions have been conducted. If 5 or more live sessions have been conducted, then no refund will be provided.

Yes, you can renew your subscription anytime. Please choose your desired plan for the lab and make payment to renew your subscription.

Yes, you can upgrade from self-paced course to instructor-led course by paying the differential amount. Please contact us at reachus@cloudxlab.com for the same

Have more questions? Please contact us at reachus@cloudxlab.com