Certification Course on
Big Data Engineering with Hadoop and Spark (Scala)

Learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD, Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX From Industry Experts

Enroll Now Download Brochure

Self-Paced

Online

3 Months

Course Duration

6+

Projects

90

Lab Days

CloudxLab

Certificate

About the Course

As humans, we are immersed in data in our every-day lives. As per IBM, the data doubles every two years on this planet. The value that data holds can only be understood when we can start to identify patterns and trends in the data. Normal computing principles do not work when data becomes huge.

There is massive growth in the big data space, and job opportunities are skyrocketing, making this the perfect time to launch your career in this space.

In this specialization, you will learn Hadoop and Spark to drive better business decisions and solve real-world problems.

Program Highlights

Certificate of Completion by CloudxLab
Work on about 6+ projects to get hands-on experience
Timely Doubt Resolution

Best In Class Curriculum
Cloud Lab Access

Start Learning Today

Subscribe Now Request a Callback

Our Students Work At

Certificate

What is the certificate like?

Course Certificate

Why Cloudxlab?

CloudxLab is a team of developers, engineers, and educators passionate about building innovative products to make learning fun, engaging, and for life. We are a highly motivated team who build fresh and lasting learning experiences for our users. Powered by our innovation processes, we provide a gamified environment where learning is fun and constructive. From creative design to intuitive apps we create a seamless learning experience for our users. We upskill engineers in deep tech - make them employable and future-ready.

Programming Languages and Tools

Hands-on Learning

Gamified Learning Platform

Auto-assessment Tests

No Installation Required

Course Creators

Sandeep Giri

Founder at CloudxLab

Past: Amazon, InMobi, D.E.Shaw

Abhinav Singh

Co-Founder at CloudxLab

Past: Byjus

Jatin Shah

Yale CS, Ph.D. IIT-Bombay

Past: Ex-LinkedIn, Yahoo

Curriculum

60+

Hours of Online Training

240

90 Days of Lab Access

Projects

16K+

Learners

Download Curriculum

1. Introduction: 1 Big Data Introduction

2. Distributed systems

3. Big Data Use Cases

4. Various Solutions

5. Overview of Hadoop Ecosystem

6. Spark Ecosystem Walkthrough

7. Quiz
2. Foundation and Environment: 1 Understanding the CloudxLab

2. Getting Started - Hands on

3. Hadoop and Spark Hands-on

4. Quiz and Assessment

5. Basics of Linux - Quick Hands-On

6. Understanding Regular Expressions

7. Quiz and Assessment

7. Setting up VM (optional)
3. ZooKeeper: 1 ZooKeeper - Race Condition

2. ZooKeeper - Deadlock

3. Hands-On

4. Quiz and Assessment

5. How does election happen - Paxos Algorithm?

6. Use cases

7. When not to use

8. Quiz and Assessment
4. HDFS: 1. Why HDFS or Why not existing file systems?

2. HDFS - NameNode & DataNodes

3. Quiz

4. Advance HDFS Concepts (HA, Federation)

5. Quiz

6. Hands-on with HDFS (Upload, Download, SetRep)

7. Quiz and Assessment

8. Data Locality (Rack Awareness)
5. YARN: 1. YARN - Why not existing tools?

2. YARN - Evolution from MapReduce 1.0

3. Resource Management: YARN Architecture

4. Advance Concepts - Speculative Execution

5. Quiz
6. MapReduce Basics: 1. MapReduce - Understanding Sorting

2. MapReduce - Overview

3. Quiz

4. Example 0 - Word Frequency Problem - Without MR

5. Example 1 - Only Mapper - Image Resizing

6. Example 2 - Word Frequency Problem

7. Example 3 - Temperature Problem

8. Example 4 - Multiple Reducer

9. Example 5 - Java MapReduce Walkthrough

10. Quiz
7. MapReduce Advanced: 1. Writing MapReduce Code Using Java

2. Building MapReduce project using Apache Ant

3. Concept - Associative and Commutative

4. Quiz

5. Example 8 - Combiner

6. Example 9 - Hadoop Streaming

7. Example 10 - Adv. Problem Solving - Anagrams

8. Example 11 - Adv. Problem Solving - Same DNA

9. Example 12 - Adv. Problem Solving - Similar DNA

10. Example 13 - Joins - Voting

11. Limitations of MapReduce

12. Quiz
8. Analyzing Data with Pig: 1. Pig - Introduction

2. Pig - Modes

3. Getting Started

4. Example - NYSE Stock Exchange

5. Concept - Lazy Evaluation
9. Processing Data with Hive: 1. Hive - Introduction

2. Hive - Data Types

3. Getting Started

4. Loading Data in Hive (Tables)

5. Example: Movielens Data Processing

6. Advance Concepts: Views

7. Connecting Tableau and HiveServer 2

8. Connecting Microsoft Excel and HiveServer 2

9. Project: Sentiment Analyses of Twitter Data

10. Advanced - Partition Tables

11. Understanding HCatalog and Impala

12. Quiz
10. NoSQL and HBase: 1. NoSQL - Scaling Out / Up

2. NoSQL - ACID Properties and RDBMS Story

3. CAP Theorem

4. HBase Architecture - Region Servers etc

5. HBase Data Model - Column Family Orientedness

6. Getting Started - Create table, Adding Data

7. Adv Example - Google Links Storage

8. Concept - Bloom Filter

9. Comparison of NOSQL Databases

10. Quiz
11. Importing Data with Sqoop, Flume and Oozie: 1. Sqoop - Introduction

2. Sqoop Import - MySQL to HDFS

3. Exporting to MySQL from HDFS

4. Concept - Unbounding Dataset Processing or Stream Processing

5. Flume Overview: Agents - Source, Sink, Channel

6. Example 1 - Data from Local network service into HDFS

7. Example 2 - Extracting Twitter Data

8. Quiz

9. Example 3 - Creating workflow with Oozie
1. Introduction: 1.1 Apache Spark ecosystem walkthrough

1.2 Spark Introduction - Why Spark?

1.3 Quiz
2. Scala Basics: 2.1 Scala - Quick Introduction - Access Scala on CloudxLab

2.2 Scala - Quick Introduction - Variables and Methods

2.3 Getting Started: Interactive, Compilation, SBT

2.4 Types, Variables and Values

2.5 Functions

2.6 Collections

2.7 Classes

2.8 Parameters

2.9 More Features

2.10 Quiz and Assessment
3. Spark Basics: 3.1 Apache Spark ecosystem walkthrough

3.2 Spark Introduction - Why Spark?

3.3 Using the Spark Shell on CloudxLab

3.4 Example 1 - Performing Word Count

3.5 Understanding Spark Cluster Modes on YARN

3.6 RDDs (Resilient Distributed Datasets)

3.7 General RDD Operations: Transformations and Actions

3.8 RDD lineage

3.9 RDD Persistence Overview

3.10 Distributed Persistence
4. Writing and Deploying Spark Applications: 4.1 Creating the SparkContext

4.2 Building a Spark Application (Scala, Java, Python)

4.3 The Spark Application Web UI

4.4 Configuring Spark Properties

4.5 Running Spark on Cluster

4.6 RDD Partitions

4.7 Executing Parallel Operations

4.8 Stages and Tasks
5. Common Patterns in Spark Data Processing: 5.1 Common Spark Use Cases

5.2 Example 1 - Data Cleaning (Movielens)

5.3 Example 2 - Understanding Spark Streaming

5.4 Understanding Kafka

5.5 Example 3 - Spark Streaming from Kafka

5.6 Iterative Algorithms in Spark

5.7 Project: Real-time analytics of orders in an e-commerce company
6. Data Formats and Management: 6.1 InputFormat and InputSplit

6.2 JSON

6.3 XML

6.4 AVRO

6.5 How to store many small files - SequenceFile?

6.6 Parquet

6.7 Protocol Buffers

6.8 Comparing Compressions

6.9 Understanding Row Oriented and Column Oriented Formats - RCFile?
7. DataFrames and Spark SQL: 7.1 Spark SQL - Introduction

7.2 Spark SQL - Dataframe Introduction

7.3 Transforming and Querying DataFrames

7.4 Saving DataFrames

7.5 DataFrames and RDDs

7.6 Comparing Spark SQL, Impala, and Hive-on-Spark
8. Machine Learning with Spark: 8.1 Machine Learning Introduction

8.2 Applications Of Machine Learning

8.3 MlLib Example: k-means

8.4 SparkR Example

Projects

Project 1.
Sentiment analysis

We will do sentiment analysis of "Iron Man 3" movie using Hive and visualize the sentiment data using BI tools such as Tableau

Project 2.
Process the New York Stock Exchange data

We will see how to process the NYSE (New York Stock Exchange) data using Hive for various insights.

Project 3.
MovieLens Project

We will analyze MovieLens data using Hive

Project 4.
Spark MLlib

We will learn to generate movie recommendations using Spark MLlib

Project 5.
Churn the logs

We will see how to churn the logs of NASA Kennedy Space Center WWW server using Spark to find out useful business and devops metrics

Project 6.
Spark application

We will understand how to write end-to-end Spark application starting from writing code on your local machine to deploying to the cluster

Project 7
Analytics Dashboard

Real-time analytics dashboard for an e-commerce company using Apache Spark, Kafka, Spark Streaming, Node.js, Socket.IO and Highcharts

Apply Now

Prerequisites

A know-how of the basics of programming. If you understand 'loops' in any programming language, and if you are able to create a directory and see what's inside a file from the command line, you are good to get the concepts of this course even if you have not really touched programming for the last 10 years! In addition, we will be providing video classes on the basics of Python and Scala.

Subscription | CloudxLab

Start Learning

29

0

Big Data with Hadoop and Spark (Scala)
_{(lifetime course access)}
+
0 days Cloud Lab

Enroll Now

58

29

Big Data with Hadoop and Spark (Scala)
_{(lifetime course access)}
+
30 days Cloud Lab

Enroll Now

118

59

Big Data with Hadoop and Spark (Scala)
_{(lifetime course access)}
+
90 days Cloud Lab

Enroll Now

or

Free Subscription

0 days Cloud Lab access
1 month access to all CloudxLab self paced courses
Earn Industry-relevant Certificates
Placement Assistance
Cancel Anytime

CloudxLab Premium Subscription

34 /mo

17 /mo

180 days Cloud Lab access
6 months access to all CloudxLab self paced courses
Earn Industry-relevant Certificates
Placement Assistance
Cancel Anytime

Explore cloudxlab Pro

Get Access to ALL Courses with One Single Subscription.

Explore cloudxlab Pro Request a Callback

Testimonials

“Joined the Hadoop class 5 weeks back and it has been a motivating experience. Last I coded was 20yrs back and but thanks to the instructor-led training - I am executing Pig Latin and Hive commands to solve data problems and look forward to soon be able to complete small projects all by myself. Sandeep has been a great instructor, very very patient, always ready to put in extra time to clarify doubts and work at your pace and schedule.”

Savita Singh

“Big Data with Apache Spark: This is not just a series of videos with the one-way flow of information. Instead, it is a highly interactive course. The course is well structured, covering the concepts of Big Data in width and depth. I am currently half-way through the course and I am already working on translating the concepts learned in the class to real-world problems.”

Dr. Makhan Virdi, NASA

“Thank you so much Sandeep for all your great sessions. It will help in our career a lot. Your session is very much explanatory and understandable. Kudos to you.Thanks for all your hard work and time. Definitely, we will recommend all our friends and colleagues to attend your different course.Thanks a ton”

Hemanta Lenka

“I have been using CloudxLab for a while now, and they are amazing! The best part about using CloudxLab is that you do not need to wait for someone to tell you whether what you did was right or not, it is done automatically on the go. The training materials are of top notch quality. If you get stuck, they have a huge community of trainers and learners to help you out with all your doubts. They have a course structure for everyone, whether you are new to programming or are a seasoned programmer, they have something to offer you. And they are affordable too! I would recommend CloudxLab all the time.”

Rajtilak Bhattacharjee

“This course is suitable for everyone. Me being a product manager had not done hands-on coding since quite some time. Python was completely new to me. However, Sandeep Giri gave us a crash course to Python and then introduced us to Machine Learning. Also, the CloudxLab’s environment was very useful to just log in and start practising coding and playing with things learnt. A good mix of theory and practical exercises and specifically the sequence of starting straight away with a project and then going deeper was a very good way of teaching. I would recommend this course to all.”

Kamal Upadhyay

“It has been a wonderful learning experience with CXL. This is one of the courses that will probably stay with me for a significant amount of time. The platform provides a unique opportunity to try hands-on simultaneously with the coursework in an almost real-life coding example. Besides, learning to use algebra, tech system and Git is a good refresher for anyone planning to start or stay in technology. The course covers the depth and breadth of ML topics. I specifically like the MNIST example and the depth to which it goes in explaining each and every line of code. Would definitely recommend the instructor-led course.”

Pratik Sonthalia

“This is one of the best-designed course, very informative and well paced. The killer feature of machine/deep learning coursed from CloudxLab is the live session with access to labs for hands-on practices! With that, it becomes easy following any discourse, even if one misses the live sessions(Read that as me!). Sandeep(course instructor) has loads of patience and his way of explaining things are just remarkable. I might have better comments to add here, once I learn more! Great Jobs guys!”

Dhyan Prem

Senior Software Developer at Decision Resources Group

Related Courses

Post Graduate Certificate Program in AI and Machine Learning by IIT Roorkee

10 Months Online Live Program

4725 Ratings | 58790

View Details »

Executive Certificate Program in Applied AI by IIT Roorkee

3 Months Online Program

1578 Ratings | 8746

View Details »

Frequently Asked Questions

Is it an online course?

It is a self-paced course. You will get access to videos, quizzes, hands-on assessments and projects. If you have any doubts during your learning journey, you can post it on the discussion forum. Our experts and community will assist you over there.

Do we have to pay separately for the lab?

No, the lab is available within the course price.

What are the prerequisites for this course?

Basics Of SQL. You should know the basics of SQL and databases. If you know about filters in SQL, you are expected to understand the course.
A know-how of the basics of programming. If you understand 'loops' in any programming language, and if you are able to create a directory and see what's inside a file from the command line, you are good to get the concepts of this course even if you have not really touched programming for the last 10 years! In addition, we will be providing video classes on the basics of Python and Scala.

What technologies can I practice on CloudxLab?

The tools and components available in the cluster include Hadoop, Spark, Kafka, Hive, Pig, HBase, Oozie, ZooKeeper, Flume, Sqoop, Mahout, R, Linux, Python, Scala, MongoDB, NumPy, SciPy, Pandas, Scikit-learn etc. Again, if you are looking for other tools please contact us at reachus@cloudxlab.com.

What is your refund policy?

If you are unhappy with the product for any reason, let us know within 7 days of purchasing or upgrading your account, and we'll cancel your account and issue a full refund. Please contact us at reachus@cloudxlab.com to request a refund within the stipulated time. We will be sorry to see you go though!

Who will be the course instructors?

The instructors for this course are industry experts having years of experience in mentoring students across the world.

What is the validity of course material?

We understand that you might need course material for a longer duration to make most out of your subscription. You will get lifetime access to the course material so that you can refer to the course material anytime.

Do I need to install any software before starting this course?

No, we will provide you with the access to our online lab and BootML so that you do not have to install anything on your local machine

What do I need to fulfill to get the CloudxLab certificate for the course?

You should complete 100% of the course along with all the given projects in order to be eligible for the certificate.

Kindly note that there is no deadline for CloudxLab courses.

Can I get a certificate for the projects completed?

We have created a set of Guided Projects on our platform. You may complete these guided projects and earn the certificate for free. Check it out here

Certification Course on Big Data Engineering with Hadoop and Spark (Scala)

Learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD, Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX From Industry Experts

Self-Paced

3 Months

6+

90

CloudxLab

About the Course

Program Highlights

Certificate of Completion by CloudxLab

Work on about 6+ projects to get hands-on experience

Timely Doubt Resolution

Best In Class Curriculum

Cloud Lab Access

Start Learning Today

Our Students Work At

Certificate

What is the certificate like?

Why Cloudxlab?

Programming Languages and Tools

Hands-on Learning

Gamified Learning Platform

Auto-assessment Tests

No Installation Required

Course Creators

Sandeep Giri

Abhinav Singh

Jatin Shah

Curriculum

Hours of Online Training

90 Days of Lab Access

Projects

Learners

Big Data with Hadoop

Big Data with Spark

Projects

Project 1.Sentiment analysis

Project 2.Process the New York Stock Exchange data

Project 3. MovieLens Project

Project 4. Spark MLlib

Project 5. Churn the logs

Project 6. Spark application

Project 7 Analytics Dashboard

Apply Now

Prerequisites

Start Learning

29

0

Big Data with Hadoop and Spark (Scala)(lifetime course access)+0 days Cloud Lab

58

29

Big Data with Hadoop and Spark (Scala)(lifetime course access)+30 days Cloud Lab

118

59

Big Data with Hadoop and Spark (Scala)(lifetime course access)+90 days Cloud Lab

or

Free Subscription

CloudxLab Premium Subscription

34 /mo

17 /mo

Get Access to ALL Courses with One Single Subscription.

Testimonials

Savita Singh

Dr. Makhan Virdi, NASA

Hemanta Lenka

Rajtilak Bhattacharjee

Kamal Upadhyay

Pratik Sonthalia

Dhyan Prem

Related Courses

Post Graduate Certificate Program in AI and Machine Learning by IIT Roorkee

Executive Certificate Program in Applied AI by IIT Roorkee

Frequently Asked Questions

Certification Course on
Big Data Engineering with Hadoop and Spark (Scala)

Project 1.
Sentiment analysis

Project 2.
Process the New York Stock Exchange data

Project 3.
MovieLens Project

Project 4.
Spark MLlib

Project 5.
Churn the logs

Project 6.
Spark application

Project 7
Analytics Dashboard

Big Data with Hadoop and Spark (Scala)
_{(lifetime course access)}
+
0 days Cloud Lab

Big Data with Hadoop and Spark (Scala)
_{(lifetime course access)}
+
30 days Cloud Lab

Big Data with Hadoop and Spark (Scala)
_{(lifetime course access)}
+
90 days Cloud Lab