Self-Paced

Online

3 Months

Course Duration

6+

Projects

90

Lab Days

CloudxLab

Certificate

About the Course

As humans, we are immersed in data in our every-day lives. As per IBM, the data doubles every two years on this planet. The value that data holds can only be understood when we can start to identify patterns and trends in the data. Normal computing principles do not work when data becomes huge.

There is massive growth in the big data space, and job opportunities are skyrocketing, making this the perfect time to launch your career in this space.

In this specialization, you will learn Hadoop and PySpark to drive better business decisions and solve real-world problems.

Program Highlights

  • Certificate of Completion by CloudxLab

  • Work on about 6+ projects to get hands-on experience

  • Timely Doubt Resolution

  • Best In Class Curriculum

  • Cloud Lab Access

Certificate

What is the certificate like?

  • Why Cloudxlab?

    CloudxLab is a team of developers, engineers, and educators passionate about building innovative products to make learning fun, engaging, and for life. We are a highly motivated team who build fresh and lasting learning experiences for our users. Powered by our innovation processes, we provide a gamified environment where learning is fun and constructive. From creative design to intuitive apps we create a seamless learning experience for our users. We upskill engineers in deep tech - make them employable and future-ready.

Hands-on Learning

hands-on lab
  • Gamified Learning Platform


  • Auto-assessment Tests


  • No Installation Required

Course Creators

Instructor Sandeep Giri

Sandeep Giri

Founder at CloudxLab

Past: Amazon, InMobi, D.E.Shaw

Instructor Abhinav Singh

Abhinav Singh

Co-Founder at CloudxLab

Past: Byjus

Instructor Jatin

Jatin Shah

Yale CS, Ph.D. IIT-Bombay

Past: Ex-LinkedIn, Yahoo

Curriculum

60+
Hours of Online Training
90
90 Days of Lab Access
6+
Projects
16K+
Learners

Big Data with PySpark(includes Hadoop, Spark and Python)

1. Introduction
1 Big Data Introduction
2. Distributed systems
3. Big Data Use Cases
4. Various Solutions
5. Overview of Hadoop Ecosystem
6. Spark Ecosystem Walkthrough
7. Quiz
2. Foundation and Environment
1 Understanding the CloudxLab
2. Getting Started - Hands on
3. Hadoop and Spark Hands-on
4. Quiz and Assessment
5. Basics of Linux - Quick Hands-On
6. Understanding Regular Expressions
7. Quiz and Assessment
7. Setting up VM (optional)
3. ZooKeeper
1 ZooKeeper - Race Condition
2. ZooKeeper - Deadlock
3. Hands-On
4. Quiz and Assessment
5. How does election happen - Paxos Algorithm?
6. Use cases
7. When not to use
8. Quiz and Assessment
4. HDFS
1. Why HDFS or Why not existing file systems?
2. HDFS - NameNode & DataNodes
3. Quiz
4. Advance HDFS Concepts (HA, Federation)
5. Quiz
6. Hands-on with HDFS (Upload, Download, SetRep)
7. Quiz and Assessment
8. Data Locality (Rack Awareness)
5. YARN
1. YARN - Why not existing tools?
2. YARN - Evolution from MapReduce 1.0
3. Resource Management: YARN Architecture
4. Advance Concepts - Speculative Execution
5. Quiz
6. MapReduce Basics
1. MapReduce - Understanding Sorting
2. MapReduce - Overview
3. Quiz
4. Example 0 - Word Frequency Problem - Without MR
5. Example 1 - Only Mapper - Image Resizing
6. Example 2 - Word Frequency Problem
7. Example 3 - Temperature Problem
8. Example 4 - Multiple Reducer
9. Example 5 - Java MapReduce Walkthrough
10. Quiz
7. MapReduce Advanced
1. Writing MapReduce Code Using Java
2. Building MapReduce project using Apache Ant
3. Concept - Associative and Commutative
4. Quiz
5. Example 8 - Combiner
6. Example 9 - Hadoop Streaming
7. Example 10 - Adv. Problem Solving - Anagrams
8. Example 11 - Adv. Problem Solving - Same DNA
9. Example 12 - Adv. Problem Solving - Similar DNA
10. Example 13 - Joins - Voting
11. Limitations of MapReduce
12. Quiz
8. Analyzing Data with Pig
1. Pig - Introduction
2. Pig - Modes
3. Getting Started
4. Example - NYSE Stock Exchange
5. Concept - Lazy Evaluation
9. Processing Data with Hive
1. Hive - Introduction
2. Hive - Data Types
3. Getting Started
4. Loading Data in Hive (Tables)
5. Example: Movielens Data Processing
6. Advance Concepts: Views
7. Connecting Tableau and HiveServer 2
8. Connecting Microsoft Excel and HiveServer 2
9. Project: Sentiment Analyses of Twitter Data
10. Advanced - Partition Tables
11. Understanding HCatalog and Impala
12. Quiz
10. NoSQL and HBase
1. NoSQL - Scaling Out / Up
2. NoSQL - ACID Properties and RDBMS Story
3. CAP Theorem
4. HBase Architecture - Region Servers etc
5. HBase Data Model - Column Family Orientedness
6. Getting Started - Create table, Adding Data
7. Adv Example - Google Links Storage
8. Concept - Bloom Filter
9. Comparison of NOSQL Databases
10. Quiz
11. Importing Data with Sqoop, Flume and Oozie
1. Sqoop - Introduction
2. Sqoop Import - MySQL to HDFS
3. Exporting to MySQL from HDFS
4. Concept - Unbounding Dataset Processing or Stream Processing
5. Flume Overview: Agents - Source, Sink, Channel
6. Example 1 - Data from Local network service into HDFS
7. Example 2 - Extracting Twitter Data
8. Quiz
9. Example 3 - Creating workflow with Oozie

Big Data with Spark

1. Introduction
1.1 Apache Spark ecosystem walkthrough
1.2 Spark Introduction - Why Spark?
1.3 Quiz
2. Scala Basics
2.1 Scala - Quick Introduction - Access Scala on CloudxLab
2.2 Scala - Quick Introduction - Variables and Methods
2.3 Getting Started: Interactive, Compilation, SBT
2.4 Types, Variables and Values
2.5 Functions
2.6 Collections
2.7 Classes
2.8 Parameters
2.9 More Features
2.10 Quiz and Assessment
3. Spark Basics
3.1 Apache Spark ecosystem walkthrough
3.2 Spark Introduction - Why Spark?
3.3 Using the Spark Shell on CloudxLab
3.4 Example 1 - Performing Word Count
3.5 Understanding Spark Cluster Modes on YARN
3.6 RDDs (Resilient Distributed Datasets)
3.7 General RDD Operations: Transformations and Actions
3.8 RDD lineage
3.9 RDD Persistence Overview
3.10 Distributed Persistence
4. Writing and Deploying Spark Applications
4.1 Creating the SparkContext
4.2 Building a Spark Application (Scala, Java, Python)
4.3 The Spark Application Web UI
4.4 Configuring Spark Properties
4.5 Running Spark on Cluster
4.6 RDD Partitions
4.7 Executing Parallel Operations
4.8 Stages and Tasks
5. Common Patterns in Spark Data Processing
5.1 Common Spark Use Cases
5.2 Example 1 - Data Cleaning (Movielens)
5.3 Example 2 - Understanding Spark Streaming
5.4 Understanding Kafka
5.5 Example 3 - Spark Streaming from Kafka
5.6 Iterative Algorithms in Spark
5.7 Project: Real-time analytics of orders in an e-commerce company
6. Data Formats and Management
6.1 InputFormat and InputSplit
6.2 JSON
6.3 XML
6.4 AVRO
6.5 How to store many small files - SequenceFile?
6.6 Parquet
6.7 Protocol Buffers
6.8 Comparing Compressions
6.9 Understanding Row Oriented and Column Oriented Formats - RCFile?
7. DataFrames and Spark SQL
7.1 Spark SQL - Introduction
7.2 Spark SQL - Dataframe Introduction
7.3 Transforming and Querying DataFrames
7.4 Saving DataFrames
7.5 DataFrames and RDDs
7.6 Comparing Spark SQL, Impala, and Hive-on-Spark
8. Machine Learning with Spark
8.1 Machine Learning Introduction
8.2 Applications Of Machine Learning
8.3 MlLib Example: k-means
8.4 SparkR Example

Projects

Apply Now

Subscription | CloudxLab

Start Learning

58

29

Big Data with PySpark (Incl. Hadoop, Spark and Python)
(lifetime course access)
+
30 days Cloud Lab

118

59

Big Data with PySpark (Incl. Hadoop, Spark and Python)
(lifetime course access)
+
90 days Cloud Lab

158

79

Big Data with PySpark (Incl. Hadoop, Spark and Python)
(lifetime course access)
+
180 days Cloud Lab

or

Subscribe to CloudxLab Premium

34 /mo

17 /mo

  • 180 days Cloud Lab access
  • 6 months access to all CloudxLab self paced courses
  • Earn Industry-relevant Certificates
  • Placement Assistance
  • Cancel Anytime
Explore cloudxlab Pro

Get Access to ALL Courses with One Single Subscription.

Testimonials

​

Frequently Asked Questions

How do you provide placement support?

We help learners in placement support through:

  • By posting the latest jobs from our industry networks on our job
    portal
  • Conducting PET(Placement Eligibility Test), these are proctored tests around the skills which are needed in the industry
  • Resume building and assisting in mock interviews, etc

What do I need to fulfill to get the CloudxLab certificate for the course?

You should complete 100% of the course along with all the given projects in order to be eligible for the certificate.

Kindly note that there is no deadline for CloudxLab courses.

Is there any prerequisites for this course?

No, this course is for everyone. The complimentary access to CloudxLab courses will help you in learning the required foundations to make the most out of this certificate course.

What is the validity of course material?

We understand that you might need course material for a longer duration to make most out of your subscription. You will get lifetime access to the course material so that you can refer to the course material anytime.

Do I need to install any software before starting this course?

No, we will provide you with the access to our online lab and BootML so that you do not have to install anything on your local machine

Will I get support?

Yes! Please feel free to ask your questions on CloudxLab forum and our community and team of experts will answer your questions. We believe forum will add better perspectives, ideas, and solutions to your questions.

I have some more questions. Can I talk to someone?

Absolutely! Please contact us here. You can also reach us anytime on our 24/7 support helpline by calling us on +918049202224