Understanding Big Data Stack - Apache Hadoop and Spark

3 / 4

Overview of Apache Hadoop ecosystem




Not able to play video? Try with youtube

The Apache Hadoop is a suite of components. Let us take a look at each of these components briefly. We will cover the details in depth during the full course.

HDFS

HDFS or Hadoop Distributed File System is the most important component because the entire eco-system depends upon it. It is based on Google File System.

It is basically a file system which runs on many computers to provide a humongous storage. If you want to store your petabytes of data in the form of files, you can use HDFS.

YARN or yet another resource negotiator keeps track of all the resources (CPU, Memory) of machines in the network and run the applications. Any application which wants to run in distributed fashion would interact of YARN.

HBase

HBase provides humongous storage in the form of a database table. So, to manage humongous records, you would like to use HBase.

HBase is a kind NoSQL Datastore.

MapReduce

MapReduce is a framework for distributed computing. It utilizes YARN to execute programs and has a very good sorting engine.

You write your programs in two parts Map and reduce. The map part transforms the raw data into key-value and reduce part groups and combines data based on the key. We will learn MapReduce in details later.

Spark

Spark is another computational framework similar to MapReduce but faster and more recent. It uses similar constructs as MapReduce to solve big data problems.

Spark has its own huge stack. We will cover in details soon.

Hive

Writing code in MapReduce is very time-consuming. So, Apache Hive makes it possible to write your logic in SQL which internally converts it into MapReduce. So, you can process humongous structured or semi-structured data with simple SQL using Hive.

Pig (Latin)

Pig Latin is a simplified SQL like language to express your ETL needs in stepwise fashion. Pig is the engine that translates Pig Latin into Map Reduce and executes it on Hadoop.

Mahout

Mahout is a library of machine learning algorithms that run in a distributed fashion. Since machine learning algorithms are complex and time-consuming, mahout breaks down work such that it gets executed on MapReduce running on many machines.

ZooKeeper

Apache Zookeeper is an independent component which is used by various distributed frameworks such as HDFS, HBase, Kafka, YARN. It is used for the coordination between various components. It provides a distributed configuration service, synchronization service, and naming registry for large distributed systems.

Flume

Flume makes it possible to continuously pump the unstructured data from many sources to a central source such as HDFS.

If you have many machines continuously generating data such as Webserver Logs, you can use flume to aggregate data at a central place such as HDFS.

SQOOP

Sqoop is used to transport data between Hadoop and SQL Databases. Sqoop utilizes MapReduce to efficiently transport data using many machines in a network.

Oozie

Since a project might involve many components, there is a need of a workflow engine to execute work in sequence.

For example, a typical project might involve importing data from SQL Server, running some Hive Queries, doing predictions with Mahout, Saving data back to an SQL Server.

This kind of workflow can be easily accomplished with Oozie.

User Interaction

A user can either talk to the various components of Hadoop using Command Line Interface, Web interface, API or using Oozie.

We will cover each of these components in details later.


Loading comments...