1 / 18

Hive - Introduction

Not able to play video? Try with youtube


In this video, we will learn Hive.

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop and makes data churning easy by providing SQL-like queries.

Developers face challenges in breaking the problem into MapReduce paradigm. Also writing MapReduce code takes a lot of time and some of the operations like joins are hard to write in MapReduce.

Most of the companies already have relational databases and SQL infrastructure in place.

If data from relational databases can be moved to Hadoop, companies can make use of their existing infrastructure and resources. Most of the data warehousing applications work with SQL-based querying languages and Hive provides easy portability of SQL-based applications to Hadoop.

Also, developers and business analysts are familiar with SQL queries than MapReduce or Pig. Hive's SQL-like query language, helps end-users to quickly churn data.

Hive resides on top of Hadoop. Hive driver takes the query written in HiveQL - Hive Query Language, compiles it into MapReduce, optimizes, and executes it.

Hive stores metadata of each table such as their schema and location in a relational database like MySQL or PostgreSQL. With Pig, datasets used in a session get lost once we exit the session. With metastore, Hive stores the metadata in a relational database. This makes sure that tables and databases created in a session are available across sessions.

Users can interact with Hive using CLI - command-line interface, HWI - Hive web interface, and via JDBC and ODBC using thrift server. Hue and Quoble provide a good user interface to interact with Hive.

Though earlier versions of Hive did not have row-level updates, the recent versions provide row-level updates.

It is not suited for OLTP - Online Transaction Processing as Hive queries have higher latencies than queries in relational databases. This is because Hive queries get converted into MapReduce and MapReduce jobs have startup overhead due to resource allocations via YARN and other factors.

Hive is best suited for churning large datasets. For small datasets and milliseconds latencies requirements, Hive will not be a good choice.

Loading comments...