Hi,
In this video, we will learn Hive.
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop and makes data churning easy by providing SQL-like queries.
Developers face challenges in breaking the problem into MapReduce paradigm. Also writing MapReduce code takes a lot of time and some of the operations like joins are hard to write in MapReduce.
Most of the companies already have relational databases and SQL infrastructure in place.
If data from relational databases can be moved to Hadoop, companies can make use of their existing infrastructure and resources. Most of the data warehousing applications work with SQL-based querying languages and Hive provides easy portability of SQL-based applications to Hadoop.
Also, developers and business analysts are familiar with SQL queries than MapReduce or Pig. Hive's SQL-like query language, helps end-users to quickly churn data.
Hive resides on top of Hadoop. Hive driver takes the query written in HiveQL - Hive Query Language, compiles it into MapReduce, optimizes, and executes it.
Hive stores metadata of each table such as their schema and location in a relational database like MySQL or PostgreSQL. With Pig, datasets used in a session get lost once we exit the session. With metastore, Hive stores the metadata in a relational database. This makes sure that tables and databases created in a session are available across sessions.
Users can interact with Hive using CLI - command-line interface, HWI - Hive web interface, and via JDBC and ODBC using thrift server. Hue and Quoble provide a good user interface to interact with Hive.
Though earlier versions of Hive did not have row-level updates, the recent versions provide row-level updates.
It is not suited for OLTP - Online Transaction Processing as Hive queries have higher latencies than queries in relational databases. This is because Hive queries get converted into MapReduce and MapReduce jobs have startup overhead due to resource allocations via YARN and other factors.
Hive is best suited for churning large datasets. For small datasets and milliseconds latencies requirements, Hive will not be a good choice.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Please login to comment
4 Comments
Can hive only works for Structured Data?
because there are several collections such as Map, array etc which are supported by HQL
Hive is a data warehousing tool that is built on top of Hadoop and is used for querying and managing large datasets stored in the Hadoop Distributed File System (HDFS). Hive supports a variety of data formats, including structured data in formats such as CSV, JSON, and Avro, as well as semi-structured data in formats such as XML and JSON. It also supports collections such as maps and arrays, which can be queried using Hive's query language, HiveQL. In addition, Hive supports storing and querying data stored in external tables, which can be located in other data stores, such as HBase or Amazon S3.
What is data Churning?
As per google :- Churn analysis is the process of using data to understand why your customers have stopped using your product or service.
Is it only analysis of custmers?
1 Upvote ShareHi Bhavesh,
Interesting question.
Data churning is the process to make sense of your data. It can be related to anything.
You have searched about "Churn Analysis" which is also a part of data churning where we make sense of data to find out why the customers have stopped using your product/service.
Hope this helps.
Upvote Share