Hive – A Warehousing Solution Over a MapReduce Framework
What is Hive?
Hive is a data warehousing infrastructure built on top of apache Hadoop.
Hadoop provides massive scale-out and fault-tolerance capabilities for data storage and processing (using the MapReduce programming paradigm) on commodity hardware.
Hive enables easy data summarization, ad-hoc querying and analysis of large volumes of data.
It is best used for batch jobs over large sets of immutable data (like web logs).
It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to easily perform ad-hoc querying, summarization and data analysis.
At the same time, Hive QL also allows traditional MapReduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the languag
Hive Query Language capabilities:
Hive query language provides the basic SQL like operations. These operations work on tables or partitions.
- Ability to create and manage tables and partitions (create, drop and alter).
- Ability to support various Relational, Arithmetic and Logical Operators.
- Ability to do various joins between two tables.
- Ability to evaluate functions like aggregations on multiple “group by” columns in a table.
- Ability to store the results of a query into another table.
- Ability to download the contents of a table to a local directory.
- Ability to create an external table that points to a specified location within HDFS
- Ability to store the results of a query in an HDFS directory.
- Ability to plug in custom scripts using the language of choice for custom map/reduce jobs.
Major Components of Hive and its interaction with Hadoop:
Hive provides external interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC
- (click to enlarge)
The Hive Thrift Server exposes a very simple client API to execute HiveQL statements. Thrift is a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages.
The Metastore is the system catalog. All other components of Hive interact with the Metastore.
The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.
The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler translates this statement into a plan which consists of a DAG of map/reduce jobs.
The driver submits the individual map/reduce jobs from the DAG to the Execution Engine in a topological order. Hive currently uses Hadoop as its execution engine.
What Hive is NOT
Hive is not designed for online transaction processing and does not offer real-time queries and row-level updates.
Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.
- Log processing
- Text mining
- Document indexing
- Customer-facing business intelligence (e.g., Google Analytics)
- Predictive modeling, hypothesis testing
Vijaya R Kolli is a Hadoop-Big Data developer with Bodhtree. Bodhtree specializes in BI and Big Data solutions, providing analytics consulting, implementation, data cleansing, and maintenance services.