What is Hive? Its Interaction with Hadoop and Big Data

Hive – A Warehousing Solution Over a MapReduce Framework

What is Hive?

Hive is a data warehousing infrastructure built on top of apache Hadoop.

Hadoop provides massive scale-out and fault-tolerance capabilities for data storage and processing (using the MapReduce programming paradigm) on commodity hardware.

Hive enables easy data summarization, ad-hoc querying and analysis of large volumes of data.

It is best used for batch jobs over large sets of immutable data (like web logs).

It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to easily perform ad-hoc querying, summarization and data analysis.

At the same time, Hive QL also allows traditional MapReduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the languag

Hive Query Language capabilities:

Hive query language provides the basic SQL like operations. These operations work on tables or partitions.

  • Ability to create and manage tables and partitions (create, drop and alter).
  • Ability to support various Relational, Arithmetic and Logical Operators.
  • Ability to do various joins between two tables.
  • Ability to evaluate functions like aggregations on multiple “group by” columns in a table.
  • Ability to store the results of a query into another table.
  • Ability to download the contents of a table to a local directory.
  • Ability to create an external table that points to a specified location within HDFS
  • Ability to store the results of a query in an HDFS directory.
  • Ability to plug in custom scripts using the language of choice for custom map/reduce jobs.

Major Components of Hive and its interaction with Hadoop:

Hive provides external interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC

(click to enlarge)

The Hive Thrift Server exposes a very simple client API to execute HiveQL statements. Thrift is a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages.

The Metastore is the system catalog. All other components of Hive interact with the Metastore.

The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.

The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler translates this statement into a plan which consists of a DAG of map/reduce jobs.

The driver submits the individual map/reduce jobs from the DAG to the Execution Engine in a topological order. Hive currently uses Hadoop as its execution engine.

What Hive is NOT

Hive is not designed for online transaction processing and does not offer real-time queries and row-level updates.

Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.

Hive Applications:

  • Log processing
  • Text mining
  • Document indexing
  • Customer-facing business intelligence (e.g., Google Analytics)
  • Predictive modeling, hypothesis testing

Vijaya R Kolli is a Hadoop-Big Data developer with Bodhtree.  Bodhtree specializes in BI and Big Data solutions, providing analytics consulting, implementation, data cleansing, and maintenance services.

Read More

What is Big Data? What is Hadoop? And Why Do They Matter to My Enterprise?

Big Data is when the size of the data itself becomes part of the problem. But there’s more to Big Data than merely being “big”.

The ‘Three Vs’ of Big Data:

Volume – Enterprises across all industries will need to find ways to handle the ever-increasing data volumes being created on a daily basis.

Velocity – Real-time decisions require real-time data.  Velocity refers to the speed with which data must be generated, captured, shared and responded to.

Variety – Big Data encompasses all data types- structured and unstructured – such as text, sensor data, audio, video, click streams and log files.  This broad-view analysis offers insights siloed data cannot approach.

What is HADOOP?

Hadoop is the open source framework designed to address the Three Vs of Big Data. It enables applications to work with thousands of computationally independent computers processing petabytes of data. Hadoop was derived from Google’s MapReduce and Google File System

Why HADOOP for BIG DATA

• HADOOP handles petabytes of data and most forms of unstructured data

• The velocity challenge of big data can be addressed by integrating appropriate tools within the Hadoop eco system, such as Vertica, HANA, etc.

Advantages of HADOOP

1) Data and computation are distributed, and the local computation model to data prevents network overload.

2) Tasks are independent, therefore –

– Can handle partial failure, i.e. entire nodes can fail and restart

– Avoids crawling horrors of failure and tolerant synchronous distributed systems

– Speculative execution available to work around stragglers

– Linear scaling utilizes cheap, commodity hardware

4) Simple programming model. The end-user programmer only writes MapReduce tasks

5) Hadoop Distributed File System (HDFS) is a simple and robust coherency model

6) Data reliably

7) HDFS is scalable without compromising fast access to information.

Traditional vs. HADOOP
Hadoop BigData

Phani K Reddy
is a Big Data Architect with Bodhtree, a leader in Data Analytics, Business Intelligence, and Big Data services.  Bodhtree provides Hadoop implementation and maintenance services as an end-to-end service to solve specific business challenges.

Read More