PIG and Big Data – Processing Massive Data Volumes at High Speed

For most organizations, availability of data is not the challenge.  Rather, it’s handling, analyzing, and reporting on that data in a way that can be translated into effective decision-making.

PIG is an open source project intended to support ad-hoc analysis of very large data volumes. It allows us to process data collected from a myriad of sources such as relational databases, traditional data warehouses, unstructured internet data, machine-generated log data, and free-form text.

How does it process?

PIG is used to build complex jobs behind the scenes to spread the load across many servers and process massive quantities of data in an endlessly scalable parallel environment.

Unlike traditional BI tools that are used to report on structured data, PIG is a high level data flow language which creates step-by-step procedures on raw data to derive valuable insights. It offers major advantages in efficiency and flexibility to access different kinds of data.

What does PIG do?

PIG opens up the power of Map Reduce to the non-java community. The complexity of writing java programs can be avoided by creating simple procedural language abstraction over Map Reduce to expose a more Structured Query Language (SQL)-like interface for big data applications.

PIG provides common data processing operations for web search platforms like web log processing. PIG Latin is a language that follows a specific format in which data is read from the file system, a number operations are performed on the data (transforming it in one or more ways), and then the resulting relation is written back to the file system.

PIG scripts can use functions that you define for things such as parsing input data or formatting output data and even operators. UDFs (user defined functions) are written in the Java language and permit PIG to support custom processing. UDFs are the way to extend PIG into your particular application domain.

PIG allows rapid prototyping of algorithms for processing petabytes of data. It effectively addresses data analysis challenges such as traffic log analysis and user consumption patterns to find things like best-selling products.

Common Use Cases:

Mostly used for data pipelining which includes bringing in data feed, data cleansing, and data enhancements through transformations. A common example would be log files.

PIG is used for iterative data processing to allow time sensitive updates to a dataset. A common example is “Bulletin”, which involves constant inflow of small pieces of new data to replace the older feeds every few minutes.

Sailaja Bhagavatula specializes in SAP Business Objects and Hadoop for Bodhtree, a business analytics services company focused on helping customers get maximum value from their data.  Bodhtree not only implements the tools to enable processing and analysis of massive volumes of data, we also help business to ensure the questions being asked target key factors for long term growth.

Read More

What is Big Data? What is Hadoop? And Why Do They Matter to My Enterprise?

Big Data is when the size of the data itself becomes part of the problem. But there’s more to Big Data than merely being “big”.

The ‘Three Vs’ of Big Data:

Volume – Enterprises across all industries will need to find ways to handle the ever-increasing data volumes being created on a daily basis.

Velocity – Real-time decisions require real-time data.  Velocity refers to the speed with which data must be generated, captured, shared and responded to.

Variety – Big Data encompasses all data types- structured and unstructured – such as text, sensor data, audio, video, click streams and log files.  This broad-view analysis offers insights siloed data cannot approach.

What is HADOOP?

Hadoop is the open source framework designed to address the Three Vs of Big Data. It enables applications to work with thousands of computationally independent computers processing petabytes of data. Hadoop was derived from Google’s MapReduce and Google File System


• HADOOP handles petabytes of data and most forms of unstructured data

• The velocity challenge of big data can be addressed by integrating appropriate tools within the Hadoop eco system, such as Vertica, HANA, etc.

Advantages of HADOOP

1) Data and computation are distributed, and the local computation model to data prevents network overload.

2) Tasks are independent, therefore –

– Can handle partial failure, i.e. entire nodes can fail and restart

– Avoids crawling horrors of failure and tolerant synchronous distributed systems

– Speculative execution available to work around stragglers

– Linear scaling utilizes cheap, commodity hardware

4) Simple programming model. The end-user programmer only writes MapReduce tasks

5) Hadoop Distributed File System (HDFS) is a simple and robust coherency model

6) Data reliably

7) HDFS is scalable without compromising fast access to information.

Traditional vs. HADOOP
Hadoop BigData

Phani K Reddy
is a Big Data Architect with Bodhtree, a leader in Data Analytics, Business Intelligence, and Big Data services.  Bodhtree provides Hadoop implementation and maintenance services as an end-to-end service to solve specific business challenges.

Read More