Ever wondered what happens between Map and Reduce?

Shuffle and Sort – The input passed to every reducer is sorted by a key. The process of sorting and transforming the map outputs into reducer outputs is known as Shuffle.

MAP side

The output produced by the mapper is not directly recorded onto the memory. This process involves buffering and processing data further to enhance efficiency. It is often a good idea to compress the map output while writing it onto a disk, as doing so improves performance, saves disk space, and optimizes the volume of data that is being transferred to the reducer. By default the output is not compressed, but it is easy to enable by setting the value of ‘mapred.compress.map.output’ to ‘True’.

Map-reduce-areaReduce side

The map output file resides on the local disk of the task tracker that runs the map task. This requires further processing by the task tracker that is about to run the reduce task for the partition. The reduce task requires the map output for a particular partition from several map tasks across the cluster. The map tasks may complete at different times and the reduce task starts copying their outputs as soon as each map task completes.

Bodhtree, a leader in ‘PACE’ technology IT Services, including Product Engineering, Analytics, Cloud Computing, and Enterprise Services.   Bodhtree empowers innovative businesses strategies through a mission to Educate, Implement, Align, and Secure transformational technology solutions.

Read More

What is Big Data? What is Hadoop? And Why Do They Matter to My Enterprise?

Big Data is when the size of the data itself becomes part of the problem. But there’s more to Big Data than merely being “big”.

The ‘Three Vs’ of Big Data:

Volume – Enterprises across all industries will need to find ways to handle the ever-increasing data volumes being created on a daily basis.

Velocity – Real-time decisions require real-time data.  Velocity refers to the speed with which data must be generated, captured, shared and responded to.

Variety – Big Data encompasses all data types- structured and unstructured – such as text, sensor data, audio, video, click streams and log files.  This broad-view analysis offers insights siloed data cannot approach.

What is HADOOP?

Hadoop is the open source framework designed to address the Three Vs of Big Data. It enables applications to work with thousands of computationally independent computers processing petabytes of data. Hadoop was derived from Google’s MapReduce and Google File System

Why HADOOP for BIG DATA

• HADOOP handles petabytes of data and most forms of unstructured data

• The velocity challenge of big data can be addressed by integrating appropriate tools within the Hadoop eco system, such as Vertica, HANA, etc.

Advantages of HADOOP

1) Data and computation are distributed, and the local computation model to data prevents network overload.

2) Tasks are independent, therefore –

– Can handle partial failure, i.e. entire nodes can fail and restart

– Avoids crawling horrors of failure and tolerant synchronous distributed systems

– Speculative execution available to work around stragglers

– Linear scaling utilizes cheap, commodity hardware

4) Simple programming model. The end-user programmer only writes MapReduce tasks

5) Hadoop Distributed File System (HDFS) is a simple and robust coherency model

6) Data reliably

7) HDFS is scalable without compromising fast access to information.

Traditional vs. HADOOP
Hadoop BigData

Phani K Reddy
is a Big Data Architect with Bodhtree, a leader in Data Analytics, Business Intelligence, and Big Data services.  Bodhtree provides Hadoop implementation and maintenance services as an end-to-end service to solve specific business challenges.

Read More