Ever wondered what happens between Map and Reduce?

Shuffle and Sort – The input passed to every reducer is sorted by a key. The process of sorting and transforming the map outputs into reducer outputs is known as Shuffle.

MAP side

The output produced by the mapper is not directly recorded onto the memory. This process involves buffering and processing data further to enhance efficiency. It is often a good idea to compress the map output while writing it onto a disk, as doing so improves performance, saves disk space, and optimizes the volume of data that is being transferred to the reducer. By default the output is not compressed, but it is easy to enable by setting the value of ‘mapred.compress.map.output’ to ‘True’.

Map-reduce-areaReduce side

The map output file resides on the local disk of the task tracker that runs the map task. This requires further processing by the task tracker that is about to run the reduce task for the partition. The reduce task requires the map output for a particular partition from several map tasks across the cluster. The map tasks may complete at different times and the reduce task starts copying their outputs as soon as each map task completes.

Bodhtree, a leader in ‘PACE’ technology IT Services, including Product Engineering, Analytics, Cloud Computing, and Enterprise Services.   Bodhtree empowers innovative businesses strategies through a mission to Educate, Implement, Align, and Secure transformational technology solutions.

Read More