Hadoop – How to manage huge numbers of small files in HDFS

A file is considered to be a small file when the size of the file is less than the HDFS block size (default 64MB)

The Hadoop Distributed File System (HDFS) is designed to store and process large (terabytes) data sets. However, storing a large number of small files in HDFS is inefficient.

Issue: Files and blocks are name objects in HDFS and they occupy namespace. The namespace capacity of the system is naturally limited by the physical memory in the NameNode.

When there are large numbers of small files stored in the system, metadata occupies large portion of the namespace.

Two techniques to deal with small files in HADOOP

1) HAR files

Technique – HDFS stores the small files inefficiently.  To deal with this problem of small files which put pressure on the name node memory HADOOP Archives (HAR files) were introduced to HDFS in 0.18.0. HAR packs a number of small files into large files so that the original files can be accessed in parallel transparently (without expanding the files) and efficiently. HAR files work by building a layered file system on top of HDFS.

Price – Reading through files in a HAR is no more efficient than reading through files in HDFS, and in fact may be slower since each HAR file access requires two index file reads as well as the data file read. At the current time HARs are probably best used purely for archival purposes.

2) SequenceFiles

Technique – The idea behind the sequence files is that the filename is used as a key and its contents as the value. We can write a single program to put a number of files in to a single Sequence file, and then you can process them in a streaming fashion by using a map reduce technique. The advantage of using sequence files is that they can be split. So MapReduce can break them into chunks and operate on each chunk independently. The sequence files allows compression

Sequence File Layout

– it can be slow to convert existing data into SequenceFiles. It’s best to design your data pipeline to write the data at source direct into a SequenceFile, if possible, rather than writing to small files as an intermediate step.

Krishnaji is a Big Data developer with Bodhtree, which provides Data Analytics and Big Data solutions to companies seeking an advantage in understanding customers, improving sales, and reacting faster to market trends.

Share Button

Leave a Reply

Your email address will not be published. Required fields are marked *

3 − three =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>