Big Data Platform Options and Technologies

The following are the three primary architectures used to handle ‘Big Data’:

    1. Symmetric Multiprocessing Solutions (SMP)
    2. Massively Parallel Processing ( MPP) data warehousing appliances
  1. NoSQL platforms

SMP Solutions are used as the basis of most Business Intelligence / Data Warehousing storage environments. These solutions use multiple processors that share a common operating system and memory. Due to capacity limitations of the operating system architecture, these solutions often have approximately 16-32 processors. While SMP are traditionally seen as a solution for systems of online transaction processing (OLTP), the industry has recently seen demand for SMP solutions for data storage solutions and business intelligence that deal with large volumes of structured data. Increasing the power of computers and software combined with architectures designed specifically to handle large data sets have resulted in a large increase in the yield capacity of the SMP platforms.

These Data Warehousing / Business Intelligence platforms often provide shorter deadlines, are less complex to implement and support, and offer a low purchase price TCO compared to other enterprise-level data management solutions. These are ideal for data storage environments in the 5-50 terabyte range. Microsoft is a leader in this space with the launch of Microsoft SQL Server 2008 R2 Fast Track Data Warehouse platform. This platform combines database SQL Server with standard hardware manufacturers like HP and Dell in an architecture that achieves increase performance and reduce costs through traditional clustering.

The Massively Parallel Processing (MPP) platforms are built for structured data and these systems harness numerous processors working on different parts of an operation in a coordinated fashion. Since each processor has its own operating system and memory, MPP systems can grow horizontally to increase performance or capacity by simply adding more processors to the architecture. These solutions often contain 50 – 1000 processors. From a performance and cost perspective, the most important components of an MPP solution are the hardware configuration, coordination and communication between the processors. MPP solutions range from pure data warehousing appliance solutions, which offer both hardware and software in a single package, to appliance focused solutions that provide software with the option of a few different hardware configurations.

The major platform beyond SQL data management world today is NoSQL meaning “Not Only SQL.” These architectures can provide higher performance at a lower cost, with linear scalability, the ability to use commodity hardware and a free data retention scheme with no fixed data model and relaxed data consistency validation. These architectures also provide a number of different database types based on the type of data being retained. NoSQL solutions perform better in conditions where there are extremely high data volumes or high content of unstructured data such as documents and media files.

Today the most popular NoSQL platform is Hadoop. Hadoop provides an end-to-end architecture for large volumes of data, including a distributed file system (HDFS), a distributed processing manager (MapReduce) and different databases and various data flow options including Sqoop, Hive, HBase, and Pig. Hadoop can be implemented in an open source platform approach or through one of several marketed versions that can accelerate the deployment and the ability to increase support for the price of a license fee.

Sushanth Reddy is a Big Data solutions architect for Bodhtree, which specializes in analytics solutions, including data warehousing, business intelligence and Big Data.

Leave a Reply

Your email address will not be published. Required fields are marked *