What is nosql ? why nosql ? When Nosql?

no sqlWhat is nosql?

Unlike what it sounds Nosql means “not only sql” since the goal is not to reject SQL but, rather, to compensate for the technical limitations shared by the majority of relational database implementations.

NoSQL is a whole new way of thinking about a database. NoSQL is not a relational database.

nosql is becoming prominent, for the simple reason that relational database model may not be the best solution for all situations

Best way to think of nosql db is distributed non-relational db with very loose structure or no structure

NoSQL databases are finding significant and growing industry use in big data analytics and real-time web applications

Why nosql?

In 2000,Eric Brewer outlined the now famous CAP(Consistency, Availability and Partitioning) theorem,

which states that both Consistency and high Availability cannot be maintained when a database is Partitioned across a fallible wide area network.

so to get all consistency availability over partitions nosql comeback to deal with data explosion

other imp advantages along with providing Consistency, Availability and Partitioning are as below

  • Horizontal Scalability
  • More flexible data model and
  • Performance advantages

When nosql?

Typically nosql db would be preferred but not limited in the following scenarios

  • Real time web applications
  • Unstructured/”schema less” data – usually, you don’t need to explicitly define your schema up front and can just include new fields
  • Huge data (TBs)
  • When scalability is critical

Read More

PIG and Big Data – Processing Massive Data Volumes at High Speed

For most organizations, availability of data is not the challenge.  Rather, it’s handling, analyzing, and reporting on that data in a way that can be translated into effective decision-making.

PIG is an open source project intended to support ad-hoc analysis of very large data volumes. It allows us to process data collected from a myriad of sources such as relational databases, traditional data warehouses, unstructured internet data, machine-generated log data, and free-form text.

How does it process?

PIG is used to build complex jobs behind the scenes to spread the load across many servers and process massive quantities of data in an endlessly scalable parallel environment.

Unlike traditional BI tools that are used to report on structured data, PIG is a high level data flow language which creates step-by-step procedures on raw data to derive valuable insights. It offers major advantages in efficiency and flexibility to access different kinds of data.

What does PIG do?

PIG opens up the power of Map Reduce to the non-java community. The complexity of writing java programs can be avoided by creating simple procedural language abstraction over Map Reduce to expose a more Structured Query Language (SQL)-like interface for big data applications.

PIG provides common data processing operations for web search platforms like web log processing. PIG Latin is a language that follows a specific format in which data is read from the file system, a number operations are performed on the data (transforming it in one or more ways), and then the resulting relation is written back to the file system.

PIG scripts can use functions that you define for things such as parsing input data or formatting output data and even operators. UDFs (user defined functions) are written in the Java language and permit PIG to support custom processing. UDFs are the way to extend PIG into your particular application domain.

PIG allows rapid prototyping of algorithms for processing petabytes of data. It effectively addresses data analysis challenges such as traffic log analysis and user consumption patterns to find things like best-selling products.

Common Use Cases:

Mostly used for data pipelining which includes bringing in data feed, data cleansing, and data enhancements through transformations. A common example would be log files.

PIG is used for iterative data processing to allow time sensitive updates to a dataset. A common example is “Bulletin”, which involves constant inflow of small pieces of new data to replace the older feeds every few minutes.

Sailaja Bhagavatula specializes in SAP Business Objects and Hadoop for Bodhtree, a business analytics services company focused on helping customers get maximum value from their data.  Bodhtree not only implements the tools to enable processing and analysis of massive volumes of data, we also help business to ensure the questions being asked target key factors for long term growth.

Read More

Big Data Platform Options and Technologies

The following are the three primary architectures used to handle ‘Big Data’:

    1. Symmetric Multiprocessing Solutions (SMP)
    2. Massively Parallel Processing ( MPP) data warehousing appliances
  1. NoSQL platforms

SMP Solutions are used as the basis of most Business Intelligence / Data Warehousing storage environments. These solutions use multiple processors that share a common operating system and memory. Due to capacity limitations of the operating system architecture, these solutions often have approximately 16-32 processors. While SMP are traditionally seen as a solution for systems of online transaction processing (OLTP), the industry has recently seen demand for SMP solutions for data storage solutions and business intelligence that deal with large volumes of structured data. Increasing the power of computers and software combined with architectures designed specifically to handle large data sets have resulted in a large increase in the yield capacity of the SMP platforms.

These Data Warehousing / Business Intelligence platforms often provide shorter deadlines, are less complex to implement and support, and offer a low purchase price TCO compared to other enterprise-level data management solutions. These are ideal for data storage environments in the 5-50 terabyte range. Microsoft is a leader in this space with the launch of Microsoft SQL Server 2008 R2 Fast Track Data Warehouse platform. This platform combines database SQL Server with standard hardware manufacturers like HP and Dell in an architecture that achieves increase performance and reduce costs through traditional clustering.

The Massively Parallel Processing (MPP) platforms are built for structured data and these systems harness numerous processors working on different parts of an operation in a coordinated fashion. Since each processor has its own operating system and memory, MPP systems can grow horizontally to increase performance or capacity by simply adding more processors to the architecture. These solutions often contain 50 – 1000 processors. From a performance and cost perspective, the most important components of an MPP solution are the hardware configuration, coordination and communication between the processors. MPP solutions range from pure data warehousing appliance solutions, which offer both hardware and software in a single package, to appliance focused solutions that provide software with the option of a few different hardware configurations.

The major platform beyond SQL data management world today is NoSQL meaning “Not Only SQL.” These architectures can provide higher performance at a lower cost, with linear scalability, the ability to use commodity hardware and a free data retention scheme with no fixed data model and relaxed data consistency validation. These architectures also provide a number of different database types based on the type of data being retained. NoSQL solutions perform better in conditions where there are extremely high data volumes or high content of unstructured data such as documents and media files.

Today the most popular NoSQL platform is Hadoop. Hadoop provides an end-to-end architecture for large volumes of data, including a distributed file system (HDFS), a distributed processing manager (MapReduce) and different databases and various data flow options including Sqoop, Hive, HBase, and Pig. Hadoop can be implemented in an open source platform approach or through one of several marketed versions that can accelerate the deployment and the ability to increase support for the price of a license fee.

Sushanth Reddy is a Big Data solutions architect for Bodhtree, which specializes in analytics solutions, including data warehousing, business intelligence and Big Data.

Read More

What is Hive? Its Interaction with Hadoop and Big Data

Hive – A Warehousing Solution Over a MapReduce Framework

What is Hive?

Hive is a data warehousing infrastructure built on top of apache Hadoop.

Hadoop provides massive scale-out and fault-tolerance capabilities for data storage and processing (using the MapReduce programming paradigm) on commodity hardware.

Hive enables easy data summarization, ad-hoc querying and analysis of large volumes of data.

It is best used for batch jobs over large sets of immutable data (like web logs).

It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to easily perform ad-hoc querying, summarization and data analysis.

At the same time, Hive QL also allows traditional MapReduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the languag

Hive Query Language capabilities:

Hive query language provides the basic SQL like operations. These operations work on tables or partitions.

  • Ability to create and manage tables and partitions (create, drop and alter).
  • Ability to support various Relational, Arithmetic and Logical Operators.
  • Ability to do various joins between two tables.
  • Ability to evaluate functions like aggregations on multiple “group by” columns in a table.
  • Ability to store the results of a query into another table.
  • Ability to download the contents of a table to a local directory.
  • Ability to create an external table that points to a specified location within HDFS
  • Ability to store the results of a query in an HDFS directory.
  • Ability to plug in custom scripts using the language of choice for custom map/reduce jobs.

Major Components of Hive and its interaction with Hadoop:

Hive provides external interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC

(click to enlarge)

The Hive Thrift Server exposes a very simple client API to execute HiveQL statements. Thrift is a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages.

The Metastore is the system catalog. All other components of Hive interact with the Metastore.

The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.

The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler translates this statement into a plan which consists of a DAG of map/reduce jobs.

The driver submits the individual map/reduce jobs from the DAG to the Execution Engine in a topological order. Hive currently uses Hadoop as its execution engine.

What Hive is NOT

Hive is not designed for online transaction processing and does not offer real-time queries and row-level updates.

Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.

Hive Applications:

  • Log processing
  • Text mining
  • Document indexing
  • Customer-facing business intelligence (e.g., Google Analytics)
  • Predictive modeling, hypothesis testing

Vijaya R Kolli is a Hadoop-Big Data developer with Bodhtree.  Bodhtree specializes in BI and Big Data solutions, providing analytics consulting, implementation, data cleansing, and maintenance services.

Read More