Big Data Cloud Today! Experts discuss what today’s data is saying about tomorrow opportunities

Increased productivity, new innovations and smarter marketing are just a few advantages being realized by organizations that embrace big data.

Big Data Cloud Today!, an event held on June 7th in Mountain View, drew leaders from business and technology to discuss the next generation of Big Data use cases. Attendees to Big Data Cloud Today! learned emerging techniques, like 3d data visualization, to distill new insights from existing data.  The event addressed the growth of big data, big data architectures, and identification of new business opportunities.

As I participant in the event, I would like to share a few of the insights and key learning’s that I felt offer the most value for Bodhtree customers and network.   Milind Bhandarkar, Chief Scientist from Pivotal, Dr. Mayank Bawa, Co-President R&D Labs of from Teradata Aster, Jim Blomo Engineering Manager – Data-Mining, and Gail Ennis, CEO of Karmasphere, were a few experts who made this event truly impact full. Speakers presented first-hand experiences and lessons learned from Big Data early-adopter organizations.

Dr. Mayank Bawa (Co-President, R&D Labs, Teradata Aster) set the tone for the conference with an excellent keynote address. ‘Why is there such excitement around Big Data Analytics in the current environment?’ and ‘How are Big Data Services & Data Sciences Unique?’ were the two questions that framed his remarks.  His presentation marvelously answered them with real-life use cases in two broad categories:

• “All kinds of data in a single platform”
• “All kinds of Analytics in a single platform”

To underscore these points, he presented various applications of Big Data Solutions such as ‘Market Basket Analysis’, ’Telecom and Churn analysis’, and ‘Predictive Modeling in Insurance Domain.’ Some of the interesting takeaways, challenges and open questions are as follows:

– How will technology progress to a unified architecture from the current state?
– The focus of every company is on building a platform that bring silos of data together and facilitates seamless dataflow across systems
– Empowering data sciences and improving analytical algorithms
– Relational vs. NoSQL:
– Is there a need to build SQL layer over NoSQL?
– Does it add any business value?
– Vision of oracle, Teradata in bringing relational and NoSQL together.

How to make Big Money from Big Data? – Sourabh Satish, Security Technologies & Response, CTOs Office, Symantec

While Dr. Bawa presented the motivation to build a unified architecture with better analytics, Sourabh Satish, Security Technologies and Response at Symantec, explored the advances offered by Big Data in the Security Domain. He demonstrated some of the security tools built at Symantec and illustrated how the three fields – Big Data, Data Science and Domain Expertise – can be leveraged for building an application.

Hadoop: A Foundation for Change – Milind Bhandarkar, Chief Scientist, Pivotal

If I had to design a metric to calculate the most relevant and valuable presentation in the conference, then without a doubt the gold standard would be set by Milind Bhandarkar, Chief Scientist of Pivotal.  Mr. Bhandarkar talked about the evolution of analytics and big data and characterized by three distinct areas:

• Source Systems +ETL+EDW+Visualization
• Source Systems +Hadoop& M>R +EDW+Visualization
• Hadoop and ecosystem

He went on to explore several key issues in the Big Data field:

• BI Vs. Big Data and Future
• Big Data’s journey from batch processing to interactive processing. Is interactive processing possible?
• Hadoop as a service?
• Applications as a service?
• Cloudera Impela bypassing MapReduce
• Myth around how huge (Big) is the volume of data used in  analytics query (at Yahoo, Microsoft)

Why Hadoop is the New Infrastructure for the CMO (they may not know it yet!)- Gail Ennis CEO, Karmasphere

Gail Ennis talked about business use cases driving the demand for Big Data in today’s rapidly changing world, the journey of technology from BI to Big Data (predictive insights) and the potential of predictive analytics in Marketing and product Development.

Insights from Big Data: How-to? –Panel Discussion

Jim Blomo, Engineering Manager Data-Mining, Yelp
David P. Mariani, VP Engineering, Klout

The frank and energetic interaction between Professor Blomo and Mariani offered some of the most interesting discussion in the conference, including topics such as:

• How to identify whether a given problem is a BI Analytics problem or Big Data problem?
• Is existing BI framework needed? How Big Data evolves to be interactive BI
• How a company can form a  data sciences group & their Journey in building their team
• What qualities they looked at while selecting Data scientists in their team as Data Scientist is not a role well defined across the industry
• Evolving technologies in data sciences and Big data (hive vs. cloudera imepala vs. shark)
• Is ETL on the fly possible
• Yelp and its work in data sciences
• Academic  Education or Practical Experiences which helps in being a great Data Scientist

Rhaghav Karnam manages the Big Data Scientists group at Bodhtree, focusing on Big Data for customers in High Tech, Manufacturing, Life Sciences/Pharmaceuticals and government industries.  Bodhtree enables its customers and partners for business transformation through Big Data and social-mining solutions laser focused on measurable business objectives.

Read More

What is Hive? Its Interaction with Hadoop and Big Data

Hive – A Warehousing Solution Over a MapReduce Framework

What is Hive?

Hive is a data warehousing infrastructure built on top of apache Hadoop.

Hadoop provides massive scale-out and fault-tolerance capabilities for data storage and processing (using the MapReduce programming paradigm) on commodity hardware.

Hive enables easy data summarization, ad-hoc querying and analysis of large volumes of data.

It is best used for batch jobs over large sets of immutable data (like web logs).

It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to easily perform ad-hoc querying, summarization and data analysis.

At the same time, Hive QL also allows traditional MapReduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the languag

Hive Query Language capabilities:

Hive query language provides the basic SQL like operations. These operations work on tables or partitions.

  • Ability to create and manage tables and partitions (create, drop and alter).
  • Ability to support various Relational, Arithmetic and Logical Operators.
  • Ability to do various joins between two tables.
  • Ability to evaluate functions like aggregations on multiple “group by” columns in a table.
  • Ability to store the results of a query into another table.
  • Ability to download the contents of a table to a local directory.
  • Ability to create an external table that points to a specified location within HDFS
  • Ability to store the results of a query in an HDFS directory.
  • Ability to plug in custom scripts using the language of choice for custom map/reduce jobs.

Major Components of Hive and its interaction with Hadoop:

Hive provides external interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC

(click to enlarge)

The Hive Thrift Server exposes a very simple client API to execute HiveQL statements. Thrift is a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages.

The Metastore is the system catalog. All other components of Hive interact with the Metastore.

The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.

The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler translates this statement into a plan which consists of a DAG of map/reduce jobs.

The driver submits the individual map/reduce jobs from the DAG to the Execution Engine in a topological order. Hive currently uses Hadoop as its execution engine.

What Hive is NOT

Hive is not designed for online transaction processing and does not offer real-time queries and row-level updates.

Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.

Hive Applications:

  • Log processing
  • Text mining
  • Document indexing
  • Customer-facing business intelligence (e.g., Google Analytics)
  • Predictive modeling, hypothesis testing

Vijaya R Kolli is a Hadoop-Big Data developer with Bodhtree.  Bodhtree specializes in BI and Big Data solutions, providing analytics consulting, implementation, data cleansing, and maintenance services.

Read More