Why data of all sizes and complexities including “Big Data” should be “Happy”

My blog is centered on the theme of making a conscious decision to begin to view data as if it was “alive” with all the complexities and mysteries of a human being. By taking this approach I hope to provide a journey and a platform to spark a conversation on how this perspective can then begin to change on how we act towards data and how our decisions around data might then change. Yet, if I do take this premise then it is in my personal opinion that at the end of the day that my data or data that I personally interact with it or have responsibility of will be “Happy Data”.

When I studied biopsychology (combo of psychology + neuroscience) at UCSD we would often look at how biological processes interact with emotions and cognition. As I was earning my degree too often the common debate of nature vs. nurture would be highlighted in this branch of psychology. Over the years, I truly believe that the difference of nature vs. nurture is very important but there is strong importance the relationship between both of them. By nature you carry the traits that might define you but you are nurtured to become the human being that you become as an adult by your interactions. Those interaction can start with your family environment (data in your organization), your extended family (data loosely related) , peer experience (how data interacts with other data), and extending to influences in socio-economic status(will you make different decision around your data in you are economical sound). So if my goal is to make sure that at the end of the day my data is happy what can I do to make sure this happens? What should I consider in the DNA of my data? What things should I consider to provide a positive environment as my data is maturing in my ecosystem?

Nurture: In the beginning there was “Little Data”

I have a strong passion for Analytics so a lot my examples going forward would probably gravitate towards that subject. (Yet, I will try to change around in future postings)Lets then look at my first example on when an organization has decided to launch existing product line in a different channel. Let’s say that this organization has traditionally provided this product only via direct to consumer over the web and print channel (catalog). Now they want to have physical presence were can have a more intimate relationship with the customer and have begun to roll their products via kiosk in a mall. It is anticipated that mix % of these new distribution channel might increase to 15% in 12 month period so they are being cautious not to tarnish the brand but also cognizant that there is certain opportunity cost if they do not move fast enough. Both the folks in marketing and product development might have decided that it was more important to the launch the product quickly then to see if the proper process of capturing the entire 360 degree touch points of the customer. In this example, the organization rolled out the product and did not think about the various components that the transactions with the customer might be different that on the web. Thus, it is treating the data with a limited view. So the data is small and young at the beginning of this process. If the data was alive like a human being would you wait until the data grows or would you try to deal data at a different cycle of the process? It is best to think about it, listen to it, analyze it, interpret it, treat it, nurture it, and protect it (we will talk about security in detail in later blogs) at a stage that it is not as complex and the size is manageable. You also have a stronger chance to nurture it along the process and can influence the outcome of this data by beginning a relationship with it earlier on. You are able to change some the environment and process when you begin to understand the importance of this data in the future.

I will try to get into more details on different examples on different stages in maturity and complexities of data going forward on other postings. I did not get into too much detail given that I wanted to introduce this subject first. I am excited to see in other discussion what we should consider in your organization if the data might be unstructured and rebellious, how then you would then need to act around it. Also, if you have old enterprise data that has been there a long time what are different ways to deal with historical and older data. Regardless, your data should be happy and you should consider how to get there. Can you provide an example in your organization that if you had taken this approach the outcome would be different? Did I miss something or angle that I should have considered? Thank you and please share your comments.

Kain A Sosa VP, Analytics at Bodhtree with expertise on various big data technologies, like Hadoop, Big query, Passionate leader in Data Analytics, Business Intelligence, and Big Data services.

Read More

Why are so many customers failing in their Big Data initiatives?

I strongly believe that companies with a successful Big Data strategy have an information-centric culture where all employees are fully aware of the possibilities of well-analyzed and visualized information. Better data visualization can help you make better decisions

As a matter of fact, Gartner’s top predictions for 2012 and beyond included this prediction about Big Data: “Through 2015, more than 85 percent of Fortune 500 organizations will fail to effectively exploit Big Data for competitive advantage.” This leads to the question “Why are so many customers failing in their Big Data initiatives?”

The success of a Big Data implementation is directly proportional to the maturity model of the organization.

Remembering the Big Data project implementation experience I would like to share the approach that includes three assessment steps as mentioned below. I thought it would be insightful if I also mention here the recommendations which lead to a successful Big Data implementation.



Recommend a model, which will demonstrate the real value of Big Data as it is applicable to an organization. The final recommendations and roadmap, based on our learning’s yield one of two possible outcomes:

• If an organization already has all the necessary tools, processes, systems, and solution to solve the existing problems, then we will recommend through a business case that they are not a good contender to adopt Big Data technologies but can resolve their problems with existing ecosystem

• If an organization demonstrates the potential value of a Big Data investment, then we would recommend moving forward with next steps: take the executable roadmap and blueprint to engage in a Big Data proof of concept (POC)


Organizations that approach big data from a value perspective with partnership between the business and IT are much more likely to be successful than those which adopt a pure technology approach. For this reason, making appropriate investments in both technology and organizational skill sets to ensure enterprise capability in extracting value from big data is essential.

Don’t wait, start now

Start collecting massive amounts of data and store it centralized with Hadoop, hire or train your data scientists and change your culture to an information-centric organization. This will help to drive innovation and stay ahead. Don’t wait, as Big Data is the only way forward.

Phani Kumar Reddy is a Manager Analytics at Bodhtree, Managing presales of BI with expertise on various big data technologies, like Hadoop, Big query , Passionate leader in Data Analytics, Business Intelligence, and Big Data services

Read More

Balance your Supply Chain with Big Data

Let’s start by going back…way back from a tech perspective. In the 1840s Samuel Finley Breese Morse, the American co-inventor of Morse code envisioned laying cable across the Atlantic to enable telephonic communication from US to Europe. The business benefit metric of the solution was a reduction in message transmission time from 10 days to only a few minutes. With this massive return, the initiative would seem like a “no brainer” from today’s perspective where communication is at milliseconds speed from your cell phone; believe it or not, the question commonly asked then was ‘Do we really need communication so fast?’ The project ultimately took over 18 years to complete when US president James Buchanan finally conversed with Queen Victoria over the transatlantic cable, hence demonstrating the first business benefit. Let us call this the ‘Paradigm Shift Period’ for communication. Modern businesses now rely on instant communication across the world with voice and data transfers occurring at lightning speed. People, processes and technologies within business have all evolved to conform to this new paradigm of global data interconnection.

In fact, the original challenge has now come full circle. Business and government have become so efficient at capturing and transmitting data that getting the data is no longer the core of the issue. The challenge and opportunity now lay in processing and interpreting the terabytes, even petabytes, of available structured and unstructured data to influence effective business strategy.

The chances are that you’ve been bombarded with Big Data buzz over the last year. But in spite of all the noise, you’ve probably noticed that few of these descriptions contain focused business use cases for applying Big Data technologies. I am the first to acknowledge and agree with Gartner research that Big Data Analytics is riding a hype cycle that will likely peak sometime in 2013. Between now and then a lot of mind share will go into figuring out if there is value for your domain, your industry and your job. If you work in supply chain, irrespective of the industry, continue reading to understand how Big Data is expected to bring both direct and indirect impact. Some of these reverberations may fundamentally change the nature and duties performed in supply chain jobs. In 2010 we have witnessed a ‘Paradigm Shift Period’ for Big Data Analytics with major players like SAP announcing the next generation of real-time analytics as many ask a similar question to 170 years earlier, ‘Do we really need analytics so fast?’ SAP is now seeing their Hana analytics customers grow rapidly, similar to other big players like Oracle. We are witnessing an epic shift in supply chain data analytics that will make the approaches of the last decade seem antiquated.

The Supply Chain Domain

The core of any supply chain strategy is maintaining an appropriate balance between the supply and respective demand. Every other related model, including the well-known JIT (Just in Time), really targets the same goal with different degrees of precision and timeliness. Every time you enter the car repair shop and the mechanic mentions a part will take X days to order, you get a prime, though frustrating, example of a supply-demand imbalance. It is every organization’s goal to maintain a supply-demand balance by optimizing cost and quality with operational efficiencies.

On a much larger scale, I have observed operations at a $40B Hi Tech manufacturer where maintaining the supply-demand balance is a far more complex proposition. Everyday employees and partners in this supply chain ecosystem are trying to find answers to key supply chain questions, but their view is constrained to only a piece of the picture since reports rely primarily on structured data. How fast the person can get accurate and relevant information has a significant impact on the growth, profitability and productivity of the supply chain function.

The following are some ballpark metrics for the annual activities involved in keeping supply aligned with constant variation in market demand:

Does this look ugly? It is. But think about what these numbers will be after data volumes grow 16X by 2016.

It’s a category 5 hurricane of data.

All of the above communication is related to one or more of the following four areas: Assess the demand, Assess the supply, Fulfilment of demand, Delivery of the product/service. The efficiency and success of these activities can be tracked through metrics such as lead-time variance, forecast inaccuracies, on-time shipments and quality metrics to name a few.

Big Data for Supply Chain

NOW, let us bring Big Data into the picture and see how this outlook changes. A Big Data problem exists if data Volume, Velocity and Variety become difficult or impossible to store, process, and analyse using traditional technology and methods. With Big Data technologies, the capability to find answers faster and cheaper has grown exponentially.

While we predict 16X growth in data volumes in just a few years, human ability to comprehend does not keep the same pace. From the perspective of people, processes and technology within supply chain management, improvements will need to catch up as you implement Big Data technologies. The probability is high that Big Data technologies will play a key role in handling your rapid data expansion, so gear up your people and processes to match the potential of these technological innovations. Within the broad range of supply chain roles, let us consider the role of planner to see how his/her activities change from today’s traditional technologies vs. Big Data technologies of tomorrow.

Key Supply Chain functions Today – Traditional Technologies Tomorrow – Big Data Technologies
Forecasting Running reports and analysis on a daily basis (reports alone can take hours to produce). Forecasting using real time dashboards, eliminating the concept of running reports. Data is ready at lightning fast speeds with the capability to capture snapshots of analysis.
Demand Planning Mostly using human-fed structured data Demand Planning using structured and unstructured data (e.g. web clickstreams, Facebook likes, Twitter Feeds , Customer reviews, news article mentions)
Supply Planning Traditional reports and email communications Supply Planning using real time data with deep insights to the news of vendors and partners.
Fulfilment & Delivery Tracked through workflows and report status Proactive delivery tracking to predict possible delays and correlated interdependent events.

There is a fundamental shift from planners reading the data and recommending changes to the machine recommending changes and planners managing the exceptions. This has been the goal of many organizations for the last decade, but recent Big Data technology innovations represent quantum-leap advances toward true strategy automation.

The traditional model makes local copies of data which the planner edits and writes back. The read/write process might take anywhere from seconds to many hours depending on the tasks. With Big Data, the turnaround becomes milliseconds. The natural reaction is, “Do I really need information flow that fast?” The important question is not how fast the information flows, but how quickly you can change your decision from A to B, capturing a time-sensitive opportunity or averting a major cost. Cancelling a wrong work order or not considering all available information for analysis could mean a poor decision in current model. Visualize the planners viewing all the information they want to see in real time while the competition is still updating data and processing reports.

Bringing the Supply Chain Contacts, Content and Context Together for decisions

The most critical factor for effective corporate decisions is to bring the contacts, content and context closer to each other. For example, a supply chain company that knows a part defect would potentially affect the assembly, which could in turn delay customer delivery and eventually affect services. Predicting the occurrence of defects well in advance through analysis of historical Big Data has huge ROI potential by enabling appropriate adjustments to every event in this chain. Additionally, with Big Data recommending related content and relaying all of this to the right contacts, the result is direct ROI in the form of improved quality metrics, increased customer satisfaction and reduced maintenance costs for part replacement.

Today’s Big Data technologies have the capability to demonstrate how in the automobile industry an alternator part data sheet (Content) can be analysed against all cars sold (Contacts) and reveal the root cause for battery replacements (Context), an issue which has cost the company millions of dollars in repair services. Similar examples can be found in many Big Data technology use cases across industry verticals.

All of these scenarios are primarily connecting the 3Cs, the Contacts (e.g. Customer information or internal employee) and Content (Use case specific information e.g. Battery failure) with Context (How a battery replacement is due to alternator failure).

Much of a Planner’s time is spent searching for information across multiple tools, reports and manual communication with traditional technologies. One gauge of an effective Big Data technologies implementation is to reduce the number of reports to 1/10 the current volume. Let the machines do the job of relating and correlating the huge flow of information, and put the planner in the command seat to review recommendations and approve/disapprove. This will directly increase the productivity of the planner as he/she has to focus on reviewing the recommendations rather than searching for information.

Where to Start

All of this means that you need to first conduct an assessment of your supply chain ecosystem with a specific use case in mind to which Big Data technologies will be applied. The specific area targeted for improvement may be forecast inaccuracies, which in today’s model relies mostly on structured data combined with massive exchanges of manual communication, ignoring much of the available market feedback (unstructured data). Measure the baseline and set realistic targets. Traditional Forecast/Demand planning fundamentally relies on a set of numbers entered by internal and external users. It does not factor in some of the Big Data elements such as sentimental analysis of the market, internal/external unstructured communication (e.g. blogs, chats, Tweets, customer reviews). When the unstructured information is correlated with structured data, new insights arise prompting better decisions. 1% improvement in your forecasting drives multi-fold improvements to your entire supply Chain based on empirical research. Upon realizing these early Big Data benefits, we can then expand it to broader supply chain use cases.


Now, where do you initiate the change and get the quick ROI? Our recommendation is to pick the top five supply chain reports you run on your traditional BI Solutions and Analytics platform, analyse them and assess whether Big Data technologies would bring in improved results. Consider dimensions of accuracy, precision, and timeliness. For example, forecasting traditionally depends on sales, BU or operations entering their forecasts and coming up with some form of consensus. Inherent forecast inaccuracy exists, which are mitigated by a continuous improvement process. Now, with Big Data you start feeding unstructured market information into the analysis, casting more light on external reactions to your product. This insight provides early indications of demand variations, allowing for corrections to forecasts.


The fundamental disruption in our supply chain eco system has begun through Big Data technology capabilities impacting People, Process and Technology. Faster, better and cheaper processing of Big Data will drive improvements in people’s behaviour and actions, bringing improved supply/demand balance. Similarly, process improvements learned from various supply chain driven companies (e.g. automobile) will flow into other industries like Hi Tech and Healthcare. The traditional daily job of a supply chain employee who reads and writes Content relating it to a Context working with his set of Contacts will dramatically change. Human-driven searching will fundamentally shift to machine-driven searching, mapping relevant information for faster decision making with recommendations. Get started with a use case which can be easily measured for ROI realization, then use this success as a launch pad to expand Big Data insights across the organization.

Read More

Ever wondered what happens between Map and Reduce?

Shuffle and Sort – The input passed to every reducer is sorted by a key. The process of sorting and transforming the map outputs into reducer outputs is known as Shuffle.

MAP side

The output produced by the mapper is not directly recorded onto the memory. This process involves buffering and processing data further to enhance efficiency. It is often a good idea to compress the map output while writing it onto a disk, as doing so improves performance, saves disk space, and optimizes the volume of data that is being transferred to the reducer. By default the output is not compressed, but it is easy to enable by setting the value of ‘mapred.compress.map.output’ to ‘True’.

Map-reduce-areaReduce side

The map output file resides on the local disk of the task tracker that runs the map task. This requires further processing by the task tracker that is about to run the reduce task for the partition. The reduce task requires the map output for a particular partition from several map tasks across the cluster. The map tasks may complete at different times and the reduce task starts copying their outputs as soon as each map task completes.

Bodhtree, a leader in ‘PACE’ technology IT Services, including Product Engineering, Analytics, Cloud Computing, and Enterprise Services.   Bodhtree empowers innovative businesses strategies through a mission to Educate, Implement, Align, and Secure transformational technology solutions.

Read More