3. Examples of Big Data in action

13
De-mystifying Big Data Analytics By Chris Brown

Transcript of 3. Examples of Big Data in action

Page 1: 3. Examples of Big Data in action

De-mystifying Big Data Analytics

By Chris Brown

Page 2: 3. Examples of Big Data in action

Table of Contents

1. Overview ............................................................................................................................ 3

2. Summary ............................................................................................................................ 7

3. Examples of Big Data in action ............................................................................................ 8

4. Jargon Buster ...................................................................................................................... 9

5. Acknowledgements ...........................................................................................................12

Page 3: 3. Examples of Big Data in action

1. Overview

In October 1997 Michael Cox and David Ellsworth published ‘Application-controlled demand paging for out-of-core visualization’ in the Proceedings of the IEEE 8th conference on Visualization. They start the article with “Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.” It is the first article in the ACM digital library to use the term ‘Big Data’.

The concept of Big Data was defined by Meta Group analyst Doug Laney in a 2001 research report1 and related lectures, he defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner (who now own the Meta Group), and much of the IT industry, continue to use this ‘3Vs’ model for describing Big Data. In 2012, Gartner updated its definition as follows: ‘Big Data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.’

The term Big Data is pervasive, and yet still the notion engenders confusion, the accepted definition appears clear (see above) but most organisations still do not understand how it may benefit their business. Big Data has been used to convey all sorts of concepts, including: huge quantities of data, social media analytics, next generation

data management capabilities, real-time data, and much more and thereby lies much of the confusion.

Major organisations such as Tesco, Boeing and BMW have been running Big Data Analytic projects for a number of years; they understand the benefits of aggregating all the data available to them and then analysing it to predict trends, maintenance issues, product performance etc. However until recently this technology was seen as being the domain of only the large corporates who could afford the technology and the infrastructure to support it.

Times however are changing and a recent report by IDC2 shows that factory revenue for the high performance computing (HPC) technical server market increased 5.3% in the first quarter of 2013 to reach $2.5 billion, up from $2.4 billion in the same period of 2012.

Page 4: 3. Examples of Big Data in action

The key point being that this growth was driven by smaller departmental and workgroup systems, the IT buyers for these smaller HPC systems were typically line of business buyers within a larger corporation or ‘smaller’ organisations moving into more technical computing for the first time. Supercomputer sales, defined as systems going for $500,000 and up, fell 10.9% in the first quarter compared to a year ago to $861 million. This trend is believed to be driven by analytic projects (Note: Big Data servers are not accounted for separately and are generally lumped in with HPC servers).

Whatever the label, organisations are starting to understand and explore how to process and analyse the vast array of information in new ways. In doing so, a small but growing group of pioneers are achieving breakthrough business outcomes; in industries throughout the world executives are starting to recognise the need to learn more about how to exploit Big Data.

But despite what seems like unrelenting media attention, it can be hard to find in-depth information on what organisations are really doing with Big Data and what Big Data actually means.

In 2012 the IBM Institute for Business Value and the Saïd Business School at the University of Oxford partnered to develop a report to better understand how organisations view Big Data and to what extent they are currently using it to benefit their businesses. It is based on the Big Data @ Work Survey conducted by IBM in mid-2012 with 1144 professionals from 95 countries across 26 industries. Respondents represent a mix of disciplines, including both business professionals (54% of the total sample) and IT professionals (46%). Respondents’ self- selected to participate in the web-based survey. Study findings

are based on analysis of survey data, and discussions with University of Oxford academics, subject matter experts and business executives. IBM is the primary source of the study recommendations.

They found that 63%, nearly two-thirds, of respondents report that the use of information (including Big Data) and analytics is creating a competitive advantage for their organisations. This compares to 37% of respondents in IBM’s 2010 New Intelligent Enterprise Global Executive Study and Research Collaboration, a 70% increase in just two years. As an increasingly important segment of the broader information and analytics market, Big Data is having an impact. Respondents whose organisations had implemented Big Data pilot projects or deployments were 15% more likely to report a significant advantage from information (including Big Data) and analytics compared to those relying on traditional analytics alone.

Figure 1 Breadth of big data Source Schroeck et al., 2012

Page 5: 3. Examples of Big Data in action

Understanding how Big Data Analytics fit in today’s organisations

As with any new technology or concept there are always barriers to implementation and here I have selected 2 of the ‘10 roadblocks to implementing Big Data analytics’ as drawn up by Mary Shacklett as part of the Executive Guide: Making the Business Case for Big Data3.

Budget

Traditional servers in enterprise data centres are not designed for processing Big Data. Minimally, analytics servers, and in some case high performance computing (HPC) servers and applications, will be needed. This will require new IT investment. The key to success here for the CIO is to build a business case in plain English so that others in the organisation (like the CFO) can understand why servers already installed in the data centre can’t be repurposed to work with Big Data. The CIO should have this understanding (and buy-in) in place before making any IT investment.

Business and IT alignment

Business goals and IT Big Data strategy should be tightly aligned before any IT investments are made. Ultimately, C-level executives are going to look back to see whether they really were able to answer the big questions and gain competitive advantage for the company through the use of Big Data. Do you want to predict when there will be a disruption to a supply chain so you can rearrange your logistics to still be on time with your order fulfilment? Is it important to know when a certain buying trend first emerges so you can be first to market? Know what you are going after before you invest in Big Data Analytics.

I chose these 2 because I believe these are the key issues organisations face today when thinking about implementing Big Data Analytics. They need to understand that the technology surrounding Big Data needs a different approach and that IT departments and the business managers need to work together to build a Big Data/Analytic strategy, after all - “If you do not know how to ask the right question you discover nothing”4.

Once the barriers have been overcome and the success criteria defined then you should be looking at the ROI for any Big Data/Analytic project. The CFO will likely be the first person to ask for a projected return on investment, the irony that CIOs face is that they may not have any existing ROI models in IT to draw from! This is because traditional IT ROI models are based on elements like speed per transaction (Big Data doesn’t work on a speed-per-transaction basis), shrinking data centre equipment footprints and gaining energy savings (Big Data does not run on virtualised machines, which are pivotal to shrinking data centre footprints and gaining energy savings). To compound the situation, Big Data for applications like automobile performance simulations or modelling new drug formulations can sometimes take hours to run. These apps can’t usually qualify for the more compact and inexpensive processing options of simple business analytics computing.

As we’ve already discussed however there are a growing number of scalable clustered HPC solutions available to make Big Data analysis a reasonable option for enterprise. These solutions have the ability to start small and then be expanded, as the needs for enterprise Big Data analysis grow, they can also be scaled to the level of supercomputers if they need to be. This can be enough to pass the CFO’s litmus test on cost of acquisition.

Of course, there is still more work to do. The team proposing Big Data technology must also show how the technology is going to bring value to the enterprise and how long it will take the company to recoup its technology investment. Value, as McKinley states, will come in the form of faster times to market from Big Data analysis that give the enterprise a competitive edge or superior ways to evaluate and respond to consumer

Page 6: 3. Examples of Big Data in action

buying patterns that enable the company to capture more revenue. In some cases, Big Data analysis (think healthcare) can provide insights that allow organisations to revamp operations for less waste, thereby reducing costs. These savings or earnings projections are usually pencilled out by a line of business managers at the budget table.

That leaves ROI from the data centre, which can be a challenge for CIOs. Remember, HPC for Big Data is not virtualised, so it is not likely to contribute return on investment in saved data centre floor space or energy savings. Instead, the CIO should look at energy consumption and data centre efficiencies from the standpoint of server utilisation. In a traditional transaction processing environment, server utilisation generally hovers around 40 - 60% for X86 servers running virtual applications and operating systems, and around 80% for a mainframe. The wasted utilisation occurs because transaction servers must often wait for transaction requests to come in.

In contrast, servers that function in HPC clusters for Big Data processing contain different processing nodes, with each node processing a single thread of the data and with no interruptions or wait times. These nodes operate in parallel. It is only at the end of the processing that all Big Data threads are brought back together for a composite Big Data analysis. Because of this parallel processing, Big Data HPC servers usually run at 90% - 95% utilisation, with almost no idle time.

Efficient resource utilisation is the area of Big Data ROI that the CIO needs to bring to the budget discussion. When this resource utilisation argument is combined with the time to market, operational efficiencies, and revenue optimisation arguments from the end business, the CFO will feel a lot more comfortable about the investment.

Finally, think both tactically and strategically for your Big Data project. Tactically develop a procedure that evaluates the return on investment for each Big Data proposal. Using other criteria mentioned above, determine which Big Data idea is the best Big Data project and use its development as an on-going blueprint strategy for subsequent Big Data efforts. This will help your company develop repeatable processes and a strategy for developing successful business solutions for your Big Data projects.

Page 7: 3. Examples of Big Data in action

2. Summary

Big Data is becoming more prevalent in a wider range of organisations and more readily available as:

The price of the relevant technology drops and becomes more affordable.

Organisations are beginning to understand the benefits of adopting Big Data Analytics as a tool for improving their business.

Big Data Analytics are moving from the domain of the IT Professional to the Businessman as non-technologists become more comfortable with the technology and the jargon.

The amount of data in the world is increasing exponentially, according to Google CEO, Eric Schmidt, “There were 5 exabytes of information created by the entire world between the dawn of civilization and 2003, now that same amount is created every two days." Yet a recent IDC study14 estimated that only 3% of the world’s data is tagged, and a mere 0.5% of it is being analysed, so still a long way to go.

So we now have lots more data than we used to have, which can be processed at a higher speed by technology that is becoming more affordable - but why do we need Big Data? Big Data is fast becoming a key tool for companies to outperform their peers, in most industries established competitors and new entrants alike will use data-driven strategies to innovate, compete, and capture value.

While it is true that Big Data emanated from the scientific community with projects such as Planet detection and Super Collider analytics, commerce has not been slow in adopting a technology that some see as game changer. Areas such as predictive maintenance in manufacturing, student retention in higher education and customer preferences in supermarkets are just 3 examples of how organisations have adapted this scientific model and found ways of monetising it, while at the same time improving the customer experience for many.

The combination of an innovative technology and demonstrable benefits allied to more affordable prices is an attractive mix and, unlike many technology driven projects, the paybacks can be both predicted and measured.

Even Microsoft (not noted for its’ speedy adoption of non-Redmond developed technologies) is saying that in their view 75% of midsize to large businesses are implementing big data-related solutions within the next 12 months15, with customer care, sales, finance and marketing being the top 4 drivers for this implementation.

Post 2nd World War a new term was coined, ‘Big Science’, which essentially meant that scientists were able to tackle major scientific issues as large-scale projects thanks to a change in the way these projects were funded (usually by national governments or groups of governments). This was a game changer in scientific and medical disciplines and its effects are on-going (think Large Hadron Collider). ‘Big Data’ comes from this background and it has the potential to be as significant as Big Science but not only in scientific disciplines, this time governments, commerce, education, charities in fact any organisation which is looking to improve the way it does business can benefit from this technology.

Page 8: 3. Examples of Big Data in action

3. Examples of Big Data in action

According to Steve Jones, director of strategy for Big Data Analytics for Capgemini, “rather than focus on the goodies of Big Data -- the new-fangled platforms and cutting-edge software -- it's smarter to determine the business problem you're determined to solve and what sources of data can assist your decision-making.”

Tesco aims to save over €20m5 a year by using sophisticated business intelligence technology to ensure its refrigerators operate at the right temperature. The move will help the retailer cut its refrigeration energy costs by up to 20% across 3000 stores in the UK and Ireland.

BMW use Predictive Maintenance to reduce warranty costs by 5% and repeat repairs by 50%6

Sebastien Sasseville is an extreme athlete and diabetic, and he uses Big Data biometric sensors to train for the ultimate race—and help science conquer his disease7

Apple’s ubiquitous intelligent personal assistant, Siri, available on its latest iPhones started life in one of the Pentagon’s Labs9, Siri now analyses petabytes of information to deliver real-time information to your phone.

Riot Games aims to be the most player-focused game company in the world. To fulfil that mission, it’s vital they develop a deep, detailed understanding of players’ experiences10. They use a big data infrastructure to support their capability to provide continued insights into their 32 million active users.

Hertz is the world's largest airport car rental brand operating in 146 countries, Hertz continually requests and receives feedback from its customers. To retain a competitive edge, the feedback is analysed so that issues can be identified in real-time and problems can be addressed and resolved quickly11.

NASA, in the time it took you to read this sentence, NASA gathered approximately 1.73 gigabytes of data from nearly 100 currently active missions! They do this every hour, every day, every year13 and it is all processed in a Big Data environment.

Lady Gaga, Troy Carter, Lady Gaga's business manager, is a big data devotee, reports The South China Morning Post. Carter created Littlemonsters.com, a Gaga-centric social network, by mining the singer's 31 million plus fans on Twitter and 51 million plus on Facebook. The reported goal is to woo as many of Gaga's "little monsters" as possible to this site, effectively bypassing the general-purpose social media networks and keeping 100% of future revenues

Page 9: 3. Examples of Big Data in action

4. Jargon Buster

ACID – is the acronym for Atomicity, Consistency, Isolation and Durability, a set of properties that guarantee that database transactions are processed reliably.

Analytics - Data Analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information.

Big Data - are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimisation.

Big Data Apps - also known as BDA’s, are responsible for taking all the information gathered by big data, and turning it into an easy to consume visualisation. This is a broad category that can include everything from end user analytics to new data management tools like Hadoop.

Cloud - the definition of “the Cloud” is very ambiguous, it began as a method for anonymously requesting a business service over the internet but was redefined when the big players such as Microsoft, Amazon & Google became involved and became another way of providing a managed environment where information can be stored but this time in a remote (and invisible) location

Data Cholesterol - a condition that affects computers similarly to humans - the “excessive build-up of data leads to sluggishness across your systems,” which can lead to a hindrance in the system’s ability to function.

Data-mining - the task of turning analysed information into useful information that can then be used by an organisation to cut costs, increase efficiency, and better serve customers. Data mining has grown in popularity as businesses realise that previously untapped data sources may hold secrets that provide a competitive edge.

Data Scientist - is generally an individual who performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.

Data Sizes

ETL - is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one source and place it into another

Extract is the process of reading data from a database.

Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database.

Load is the process of writing the data into the target database.

Term Short form Absolute size Relative size

Bit b Binary Digit A Bit is the smallest unit of data that a computer uses, it is either on or off

Byte B 8 Bits Equals one character

Kilobyte KB 10001 bytes Would be equal to a standard paragraph in length

Megabyte MB 10002 bytes About 873 pages of plain text

Gigabyte GB 10003 bytes Could hold the contents of about 10 yards of books on a shelf

Terabyte TB 10004 bytes Could hold 1,000 copies of the Encyclopedia Britannica

Petabyte PB 10005 bytes Could hold approximately 20 million 4-door filing cabinets full of text

Exabyte EB 10006 bytes It has been said that 5 Exabytes would be equal to all of the words ever spoken by mankind

Zettabyte ZB 10007 bytes There is nothing to compare a Zettabyte too except to say it is a 1 and a lot of zeroes

Yottabyte YB 10008 bytes Even more zeroes than a Zettabyte

Brontobyte BB 10009 bytes Even more zeroes than a Yottabyte

Page 10: 3. Examples of Big Data in action

Hadoop - is an open-source software framework, from the Apache Software Foundation, that supports data-intensive distributed applications.

HDFS - stands for Hadoop Distributed File System. This is the system of distributing files that allows Hadoop to work on huge datasets at speed. It spreads blocks of data across different servers, as well as duplicating those blocks of data, and storing them distinctly -- which allows for both parallel processing and server failure compensation.

MapReduce - is a programming model for processing large datasets which occurs in two steps. First it "maps" out the relevant information for your query, then it "reduces" the information down, sorts it based on any rules you've applied, and gives you just the data you were after.

Pig - is a scripting interface to Hadoop, meaning a lack of MapReduce programming experience won't hold you back. It's also known for being able to process a large variety of different data types.

HPC - stands for High Performance Computing (or Computers) and is the use of “Supercomputers” and parallel processing techniques for solving complex computational problems. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both administration and parallel computational techniques.

In Memory Databases - a class of databases that store data in transient memory instead of on disk, which eliminates the latency associated with I/O making them much faster than traditional databases. In-memory databases have existed for years, but the rise of cheap memory has renewed interest and led to the emergence of products like SAP’s Hana and Oracle’s Times Ten.

NewSQL - Relational databases and NoSQL are still not enough to store and manage continually shifting and growing data stores, yet another technology has emerged, NewSQL, which includes products like SQLFire and StormDB. NewSQL databases can be thought of as a hybrid between NoSQL and traditional SQL-based relational databases. They provide the scalability and performance of NoSQL systems, but also offer ACID guarantees of a traditional database system. NewSQL systems generally target use cases that involve:

a large number of short-lived transactions

the same queries repeatedly with different inputs

indexed information in a single structure (no complex joins/full table scans)

NoSQL - NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. It encompasses a wide range of technologies and architectures and seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address.

Structured Data - Data that resides in fixed fields within a record or file, relational databases and spread sheets are examples of structured data.

Unstructured Data - Data that does not reside in fixed locations. The term generally refers to free-form text, which is ubiquitous; examples are word processing documents, PDF files, e-mail messages, Tweets, blogs and Web pages

XaaS - is a collective term which is generally accepted as meaning "everything as a service." The acronym refers to an increasing number of services that are delivered over the Internet rather than provided locally or on-site,

Page 11: 3. Examples of Big Data in action

services such as SaaS, PaaS, IaaS etc. XaaS is used when referring to more than one of these technologies and you want to condense them into a single term.

Page 12: 3. Examples of Big Data in action

5. Acknowledgements

1 http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

2 http://www.businesswire.com/news/home/20130612005142/en/HPC-Server-Market-Grows-5.3-Quarter-2013

3 http://www.techrepublic.com/whitepapers/executive-guide-making-the-business-case-for-big-data/32599113

4 Quote by W. Edward Deming

5 http://www.computerweekly.com/news/2240184482/Tesco-uses-big-data-to-cut-cooling-costs-by-up-to-20m

6 http://www.ibm.com/analytics/us/en/events/leadership-summit/SALS_Predictive_Maintenance.pdf

7 http://www.emc.com/campaigns/global/big-data/human-face-of-big-data.htm

8 http://davebeulke.com/big-data-three-criteria-to-big-data-project-success/

9 http://siliconangle.com/blog/2013/02/20/five-government-big-data-projects-thatll-change-the-world/

10 http://stampedecon.com/agenda/presentations/#Jerome_Boulon

11 http://www.slideshare.net/robdthomas/ibm-big-data-references

12 http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017

13 http://open.nasa.gov/blog/2012/10/04/what-is-nasa-doing-with-big-data-today/

14 http://www.marketingcharts.com/wp/topics/asia-pacific/just-0-5-of-the-worlds-massive-trove-of-online-data-is-being-analyzed-25463/

15 http://www.microsoft.com/en-us/news/Press/2013/Feb13/02-11BigDataRoundupPR.aspx

Page 13: 3. Examples of Big Data in action