Big Data overview

68
Big Data overview ESCEN Alexis Roos Senior Sales Engineer / Architect © Copyright 2013, Alexis Roos, [email protected]

description

Subset of overview course I gave for ESCEN, Silicon Valley http://www.escen.fr/lecole/escen-silicon-valley/

Transcript of Big Data overview

Page 1: Big Data overview

Big Data overviewESCEN Alexis RoosSenior Sales Engineer / Architect

© Copyright 2013, Alexis Roos, [email protected]

Page 2: Big Data overview

Course objectives● Give you a map / big picture and pointers to be

able to drill down as you need● Will cover business side but will also cover

technology as without good technical understanding; it is not possible to grasp business side

● Will go over landscape and possibilities and illustrate with a good number of use cases

Page 3: Big Data overview

Proposed Agenda

● What is Big Data?● Big Data landscape (Tech heavy)● Business / Use cases● Discussion

Page 4: Big Data overview

Proposed Agenda

● What is Big Data?● Big Data landscape● Business / Use cases● Discussion

Page 5: Big Data overview

Big Data

Page 6: Big Data overview

Data and Big Data● Data is the basis for Information

Economics are now allowing to store virtually unlimited data

● "“Big data” is high -volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making."Gartner's definition.

http://www.youtube.com/watch?v=ah14LEFKe8Q

Year Cost of 1GB

1980 $3,000,000

1990 $8000

2000 $30

2010 $0.08

Page 7: Big Data overview

Data – Information processing 1/2● Through processing data becomes

Information (knowledge) and knowledge creates insight and insight = success.

● Transaction processing:A sequence of information exchange and related work that is treated as a unit for the purposes of satisfying a request (usually human but not exclusively)aka Online Transaction processing or OLTPExample: you buy an item on Amazon:. Item is placed on hold in Inventory system. Item is placed in shopping cart. System requests CC payment authorization for item. If payment is approved, CC is charged, item is removed from inventory and shipped.-> all of the above or nothing (roll back)

Page 8: Big Data overview

Data – Information processing 2/2● Real Time processing

Perceived as "immediate" from the originatorEx: trading, payment, online booking, "right" ad delivery, gaming, etc.

● Batch processing:Delayed Execution of a series of programs ("jobs") on a computer without manual intervention.Ex: billing, virus scanning, web indexing, data mining, analytics, etc.

Page 9: Big Data overview

Data – ACID Transaction● Technical definition:

● Atomicity: each transation is all or nothing● Consistency: transaction will stay consistent

with data rules● Isolation: Ensures that each transaction is

kept isolated from others● Durability: Once a transaction has been

committed, it will remain so, even in the event of power loss, crashes, or errors

Page 10: Big Data overview

Big Data - Applications● Find deeper insight in data:

customers, partners and business.All Industries will be affected."The software is eating the world"● Retail: buying patterns, store traffic, etc● Logistics: track and optimize shipments, etc● Healthcare: preventive medecine, disease

management, etc.● Social media: optimize usage, ads, etc.● Finance: buying patterns, portfolio optimization

http://www.youtube.com/watch?v=7D1CQ_LOizA

http://online.wsj.com/article/SB10001424053111903480904576512250915629460.html

Page 11: Big Data overview

Big Data – Three dimensions● Volume

● Amount of data● Velocity

● Speed at which it arrives● Variety

● Types of data

Page 12: Big Data overview

Big Data – Volume/Size mattersName Value Example

kilobyte (kB) 10^3 Email (7KB), Images, web pages

megabyte (MB) 10^6 Ebooks, MP3, SD video etc.

gigabyte (GB) 10^9 HD movie

terabyte (TB) 10^12 For a single journey across the Atlantic Ocean, a four-engine jumbo jet can create 640 terabytes of data

petabyte (PB) 10^15 FB has over 1.5 PB of stored photos

exabyte (EB) 10^18 Seagate Technology reported selling 330 exabytes worth of hard drives during the 2011 Fiscal Year

zettabyte (ZB) 10^21 WW production and consumption of data.According to International Data Corporation, the total amount of global data is expected to grow to 2.7 zettabytes during 2012

yottabyte (YB) 10^24 Not there yet ..

...

http://en.wikipedia.org/wiki/Zettabyte http://www.youtube.com/watch?v=CsVYID9rMGE

Page 13: Big Data overview

Big Data - Speed● How fast is new data coming?● How does this data need to be used or

correlated?● How long is data valuable?● How fast does data need to be

processed?

● This dimension in particular will affect the system architecture

Page 14: Big Data overview

Big Data - Variety● What type(s) / format(s)?

● Human or machine generated● Text, location, document, picture, video, click streams,

log file, event, etc.

● Is it structured or unstructured?● Static vs dynamic

● What are relationships/dependencies within data elements?

Page 15: Big Data overview

Proposed Agenda

● What is Big Data?● Big Data landscape● Business / Use cases● Discussion

Page 16: Big Data overview

Big Data landscape

Big Data applications are roughly built using three technology layers:● Storage● Analytics● Visualization

Page 17: Big Data overview

For whom?

Page 18: Big Data overview

Big Data landscape

● Storage● Analytics● Visualization

Page 19: Big Data overview

Big Data – Storage● Main logical data models:

● Tabular (represented by rows and columns) - Relational model

● Tree (a set of nodes with parent-children relationship)

● Graph structure (a set of interconnected nodes)

● Document (free structure / unstructured / schema less)

Page 20: Big Data overview

Big Data – Storage● Physical data models:

● Relational Data Base Mananement Systems (RDBMS) support ACIDity and joins are considered relational. Use SQL language as API.

● Key-value systems basically support get, put, and delete operations based on a primary key.

● Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier.

● Document-oriented systems store structured "documents" such as JSON or XML but have no joins (joins must be handled within your application). It's very easy to map data from object-oriented software to these systems.

http://nosql-database.org/

Page 21: Big Data overview

Big Data - Storage● Not practical to store data on 1 system,

but distributing data creates complexity:● Consistency: means that each client always has the

same view of the data.● Availability: means that all clients can always read

and write.● Partition: tolerance means that the system works

well across physical network partitions.

● If system is partitioned, it is only possible to achieve 2 out of 3 properties (known as CAP theorem): CA, AP or CP.

Page 22: Big Data overview

Big data - Storage

Source: http://blog.nahurst.com/visual-guide-to-nosql-systems

Page 24: Big Data overview

Big Data landscape

● Storage● Analytics● Visualization

Page 25: Big Data overview

Big Data – Analytics● Process of examining large amounts of

data of a variety of types to uncover hidden patterns, unknown correlations and other useful information resulting in business benefits, such as more effective marketing or increased revenue.

● Can work on all forms of data as described before

● Can involve Transactions, Real Time and/or Batch Oriented

Page 26: Big Data overview

Big Data – Analytics● "Stages" of analytics:

● Business monitoring: traditional BI, Charting, Key Performance Indicator, etc.

● Business insights: uses statistics, data mining, predictive analysis to generate actionable insights: "Intelligent dashboards". Leverages trending, classification, optimization, simulations.

● Business transformation based on data

Page 27: Big Data overview

Big Data – Analytics● Anticipate and predict

Page 28: Big Data overview

Big Data – Analytics● Traditional predictive analytics and data

mining are designed for relational data or structured data so a whole set of new technologies have evolved for unstructured data.

● Hadoop (batch oriented): "brute force"● Real Time processing (new trend):

optimized for specific use cases● Machine learning: data intensive

Page 29: Big Data overview

Big Data – Hadoop● Designed for large scale (100's of

terabytes of data) batch oriented information processing: archiving, transformation, exploration, etc.

● Reliable while using commodity HW and open source

● Main components:● Distributed File System (HDFS)● Map Reduce: distributed data processing● Associated infrastructure components, query

mechanisms and machine learning

Page 30: Big Data overview

Big Data – Hadoop Example● Derive meaning from logs:

● Who is using the web site?IP, location, device, etc.

● What pages are they looking at?How long, how often?

● Are they buying?Adding products to cart?Checking out?

● What are the trends?

Page 31: Big Data overview

Big Data – Real-Time● Goal is to process data from highly

dynamic sources in real time● Data is typically streaming to the

processing system and stored / processed directly into memory

● Complex Event Processing has been there for years but need new architecture for Big Data scale and distributed processing: Storm/Kafta are one of the frameworks that could become "Hadoop" of Real-Time

Page 32: Big Data overview

Big Data – Real-Time Example● Derive meaning from tweets:

● How well brand is trending?

● By time, category?● Compared to competitors● Sentiment?● etc

http://www.filtize.com/

Page 33: Big Data overview

Big Data – Machine learning● What is Machine learning?

Page 34: Big Data overview

Big Data – Machine learning● "A branch of artificial intelligence, that is

about the construction and study of systems that can learn from data."Supports Predictive Analytics

● Can perform tasks that are too difficult to specify algorithmically

● Example of applications:● Computer vision, Natural language processing,

Fraud detection, Game playing, Robot locomotion, Sentiment analysis, Adaptive systems, scientific applications, anomaly detection, recommendation engine, personal assistant, etc

Page 35: Big Data overview

Big Data – Example● Handwritten recognition● Handcrafted rules will result in large

number of rules and exceptions. Best to have a machine that learns from a large training set.

Page 36: Big Data overview

Big Data – Example● Computer vision: car detection● First Learning

● Then Testing: Is this a car?

Not a carCars

Page 37: Big Data overview

Big Data – Machine learning● Supervised or unsupervised learning:

whether we train the model or the system learns on its own

● Types of information processing:● Supervised

– Classification (discrete)– Regression (continuous)

● Unsupervised– Clustering (discrete)

Page 38: Big Data overview

Big Data – Machine learningSupervised – Classification / Regression

● First teach the model● Then verify against the model

Page 39: Big Data overview

Big Data – Machine learningClassification

● Classifier (single or multi class): given some set of features with corresponding labels, learn a function to predict the labels from the features

x x

xx

x

x

x

x

oo

o

o

o

x2

x1

Page 40: Big Data overview

Big Data – Machine learningClassification

Many algorithms to choose from:● SVM● Neural networks● Naïve Bayes● Bayesian network● Logistic regression● Randomized Forests● Boosted Decision Trees● K-nearest neighbor● RBMs● Etc.

● In reality much more than 1 variable: size, number of floors, number of rooms, age, location, etc

Page 41: Big Data overview

Big Data – Machine learningRegression

● Regression allows to fit an equation to a dataset to be able to predict values for new data

Example: calculate price of a house: in reality much more than 1 variable: size, number of floors, # of rooms, age, location, etc

Page 42: Big Data overview

Big Data – Machine learningClustering

● Clustering allows to place data elements into related groups without advance knowledge of the group definitions.

● Example: social network aka similar profiles● K-means is a popular algorithm for clustering

http://en.wikipedia.org/wiki/K-means_clustering

Page 43: Big Data overview

Big Data – Machine learning● Predictive analytics techniques usage

Page 44: Big Data overview

Big Data – Machine learning● Designing a high accuracy learning system

“It’s not who has the best algorithm that wins. It’s who has the most data.”

Ex: Classify between confusable words.{to, two, too}, {then, than}For breakfast I ate _____ eggs.

● Algorithms● Perceptron (Logistic regression)● Winnow● Memory-based● Naïve Bayes

Training set size (millions)

Acc

urac

y

Page 45: Big Data overview

Big Data landscape

● Storage● Analytics● Visualization

Page 46: Big Data overview

Big Data – Visualization● Help overcome information overload● Allows to see patterns and connections:

instantly and overtime● Focus on specific parts of data but also in

relation to other parts: data is relative● Many different tools and techniques can

be used based on data sets

http://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualization.html

http://www.ted.com/talks/joann_kuchera_morin_tours_the_allosphere.html

Page 47: Big Data overview

Big Data – Visualization● Many differents types available:

● 1D, 2D, 3D● Temporal: timeline, time series, etc● Advanced types: cloud tag, bubble

chart, network graph, rose chart, , spider chart, heatmap, tree map, dependency graph, etc.

● Can allow interactivity (navigate, zoom in/out, slide and dice, etc).

http://guides.library.duke.edu/vis_types

Page 48: Big Data overview

Big Data – Visualization Examples

https://developers.google.com/maps/tutorials/visualizing/earthquakeshttp://www.webdesignerdepot.com/2009/06/50-great-examples-of-data-visualization/

Page 49: Big Data overview

Proposed Agenda

● What is Big Data?● Big Data landscape● Business / Use cases● Discussion

Page 50: Big Data overview

Big Data – A Word on Privacy● Currently mostly ignored: Big Brother?

● Everything is being stored (data retention)– Location, calls, SMS, searches, web access,

transactions, applications used, contacts, calendar, etc.

● Data doesn't belong to you (Facebook, etc) and may be resold (based on privacy policy)

● Apps can read your calendar, contacts, etc. and upload data on their server

● For now users do not seem to care:they care about service and free (as in $ ). Your phone company is watchingGoogle's drive privacy articleWho's afraid of the bad, big data?

Page 51: Big Data overview

Big Data – And Social MediaAn opportunity to:● Identify trends: tweets, likes, blogs, page

views, etc● Pinpoint problems: social media data can be

used to get sentiment / feedback on products / brands / events (even real-time)

● Predict behavior: what is trend over time and how does it correlate to particular events?

Page 52: Big Data overview

Big Data – Not just 1 device

http://www.smartinsights.com/mobile-marketing/mobile-marketing-analytics/mobile-marketing-statistics/

Page 53: Big Data overview

Big Data – Mobile is growing faster

http://www.smartinsights.com/mobile-marketing/mobile-marketing-analytics/mobile-marketing-statistics/

Page 54: Big Data overview

Big Data – Shopping habits

Page 55: Big Data overview

Big Data - Business models● Data is the "new oil"

● Every day, 2.5 quintillion bytes of data are created, with 90 percent of the world's data created in the past two years alone.

● Data production will be 44 times greater in 2020 than in 2009.

Page 56: Big Data overview

Big Data - Business models● Data is the new business model as:

● Cost of HW, SW and networks requires to produce and transport data continues to approach an effective cost of zero

● Even in the physical manufacturing world, cost will go down: robotics, 3D printing, etc.

● Data creates insight which allows to enhance and disrupt existing business models

Page 57: Big Data overview

Big Data - Business models● Opportunities for:

● Web businessesTo increase ARPU

● EnterprisesServe their customers better and improve management of suppliers and partners

● IoTInternet of Things (IoT) or M2M (Machine To Machine) for instance will allow brand new capabilities and services

Page 58: Big Data overview

Big Data - Business models● Already used by web business (Google,

Facebook,etc and moving to Enterprises)

Page 59: Big Data overview

Big Data - Web● More data can derive more insight which lead to

increase ARPU● Ex: Ad platform

Advertisers define ads and campaigns available across web, mobile, TV, etc.On Google properties, Google makes money each time an ad is clicked (CPC). On Network members and content providers, Google makes money each time an ad is clicked or is displayed (CPM)

-> Increase relevance and knowledge on the user lead to increased revenues

Page 60: Big Data overview

Big Data - Enterprises● All Industries are being disrupted

Page 61: Big Data overview

Big Data - Enterprises

Page 62: Big Data overview

Big Data - Enterprises● Differentiation: satisfy customers, improve

existing services and create new service offerings

● Improve processes: merchandising, forecasting, and purchasing to distribution, allocation, and transportation, etc.

● Data as a service: resell information, analysis and insights

Page 63: Big Data overview

Big Data - Enterprises

Page 64: Big Data overview

Big Data - IoTMore and more machine are connecting and generating data

Page 65: Big Data overview

Big Data - IoT

http://harborresearch.com/wp-content/uploads/2012/05/HarborResearch-nPhase_Paper_March-2011.pdf

Page 66: Big Data overview

Big Data - IoT

http://www.slideshare.net/harborresearch/harbor-research-introduction-to-smart-business-m2-m

Page 67: Big Data overview

Big Data – IoT and HealthcareHome Healthcare / Tele-Health

● Business and Technology trends● Aging Population● Increase in Chronic Illnesses● Demand from patients for home environment and

independence● Costs pressure and scarcity for hospital beds ● Affordable and available telecommunications● Computing advances: cost, size, power,

performance, imaging, etc.

Page 68: Big Data overview

Proposed Agenda

● What is Big Data?● Big Data landscape● Business / Use cases● Discussion