Big Data Analytics Challenges and Opportunities Ahead '“ Beyond the Hype
What is Big Data Analytics and Why Should I Care? is Big... · 2016-09-28 · The 3 V’s of Big...
Transcript of What is Big Data Analytics and Why Should I Care? is Big... · 2016-09-28 · The 3 V’s of Big...
Instructor(s) Name Kirk Borne, Principal Data Scientist, Booz Allen Hamilton
What is Big Data Analytics and Why Should I Care?
Crystal City, VA
September 30, 2016
Workshop in Two Parts
Part 1: Big Data Analytics
Part 2: Going for the Gold
2
Outline for Part 1: Big Data Analytics
How did we arrive at Big Data?
Domains of Data
Data Science, ML, and Analytics
The Internet of Things = hyper-Big Data
Simple Applications Demonstrated
3
Outline for Part 1: Big Data Analytics
How did we arrive at Big Data?
Domains of Data
Data Science, ML, and Analytics
The Internet of Things = hyper-Big Data
Simple Applications Demonstrated
4
Ever since we first explored our world…
5 Source for graphic: http://www.livescience.com/27663-seven-seas.html
… We have asked questions about everything around us.
6 Source for graphic: https://jefflynchdev.wordpress.com/tag/adobe-photoshop-lightroom-3/page/5/
The result is…
As we collect evidence (data) to answer our
questions, the data leads to more questions, etc…
7
Source for graphic: http://www.airshipman.com/use-people-to-do-your-advertising/
The result is…
As we collect evidence (data) to answer our
questions, the data leads to more questions, etc…
which leads to BIG DATA!
8
Source: https://www.andertoons.com/data/cartoon/7468/the-data-weve-gathered-discussions-about-big-data-are-up-72
Exponential & Combinatorial Growth (all numbers quoted here are from circa 2014)
9
16 BILLION DVDs needed to store the internet traffic
generated in a single hour, a stack 3x the
height of Mount Everest.
150 BILLION Emails sent every day, up to 70% of which
are spam.
33 PERCENT of children born in the United States
have an online presence prior to birth.
100 HOURS of video uploaded to YouTube every minute
= 16 years of content each day.
30 BILLION pieces of content shared monthly
on Facebook.
300 MILLION photos updated to Facebook daily, nearly
20 times larger than all the photos in the
Library of Congress.
2.4 BILLION tweets every 72 hours from more than 550
million active users.
Defining Big Data
We collect evidence (data) to answer our questions about the world around us … How? Why? What if?
– … and that is how we end up in a world of BIG DATA!
Big Data refers to data collections in which “everything is now being quantified and tracked” (= full-population samples of everything = The End of Demographics!)
– Examples: Social networks (Twitter, YouTube), search & online histories, web logs, financial and e-commerce transactions, environment & health monitors (wearable devices, EHRs), IoT, Astronomy,…
– Huge quantities of data are now being used everywhere.
10
Source for graphic: http://hinalockim.blogspot.com/2012/08/6th-week-cognitive-learning.html
Outline
How did we arrive at Big Data?
Domains of Data
Data Science, ML, and Analytics
The Internet of Things = hyper-Big Data
Simple Applications Demonstrated
11
Big Data in Business: Monetization is a Big Challenge
12
Source for graphic: http://www.gladwinanalytics.com/blog/big-data-business-model-maturity-index-and-the-internet-of-things-iot
Big Data in Business & Government: Analytics-driven innovation
13
Source: http://www.gao.gov/products/GAO-16-659SP
Big Data in Government: R&D strategic plan
14
Source: https://www.whitehouse.gov/sites/default/files/microsites/ostp/NSTC/bigdatardstrategicplan-nitrd_final-051916.pdf
Big Data from your body!
15 Source for graphic: https://datafloq.com/read/body-source-big-data-infographic/413
Big Data in your face!
16
Source for graphic: http://qz.com/779625/none-of-your-pixelated-or-blurred-information-will-stay-safe-on-the-internet/
Nothing pixelated (or blurred) will stay safe on the internet.
Deep Learning algorithms can discover deep hidden patterns.
Big Data in Science: Discovery at Petascale & Exascale
17 http://www.extremetech.com/extreme/124561-ibm-to-build-exascale-supercomputer-for-the-worlds-largest-million-antennae-telescope
SKA = Square Kilometer Array
joint project: Australia and South Africa
http://www.ska.gov.au/
~5 exabytes (5,000,000 Terabytes) every day!
Big Data in Environmental Monitoring
18
From Data to Information to Knowledge to Understanding
Big Data in Environmental Monitoring
19
Big Data in Science: Example from Astronomy
20
LSST Construction began 2014. Survey period = 2022-2032
Deep, Wide, Fast Data to answer Big Questions about the Universe
21
LSST Key Science Drivers: Mapping the Dynamic Universe – Complete inventory of the Solar System (Near-Earth Objects; killer asteroids???)
– Nature of Dark Energy (Cosmology; Supernovae at edge of the known Universe)
– Optical transients (10 million daily event notifications sent within 60 seconds)
– Digital Milky Way (Dark Matter; Locations and velocities of 20 billion stars!)
LSST in time and space: – When? ~2022-2032 – Where? Cerro Pachon, Chile
Architect’s design
of LSST Observatory
LSST Summary: Big Data & Data Science
22
• http://www.lsst.org
• 3-Gigapixel camera
• One 6-Gigabyte image every 20 seconds
• 20 Terabytes every night for 10 years
• Repeat images of the entire night sky every 3 nights:
Celestial Cinematography
• 100-Petabyte final image data archive anticipated
all data are public!!!
• 20-Petabyte final database catalog anticipated
~20 trillion sources with 200+ database attributes each.
This is a combinatorial explosion!
~10 million events per night, every night, for 10 years.
Fast categorization and decisions (triage!) required.
Goal: understand our vast dynamic Universe
LSST Summary: Big Data & Data Science
23
• http://www.lsst.org
• 3-Gigapixel camera
• One 6-Gigabyte image every 20 seconds
• 20 Terabytes every night for 10 years
• Repeat images of the entire night sky every 3 nights:
Celestial Cinematography
• 100-Petabyte final image data archive anticipated
all data are public!!!
• 20-Petabyte final database catalog anticipated
~20 trillion sources with 200+ database attributes each.
This is a combinatorial explosion!
~10 million events per night, every night, for 10 years.
Fast categorization and decisions (triage!) required.
Goal: understand our vast dynamic Universe
DEEP
WIDE
FAST
VALUE
The 4 Rewards of Big Data in all Domains
o Knowledge Discovery – Data-to-Discovery (D2D)
o Data-driven Decision Support – Data-to-Decisions (D2D)
o Big ROI (Return On Innovation) – Data-to-Dollars or Data-to-Dividends (D2D)
– Innovative Applications of sense-making from sensors and sentinels everywhere
o Data Science for Social Good – Data for Good (D4G) – follow @DataSci4Good
24
http://thinkfuture.com/
Challenges to Achieving Rewards The 3 V’s of Big Data are not just hype – they represent really big challenges:
1. Volume (DEEP)
2. Variety (WIDE)
3. Velocity (FAST)
But… Volume is not the problem! Storage is manageable.
Data Science & Analytics (integrating and combining disparate data sources to achieve Data-to-Discovery, Data-to-Decisions, and Data-to-Dividends) are hard…
… especially on complex (diverse, high-Variety) and fast-moving (real-time, high-Velocity) data!
Focus on Value Creation through Advanced Analytics / Data Science in order to conquer these challenges.
25 Source for graphic: http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/
Outline
How did we arrive at Big Data?
Domains of Data
Data Science, ML, and Analytics
The Internet of Things = hyper-Big Data
Simple Applications Demonstrated
26
Some Quick Definitions
Statistics = the practice (and science) of collecting and analyzing numerical data.
Machine Learning (ML) = mathematical algorithms that learn from experience (historical data).
Data Mining = application of ML algorithms to data.
Artificial Intelligence (AI) = application of ML algorithms to robotics and machines.
27 Source for graphic #1: http://www.satyavedism.com/mathematics-astrophysics/mathematics-resources Source for graphic #2: http://blogs.sas.com/content/subconsciousmusings/2014/08/22/looking-backwards-looking-forwards-sas-data-mining-and-machine-learning/
Data Science = application of scientific method to discovery from data (including statistics, machine learning, and more: visual analytics, machine vision, computational modeling & simulation, semantics, graphs, network analysis, data indexing schemes, …).
Analytics = the products of machine learning & data science.
Machine Learning: 4 Types of Discovery (algorithms that learn from experience)
1) Class Discovery: Finding new classes of objects (population segments), events, and behaviors. This includes: learning the rules that constrain the class boundaries.
2) Correlation (Predictive and Prescriptive Power) Discovery: Finding patterns and dependencies, which reveal new governing principles or behavioral patterns (the “customer DNA”).
3) Novelty (Surprise!) Discovery: Finding new, rare, one-in-a-[million / billion / trillion] objects and events.
4) Association (or Link) Discovery: Finding unusual (improbable) co-occurring associations.
28
The Data Analytics Revolution
Exploiting the Value Chain: from Digital Data to Information to Knowledge to Insights (and Action) From Sensors (Measurement & Data Collection) …
… Big Data (Deep, Fast, Wide)
to Sentinels (Monitoring & Alerts = Information) …
… Machine Learning
to Sense-making (Knowledge & Insight Discovery) …
… Data Science
to Cents-making (Your Applications of Data = Action!)
… Analytics
… Productizing and Actionizing your Big Data
29
Data Analytics has evolved with growth in data
5 Levels of Analytics Maturity: 1) Descriptive = hindsight : what happened?
2) Diagnostic = oversight : what is happening? and why is it?
3) Predictive = foresight : what will happen?
– Predictive : given x, find y (needs historical training data)
4) Prescriptive = insight : how can I prescribe a better outcome?
– Prescriptive : given y, find x (needs comprehensive data set)
5) Cognitive = the “right sight” : asking the right question, at
right time, in the right context, in order to make the right decision!
– Cognitive : the “360 view”, take it all in, ask new questions!
– …to identify your “next-best move” or “next-best action”
– “It is not what you look at that matters – it’s what you see’’ (Henry David Thoreau)
30
31
32
33
From Descriptive to Predictive to Prescriptive Analytics via Cognitive Analytics: Exploring “Data in Context” leads to new questions and new hypotheses …
34 http://www.boozallen.com/datascience
35
The Full Operational Data Analytics Spectrum
© Copyright 2016 Booz Allen Hamilton
Data Science and Analytics
Class Discovery
Correlation (Predictive / Prescriptive Power) Discovery
Surprise (anomaly) Discovery
Association (Link) Discovery
Each one can be applied at the 5 different levels of Analytics Maturity:
Descriptive → Diagnostic → Predictive → Prescriptive → Cognitive
36
The Future of Big Data Analytics and Data Science
37 http://www.boozallen.com/datascience
Machine Learning in our Lives
Your Purchase Preferences, Recommender Systems, Credit Scoring, Smart Phone auto-complete, …
38
PREDICT
OPTIMIZE
DISCOVER
DETECT
Your Thermostat, Your Commute Time and Routing, Personalized Learning, …
Your Health Issues (wearables), Your Best Deal (Bed & Breakfast or Restaurant), …
Your Social Sentiment, Flu Outbreaks, Credit Card Fraud, …
© Copyright 2016 Booz Allen Hamilton
Machine Learning in our Work
Predict outcomes, events, needs, costs, risks, product demand, … PREDICT
OPTIMIZE
DISCOVER
DETECT
Optimize processes, products, and people (delivery of services, supplies, personnel), …
Discover insights in publications, social media, quarterly business reports, electronic records, …
Detect fraud, anomalies in safety events, behaviors, outbreaks, data usage (HIPAA), cyber systems (data breaches), …
© Copyright 2016 Booz Allen Hamilton 39
Data Analytics in Medicine & Health Administration
40
1. Benefits Administration improvement (“ACO = HIE + Analytics”: process mining, best practices, cost-efficiency, success metrics validation)
2. Do Not Pay initiatives (payment error / fraud analytics) 3. Beneficiary Recommendations ("Amazon-style" predictive analytics, prescriptive modeling) 4. Consumer Engagement (personalized online web experience, "marketing analytics") 5. Health Information Exchange (HIE) Exploitation (population health discovery, link analysis,
ICD-10 mining) 6. Personalized Healthcare and Patient Wellness (wearables data-sharing/mining, health
baselining) 7. Personalized/Precision Medicine and Care Coordination (EHR, HIE monitoring / mining) 8. Predictive Medicine (readmissions, complications, adverse interactions) 9. At-Risk Precursor Analytics (early warning signals of cancer, diabetes, heart disease, suicidal /
mental health issues, ...) 10. Patient Trajectories Analysis (mining / segmentation of whole population EHR histories,
pathways, outcomes, outliers) 11. Learning Health System Decision Support (advanced analytics embedded in health system
data feeds) 12. What Question Should I Be Asking of My Data? (Cognitive Analytics)
© Copyright 2016 Booz Allen Hamilton – http://www.boozallen.com/datascience
Outline
How did we arrive at Big Data?
Domains of Data
Data Science, ML, and Analytics
The Internet of Things = hyper-Big Data
Simple Applications Demonstrated
41
Data Science: Applications and Use Cases are everywhere… Smart Apps (Find best price; real-time travel adjustments; type-ahead texting)
Predictive Retail (Dynamic Pricing, Smart Supply Chain, Precision Demand Forecasting)
Precision Marketing (SegOne, Personalized Real-time Ad Campaigns for Next Best Offer)
Smart Highways (Real-time intelligence among vehicles, weather, roads, repairs)
Precision Traffic (Self-driving & Self-parking Connected Cars)
Smart Cities (Growth, Dynamic Street-lighting, Smart Energy Usage)
Predictive Law Enforcement (Predictive, Prescriptive personnel & resource placements)
Smart Healthcare (Wearables, Personalized Medicine, Patient/Provider Monitoring)
Invisibles (under-the-skin smart sensors that measure, learn, respond) = The Internet of Emotions!
Personalized Online Education (Dynamic learning, Gamification, Real-time interventions)
Precision Forests, Farms, Vineyards,… (Data-driven Planning, Nurturing, Harvesting)
Fintech / Banks / Insurance (Fast Risk analysis, Fraud detection, Personalized services)
Smart Organizations (Talent Placement, Employee Retention, Workforce Deployment, Process Mining for Efficiencies, Workflow recommender engines)
Predictive Machines (Early Warning, Prescriptive Maintenance & Obsolescence, IoT, Industrial IoT) 42
The XYZ of Data Science: Intelligence at the edge of the network (Edge Analytics at the point of data collection)
Smart X
– Smart Cities
– Smart Highways
– Smart Supply Chain
Precision Y
– Precision Medicine
– Precision Farming
– Precision Pricing
Personalized Z
– Personalized Health
– Personalized Learning
– Personalized Shopping Experience
43
http://www.loopcayman.com/content/if-smart-cities-are-next-big-thing-what-about-smart-regions
Internet of Things
https://www.nsf.gov/news/news_images.jsp?cntn_id=122028
Everything Interconnected
https://www.nsf.gov/news/news_images.jsp?cntn_id=122028
The Internet of Things (IoT)
is an interconnected universe of Dynamic Data-Driven Application
Systems (DDDAS)
https://www.nsf.gov/news/news_images.jsp?cntn_id=122028
Drive Big Benefits with Big Data Analytics Triage
General example of Data Analytics Triage in IoT: Event Mining in Dynamic Big Data Collections for Actionable Intelligence:
Behavior modeling (anomaly & trend detection) and ad hoc inquiry for Discovery
Identifying, characterizing, & responding to events for data-driven Decisions
Deciding which events need immediate investigation and/or intervention = Action!
Many other examples: Web user engagement & recommendations (from web analytics data)
Customer churn early warning (from 360-view customer data)
Predictive Maintenance alerts (from machine / engine sensors)
Infrastructure Monitoring alerts (from ubiquitous sensors)
Supply chain monitoring (from manufacturing & shipping sensors)
Cybersecurity alerts (from network logs)
Preventive Fraud alerts (from financial applications)
Health alerts (from EHRs and national health systems)
Tsunami alerts (from geo sensors everywhere)
Social event alerts or early warnings (from social media)
47
Prescrip
tive
Ris
k M
itig
ati
on
Infusing Analytics Capability into your organization
48 © Copyright 2016 Booz Allen Hamilton
Booz Allen’s approach to helping organizations drive
competitive advantage through data analytics
Activities • Enrich • Integrate
and Transform Data
Methods • Descriptive
Statistics • Filtering • Aggregation
Activities • Reveal trends • Identify
Correlations • Learn
Patterns
Methods • Unsupervised
Learning • Clustering • Outlier
Detection
Activities • Classify
Signals • Predict Risks • Forecast
Resources
Methods • Random
Forest • Neural
Networks • Bayesian
Analysis • Collaborative
Filtering
Activities • Optimize
Resources
• Simulate Decision Outcomes
Methods • Genetic
Algorithms
• Integer Programming
• Non-Linear Programming
• Discrete Event Simulation
Acquisition, aggregation and enrichment of information from multiple entry points will help create a holistic
view that can enhance operations, reduce risk, provide powerful insight, and create value.
Enables Effective Operations
and Decision-Making
• Allows for accurate
analysis of trends
across the organization
against defined KPI’s
• Supports strategic C-
Suite decision making
• Reveals operational
risks and potential
bottlenecks in real-time
• Supports critical
information
infrastructure protection
efforts by early
detection of
vulnerabilities
Products
Reports | Dashboards |
Mitigations
360o Data Acquisition
Business Operations and
Performance Data
Logs: Systems, Customers,…
Reports, e-Docs, and Manuals
Open Data
Outline
How did we arrive at Big Data?
Domains of Data
Data Science, ML, and Analytics
The Internet of Things = hyper-Big Data
Simple Applications Demonstrated
49
Mars Rovers (metaphor for general use case)
50
Mars Rovers (metaphor for general use case)
51
• Mars Rover = intelligent data-gatherer, mobile data mining
agent, and autonomous decision-support system:
– Gathers data (in situ) for remote sensors
– Performs intelligent (autonomous, cognitive) data mining operations
• Class Discovery
• Correlation (Predictive and Prescriptive Power) Discovery
• Novelty Discovery
• Association Discovery
– Enacts on-board Intelligent Data Understanding & Decision Support
• “Stay here and do more, or move elsewhere”
• “Follow trend to more interesting, lucrative, and productive location”
• “Send results immediately, or store for later analysis”
From Sensors to Sentinels to Sense
52
• New knowledge and insights are acquired by monitoring and mining actionable data from all digital inputs.
–Sensors!
• Alerts are triggered autonomously, without intervention (when it is permitted), applying machine learning and actionable business decision rules for pattern detection and diagnosis.
–Sentinels! (embedded machine learning / data science algorithms)
• “Smart Sensors” (powered by Machine Learning-enabled sentinels) deliver actionable intelligence.
–Sense!
(applies to any application domain with streaming data from sensors)
Dynamic Data-Driven Application Systems (DDDAS)
4 steps from data to action = MIPS:
– Measurement – Inference – Prediction – Steering
This applies to any Network of Sensors:
– Web user interactions & actions (web analytics data), Cyber network usage logs,
Social network sentiment, Machine logs (of any kind), Manufacturing sensors, Health &
Epidemic monitoring systems, Financial transactions, National Security, Utilities and
Energy, Remote Sensing, Tsunami warnings, Weather/Climate events, Astronomical
sky events, …
– IoT (the Internet of Things) and M2M (Machine-to-Machine): e.g., connected cars,
manufacturing plants, transportation systems, locomotive and jet engines, power grid,
“smart home”, “smart cities”, “smart farms”,…
Machine Learning enables the “IP” part of MIPS:
– Pattern (Segment) Discovery
– Correlation (Trend) Discovery
– Novelty (Anomaly) Discovery
– Association (Link) Discovery
53
http://dddas.org
Alert & Response systems:
• Actionable insights from
streaming business data
• Automation of any data-driven
operational system
Dynamic Data-Driven Application Systems (DDDAS)
4 steps from data to action = MIPS:
– Measurement – Inference – Prediction – Steering
This applies to any Network of Sensors:
– Web user interactions & actions (web analytics data), Cyber network usage logs,
Social network sentiment, Machine logs (of any kind), Manufacturing sensors, Health &
Epidemic monitoring systems, Financial transactions, National Security, Utilities and
Energy, Remote Sensing, Tsunami warnings, Weather/Climate events, Astronomical
sky events, …
– IoT (the Internet of Things) and M2M (Machine-to-Machine): e.g., connected cars,
manufacturing plants, transportation systems, locomotive and jet engines, power grid,
“smart home”, “smart cities”, “smart farms”,…
Machine Learning enables the “IP” part of MIPS:
– Pattern (Segment) Discovery
– Correlation (Trend) Discovery
– Novelty (Anomaly) Discovery
– Association (Link) Discovery
54
http://dddas.org
Alert & Response systems:
• Actionable insights from
streaming business data
• Automation of any data-driven
operational system
… more data means less uncertainty, and more laser-focused insights & intelligence!
5 data points 10 data points 50 data points
100 data points 1000 data points 10000 data points
Source for graphics: https://rexplorations.wordpress.com/2015/09/05/animated-mean-and-sample-size/
Big Data Volume is great news…
55
…but the greatest of V’s is Variety
Source for graphic: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf
The discovery and separation of classes improves when a sufficient
number of “correct” features are available for exploration:
(a) 2 classes are discovered and become separable:
(b) One trend line becomes 2 clusters:
14 56
Feature Selection and Projection
Feature Selection is important to disambiguate different classes. More importantly, Class Discovery depends on selecting the right features!
57
Feature Selection and Model Bias: choosing features in the dark
I picked out two socks from my sock drawer this morning!
It was still dark, but that shouldn’t matter, right? After all, they are the same size … THE SAME ?!?
The Era of Big Data represents the END OF DEMOGRAPHICS (i.e., our models should no longer be based on and biased by a limited selection of attributes and features)
58
59
Insufficient Variety: multiple classes are not distinguishable using this one feature
Sufficient Variety: two classes are discovered using this new feature
60
Another example of class discovery in a data set: by exploring high-variety (high-dimension data)
The separation and discovery of classes improves when a sufficient number of “correct”
features are available for exploration:
61
Source for graphic: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf
Clustering for Persona Discovery and Customer Personalization
Exploiting the 3rd V of Big Data
(Data Exploration and Data Exploitation)
1. Volume
2. Velocity
3. Variety
62
Digital Marketing Analytics Evolution: From Demographics to Personalization to Hyper-personalization
63
http://www.webtwit.com/digital-marketing-company-india.html
360 Customer View in Digital Marketing
64
Clustering = Class / Segment Discovery
Clustering = the process of partitioning a set of data into subsets (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging to that same cluster than to the data elements belonging to other clusters. = Grouping together similar items, and separating dissimilar items = Identifying similar characteristics, patterns, or behaviors among subsets of the data elements.
Challenge #1) No prior knowledge of the number of clusters. #2) No prior knowledge of semantic meaning of the clusters. #3) Different clusters are possible from the same data set! #4) Selecting different features can lead to different clusters.
65
Types of Clustering
In general terms, there are two approaches to clustering:
– Partitional – One set of clusters is created (e.g., K-Means clustering – choose K, the number of clusters).
– Hierarchical – Nested sets of clusters are created sequentially.
66
Example of Hierarchical Clustering
67
67
Starting with (a), then going to (e): Bottom-up, Agglomerative Clustering
Starting with (e), then going to (a): Top-down, Divisive Clustering
( ( (
( (
The “Google Maps” view for your Customer Space
https://www.researchgate.net/figure/273456906_fig3_Figure-4-Example-of-hierarchical-clustering-clusters-are-consecutively-merged-with-the
Hierarchical Clustering Approaches
Clusters are created at multiple levels – creating a new set of clusters at each level.
There are 2 types of hierarchical clustering:
– Agglomerative Clustering
Bottom-Up
Initially, each item is in its own cluster.
Then, clusters are merged together iteratively ...
– ... based upon similarity of data items.
– Divisive Clustering
Top-Down
Initially, all items are in one cluster.
Then, large clusters are successively divided ...
– ... based upon distance between data items.
68
Segmentation of One = ‘SegOne’ Marketing = Personalization
Marketing Campaign Segments = Customer Personas
Digital Marketing: your “Mars Rover” in a box – 1
Mining multi-channel big data streams (across your organization) o Class Discovery
o Correlation (Predictive and Prescriptive Power) Discovery
o Novelty Discovery
o Association Discovery
Hierarchical Segmentation for Personalization (“SegOne Marketing”)
Decision Automation in a rich content (Big Data) environment
69
Digital Marketing: your “Mars Rover” in a box – 2
Your own “Smart Sentinel (Mars Rover)” – Your business rules determine the decision points,
alerts, and responses (IF-This-Then-That = IFTTT).
– Move beyond historical hindsight and oversight (Descriptive & Diagnostic Analytics)
– Apply insight and foresight (Predictive & Prescriptive Analytics)
– Achieve right sight for your next-best move (Cognitive Analytics)
the 360 view enables the right question, right action, for the right customer, at the right place, at the right time, in the right context.
70
Data Science improves your odds in the fundamental business gambit: RISK versus REWARD
71
http://www.telegraph.co.uk/news/worldnews/europe/russia/10061780/Russian-convicts-beat-Americans-in-cyber-chess-battle.html
Are you ready to reap rewards (the 3 D2D’s) from Hyper-Big Data through Data Science?
Learning from data (Data Science): – Clustering (= New Class discovery, Segmentation)
– Correlation, Trend, Association, & Link discovery
– Classification, Diagnosis (Predictive power discovery)
– Outlier, Anomaly, Novelty detection (Surprise discovery)
… for business value (the 3 D2D’s): – Data-to-Discoveries
– Data-to-Decisions
– Data-to-Dividends (big ROI = Return on Innovation)
72
http://thinkfuture.com/
http://www.hadoop360.com/blog/iot-101-everything-you-need-to-know-to-start-your-iot-project
http://www.dataev.com/it-experts-blog/bid/297713/The-Big-Data-Challenges-of-a-Biotechnology-Startup-Company
SUMMARY – Part 1 Big Data is not about “Big” or “Data”
Big Data is a concept, focused on:
1) Data Science Discovery = Data-to-Discovery
2) Analytics Solutions = Data-to-Decisions
3) Value Creation = Data-to-Dividends (Data-to-Dollars)
… The Right ROI in a Big Data World = Return On Innovation
Machine Learning and Data Science are about:
a) Digital data transformations from Sensors to Sentinels to
Sense-Making; and
b) Insights through Predictive & Prescriptive Power Discovery
and Cognitive Exploration in DEEP, WIDE, FAST data!
73
http://www.boozallen.com/datascience @KirkDBorne
Part 2 – Going for the Gold
Steps to Cognitive Analytics
The Data Science Bowl (data for good)
Dare to Change the World
74
Part 2 – Going for the Gold
Steps to Cognitive Analytics
The Data Science Bowl (data for good)
Dare to Change the World
75
Simple Example of
Descriptive, Predictive,
Prescriptive, and
Cognitive Analytics
© Copyright 2016 Booz Allen Hamilton 76
Trend Lines in data: Descriptive!
Warning: it is tempting to over -f it every
wiggle in the data?
92 Naturally Occurring Elements
All Measurements are Degree Kelvin
© Copyright 2016 Booz Allen Hamilton
92 Naturally Occurring Elements
77
This is a better fit to the trend line…
for use in Predictive & Prescriptive analytics!
92 Naturally Occurring Elements
All Measurements are Degree Kelvin
© Copyright 2016 Booz Allen Hamilton
92 Naturally Occurring Elements
78
Sometimes we are
tempted to think that
outliers are just noise.
Trend Lines and
Outliers:
© Copyright 2016 Booz Allen Hamilton 79
92 Naturally Occurring Elements
Sometimes we are
tempted to think that
outliers are just noise.
Trend Lines and
Outliers:
Where is the
real discovery?
© Copyright 2016 Booz Allen Hamilton 80
92 Naturally Occurring Elements
Add some
context to
the data!
…that diagonal line in the
plot (where melting point =
boiling point) provides some
context (your expectations)!
Trend Lines and
Outliers:
© Copyright 2016 Booz Allen Hamilton 81
92 Naturally Occurring Elements
Why is that
point below
the line?
…that diagonal line in the
plot (where melting point =
boiling point) provides some
context (your expectations)!
Trend Lines and
Outliers:
© Copyright 2016 Booz Allen Hamilton 82
92 Naturally Occurring Elements
There’s
the Real
Discovery!
Trend Lines and
Outliers:
© Copyright 2016 Booz Allen Hamilton 83
92 Naturally Occurring Elements
Arsenic!
Trend Lines and
Outliers:
© Copyright 2016 Booz Allen Hamilton
Melts @ 1089oK
Boils @ 889oK
84
92 Naturally Occurring Elements
Arsenic!
Trend Lines and
Outliers:
© Copyright 2016 Booz Allen Hamilton
Melts @ 1089oK
Boils @ 889oK
85
Cognitive Surprise Discovery
(outlier / anomaly / deviation detection)
Knowing the right question to ask!
Part 2 – Going for the Gold
Steps to Cognitive Analytics
The Data Science Bowl (data for good)
Dare to Change the World
86
55
The Catalyst
Booz Allen’s Data
Science Practice
Our Passion for
Data Science
Lack of a National
Data Science Event http://www.boozallen.com/datascience/
http://www.datasciencebowl.com/
(www.DataScienceBowl.com)
Citizen Data Science!
About Kaggle
● World’s largest online data science competition
community
● Over 500,000 members across ~200 countries
● Community uses diverse backgrounds to solve some of the
most complex data science problems in the world
● Extremely strong brand within the data science community
“We and the broader data science community
share a common passion, culture, and vision
for using data science for social good.”
(www.DataScienceBowl.com)
Last year’s Grand Challenge:
$175,000 prizes (provided by Booz Allen)
Assess ocean health at a speed and scale
that were previously impossible.
(www.DataScienceBowl.com)
Services provided by Plankton: • Provide food for humans and marine animals • Produce oxygen (phytoplankton) • Remove CO2 from the atmosphere • Contribute to global biodiversity • Provide biomedical products • Major source of nutrients for indigenous populations
Assess Ocean Health by classifying
118 Classes of Plankton in >160K images
(www.DataScienceBowl.com)
Last year’s winning team :
Team Deep Sea
A group of postdocs and graduate students from the University of Ghent in Belgium.
● http://datasciencebe.com/2015/03/18/reslab/
● Average accuracy of 81% across all 121 plankton classes
● More than 1000 competing teams, and more than 15,000 submissions
● Recap: http://www.datasciencebowl.com/recap-first-annual-data-science-bowl/
(www.DataScienceBowl.com)
● Read all about it here: http://benanne.github.io/2015/03/17/plankton.html
● Deep Learning with convolutional neural networks
● Average accuracy of 81% across all 121 plankton classes
● Code available at https://github.com/benanne/kaggle-ndsb
● More than 1000 competing teams
● More than 15,000 submissions
● Recap: http://www.datasciencebowl.com/recap-first-annual-data-science-bowl/
Last year’s winning approach
(www.DataScienceBowl.com)
Data providers and partners: Drs. Michael Hansen and Andrew Arai, of the NIH National Heart, Lung, and Blood Institute (NHLBI); and the Children’s National Medical Center. Other partners include: NVIDIA; American College of Cardiology; The Children’s Inn at NIH; FNIH (Foundation for the NIH); MedStar Institute for Innovation; and more. The Challenge: improve diagnosis of heart disease through faster, more accurate measurement of ejection fraction (end-systolic and end-diastolic volumes) in cardiac MRI data. The Data: time-series of MRI scans from over 1000 patients.
(www.DataScienceBowl.com)
We did it again this year with a
$200K heart health challenge!
We did it again this year with a
$200K heart health challenge!
Data providers and partners: Drs. Michael Hansen and Andrew Arai, of the NIH National Heart, Lung, and Blood Institute (NHLBI); and the Children’s National Medical Center. Other partners include: NVIDIA; American College of Cardiology; The Children’s Inn at NIH; FNIH (Foundation for the NIH); MedStar Institute for Innovation; and more. The Challenge: improve diagnosis of heart disease through faster, more accurate measurement of ejection fraction (end-systolic and end-diastolic volumes) in cardiac MRI data. The Data: time-series of MRI scans from over 1000 patients.
It was not just about improving Cardio Imaging Analytics. It’s about Reed’s story: One in 100 children are born with congenital heart defects!
(www.DataScienceBowl.com)
Results: Volume Predictions
Data Science Bowl co-winner Tencia Lee visits NIH NHLBI to discuss winning algorithm
http://www.datasciencebowl.com/leading-and-winning-team-submissions-analysis/
(www.DataScienceBowl.com)
2016 Format & Logistics
● Web-based competition (www.DataScienceBowl.com)
● Competition Period: 14 December 2015 through 14 March 2016
● Models were quantitatively scored (i.e., no subjective judging panel)
● We are now seeking ideas for the 2017 Data Science Bowl #3 Challenge:
http://www.datasciencebowl.com
1 GRAND
CHALLENGE
90 DAYS =
$200,000 PRIZES
1st place: $125,000
2nd place: $50,000
3rd place: $25,000
NVIDIA also provided
complimentary GPU
Technology Conference
passes to top 3 teams
(www.DataScienceBowl.com)
Part 2 – Going for the Gold
Steps to Cognitive Analytics
The Data Science Bowl (data for good)
Dare to Change the World
100
Big Data + the IoT + Citizen Data Scientists =
= Partners in Sustainability The Internet of Things (IoT):
• Knowing the knowable via deep, wide, and fast data from ubiquitous sensors!
Big Data: • In the Big Data era,
Everything is Quantified and Tracked!
• Examples: – Social Networks – Population & Personal Health – Smart Cities & Highways – Retail Analytics – Cybersecurity – IoT = Internet of Things
17 SDGs are KPIs
for the World! (currently, the SDGs have 229
key performance indicators)
Sustainability Development Goals
101
Environmental Monitoring with IoT data
Check out and participate in the
EPA Smart City Air Quality Challenge: https://www.epa.gov/innovation/epa-challenges-prizes
EPA is challenging communities to deploy
hundreds of air quality sensors and
to make the data public!
Submissions due October 28, 2016
102
$100,000 in prizes
@KirkDBorne
@DataSci4Good
@BoozAllen
Are you ready to
change the world
with Big Data
Analytics?
LISTEN
READ www.boozallen.com/datascience
The Field Guide to Data Science
Building a Data Science Capability
Data Science Answers on Demand
10 Signs of Data Science Maturity
© Copyright 2016 Booz Allen Hamilton
Booz | Allen | Hamilton
PARTICIPATE datasciencebowl.com
Thank you!
Contact information:
@KirkDBorne
http://www.boozallen.com/datascience
104