Analysis of Big Data – Focus on Manufacturing · PDF fileAnalysis of Big Data –...
Transcript of Analysis of Big Data – Focus on Manufacturing · PDF fileAnalysis of Big Data –...
Analysis of Big Data – Focus on Manufacturing Bruce Aldridge, Sr. Analytics Consultant – Teradata
March 21, 2017
2
Big data & manufacturing
• Big data – trends
• Internet of things (IoT)
• Manufacturing data
• Analysis of big data
Agenda
© 2016 Teradata2
3
Bruce Aldridge, Ph.D. Senior Data Scientist, Advanced AnalyticsTeradata Corporation
Bio:
• Analytic / Industry (Mfg) consultant with Teradata since 2002
• 5 years in Disk Drive manufacturing (Western Digital)
• 14 years in Semiconductor manufacturing (Bell Labs, Motorola, Western Digital)
• PhD. Physics – University of California, Riverside
Technical
• Parallel Database analytics
• Failure Analysis, Yield and Reliability Physics
• 8 patents issued in analytic methods
4
Big Data – Hype or…?
© 2016 Teradata
5
“
’’
Big Data is dynamic
© 2016 Teradata
Big data can be thought of as data sets that
won’t fit in your current tool set
(or in the amount of time available for analysis).
Because tool sets (and capabilities) are dynamic
6
Tool Set improvements
Source: IntelSemiconductor Technology Scaling
Sp
ee
d:
De
lay
(Se
c)
Co
sts
($)
Vacuum Tube
Transistor
NMOS
CMOS
Vacuum Tube
Transistor
NMOS
CMOS
7
Physics is getting in the way
Transistor counts continue to increase
But clock speeds are saturating
8
Each year we get faster more processors Requiring Parallel execution
Why?
• Heat (i7 = 130W)
• Distanceo Speed of light = 186,000 miles/sec ≈ 1ft/nanosec (10-9 sec = 1GHz clock)o 3GHz clock = 0.33 nseco Information can move about 4 inches. (Die is approx 2/3 inch per side)
9
Parallel computations are hard
• Parallel cores (CPUs ) set up with “separate” memory stream
• Programming methodologies to break algorithms into separate parallel paths
• However, speed improvement limited by non-parallel activities
Non linear performance due to
management of functions and data
10
Typical parallel calculation (Hadoop word count)
Single worker
11
Parallel Analytics in Teradata (share nothing)
• A Client requests a calculation (e.g., Word count) of data values
• Request made to Parsing engine> Examines function and determines
execution method> Distributes functions across multiple
parallel processors
• Whenever possible, data is left in place at each CPU (AMP) and function executed in parallel.
> Analytics perform best when the data is locally available and does not have to be re-distributed
> More complex functions or data
distributed differently may require
redistribution of the data Vdisk(Deer, Deer)
AMP
Message Passing Layer
Parsing Engine
..…
Request Response
AMP AMP
Vdisk(River, River)
AMP
Tools and Applications
Request a calculation
F(x) F(x)F(x)F(x)
F(x)
F(x)
F(x)
Vdisk(Car, Car,
Car)
Vdisk(Bear, Bear)
..…
12
Node Level (dumb parallel)
1. Find Mean per node
2. Return 1 answer per node
Node4Mean2
Node3Mean 1.66
Node2Mean 1.33
Node1Mean9
System Level (smart parallel)
1. Get count �� and total for each node �
2. Aggregate counts (11) and total (33)
Node1Count 2Total 18
Node2Count 3Total 4
Node3Count 3Total 5
Node4Count 3Total 6
Scaling Analytics: Challenges, Implementations, Ops
Example: Assume you want to compute the mean of the following numbers 9, 9, 1, 2, 1, 2, 1, 2, 1, 2, 3. The mean is: �
�∑ ������ = ��
��= 3. What do you do if
values are scattered in multiple computing nodes? To compute a mean:
Vdisk(9,9)
AMP AMP AMP AMPF(x) F(x)F(x)F(x)
Vdisk(1,2,1)
Vdisk(2,1,2)
Vdisk(1,2,3)
Vdisk(9,9)
AMP AMP AMP AMPF(x) F(x)F(x)F(x)
Vdisk(1,2,1)
Vdisk(2,1,2)
Vdisk(1,2,3)
3. Calculate mean of mean = 3.5 (X) 3. Calculate mean total/count = 3 (Correct!)
Don’t do this
13
Data Parallel Level (data parallel)
1. Move data to single Process (AMP)
What if function is not easily Parallelized?
Example: Assume you want to compute the median of the following numbers:9, 9, 1, 2, 1, 2, 1, 2, 1, 2, 3. The median (middle value) is an is: {1,1,1,1,2,2,2,2,3,9,9} = 2.What do you do if values are scattered in multiple computing nodes?
Vdisk(9,9)
AMP AMP AMP AMPF(x) F(x)F(x)F(x)
Vdisk(1,2,1)
Vdisk(2,1,2)
Vdisk(1,2,3)
14
Data Parallel Level (data parallel)
1. Move data to single Process (AMP)
Node4idle
Node3idle
Node2idle
Node1Median2
What if function is not easily Parallelized?
Example: Assume you want to compute the median of the following numbers:9, 9, 1, 2, 1, 2, 1, 2, 1, 2, 3. The median (middle value) is an is: {1,1,1,1,2,2,2,2,3,9,9} = 2.What do you do if values are scattered in multiple computing nodes?
Vdisk(9,9)
AMP AMP AMP AMPF(x) F(x)F(x)F(x)
Vdisk(1,2,1)
Vdisk(2,1,2)
Vdisk(1,2,3)
2. Sort and Compute Median on single node = 2
Vdisk(9,9,1,2,1
2,1,2,1,2,3)
AMP AMP AMP AMPF(x) F(x)F(x)F(x)
Vdisk Vdisk Vdisk
Disadvantages
- Significant data movementbetween AMPs / Disks
- High Skew (one amp much busier than others)
Advantages
- Allows use of existing analyticfunctions In parallel environment
- Applicable to large data with multiple calculation sets
Vdisk(1,1,1,1,2,2,2,2,3,9,9)
15
• Big Data is dynamic, generally limited by the analysis / reporting tools
• Large Systems going to parallel calculations
• Requires:– Custom parallelized functions– Job management (Split workload into parallel
streams)– Data management (distribute data according
to function needs)– Splitting and recombination logic (map –
reduce)
Big data and tool set summary
16
Big Data Sources
17
Big Data Growth
18
Refresher - some terms (and bad puns)
© 2016 Teradata
Kilo (k) = 103 (1,000 or 1024 Bytes or 210 ) 2*103 mockingbirds = 2 kilo mockingbird
Mega (M) = 106 (1,000,000 or 220 = 1,048,576 Bytes) 106 phones = megaphone
Giga (G) = 109 (1,000,000,000 or 1,073,741,824 Bytes) 109 lows = gigalow
Tera (T) = 1012 (1,000,000,000,000 or 1,099,511,627,776 Bytes) 1012 bulls = terabull
Peta (P) = 1015 (1,000,000,000,000,000 or 250) 1015 coats = petacoat
Exa (E) = 1018 (1,000,000,000,000,000,000 or 260) 1018 gyrates = exagyrate
Zetta (Z) = 1021 (1,000,000,000,000,000,000,000 or 270) 1021 –Jones =
Yotta (Y) = 1024 (1,000,000,000,000,000,000,000,000 or 280) 1024 ries =
Bronto (B) = 1027(1 with 27 zeros) 1027 saurus =
Geop () = 1030 (1 with 27 zeros) 1030
19
As a reward, a peasant asked his king to place 1 grain of wheat on the 1st square of a chessboard, doubling the amount for each square (e.g., 1,2,4,8,16, …) until all 64 squares were covered with wheat.
If this could be accomplished, by the 64th
square there would be 18,446,744,073,709,551,615 grains of wheat or about 1.2 x 1012 metric tons (global production of wheat in 2014 was 729x106 metric tons)
or about 1645 years of wheat at current production
Size of 264 - Chessboard Wheat Riddle
© 2016 Teradata
20
Some terms (and bad puns)
© 2016 Teradata
Kilo (k) = 103 (1,000 or 1024 Bytes (210) ) 2*103 mockingbirds = 2 kilo mockingbird
Mega (M) = 106 (1,000,000 or 220 = 1,048,576 Bytes) 106 phones = megaphone
Giga (G) = 109 (1,000,000,000 or 1,073,741,824 Bytes) 109 lows = gigalow
Tera (T) = 1012 (1,000,000,000,000 or 1,099,511,627,776 Bytes) 1012 bulls = terabull
Peta (P) = 1015 (1,000,000,000,000,000 or 250 1015 coats = petacoat
Exa (E) = 1018 (1,000,000,000,000,000,000 or 260 1018 gyrates = exagyrate
Zetta (Z) = 1021 (1,000,000,000,000,000,000,000 or 270 1021 –Jones = (Catherine) Zetta-Jones
Yotta (Y) = 1024 (1,000,000,000,000,000,000,000,000 or 280) 1024 ries = yottory
Bronto (B) = 1027(1 with 27 zeros) 1027 saurus = brontosaurus
Geop () = 1030 (1 with 30 zeros) 1030
21
Some relative (and unverified) scales
© 2016 Teradata
# of Bytes Approx. Scale
Kilo (103) 1024 characters (page)
Mega (106) Small book
Giga (109) 100 gigabytes = library or small movie
Tera (1012) 200 Terabytes = library of congress
Peta (1015) Most everything ever printed
Exa(1018) 5 exabytes = all the words ever spoken by mankind
Yottobyte(1024) Everything ever put on the world wide web
22
Data projections
© 2016 Teradata
23
What is the cloud?
© 2016 Teradata
There is no cloud – just someone else’s computer
25
• Historical view of data sizes • <1960: Physical file systems (storage)
o Mechanical analysis (100’s to thousands of rows)
• <1970: Computer / electronic storage o Kilobytes to megabytes storage ($$$)
o Few kilobytes at a time for analysis
“Big Data” over time
© 2016 Teradata
• < 1980: Mini computers, supercomputerso Megabytes storage
o Hundreds of kilobytes memory
• < 1990: Micro computes to supercomputerso Gigabyte storage
o Megabyte memory size analysis space
• < 2000 to present – age of the interneto Petabytes of storage (dispersed)
o Gigabytes of memory
26
Data Velocity
no openslot
open
Sensor / transaction
Operations
Batch/ LoadQueue
Analysis ServerIncoming Tasks
Not open
“real time”
27
The 3 V’s (volume, velocity, variety)
© 2016 Teradata
28
Big Data by Industry
29
Dark Data - Definition
“Dark data” is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost.’ “Shedding Light on the Dark Data in the Long Tail
of Science”, P. Bryan Heidorn, LIBRARY TRENDS, Vol. 57, No. 2, Fall 2008 pp. 280–299
Three Types of Dark Data:1: Data that is in transactional systems but never makes it to the analytical systems 2: Data that is published but which the publisher does not expect to be analysed - reports3: Data that is not stored - intended for other purposes
30
Where is the Dark Data?
• Web logs
• Data Historians
• LIMS (laboratory information management system)
• PDFs, reports and free text
• Financial systems
• Sensor data
• Statutory reports to comply with regulations
31
Internet of Things (IoT)
© 2016 Teradata
32
Increasingly, everything is computerized and networked – which means even more data
33
Sensor Data
• Sensors are always on – generating data
• But frequently only used for dashboards and simple rules
34
Sensor Data (continued)
• Most of the time, sensor data is boring (and ignored or thrown away)
Most of the time sensor data is boring
35
Using Sensor Data• Anomaly detection
•
• Abuse
• Efficiency
• Degradation
• Faults
• Records
36
Using Sensor Data• Anomaly detection
•
• Abuse
• Efficiency
• Degradation
• Faults
• Records
37
Sensor Data Processing• Even if you can keep
the detailed data –should you?
- Generallysummarize
- Compare key parameters to known (mean, 98th
pctile, stddev)
- Process control summaries
38
Sensors sometimes lie
Day
mV
/ V
39
Issues with Sensors, Sensor Data - AerospaceExtreme values ? or Sensor readings … issues?
SN: 118426
Sensor False Readings indicate aircraft is restarting at MACH 0.90 at altitude….. But is actually on ground at O’Hare
Not an in-air restart as thought
40
What the sensor reads… …what the control unit stores-and-forwards
threshold
#2Even when they aren’t lying,
sensors don’t always tell the whole truth
41
© 2016 Teradata
IoT Architecture – Management View
The Edge
Data Lake
Data Warehouse
DeviceServices
DeviceMgmt
EdgeSecurity
Data management, Security, Integration, Analytics, Governance
Applications, APIs, & Micro-services
DataAccess
Analytics
Visualize
DataLab
Ingest framework
Operations & Analytics
GatewaysPolicies &
orchestration
EventProcessing
Streams& Batch
Orchestrate& PoliciesS
en
so
rs
& A
ctu
ato
rs
ControlSystems
© 2016 Teradata
42
Update and improve Edge devices
Operations Analysts & ScientistsAlgorithms
Updates, feature improvements
43
New & Improved
© 2016 Teradata
44
45 © 2016 Teradata
Poor security from “edge” device
46
How will the IoT evolve?End-to-end value chain optimisation in Smart Farming
FARMEQUIPMENT
SYSTEM
TILLERS
PLANTERS
COMBINEHARVESTERS
TRACTORS
WEATHER FORECASTS
WEATHER DATAAPPLICATION
WEATHERMAPS
FIELD SENSORS
IRRIGATION NODES IRRIGATION APPLICATION
SEED OPTIMIZATIONAPPLICATION
FARM PERFORMANCE
DATABASE
SEEDDATABASE
FARMMANAGEMENT
SYSTEM
FARMEQUIPMENT
SYSTEM
WEATHERDATA
SYSTEM
SEEDOPTIMIZATION
SYSTEM
IRRIGATIONSYSTEM
Product System System of Systems
How Smart, Connected Products Are Transforming Competition, Michael E. Porter & James E. Heppelmann, Harvard Business Review
RAIN, HUMIDITY,TEMPERATURE SENSORS
47
Manufacturing
© 2016 Teradata
48
The Internet of (Industrial) Things
The Internet of (Consumer) Things
Social data vs. Industrial data
49
“
’’
Industrial Sensors (Operational Technology)
© 2016 Teradata
In the Industrial sectors, we’ve had the Things for ages.
What we didn’t have was the Internet.
Industrial sensors used for monitoring and controlling operations
(OT = Operational Technology)
50
Industry 1.0 Water & steam
power
Industry 2.0Mass production,
assembly line, electricity
Industry 3.0Computerized
automation
Industry 4.0Connected cyber data
OT and sensor data has been used since the 1800’s
51
The Second Great Divide...Industrial things are already instrumented – but sensor data are either discarded…
“99% of data collected
from 30,000 sensors on
an oil rig was lost before
reaching operational
decision makers.”
The Internet of Things:
Mapping The Value Beyond The Hype
McKinsey Global Institute, June 2015
52
“The data that are used
today are mostly for
anomaly detection and
control, not optimisation
and prediction, which
provide the greatest
value.”
The Internet of Things:
Mapping The Value Beyond The Hype
McKinsey Global Institute, June 2015
The Second Great Divide...or the data make it only as far as an OT silo
53
Industrial Intelligence – What Problems do we Solve ?
© 2016 Teradata
2) Quality - OPS Improvements- a) Cumulative Fail Projections:
Daily model build for every single part/level proven/easy. Value:
- Spares & financial planning- Continuous Improvement- Sensor data improves accuracy
- b) Asset Field Exposure: Unique asset Incident Probability. More sensor data and up front correlation work required. proven/harder
- Prioritize assets for examination
3) Asset Management:- Remaining useful life: Individual models for
high priority specific failure mechanisms. Deep sensor/field data and model build/learning –POC Experience/emerging/more difficult
- Improved utilization - Lower maintenance costs- Improved contract profitability
1) Sensor Data Management – Curate sensor data to allow for
Complex assets fail for many reasons and require analytics
At multiple levels.
54 © 2016 Teradata
55
Example: Semiconductor Manufacturing
Pro
du
ct
Co
mp
lex
ity
Da
taV
olu
me
s
Nu
mb
er
of
Va
ria
ble
s
Tim
e t
o
Insi
gh
t
Failu
re
IT C
ost
&
Co
mp
lex
ity
• Multi-Petabyte Data Volume
• Million+ Variables• Constantly Changing
Structure
Process to identify Key Variables so Action can be Taken
Action
Y
i
e
l
d
Technology Challenge
• Data Storage (cost effective)
• Scalable Analytical Engine
• 20,000+ Attributes
• Dynamic Schema
Analytical Challenge
• Large scale multi-linear regression
• Need to operationalize analytics
• Visualization / Front End
Business Challenge
• Highly skilled engineers spending too much time loading / manipulating / analyzing data
• Unable to look at all the data – need to split up into smaller sub-sets
© 2015 Teradata
56
Complexity of Semiconductor Manufacturing Process
Testing
Repeat 15-30+ times
600 or more steps / 15-30 “loops”
57
58
• Prevent yield loss and quality escapes, but…– Process technology is becoming exponentially complex– Yield loss and quality outliers are driven by process variation– Close to 1Million process variables that can influence yield &
quality– Finding the top yield & quality factors is like trying to find “needles
in a haystack”
• Advanced analytics at scale using ALL the data
The hunt for Zero Defects…Industry Challenges
Move the analytics to the data
59
Random Forest Repeatedly fit decision trees
• Sample the variables and data (to make individual trees manageable)
• Build hundreds of “independent” trees
• Score the trees for– Predictors
– Accuracy
• Determine variable importance(repeated use and impact on accuracy)
• Random Forest may be used as a model for What If analysis
60
Correlation with Tool idProcess Step: Container Integrated Dry Etch
One tool is driving significantly lower yields! Fix tool
Sensor ID: XYZ
61
Yield Improvement estimated at 1%
By setting control limits on top 5 or 6 factors, yield can
be improved by 1%.
This is a $100M opportunity for this customer
62
Cat 797 Mining Truck • 400 Ton Payload Capacity
• 4,000 HP 20 Cylinder Diesel Engine (105 liter)
• 42 MPH Top Speed Under Full Load
• Fully sensored w/ telematics
• Utilization – Critical (> $200,000/hr operation)
• Purpose: To detect possible anomaly in system fluids (pressure, temperature, viscosity) over time.
• Process:1) Cleanse & load data “continuously” into Teradata (analyses grouped into 1-2
hour data buckets)
2) Perform multiple regression on a engineering based model (Matlab coded into Teradata in-database using Fuzzy Logix functions) with telematic data
A. Defined “golden” time period – time equipment was operating normally and in spec. Initial run on each unit
B. Automatically compare subsequent sensor data to model
Residual analysis
ANOVA analysis
C. Score shifts based on known engineering parameters.
Predictive AnalyticsFluid Anomaly – Analysis #1
• Automated detection of parameter shifts
• Automated initialization as new equipment comes on line or after select maintenance events
64
BSI EPISODE 11:
After the Government notifies this Consumer Goods food producer, Great Brands uses Big Data technology to isolate contamination sources and run a Recall Campaign quickly!
http://tinyurl.com/kn4ffhh
65
Analysis of Big Data
© 2016 Teradata
66
Extracting useful signal from time-series sensor data requires “multi-genre” Analytics, integrated data
67
SENSOR MASTER OPERATIONSREPAIR &
MAINTENANCESUPPLYCHAIN
HR FINANCE
Probability that component will fail? ✓ ✓ ✓
Probable time to failure? ✓ ✓ ✓
Where is the train now – and next few hours? ✓ ✓
Probable issue and resolution? ✓ ✓ ✓
Availability of required spare parts? ✓ ✓
Availability of suitably qualified Engineer? ✓ ✓
Replacement train available? ✓
What is the impact of cancellation of service? ✓ ✓
By themselves, sensor dataare of only limited value
68
Da
taA
na
lytic
sP
roc
ess
Raw sensor data
Raw sensor data
N/A
Capture full-fidelity
data to enable use-
case specific event
detection
Cleansed sensor data
Raw sensor data
from adjacent
sensors; Reference,
Master data
Interpolation, neural
networks, FFTs,
smoothing.
Interpolation of missing
values, “virtual sensor”
correction for drift, re-
calibration, etc., etc.
Event detection
Alerts data; “whole
fleet” historical
sensor data;
Environmental data
Time-series, Path,
Pattern, Similarity
Identification of
changes of state;
signature matching
Path-to / Event association
“Whole device” and
“whole fleet”
historical sensor data
Path, Graph,
Clustering,
Co-occurrence
Comparison and
correlation with
other system /
device events
Labelled sensor data
Maintenance and
Operations data
Text, Relational
Comparison
and correlation
with human
observations
Extracting useful signal from time-series sensor data requires “multi-genre” Analytics, integrated data
69
Thousands of sensors – millions of data valuesAnalytics can only deal with so much at a time
Example, monitoring gas valve and using observations for yield and process improvement
Figure 1 Capability Performance Verification
70
Demo - Equipment Solenoid Current
Compare to a known Standard (and reduce through piecewise approximatino)
71
“on the fly” sensor data can then be analyzed quickly
Pulse 1 scores high because of time shift (jitter)
72
The Problem of Summary Statistics
• 3 distributions
• Bimodal
• Uniform
• Gaussian
• Same median and
• Comparable range
• Obvious difference when comparingprobability distributions
1 2 3
-6-4
-20
24
6
Boxplot
-6-4
-20
24
6
1 2 3
distribution PlotBoxplot Distribution Plot
73
Handling a million variables
• Convert to Column based analytics with high performance pivot / un-pivot functions
• Analytics with “unlimited” independent variables
Id V1 V2 V3 V4 V5 V6 V7 V8 v9
AA 1.2 3.1 41 56 ‘a’ 9 0.2 ? ?
AB 0.9 2.7 41 62 ‘a’ 8 0.2 1.1 7
BA 1.0 2.9 42 57 ‘b’ 9 0.1 1.1 ?
Id Col ID Val
AA V1 1.2
AA V3 41
AB V1 0.9
AB V8 1.1
AB V2 2.7
“Unpivot”
Id Col ID Val
AA V1 1.2
AA V3 41
AB V 0.9
AB V8 1.1
AB V2 2.7
CA V4 56
CB V3 41
BB V1 1.0
Multiple tables add more rows
74
Images as Aggregations (a picture is worth 64kBytes)
• Plotting a million data points can take more time than many analytics – need to pull the data to a plotting tool
• Frequently, the analysis of data results in the necessity of generating graphic images. Some typical graphs may include:– X-Y Scatter plots– Quantile plots (e.g., quantile – quantile plots, boxplots)– Spatial plots (e.g., wafer maps, heat maps).
• In-data-base-generation avoids movement of data from the warehouse to the graphic tool (and back).
• Complex or customized images can be made available to anyone with access to the database (does not require a special application or license, e.g., wafer maps are not easily generated in standard tools).
• Much faster than bringing large data sets to custom plotting tools
75
In-database (parallel) graphics generation
© 2016 Teradata Americas
Teradata
SQL Client
ImageExtract
Aggregate across amps
Margin areaPlot area (√64kbytes)
Plo
t are
a
(√64
kb
yte
s)M
arg
in
are
a
76
Telemetric Data Graphs
Powerful in-DB Analysis enables the use of Simple BI Tools
Analysis of Big Data: 1.1M rows of Telemetry (16 of 159 plots/units shown ) graphed and stored in-database for evaluation. Approx. 2 seconds of processing time.
77
Parallel analytics available in Teradata Built in (> 300 functions)
• Aggregation (count, min, max, mean, sum, stdev, correlation,regression, …)
3rd party Add-in functions (over 600 functions)
78
A virtually new world - largest listed companies by market capitalisation, $bn
Exxon Mobil
General Electric
Gazprom
Microsoft
Citigroup
Bank of America
Royal Dutch Shell
BP
PetroChina
HSBC
END 2006
ITFinancialsEnergy Industrials TelecomHealth careSector:
0 200 400 600 2016* 0 200 400 600
Apple
Alphabet
Microsoft
Berkshire Hathaway
Exxon Mobil
Amazon
Johnson & Johnson
General Electric
China Mobile
Source: Bloomberg | Economist.com *At August 24th 2016
Summary: Companies are now based on their principal product being data & Analytics
79
Thank you