Analysis of Big Data – Focus on Manufacturing · PDF fileAnalysis of Big Data –...

Analysis of Big Data – Focus on Manufacturing Bruce Aldridge, Sr. Analytics Consultant – Teradata

March 21, 2017

2

Big data & manufacturing

• Big data – trends

• Internet of things (IoT)

• Manufacturing data

• Analysis of big data

Agenda

© 2016 Teradata2

3

Bruce Aldridge, Ph.D. Senior Data Scientist, Advanced AnalyticsTeradata Corporation

Bio:

• Analytic / Industry (Mfg) consultant with Teradata since 2002

• 5 years in Disk Drive manufacturing (Western Digital)

• 14 years in Semiconductor manufacturing (Bell Labs, Motorola, Western Digital)

• PhD. Physics – University of California, Riverside

Technical

• Parallel Database analytics

• Failure Analysis, Yield and Reliability Physics

• 8 patents issued in analytic methods

4

Big Data – Hype or…?

© 2016 Teradata

5

“

’’

Big Data is dynamic

© 2016 Teradata

Big data can be thought of as data sets that

won’t fit in your current tool set

(or in the amount of time available for analysis).

Because tool sets (and capabilities) are dynamic

6

Tool Set improvements

Source: IntelSemiconductor Technology Scaling

Sp

ee

d:

De

lay

(Se

c)

Co

sts

($)

Vacuum Tube

Transistor

NMOS

CMOS

Vacuum Tube

Transistor

NMOS

CMOS

7

Physics is getting in the way

Transistor counts continue to increase

But clock speeds are saturating

8

Each year we get faster more processors Requiring Parallel execution

Why?

• Heat (i7 = 130W)

• Distanceo Speed of light = 186,000 miles/sec ≈ 1ft/nanosec (10-9 sec = 1GHz clock)o 3GHz clock = 0.33 nseco Information can move about 4 inches. (Die is approx 2/3 inch per side)

9

Parallel computations are hard

• Parallel cores (CPUs ) set up with “separate” memory stream

• Programming methodologies to break algorithms into separate parallel paths

• However, speed improvement limited by non-parallel activities

Non linear performance due to

management of functions and data

10

Typical parallel calculation (Hadoop word count)

Single worker

11

Parallel Analytics in Teradata (share nothing)

• A Client requests a calculation (e.g., Word count) of data values

• Request made to Parsing engine> Examines function and determines

execution method> Distributes functions across multiple

parallel processors

• Whenever possible, data is left in place at each CPU (AMP) and function executed in parallel.

> Analytics perform best when the data is locally available and does not have to be re-distributed

> More complex functions or data

distributed differently may require

redistribution of the data Vdisk(Deer, Deer)

AMP

Message Passing Layer

Parsing Engine

..…

Request Response

AMP AMP

Vdisk(River, River)

AMP

Tools and Applications

Request a calculation

F(x) F(x)F(x)F(x)

F(x)

F(x)

F(x)

Vdisk(Car, Car,

Car)

Vdisk(Bear, Bear)

..…

12

Node Level (dumb parallel)

1. Find Mean per node

2. Return 1 answer per node

Node4Mean2

Node3Mean 1.66

Node2Mean 1.33

Node1Mean9

System Level (smart parallel)

1. Get count �� and total for each node �

2. Aggregate counts (11) and total (33)

Node1Count 2Total 18

Node2Count 3Total 4

Node3Count 3Total 5

Node4Count 3Total 6

Scaling Analytics: Challenges, Implementations, Ops

Example: Assume you want to compute the mean of the following numbers 9, 9, 1, 2, 1, 2, 1, 2, 1, 2, 3. The mean is: �

�∑ �� = ��

��= 3. What do you do if

values are scattered in multiple computing nodes? To compute a mean:

Vdisk(9,9)

AMP AMP AMP AMPF(x) F(x)F(x)F(x)

Vdisk(1,2,1)

Vdisk(2,1,2)

Vdisk(1,2,3)

Vdisk(9,9)


Vdisk(1,2,1)

Vdisk(2,1,2)

Vdisk(1,2,3)

3. Calculate mean of mean = 3.5 (X) 3. Calculate mean total/count = 3 (Correct!)

Don’t do this

13

Data Parallel Level (data parallel)

1. Move data to single Process (AMP)

What if function is not easily Parallelized?

Example: Assume you want to compute the median of the following numbers:9, 9, 1, 2, 1, 2, 1, 2, 1, 2, 3. The median (middle value) is an is: {1,1,1,1,2,2,2,2,3,9,9} = 2.What do you do if values are scattered in multiple computing nodes?

Vdisk(9,9)


Vdisk(1,2,1)

Vdisk(2,1,2)

Vdisk(1,2,3)

14

Data Parallel Level (data parallel)

1. Move data to single Process (AMP)

Node4idle

Node3idle

Node2idle

Node1Median2

What if function is not easily Parallelized?

Example: Assume you want to compute the median of the following numbers:9, 9, 1, 2, 1, 2, 1, 2, 1, 2, 3. The median (middle value) is an is: {1,1,1,1,2,2,2,2,3,9,9} = 2.What do you do if values are scattered in multiple computing nodes?

Vdisk(9,9)


Vdisk(1,2,1)

Vdisk(2,1,2)

Vdisk(1,2,3)

2. Sort and Compute Median on single node = 2

Vdisk(9,9,1,2,1

2,1,2,1,2,3)


Vdisk Vdisk Vdisk

Disadvantages

- Significant data movementbetween AMPs / Disks

- High Skew (one amp much busier than others)

Advantages

- Allows use of existing analyticfunctions In parallel environment

- Applicable to large data with multiple calculation sets

Vdisk(1,1,1,1,2,2,2,2,3,9,9)

15

• Big Data is dynamic, generally limited by the analysis / reporting tools

• Large Systems going to parallel calculations

• Requires:– Custom parallelized functions– Job management (Split workload into parallel

streams)– Data management (distribute data according

to function needs)– Splitting and recombination logic (map –

reduce)

Big data and tool set summary

16

Big Data Sources

17

Big Data Growth

18

Refresher - some terms (and bad puns)

© 2016 Teradata

Kilo (k) = 103 (1,000 or 1024 Bytes or 210 ) 2*103 mockingbirds = 2 kilo mockingbird

Mega (M) = 106 (1,000,000 or 220 = 1,048,576 Bytes) 106 phones = megaphone

Giga (G) = 109 (1,000,000,000 or 1,073,741,824 Bytes) 109 lows = gigalow

Tera (T) = 1012 (1,000,000,000,000 or 1,099,511,627,776 Bytes) 1012 bulls = terabull

Peta (P) = 1015 (1,000,000,000,000,000 or 250) 1015 coats = petacoat

Exa (E) = 1018 (1,000,000,000,000,000,000 or 260) 1018 gyrates = exagyrate

Zetta (Z) = 1021 (1,000,000,000,000,000,000,000 or 270) 1021 –Jones =

Yotta (Y) = 1024 (1,000,000,000,000,000,000,000,000 or 280) 1024 ries =

Bronto (B) = 1027(1 with 27 zeros) 1027 saurus =

Geop () = 1030 (1 with 27 zeros) 1030

19

As a reward, a peasant asked his king to place 1 grain of wheat on the 1st square of a chessboard, doubling the amount for each square (e.g., 1,2,4,8,16, …) until all 64 squares were covered with wheat.

If this could be accomplished, by the 64th

square there would be 18,446,744,073,709,551,615 grains of wheat or about 1.2 x 1012 metric tons (global production of wheat in 2014 was 729x106 metric tons)

or about 1645 years of wheat at current production

Size of 264 - Chessboard Wheat Riddle

© 2016 Teradata

20

Some terms (and bad puns)

© 2016 Teradata

Kilo (k) = 103 (1,000 or 1024 Bytes (210) ) 2*103 mockingbirds = 2 kilo mockingbird

Mega (M) = 106 (1,000,000 or 220 = 1,048,576 Bytes) 106 phones = megaphone

Giga (G) = 109 (1,000,000,000 or 1,073,741,824 Bytes) 109 lows = gigalow

Tera (T) = 1012 (1,000,000,000,000 or 1,099,511,627,776 Bytes) 1012 bulls = terabull

Peta (P) = 1015 (1,000,000,000,000,000 or 250 1015 coats = petacoat

Exa (E) = 1018 (1,000,000,000,000,000,000 or 260 1018 gyrates = exagyrate

Zetta (Z) = 1021 (1,000,000,000,000,000,000,000 or 270 1021 –Jones = (Catherine) Zetta-Jones

Yotta (Y) = 1024 (1,000,000,000,000,000,000,000,000 or 280) 1024 ries = yottory

Bronto (B) = 1027(1 with 27 zeros) 1027 saurus = brontosaurus

Geop () = 1030 (1 with 30 zeros) 1030

21

Some relative (and unverified) scales

© 2016 Teradata

# of Bytes Approx. Scale

Kilo (103) 1024 characters (page)

Mega (106) Small book

Giga (109) 100 gigabytes = library or small movie

Tera (1012) 200 Terabytes = library of congress

Peta (1015) Most everything ever printed

Exa(1018) 5 exabytes = all the words ever spoken by mankind

Yottobyte(1024) Everything ever put on the world wide web

22

Data projections

© 2016 Teradata

23

What is the cloud?

© 2016 Teradata

There is no cloud – just someone else’s computer

25

• Historical view of data sizes • <1960: Physical file systems (storage)

o Mechanical analysis (100’s to thousands of rows)

• <1970: Computer / electronic storage o Kilobytes to megabytes storage ($$$)

o Few kilobytes at a time for analysis

“Big Data” over time

© 2016 Teradata

• < 1980: Mini computers, supercomputerso Megabytes storage

o Hundreds of kilobytes memory

• < 1990: Micro computes to supercomputerso Gigabyte storage

o Megabyte memory size analysis space

• < 2000 to present – age of the interneto Petabytes of storage (dispersed)

o Gigabytes of memory

26

Data Velocity

no openslot

open

Sensor / transaction

Operations

Batch/ LoadQueue

Analysis ServerIncoming Tasks

Not open

“real time”

27

The 3 V’s (volume, velocity, variety)

© 2016 Teradata

28

Big Data by Industry

29

Dark Data - Definition

“Dark data” is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost.’ “Shedding Light on the Dark Data in the Long Tail

of Science”, P. Bryan Heidorn, LIBRARY TRENDS, Vol. 57, No. 2, Fall 2008 pp. 280–299

Three Types of Dark Data:1: Data that is in transactional systems but never makes it to the analytical systems 2: Data that is published but which the publisher does not expect to be analysed - reports3: Data that is not stored - intended for other purposes

30

Where is the Dark Data?

• Web logs

• Data Historians

• LIMS (laboratory information management system)

• PDFs, reports and free text

• Financial systems

• Sensor data

• Statutory reports to comply with regulations

31

Internet of Things (IoT)

© 2016 Teradata

32

Increasingly, everything is computerized and networked – which means even more data

33

Sensor Data

• Sensors are always on – generating data

• But frequently only used for dashboards and simple rules

34

Sensor Data (continued)

• Most of the time, sensor data is boring (and ignored or thrown away)

Most of the time sensor data is boring

35

Using Sensor Data• Anomaly detection

•

• Abuse

• Efficiency

• Degradation

• Faults

• Records

36

Using Sensor Data• Anomaly detection

•

• Abuse

• Efficiency

• Degradation

• Faults

• Records

37

Sensor Data Processing• Even if you can keep

the detailed data –should you?

- Generallysummarize

- Compare key parameters to known (mean, 98th

pctile, stddev)

- Process control summaries

38

Sensors sometimes lie

Day

mV

/ V

39

Issues with Sensors, Sensor Data - AerospaceExtreme values ? or Sensor readings … issues?

SN: 118426

Sensor False Readings indicate aircraft is restarting at MACH 0.90 at altitude….. But is actually on ground at O’Hare

Not an in-air restart as thought

40

What the sensor reads… …what the control unit stores-and-forwards

threshold

#2Even when they aren’t lying,

sensors don’t always tell the whole truth

41

© 2016 Teradata

IoT Architecture – Management View

The Edge

Data Lake

Data Warehouse

DeviceServices

DeviceMgmt

EdgeSecurity

Data management, Security, Integration, Analytics, Governance

Applications, APIs, & Micro-services

DataAccess

Analytics

Visualize

DataLab

Ingest framework

Operations & Analytics

GatewaysPolicies &

orchestration

EventProcessing

Streams& Batch

Orchestrate& PoliciesS

en

so

rs

& A

ctu

ato

rs

ControlSystems

© 2016 Teradata

42

Update and improve Edge devices

Operations Analysts & ScientistsAlgorithms

Updates, feature improvements

43

New & Improved

© 2016 Teradata

45 © 2016 Teradata

Poor security from “edge” device

46

How will the IoT evolve?End-to-end value chain optimisation in Smart Farming

FARMEQUIPMENT

SYSTEM

TILLERS

PLANTERS

COMBINEHARVESTERS

TRACTORS

WEATHER FORECASTS

WEATHER DATAAPPLICATION

WEATHERMAPS

FIELD SENSORS

IRRIGATION NODES IRRIGATION APPLICATION

SEED OPTIMIZATIONAPPLICATION

FARM PERFORMANCE

DATABASE

SEEDDATABASE

FARMMANAGEMENT

SYSTEM

FARMEQUIPMENT

SYSTEM

WEATHERDATA

SYSTEM

SEEDOPTIMIZATION

SYSTEM

IRRIGATIONSYSTEM

Product System System of Systems

How Smart, Connected Products Are Transforming Competition, Michael E. Porter & James E. Heppelmann, Harvard Business Review

RAIN, HUMIDITY,TEMPERATURE SENSORS

47

Manufacturing

© 2016 Teradata

48

The Internet of (Industrial) Things

The Internet of (Consumer) Things

Social data vs. Industrial data

49

“

’’

Industrial Sensors (Operational Technology)

© 2016 Teradata

In the Industrial sectors, we’ve had the Things for ages.

What we didn’t have was the Internet.

Industrial sensors used for monitoring and controlling operations

(OT = Operational Technology)

50

Industry 1.0 Water & steam

power

Industry 2.0Mass production,

assembly line, electricity

Industry 3.0Computerized

automation

Industry 4.0Connected cyber data

OT and sensor data has been used since the 1800’s

51

The Second Great Divide...Industrial things are already instrumented – but sensor data are either discarded…

“99% of data collected

from 30,000 sensors on

an oil rig was lost before

reaching operational

decision makers.”

The Internet of Things:

Mapping The Value Beyond The Hype

McKinsey Global Institute, June 2015

52

“The data that are used

today are mostly for

anomaly detection and

control, not optimisation

and prediction, which

provide the greatest

value.”

The Internet of Things:

Mapping The Value Beyond The Hype

McKinsey Global Institute, June 2015

The Second Great Divide...or the data make it only as far as an OT silo

53

Industrial Intelligence – What Problems do we Solve ?

© 2016 Teradata

2) Quality - OPS Improvements- a) Cumulative Fail Projections:

Daily model build for every single part/level proven/easy. Value:

- Spares & financial planning- Continuous Improvement- Sensor data improves accuracy

- b) Asset Field Exposure: Unique asset Incident Probability. More sensor data and up front correlation work required. proven/harder

- Prioritize assets for examination

3) Asset Management:- Remaining useful life: Individual models for

high priority specific failure mechanisms. Deep sensor/field data and model build/learning –POC Experience/emerging/more difficult

- Improved utilization - Lower maintenance costs- Improved contract profitability

1) Sensor Data Management – Curate sensor data to allow for

Complex assets fail for many reasons and require analytics

At multiple levels.

55

Example: Semiconductor Manufacturing

Pro

du

ct

Co

mp

lex

ity

Da

taV

olu

me

s

Nu

mb

er

of

Va

ria

ble

s

Tim

e t

o

Insi

gh

t

Failu

re

IT C

ost

&

Co

mp

lex

ity

• Multi-Petabyte Data Volume

• Million+ Variables• Constantly Changing

Structure

Process to identify Key Variables so Action can be Taken

Action

Y

i

e

l

d

Technology Challenge

• Data Storage (cost effective)

• Scalable Analytical Engine

• 20,000+ Attributes

• Dynamic Schema

Analytical Challenge

• Large scale multi-linear regression

• Need to operationalize analytics

• Visualization / Front End

Business Challenge

• Highly skilled engineers spending too much time loading / manipulating / analyzing data

• Unable to look at all the data – need to split up into smaller sub-sets

© 2015 Teradata

56

Complexity of Semiconductor Manufacturing Process

Testing

Repeat 15-30+ times

600 or more steps / 15-30 “loops”

58

• Prevent yield loss and quality escapes, but…– Process technology is becoming exponentially complex– Yield loss and quality outliers are driven by process variation– Close to 1Million process variables that can influence yield &

quality– Finding the top yield & quality factors is like trying to find “needles

in a haystack”

• Advanced analytics at scale using ALL the data

The hunt for Zero Defects…Industry Challenges

Move the analytics to the data

59

Random Forest Repeatedly fit decision trees

• Sample the variables and data (to make individual trees manageable)

• Build hundreds of “independent” trees

• Score the trees for– Predictors

– Accuracy

• Determine variable importance(repeated use and impact on accuracy)

• Random Forest may be used as a model for What If analysis

60

Correlation with Tool idProcess Step: Container Integrated Dry Etch

One tool is driving significantly lower yields! Fix tool

Sensor ID: XYZ

61

Yield Improvement estimated at 1%

By setting control limits on top 5 or 6 factors, yield can

be improved by 1%.

This is a $100M opportunity for this customer

62

Cat 797 Mining Truck • 400 Ton Payload Capacity

• 4,000 HP 20 Cylinder Diesel Engine (105 liter)

• 42 MPH Top Speed Under Full Load

• Fully sensored w/ telematics

• Utilization – Critical (> $200,000/hr operation)

• Purpose: To detect possible anomaly in system fluids (pressure, temperature, viscosity) over time.

• Process:1) Cleanse & load data “continuously” into Teradata (analyses grouped into 1-2

hour data buckets)

2) Perform multiple regression on a engineering based model (Matlab coded into Teradata in-database using Fuzzy Logix functions) with telematic data

A. Defined “golden” time period – time equipment was operating normally and in spec. Initial run on each unit

B. Automatically compare subsequent sensor data to model

Residual analysis

ANOVA analysis

C. Score shifts based on known engineering parameters.

Predictive AnalyticsFluid Anomaly – Analysis #1

• Automated detection of parameter shifts

• Automated initialization as new equipment comes on line or after select maintenance events

64

BSI EPISODE 11:

After the Government notifies this Consumer Goods food producer, Great Brands uses Big Data technology to isolate contamination sources and run a Recall Campaign quickly!

http://tinyurl.com/kn4ffhh

66

Extracting useful signal from time-series sensor data requires “multi-genre” Analytics, integrated data

67

SENSOR MASTER OPERATIONSREPAIR &

MAINTENANCESUPPLYCHAIN

HR FINANCE

Probability that component will fail? ✓ ✓ ✓

Probable time to failure? ✓ ✓ ✓

Where is the train now – and next few hours? ✓ ✓

Probable issue and resolution? ✓ ✓ ✓

Availability of required spare parts? ✓ ✓

Availability of suitably qualified Engineer? ✓ ✓

Replacement train available? ✓

What is the impact of cancellation of service? ✓ ✓

By themselves, sensor dataare of only limited value

68

Da

taA

na

lytic

sP

roc

ess

Raw sensor data

Raw sensor data

N/A

Capture full-fidelity

data to enable use-

case specific event

detection

Cleansed sensor data

Raw sensor data

from adjacent

sensors; Reference,

Master data

Interpolation, neural

networks, FFTs,

smoothing.

Interpolation of missing

values, “virtual sensor”

correction for drift, re-

calibration, etc., etc.

Event detection

Alerts data; “whole

fleet” historical

sensor data;

Environmental data

Time-series, Path,

Pattern, Similarity

Identification of

changes of state;

signature matching

Path-to / Event association

“Whole device” and

“whole fleet”

historical sensor data

Path, Graph,

Clustering,

Co-occurrence

Comparison and

correlation with

other system /

device events

Labelled sensor data

Maintenance and

Operations data

Text, Relational

Comparison

and correlation

with human

observations

Extracting useful signal from time-series sensor data requires “multi-genre” Analytics, integrated data

69

Thousands of sensors – millions of data valuesAnalytics can only deal with so much at a time

Example, monitoring gas valve and using observations for yield and process improvement

Figure 1 Capability Performance Verification

70

Demo - Equipment Solenoid Current

Compare to a known Standard (and reduce through piecewise approximatino)

71

“on the fly” sensor data can then be analyzed quickly

Pulse 1 scores high because of time shift (jitter)

72

The Problem of Summary Statistics

• 3 distributions

• Bimodal

• Uniform

• Gaussian

• Same median and

• Comparable range

• Obvious difference when comparingprobability distributions

1 2 3

-6-4

-20

24

6

Boxplot

-6-4

-20

24

6

1 2 3

distribution PlotBoxplot Distribution Plot

73

Handling a million variables

• Convert to Column based analytics with high performance pivot / un-pivot functions

• Analytics with “unlimited” independent variables

Id V1 V2 V3 V4 V5 V6 V7 V8 v9

AA 1.2 3.1 41 56 ‘a’ 9 0.2 ? ?

AB 0.9 2.7 41 62 ‘a’ 8 0.2 1.1 7

BA 1.0 2.9 42 57 ‘b’ 9 0.1 1.1 ?

Id Col ID Val

AA V1 1.2

AA V3 41

AB V1 0.9

AB V8 1.1

AB V2 2.7

“Unpivot”

Id Col ID Val

AA V1 1.2

AA V3 41

AB V 0.9

AB V8 1.1

AB V2 2.7

CA V4 56

CB V3 41

BB V1 1.0

Multiple tables add more rows

74

Images as Aggregations (a picture is worth 64kBytes)

• Plotting a million data points can take more time than many analytics – need to pull the data to a plotting tool

• Frequently, the analysis of data results in the necessity of generating graphic images. Some typical graphs may include:– X-Y Scatter plots– Quantile plots (e.g., quantile – quantile plots, boxplots)– Spatial plots (e.g., wafer maps, heat maps).

• In-data-base-generation avoids movement of data from the warehouse to the graphic tool (and back).

• Complex or customized images can be made available to anyone with access to the database (does not require a special application or license, e.g., wafer maps are not easily generated in standard tools).

• Much faster than bringing large data sets to custom plotting tools

75

In-database (parallel) graphics generation

© 2016 Teradata Americas

Teradata

SQL Client

ImageExtract

Aggregate across amps

Margin areaPlot area (√64kbytes)

Plo

t are

a

(√64

kb

yte

s)M

arg

in

are

a

76

Telemetric Data Graphs

Powerful in-DB Analysis enables the use of Simple BI Tools

Analysis of Big Data: 1.1M rows of Telemetry (16 of 159 plots/units shown ) graphed and stored in-database for evaluation. Approx. 2 seconds of processing time.

77

Parallel analytics available in Teradata Built in (> 300 functions)

• Aggregation (count, min, max, mean, sum, stdev, correlation,regression, …)

3rd party Add-in functions (over 600 functions)

78

A virtually new world - largest listed companies by market capitalisation, $bn

Exxon Mobil

General Electric

Gazprom

Microsoft

Citigroup

Bank of America

Royal Dutch Shell

BP

PetroChina

HSBC

END 2006

ITFinancialsEnergy Industrials TelecomHealth careSector:

0 200 400 600 2016* 0 200 400 600

Apple

Alphabet

Microsoft

Berkshire Hathaway

Exxon Mobil

Amazon

Facebook

Johnson & Johnson

General Electric

China Mobile

Source: Bloomberg | Economist.com *At August 24th 2016

Summary: Companies are now based on their principal product being data & Analytics

79

Thank you

Analysis of Big Data – Focus on Manufacturing · PDF fileAnalysis of Big Data –...

Documents

Transcript of Analysis of Big Data – Focus on Manufacturing · PDF fileAnalysis of Big Data –...