Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data...
Transcript of Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data...
iMinds Dept. MEDICAL IT
KU Leuven ESAT-STADIUS
Serious Data Mining
Prof.Dr. Bart De Moor
Content
Big DataWhatWho
Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy
Machine learning as a commodity
Expertise of ESAT-STADIUS, KU Leuven
Books & Spin-offsAlgorithmsApplications 2
1 million = 1 000 0001 billion = 1 000 000 0001 trillion = 1 000 000 000 0001 quadrillion = 1 000 000 000 000 000
1 kB = 1 000 1 MB = 1 000 0001 GB = 1 000 000 0001 TB = 1 000 000 000 0001 PB = 1 000 000 000 000 000
1 TB = large university library= 212 DVD discs = 1430 CDs= 3 year music in CD quality
3
4
5
6
7
Moore’s law:
computing power
doubles
every 18 months
Next GenerationSequencing
Carlson’s law:
complexity/cost
evolves
exponentially
8
Genome data
9
• Human genome project (2003)
– 13 year project
– $300 million value with 2002 technology
• Personal genome (2007)
– Genome of James Watson, 2 months
– $1 000 000
• €1000-genome
– Expected 2012-2020
1,00E-07
1,00E-06
1,00E-05
1,00E-04
1,00E-03
1,00E-02
1,00E-01
1,00E+00
1,00E+01
1,00E+02
1,00E+03
1,00E+04
1,00E+05
1,00E+06
1,00E+07
1,00E+08
1,00E+09
1,00E+10
1,00E+11
1990 1995 2000 2002 2005 2007 2010 2015
Cost per base pair
Genome cost
GS-FLX Roche Applied Science 454
Sequencers
Tsunami of medical data
PACS
UZ Leuven
1,6 PetaByte
Genomics core
HiSeq 2000 full
speed exome
sequencing
1 TeraByte / week
1 small
animal
image
1
GigaByte1 CD-ROM
750
MegaByte
sequencing all newborns
by 2020 (125k births /
year)
125 PetaByte / year
index of 20
million
Biomedical
PubMed
records
23 GigaByte
1 slice mouse
brain MSI at
10 μm
resolution
81 GigaByte
raw NGS data
of 1 full genome
1 TeraByte
10
11
Data explosion in finance
Growing ~ 30-50% every year, half of this is unstructured!
Data storage became 30% cheaper, yet budgets for data storage are still rising.
12
How big is big?
?
Microsoft
1 million servers (2013)
Amazon
?
13
14
15
16
Content
Big DataWhatWho
Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy
Machine learning as a commodity
Expertise of ESAT-STADIUS, KU Leuven
Books & Spin-offsAlgorithmsApplications 17
Big Data
18
Data
19
Compute infrastructure
20
Storage
21
22
AnalyticsFlowchart
23
Visualization
24
Security
25
Content
Big DataWhatWho
Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy
Machine learning as a commodity
Expertise of ESAT-STADIUS, KU Leuven
Books & Spin-offsAlgorithmsApplications 26
Big Data Landscape
More and moreanalytics as a commodity!
27
Machine Learning as a commodity
28
Big Data LandscapeMany possible applications!
Energy
Industry
Environment
Social networks
Fraud and predictive analysis
Health
…
Focus onSerious Big Data
29
Content
Big DataWhatWho
Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy
Machine learning as a commodity
Expertise of ESAT-STADIUS, KU Leuven
Books & Spin-offsAlgorithmsApplications 30
Analytics
Big Data Analytics
Numerical algorithms Rule-based
31
Main tasks
Prediction Segmentation Anomalies
Regression Clustering
Classification
Outlier
32
What can we do?
Shopping cart analysis
Fraud detection
Face recognition
Movie reccomendation
Just-in-time production
Credit worthiness
Disease spreadingTraffic management
33
black blond
orange
blue
long
short
Hair color
length
Color clothes
FeauturesClustersSimilarityDecision
Clustering/Classification
34
Forecasting
35
Deep Learning● Neural networks.● New algorithms.● Multiple layers on
top of each other.● Each layer learns a
more complex representation.
● Learn feature hierarchies.
Stadius - Books
37
Stadius - Spin-offs
38
DATA
ANALYTICS
Supervised learning Unsupervised learning Statistical methodsOther analytical
techniques
General
methodology: Dusan
Popovic, Marc
Claesen, Jack Simm,
Peter Roelants
Esemble Methods :
Dusan Popovic, Marc
Claesen, Jaak Simm
Semi-supervised
learning : Marc
Claesen
Large-scale learning :
Marc Claesen, Nico
Verbeeck
Hyperparameter
optimization : Marc
Claesen, Dusan
Popovic
Optimization meta-
heuristics : Dusan
Popovic, Maira
Rodriguez
Neural Networks :
Peter Roelants, Dusan
Popovic
Deep learning : Peter
Roelants
ICA : Nico Verbeeck
Clustering : Yousef el
Alamaat
Matrix factorization :
Jaak Simm
Manifold learning :
Xian Mao
Decision Trees :
Dusan Popovic, Jaak
Simm
Regularized
regression & GLM :
Yousef el Alamaat,
Arnaud Installe
Wavelet compression
:
Nico Verbeeck
Visualization : Toni
Verbeiren, Ryo Sakai
Multi-task learning :
Jaak Simm
Data fusion : Dusan
Popovic, Pooya Zakeri
, Gorana Nikolic
General
methodology:
Yousef el Alamaat,
Dusan Popovic, Marc
Claesen
SVMs & Kernel
Methods : Marc
Claesen, Jaak Simm,
Pooya Zakeri
Survival analysis :
Marc Claesen
Algorithms in Stadius
39
Big DataApplications
Large-scalescience
Bio-Informatics
Data Assimilation
Health Smart City
Electricity
Internet of Things
Traffic
Finance
Risk Assesment
Stock Trading
Customer Relation
Management
Fraud Detection
Churn
Text Mining
Stadius
40
BIOMEDICAL APPLICATIONS
Diseases/Bodily
processes/Tissues
Technological platforms/data
types
Cancer : Dusan Popovic, Xian
Mao
Diabetes : Marc Claesen
Endometriosis : Yousef el
Alamat, Dusan Popovic, Nico
Verbeek
Mass Spectrometry: Nico
Verbeek, Yousef el Alamat, Xian
Mao
Genomics : Inge Thijs, Dusan
Popovic, Maira Rodriguez, Ryo
Sakai
Proteomics : Inge Thijs, Yousef
el Alamat, Xian Mao, Ryo Sakai,
Pooya Zakeri
Microarrays : Dusan Popovic,
Griet Laenen, Pooya Zakeri
miRNA : Jaak Simm
RNA sequencing : Jaak Simm
NGS : Amin Ardeshirdavani, Jaak
Simm
Metagenomics : Jaak Simm
Rare genetic disorders : Dusan
Popovic, Ryo Sakai
IT support
Object-oriented programming
: Arnaud Installe, Amin
Ardeshirdavani, Gorana Nikolic,
Marc Claesen, Toni
Verbeiren,Xian Mao
Web applications : Amin
Ardeshirdavani, Gorana Nikolic,
Toni Verbeiren
Databases : Arnaud Installe,
Amin Ardeshirdavani, Gorana
Nikolic
Big data technologies (ex.
Hadoop , MapReduce) : Amin
Ardeshirdavani, Jaak Simm, Toni
Verbeiren
Distributed computing : Toni
Verbeiren
Glucose level monitoring : Tom
Van Herpe
Mouse brain : Nico Verbeek,
Yousef el Alamat, Xian Mao
High-performance computing :
Jaak Simm
Bacteria&Fungi : Jaak SimmAgent-based systems : Maira
Rodriguez
Clinical data : Arnaud Installe,
Dusan Popovic, Yousef el Alamat
Annotation data : Pooya Zakeri
Biological networks : Griet
Laenen, Pooya Zakeri
Drug discovery : Griet Laenen,
Pooya Zakeri
41
Energy
Industry
Environment
Social networks
Fraud and predictive analysis
Health
42
Power grid
43
Electric load forecastingProblem
Energy Production
Energy Demand
How to forecast the demand?
44
Electric load forecasting
Data sets
Elia KMI
Transmission system operator
Hourly data
10 substationsconsidered
Meteorological Institute
Weatherconditions
45
Electric load forecasting
Seasonal Effects
What hour? What day? What week?
Weather Effects
Temperature? Sunny?
Accurate forecast
Model update: Every week!
46
47
250 transformer substationsEvery 15 min, 5 years
1 post, 1 week 1 post, four seasons
Electric load forecasting
48
Seasonalities in the load: day, week, year, holidays
49
6 posts, 1 yearSeasonalities, calender holidays !
50
51
- Seasonalities = a priori information (regress Monday on Monday !)- Normalization:
- remove effects of temperature, cloudiness,…. - remove effect of holidays – calender days (use dummies)
- Calculate ‘eigenprofiles’ = daily shape per post
52
Energy
Industry
Environment
Social networks
Fraud and predictive analysis
Health
53
Top 99 %
Bottom 1 %
steam 15.2 T/h
Top 99.75 % (99.8%)
Bottom 0.25 % (0.2 %)
steam 20 T/h
Top 99.8 % (99.8%)
Bottom 0.31 % (0.2 %)
steam 20 T/h
Setpoint
changes
Idealrank top
3 -> 2
54
Modelling for control
900
901
902
903
900
901
902
903
100 200 300 400 500 600 700 800 900 1000
900
901
902
903
No control, Quasi steady state and fast control
100
0 10 20 30 40 50 60 70 80 90 100
Histograms
No c
on
trol
Qu
asi s
tea
dy
co
ntro
lF
ast d
yn
am
ic
co
ntro
l
55
vgl,
hgl
qbg
vopw,
hopw
qman
qopw
K7 A
v1,
h1
E
vs, hs
qK7
qA qE
vs2, hs2
qs
vs3, hs3
D
qs2 qs3
vvg, hvg
qhopw
vs4, hs4
qs4 qhs
vhopw,
hhopw
qvs qD
q2
qzbopw qzb1
qgopw
qgafw
v3,
h3
qvopw
hvopw
qK18
q3 q4
vw,
hw
qK19 qK30
qbgopw
qK7lg
qgl
vbg,
hbg
qK7bg
qzb3
q7 Demer
Zw
artew
ater
Zwartebeek
Vlootgracht
Schulensmeer
Webbekom
Herk
G
ete
Beg
ijn
eb
eek
Velp
e
Leu
geb
eek
Gro
te L
eig
ra
ch
t
Ho
uw
ersb
eek
q6 q5
qK31
qzb2
v4,
h4
vzb,
hzb
qzw
q1
vlg,
hlg
qK24B
hbgopw
qK24A
vzw,
hzw
qzwopw
qhs2
v2,
h2
qh
qsa
MPC Estimator
disturbances
Model Based Predictive Control for Flood Regulation: Demer
56
57
Energy
Industry
Environment
Social networks
Fraud and predictive analysis
Health
58
Data assimilation is the common name given to several numerical techniques thatcombine the outputs of a numerical model with observational data in order toimprove the quality of the model predictions.
Some data assimilation techniques: 3DVAR, 4DVAR, Ensemble Kalman Filter (EnKF)and its variants, Optimal Interpolation (OI), particle filters, etc.
Measurements
Numerical model
( 1) ( ), ( ) ( )k k k k x f x u Gw
Data assimilation techniques
EnKFDEnKF
ETKF
Parameters of the algorithms
Better estimation of
x(k)OI
Data Assimilation
59
0 50 100 150 200 250 3000
20
40
60
80
100
120
140
Concentr
ation [
ug/m
3]
Time [Hours]
Measurements
Aurora (Model)
OI
DEnKF
O3 air-quality stationsAverage of the O3 concentration over
the validation stations
The Deterministic Ensemble Kalman Filter (DEnKF) and the OI technique have beenused to improve the O3 estimates of the Air-quality model AURORA.
- Validation stations
- Assimilation stations Starting date: May 28th, 2005 at midnight
3 E 4 E 5 E 6 E
50 N
51 N
2
46
40
73
22
784
1667
56
19
Data Assimilation
60
o - Validation station
x - Assimilation station
3/g m
Optimal Interpolation
3/g m
DEnKF
3/g m
Free-run of Aurora
3/g m
Average of the O3 concentration field over the 14 day period
61
PM10 air-quality stationsAverage of the PM10 concentration over
the validation stations
The Deterministic Ensemble Kalman Filter (DEnKF) and the OI technique have beenused to improve the PM10 estimates of the Air-quality model AURORA.
- Validation stations
- Assimilation stations Starting date: January 20th, 2010 at midnight
3 E 4 E 5 E 6 E
50 N
51 N 3
11
13
18
0 50 100 150 200 2500
20
40
60
80
100
120
140
Concentr
ation [
ug/m
3]
Time [Hours]
Measurements
Aurora
DEnKF
OI
Data Assimilation
62
Average of the PM10 concentration field
o - Validation station
x - Assimilation station
Optimal Interpolation
3/g m
DEnKF
3/g m
Free-run of Aurora
3/g m
63
Average of the PM10 concentration field
3/g m 3/g m64
Energy
Industry
Environment
Social networks
Fraud and predictive analysis
Health
65
Journal Clustering
Find all aboutspecific topic?
66
Journal Clustering
67
Journal Clustering
68
Journal Clustering
69
Author Collaboration Clustering
70
Author Collaboration Clustering
71
Golub around the world commemoration February 29 2008
72
138 seed papers + all cited and citing publications
Result: 4943 nodes, 6216 edges
Link based clustering identifiestopically homogeneous clusters.
13 papers are writtenby another L. Ljung.
Web of Science based literature network for Lennart Ljung
73
“Bibliometrics”
“Patent analysis” “Information retrieval”
“Social aspects”
“Webometrics”
Community detection
74
Energy
Industry
Environment
Social networks
Fraud and predictive analysis
Health
75
Customer Intelligence
76
Customer Intelligence
Score & Target
Analyze & Model
Measure response
Contact 77
Short
Duration Long
Duration High
Frequency International Same
Destination Off
Peak Call
Forwarding Behaviour
Change Direct call
selling X X X X PABX fraud
X X X X X Freephone
fraud X X X X Premium
rate fraud X X X X Subscription
fraud X Handset
theft X X X X X
Call frequency
Average call duration
Fraud detection on mobile phone network
78
Energy
Industry
Environment
Social networks
Fraud and predictive analysis
Health
79
Participatory
IOTA app:population based assessment of ovariantumour malignancy:
IOTA app available in iTunes app store and on http://homes.esat.kuleuven.be/~sistawww/biomed/iota/
Leuven
Malmö
Monza
London
Maurepas Paris
RomeNapels
Milan
# patients: 1066 + 1938
Lund
Lublin
Genk
Ontario, Canada
Bologna
Sardinia
Beijing, China
80
Performance
Performance of an expert Performance of IOTA models
Performance of old models
Performance of non-experts
You share, we care ! 81
© Armstrong SA et al. Nat Genet. 2002 Jan;30(1):41-7. 12 600 genes 72 patients
- 28 Acute Lymphoblastic Leukemia (ALL)- 24 Acute Myeloid Leukemia (AML)- 20 Mixed Linkage Leukemia (MLL)
Genomic markers for Leukemia
82
High-throughputgenomics
Data analysis Candidate genes
Information sources
Candidate prioritization
Validation
Endeavour: Aerts et al., Nature Biotechnology, 2006
Genomic Data Fusion
83
84
85
86