Extreme Spatio-Temporal Data Analysis

41
Extreme Spatio Temporal Data Analysis in Biomedical Informatics Joel Saltz MD, PhD Director Center for Comprehensive Informatics

description

Talked delivered at XLDB Asia, June 2012

Transcript of Extreme Spatio-Temporal Data Analysis

Page 1: Extreme Spatio-Temporal Data Analysis

Extreme Spatio Temporal Data Analysis in Biomedical Informatics

Joel Saltz MD, PhD

Director Center for Comprehensive Informatics

Page 2: Extreme Spatio-Temporal Data Analysis

C

en

ter

for

Com

pre

hen

siv

e I

nfo

rmati

cs

Contributions

• Computer Science: Methods and middleware for analysis, classification of very large datasets from low dimensional spatio-temporal sensors; methods to carry out comparisons and change detection between sensor datasets

• Biomedical: Mine whole slide image datasets to better predict outcome and response to treatments, generate basic insights into pathophysiology and identify new treatment targets

Page 3: Extreme Spatio-Temporal Data Analysis

C

en

ter

for

Com

pre

hen

siv

e I

nfo

rmati

cs

Outline of Talk

• Pathology: Analysis of Digitized Tissue for Research and Practice

• Feature Clustering: Morphologic Tumor Subtypes in GBM Brain Tumors and Relationship to “omic” classifications

• Whole Slide Image Analysis in Clinical Practice: Neuroblastoma

• Tissue Flow: Multiplex Quantum Dot • HPC/BIGDATA Feature Pipeline• Pathology data analytic tools and techniques

Page 4: Extreme Spatio-Temporal Data Analysis

C

en

ter

for

Com

pre

hen

siv

e I

nfo

rmati

cs

Whole Slide Imaging: Scale

Data per slide: 500MB to 100GBRoughly 250-500M Slides/Year in USA

Total: 0.1-10 Exabytes/year

Page 5: Extreme Spatio-Temporal Data Analysis

C

en

ter

for

Com

pre

hen

siv

e I

nfo

rmati

cs

Pathology Computer Assisted Diagnosis

Shimada, Gurcan, Kong, Saltz

Page 6: Extreme Spatio-Temporal Data Analysis

Computerized Classification System for Grading Neuroblastoma

• Background Identification• Image Decomposition (Multi-

resolution levels)• Image Segmentation

(EMLDA)• Feature Construction (2nd

order statistics, Tonal Features)

• Feature Extraction (LDA) + Classification (Bayesian)

• Multi-resolution Layer Controller (Confidence Region)

No

YesImage Tile

InitializationI = L

Background? Label

Create Image I(L)

Segmentation

Feature Construction

Feature Extraction

Classification

Segmentation

Feature Construction

Feature Extraction

Classifier Training

Down-sampling

Training Tiles

Within ConfidenceRegion ?

I = I -1

I > 1?

Yes

Yes

No

No

TRAINING

TESTING

Page 7: Extreme Spatio-Temporal Data Analysis

C

en

ter

for

Com

pre

hen

siv

e I

nfo

rmati

cs

Page 8: Extreme Spatio-Temporal Data Analysis

C

en

ter

for

Com

pre

hen

siv

e I

nfo

rmati

cs

Direct Study of Relationship Between Image Features vs Clinical Outcome, Response to Treatment, Molecular Information

Page 9: Extreme Spatio-Temporal Data Analysis

In Silico Brain Tumor Center

Anaplastic Astrocytoma(WHO grade III)

Glioblastoma(WHO grade IV)

Page 10: Extreme Spatio-Temporal Data Analysis

C

en

ter

for

Com

pre

hen

siv

e I

nfo

rmati

cs

Morphological Tissue Classification

Nuclei Segmentation

Cellular Features

Lee Cooper,Jun Kong

Whole Slide Imaging

Page 11: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Consensus clustering of morphological signatures

Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients

Each possibility evaluated using 2000 iterations of K-means to quantify co-clustering

Nuclear Features Used to Classify GBMs

3 2 1

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

1602 3 4 5 6 725

30

35

40

45

50

# Clusters

Silh

ouet

te A

rea

0 0.5 1

1

2

3

Silhouette Value

Clu

ster

Page 12: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Clustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:

Cell Cycle (CC), Chromatin Modification (CM),

Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)

Featu

re I

ndic

es

CC CM PB

10

20

30

40

500 500 1000 1500 2000 2500 3000

0

0.2

0.4

0.6

0.8

1

Days

Sur

viva

l

CC

CM

PB

Page 13: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Cox proportional hazards– Gene expression class not significant p=0.58– Morphology clustering p=5.0e-3

Gene Expression Class Associations

CC CM PB0

20

40

60

80

100

Cluster

Sub

type

Per

cent

age

(%)

Classical

MesenchymalNeural

Proneural

Page 14: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Clustering Validation

• Separate set of 84 GBMs from Henry Ford Hospital• ClusterRepro: CC p=7.2e-3, CM p=1.3e-2

Fea

ture

Ind

ices

CC Mixed CM

10

20

30

40

50

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Months

Sur

viva

l

CCMixedCM

Page 15: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Associations

Page 16: Extreme Spatio-Temporal Data Analysis

Novel Pathology Modalities

ImagingExcellent Spatial Resolution

Limited Molecular Resolution

GenomicsExcellent Molecular Resolution

Limited Spatial Resolution

100

0’s

of g

en

es

Page 17: Extreme Spatio-Temporal Data Analysis

Quantum Dots

Professor Robin Bostick

Page 18: Extreme Spatio-Temporal Data Analysis

Imaging Pipeline – Feature Extraction

Page 19: Extreme Spatio-Temporal Data Analysis

Example Application: Cancer Stem Cell Niche

• Cancer stem cells– Rare(?), proliferative cells, regenerative– Do they prefer to live near blood vessels, or necrosis?

Page 20: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Leverage exascale data and computer resources to squeeze the most out of image, sensor or simulation data

• Run lots of different algorithms to derive same features

• Run lots of algorithms to derive complementary features

• Data models and data management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms

Extreme Spatio-Temporal Sensor Data Analytics

Page 21: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Application Targets

• Multi-dimensional spatial-temporal datasets– Microscopy image analyses

– Biomass monitoring using satellite imagery

– Weather prediction using satellite and ground sensor data

– Large scale simulations

• Can we analyze 100,000+ microscopy images per hour?

• Correlative and cooperative analysis of data from multiple sensor modalities and sources

• What-if scenarios and multiple design choices or initial conditions

Page 22: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Biomass Monitoring (joint with ORNL)• Investigate changes in vegetation and land use• Hierarchical, multi-resolution coarse/fine-grained analytics into a

unified framework • Changes identified using high temporal/low spatial resolution MODIS

data• Segmentation and classification methods used to characterize

changes using higher resolution data (e.g. multitemporal AWiFS data)

• Segmentation and classification to identify man-made structures.

Page 23: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Page 24: Extreme Spatio-Temporal Data Analysis

Core Transformations

• Data Cleaning and Low Level Transformations• Data Subsetting, Filtering, Subsampling• Spatio-temporal Mapping and Registration• Object Segmentation • Feature Extraction, Object Classification• Spatio-temporal Aggregation• Change Detection, Comparison, and Quantification

Page 25: Extreme Spatio-Temporal Data Analysis

Extreme DataCutter

DataCutter Pipeline of filters connected though logical streams

In transit processing

Flow control between filters and streams

Developed 1990s-2000s; led to IBM System S

Extreme DataCutter Two level hierarchical pipeline framework

In transit processing

Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes

Fine grained pipeline operations managed at the node level

Both levels employ filter/stream paradigm

Page 26: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Extreme DataCutter – Two Level Model

Coarse Grained Level

Page 27: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Node Level Work Scheduling

• Features of Node Level Architectures – Nodes contain CPUs, GPUs– Each CPU contains multiple cores– GPU has complex internal architecture– Data locality within node– Data paths between CPUs and GPUs

Keeneland Node

Page 28: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Node Level Work Scheduling

• Attempt to minimize data movement• Identify and assign operations that perform

well on GPU • Balance load between CPUs and GPUs• Prefetch data• Identify and use high bandwidth CPU/GPU

data paths• Schedule exclusive GPU access for

components (e.g. morphological reconstruction) requiring fine grained parallelism

Page 29: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Node Level Work Scheduling

Fine Grained Level

Page 30: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)

Page 31: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Control Structures for Handling Fine Grained/Runtime Dependent Parallelism in GPUs

Morphological Reconstruction: 8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070 and GTX580 GPUs

Page 32: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Large Scale Data Management

Implemented with IBM DB2 for large scale pathology image metadata (~million markups per slide)

Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.

Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships

Highly optimized spatial query and analyses

Page 33: Extreme Spatio-Temporal Data Analysis

Spatial Centric – Pathology Imaging “GIS”Point query: human marked point inside a nucleus

.

Window query: return markups contained in a rectangle

Spatial join query: algorithm validation/comparison

Containment query: nuclear featureaggregation in tumor regions

Page 34: Extreme Spatio-Temporal Data Analysis

PAIS (Pathology Analytical Imaging Standards)Supported by caBIG, R01 and ACTSI

PAIS Logical Model 62 UML classes

markups, annotations, imageReferences, provenance

PAIS Data Representation XML (compressed) or HDF5

PAIS Databases loading, managing and

querying and sharing data Native XML DBMS or

RDBMS + SDBMS

class Domain Mo...

Annotation

GeometricShape

CalculationObservation

Specimen

ImageReference

Provenance

User

PAIS

EquipmentGroup

AnatomicEntity

Subject

Field

Project

MicroscopyImageReference

DICOMImageReference

TMAImageReference

Markup

Inference

Region

WholeSlideImageReferencePatient

Surface

Collection

AnnotationReference

10..1

1

0..1

0..*

0..*

1

0..*1

0..1

1 0..*

1

0..1

1

0..1

10..1

10..*

1

0..*

0..*

0..*

1 0..11

0..1

1

0..*

0..1

0..*

1

0..*

1

0..1

1

0..*

10..1

10..1

1

0..*

10..*

1 0..*

1

0..*

PAIS

Page 35: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Example Query for Integrative Studies• Find mean nuclear feature vector and covariance on

tumor regions for each patient grouped by tumor subtype

PAIS: Example Queries

SELECT c.pais_uid, pc.subtype, AVG(area), AVG(perimeter), AVG(eccentricity), COVARIANCE(area, perimeter), COVARIANCE(area, eccentricity)FROM pais.calculation_flat c,TCGA.PATIENT_CHARACTERISTIC pc, pais.patient p WHERE p.patientid = pc.patient_id AND p.pais_uid = c.pais_uid GROUP BY c.pais_uid, pc.subtype;

2 1 3 4

50 100 150

20

40

60

80

100

120

140

160

0 0.2 0.4 0.6 0.8 1

1

2

3

4

Silhouette Value

Clu

ste

r

Fea

ture

Ind

ices

10

20

30

40

50

60

70

80

90

100

110

0 500 1000 1500 2000 2500 30000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DaysS

urvi

val

Cluster 1

Cluster 2Cluster 3

Cluster 4

Page 36: Extreme Spatio-Temporal Data Analysis

Algorithm Validation: Intersection between Two Result Sets (Spatial Join)

PAIS: Example Queries

. .

Page 37: Extreme Spatio-Temporal Data Analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

VLDB 2012

Change Detection, Comparison, and Quantification

Page 38: Extreme Spatio-Temporal Data Analysis

C

en

ter

for

Com

pre

hen

siv

e I

nfo

rmati

cs

Summary and Perspective

• Large scale integrative data analytic methods and tools to integrate clinical, molecular, Pathology, Radiology data

• Characterize new cancer subtypes and biomarkers, predict outcome, treatment response

• Algorithms to quantify Pathology classification• HPC/BIGDATA analysis pipelines

Page 39: Extreme Spatio-Temporal Data Analysis

C

en

ter

for

Com

pre

hen

siv

e I

nfo

rmati

cs

Importance:

• Computer Science: general approaches to analysis and classification of very large datasets from low dimensional spatio-temporal sensors

• Biomedical: generate basic insights into pathophysiology, clues to new treatments, better ways of evaluating existing treatments and core infrastructure needed for comparative effectiveness research studies

Page 40: Extreme Spatio-Temporal Data Analysis

Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David

Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)

• caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin Kurc, Himanshu Rathod Emory leads

• caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon, Daniel Rubin, Fred Prior, Larry Tarbox and many others

• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul

Pantalone• Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony

Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)

• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe

• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos

• NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman, Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi

Page 41: Extreme Spatio-Temporal Data Analysis

Thanks!