Dr. Mahout: Analyzing clinical data using scalable and distributed computing
description
Transcript of Dr. Mahout: Analyzing clinical data using scalable and distributed computing
![Page 1: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/1.jpg)
Dr. Mahout:Analyzing clinical data using scalable
and distributed computingShannon Quinn
[email protected] | [email protected]
November 10, 2011
1/29
![Page 2: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/2.jpg)
Punchline Cloud computing for biological and
clinical data analysis Problem: high- dimensional, noisy!
Heart tissue: biomedcentralfMRI: wikipediasegmentation: biodynamics UCSD
tech2date.com
2/29
![Page 3: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/3.jpg)
Disclaimer
3/29
Biology jargon
Academic jargon
![Page 4: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/4.jpg)
My Background 2nd year Ph.D. student in CPCB Program
Research in bioimage informatics
4/29
![Page 5: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/5.jpg)
My Background Other
5/29
http://collegefootballbelt.com/Logos/
http://s3.amazonaws.com/data.tumblr.com/
![Page 6: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/6.jpg)
Computational biology and …the cloud?
Biological data• is BIG
• requires repetitive analysis in chunks
• modeling involves linear algebra and statistics
6/29
![Page 7: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/7.jpg)
Use case 1: protein behavior
timescale of relevant motionsbond vibration side-chain
rotationdomain shifts/max. catalysis
protein folding
global conformational shifts
[
10-15 10-6 10-3 10010-910-12
detail sampling
a common tradeoff…7/29
![Page 8: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/8.jpg)
Molecular dynamics
8/29
![Page 9: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/9.jpg)
“The curse of [MD] dimensionality”
MD := for every atom for every t …€
F = ma
9/29
http://icanhascheezburger.files.wordpress.com/http://www.pdb.org/pdb/explore/explore.do?structureId=3fxi
![Page 10: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/10.jpg)
Pipeline for MD trajectory analysis
Find a “surface” of protein shapes1. MD output2. Define surface
(graph!)3. Partition surface
10/29
http://www.dillgroup.ucsf.edu/
![Page 11: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/11.jpg)
Mahout implementationDefining surface/graph:
MatrixMultiplicationJob (matrixmult)
TransposeJob (transpose)
DistributedLanczosSolver (svd)
StochasticSVD (ssvd)
Partitioning surface/graph:
SpectralKMeans (spectralkmeans)
Eigencuts (eigencuts)
Kmeans (kmeans)
. . .
11/29
![Page 12: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/12.jpg)
MD in Mahout conclusion MD simulations
(x@Home projects)
Existing Mahout functionality
Additional algorithms
http://folding.stanford.edu/
12/29
![Page 13: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/13.jpg)
Use case 2: diseases affecting cilia What are cilia?• Hairlike structures• Keep things
moving• Diseased
cilia =
13/29
http://fc06.deviantart.net/fs71/f/2010/177/d/5/Sad_Panda_by_jinxii24.jpg
![Page 14: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/14.jpg)
Importance of correct diagnoses Symptoms look
familiar Consequences do
not
14/29
![Page 15: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/15.jpg)
Beat pattern of cilia tells a lot! Clinicians look at cilia motion in making
their diagnoses1. What is the motion called?2. Can we create a database of motions?
15/29
![Page 16: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/16.jpg)
Clinicians’ ultimate goal
Category 1 Category 2 Category 3? ? ?
16/29
![Page 17: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/17.jpg)
Cilia as dynamic textures Computer vision
Saisan et al 2001
Properties
17/29
![Page 18: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/18.jpg)
The [proposed] pipeline Step 1• Clinician captures video and uploads it
http://googolplex.dyndns.org/cilia/
18/29
![Page 19: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/19.jpg)
The [proposed] pipeline Step 2• Mahout job: autoregressive modeling
€
y t ~ Cx t
€
x t ~ A1x t−1 + ...
Appearance Model Dynamic Model
http://web.media.mit.edu/~tristan/phd/dissertation/figures/manifold2.jpg
19/29
![Page 20: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/20.jpg)
The [proposed] pipeline Step 3• Add the transition matrices to cloud library
A =
20/29
![Page 21: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/21.jpg)
The [proposed] pipeline Step 4• Recompute network with added videos
Axis
2
Axis 1
?
21/29
![Page 22: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/22.jpg)
One more thing… What’s really cool about AR models:• Can you spot the fake?
Synthetic Original
22/29
![Page 23: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/23.jpg)
Mahout implementationLearning autoregressive models:
MatrixMultiplicationJob (matrixmult)
TransposeJob (transpose)
DistributedLanczosSolver (svd)
StochasticSVD (ssvd)
Comparing autoregressive parameters:
SpectralKMeans (spectralkmeans)
Eigencuts (eigencuts)
Frobenius norm
Tensors
? ? ?
23/29
![Page 24: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/24.jpg)
Cilia on Mahout conclusions Autoregressive modeling uses linear algebra
that is already implemented
Maintaining AR library requires new functionality
Mahout framework gives us elbow room
24/29
![Page 25: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/25.jpg)
Final Thoughts Biological / biomedical data is large,
high-dimensional, and noisy
We extend Mahout’s current linear algebra framework (spectral clustering, autoregressive models)
We provide a cloud framework!
25/29
![Page 26: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/26.jpg)
Research Group University of Pittsburgh• Dr. Chakra Chennubhotla Lab (advisor)
CMU@Qatar• Dr. Majd Sakr Lab (collaborator)
University of Pittsburgh Medical Center• Dr. Cecilia Lo Lab (collaborator)
26/29
![Page 27: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/27.jpg)
Sources Resources• Apache Mahout• Spectrally Clustered
Links• Categorizing ciliary motion defects (BSEC 2011)• Eigencuts spectral clustering algorithm
Technical report (coming soon!)
27/29
![Page 29: Dr. Mahout: Analyzing clinical data using scalable and distributed computing](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165de550346895dd8f98c/html5/thumbnails/29.jpg)
Thank you!
29/29
http://icanhascheezburger.files.wordpress.com/