PROGRESS REVIEW Mike Langston’s Research Team
description
Transcript of PROGRESS REVIEW Mike Langston’s Research Team
PROGRESS REVIEW
Mike Langston’s Research Team
Department of Computer ScienceUniversity of Tennessee
with collaborative efforts at
Oak Ridge National Laboratory
November 22, 2005
Team Members in Attendance
Bhavesh Borate, Elissa Chesler, John Eblen, Roumyana Kirova, Mike Langston,
Andy Perkins, Yun Zhang
Team Members Absent
Xinxia Peng, Jon Scharff, Josh Steadmon
Mike Langston’s Progress ReportFall, 2005
• Team ChangesNew Students: Belma Ford (GST), Peter Shaw (Australia)New Colleagues: Elissa Chesler & Roumyana Kirova (ORNL)Graduating Soon: Xinxia Peng (December), Jon Scharff (May)Moved Collaborators: Jay Snoddy (Vanderbilt)
• Recent Conferences/TalksACiD (England), Dagstuhl (Germany), COCOON (China), Purdue, Supercomputing (Seattle)
• Upcoming Visits/TalksRECOMB WS (San Diego), Texas A&M, Carleton (Canada),
Göteborg (Sweden), AICCSA-06 (UAE), ACM SAC (France)• Support
NIH (John), ORNL (Yun), Science Alliance (Andy)Proposals Outstanding
• Sample ProjectsEukaryotes: Allergy (Human), Diabetes (Mice), IR (Mice),
Neuroscience (Mice), othersProkaryotes: Operon (R. palustris), Shock (Shewanella)
Yun Zhang
• Recent conferences/talks– Prepared slides for Cocoon05, China– Presented in SC05 (SuperComputing), Seattle
• Upcoming events– Cray MTA (Multithreaded Architecture) Workshop, ORNL
• Projects: maximal clique enumeration– Comparisons of multithreaded implementations on
• Altix vs. Cray vs. IBM• Cray: Vectorization of for-loops
– Implementations on distributed-memory machines• Using MPI vs. Global Arrays• Load-balancing using master/slave vs. peer-to-peer model
– Comparison of MPI vs. Multithreaded
Parallel Clique Enumeration
• Object– Minimize data communication vs. maximize balanced load
• Dynamic load balancing– Data transfer: peer-to-peer– DLB strategies: master/slave vs. peer-to-peer
k = 1
k = 2
k = 3
k = 4
k = 5
1 2 3 5 64
1 5 2
1
3
A task needed to be transferred from slave1 to slave5
Search tree
Clique Enumeration
• Methods to speed up the computation core• Bit compression to save memory, and corresponding
bitwise operations on compressed bitmaps
a
a b c d e f g
0 1 1 1 1 0 0
b 1 0 1 1 1 1 0
c 1 1 0 1 1 1 1
d 1 1 1 0 1 0 1
e 1 1 1 1 0 0 1
f 0 1 1 0 0 0 1
g 0 0 1 1 1 1 0
(a, b, c, d) sparse0 0 0 0 1 0 0f
a
b
c d
e
g
dense
Vertices
Cli
qu
es
Andy PerkinsProjects
• Low dose• Allergy• Shewanella• HRT
Microarray Data
• Normalization• Filtering low or unchanging expression values• Control spots
Differential Analysis
• Cliquification– In a large percent of cliques in one group and few
in the other.
• Expression– 2-fold change in expression between groups
• Correlation– Correlation value >= 0.85 in one group and <=
0.25 in the other.
Differential Analysis
Red edge: >=0.85 in dose and <= 0.25 in control
Blue edge: >= 0.85 in control and <= 0.25 in dose
Other research
• Thresholding• Pearson’s vs Spearman’s• Random graphs
Papers
• ``Computational Analysis of Mass Spectrometry Data Using Novel Combinatorial Methods,'' Proceedings, ACS/IEEE International Conference on Computer Systems and Applications, Dubai, United Arab Emirates, March, 2006, with A. Fadiel, M. A. Langston, F. Naftolin, X. Peng, P. Pevsner, H.S. Talor, O. Tuncalp, and D. Vitello.
• ``Innovative Computational Methods for Transcriptomic Data Analysis,'' Proceedings, ACM Symposium on Applied Computing, Dijon, France, April, 2006, with M. A. Langston, A. M. Saxton, J. A. Scharff and B. H. Voy.
John EblenClique Analysis Tool Chain
• Projects– Gerling Data – NOD mice– Shewanella Data
• Three Interesting Problems– Aggregating Maximal Cliques– Thresholding– Biological Analysis of Clique Results
Aggregating Maximal Cliques
• The Problem– A great deal of overlap among maximal cliques– Many cliques differ by only a few nodes
• Solutions– Paraclique (Dr. Langston)– Nucleated Clique (Jon Scharff)– Clique Difference or “Nonoverlap”– Others
Direct Maximum Clique
• Parallel version scales well on Altix supercomputer, shared memory machines
• Currently working on base serial code efficiency
• Ultimate goal is speed– Best algorithm possible– Smart implementation(s)
Keller 7 Conjecture
• Goal is to find or prove nonexistence of 128-clique in Keller 7 graph
• Current approach– Found set of 128 nonoverlapping ISs– Currently searching for more– Should greatly reduce search space
Bhavesh BorateThresholding
• GO Pairwise Similarity Analysis• Percentage of Cliques with Biological Meaning at each threshold • Confidence Intervals • Graph Properties (Edge Density, Maximal Cliques, Maximum
Clique) • Spectral Graph Theory• Bayesian Statistics • Control Spot Threshold verification • Utilization of Info from Pathway Databases • Combinatorial Strategy • Kentucky Windage ;)
Graph of GO-Pairwise Scores v/s Correlation Values
Shewanella data
Avg functional similarity v/s Correlation
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
00.20.40.60.811.2
Correlation
Avg
fu
nct
ion
al s
imil
arit
y
Series1
For each pair of genes, we find a GO category X that covers both the genes and has the minimum number of total genes
Get a GO score for each pair of genes
Accumulate correlation scores in bins 1,0.99,0.98…….0
Average the GO scores of pairs in each bin.
Plot.
GO Pairwise Similarity Analysis
Pairwise Scores Score for each Clique
Get P-value for each Clique
For each threshold 0.8:0.01:0.95
At each threshold calculate % Cliques with P-value < 0.01
Updates from Xinxia
• Kevin was born in May• Defended in October• Graduating in December• Working on publications• Starting a job in December
Thank you all and Keep in touch!
Suman DuvvuruData analysis
• Effect of Strain: Currently working on Dr.Brynns mice strain data and I am writing up the code in SAS to see which strain is producing strong correlation in the data.
• The problem with microarray data
1. The numbers of variables is much higher than the number of observations – causes many eigenvalues in the Covariance matrix to be 0 – Correlation matrix is problematic.
2. Can be corrected using • shrinkage based correlation
• Information criteria based methods (using smooth covariance estimators) .
• (Implementation of these methods currently in progress)
Roumyana KirovamRNA expressions and Linkage
Gene expression data: N genes, K strains
Probe BXD1 BXD2 BXD3 BXD5 BXD8 …1 4.46 5.30 5.80 5.51 4.90 ...2 4.10 4.49 4.24 4.06 4.46 ...3 5.15 4.74 5.04 6.10 5.20 ...4 6.45 6.03 5.79 6.56 7.32 ...5 4.06 5.06 4.35 4.09 4.09 ...
12000 4.16 4.06 5.37 5.28 5.31 …
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
4 4.4 4.5 5.5 6 6.5 7 7.5 More
aa
AA
Marker BXD1 BXD2 BXD3M1 AA aa AAM2 AA AA AA M3 aa AA AA M4 AA aa AA M5 AA AA aa
M3000 aa aa AA
Polymorphisms
Model: QTL mapping
1 ,
expression levels
2 if AA, 0 if aa
error terms
( , 0) / ( , 0)
m l
i i ij i ji i j
i
i i
y q x x x e
y
x
e
LOD P Model q P Model q
C1
Paraclique1
RegulatoryModel 1
Clique 2
Paraclique2
Expressions: 0.46 0.30 0.80 1.51 0.90
C2
Regulatory
Model 2
Correlation histogram
res
Fre
qu
en
cy
-0.5 0.0 0.5
02
00
50
0
0 500 1000 1500 2000
0.0
0.4
0.8
Regulatory ID 2840
Paraclique members
res
Correlation histogram
res
Fre
qu
en
cy
-0.5 0.0 0.5
02
00
50
0
0 500 1000 1500 2000
0.0
0.4
0.8
Regulatory ID 267
Paraclique membersre
s
QTL Model 1C1
Paraclique1
C2
Principal components
QTL Model 2C1
Paraclique2
C2
Principal components
QTL Model 1
Principal components
QTL Model 2
Principal components
Common QTL
Meta component
1. How stable are the paracliques and QTL models if we choose different samples (not the average of the replicates). • generate samples of the data by choosing randomly replicates and
build confidence intervals.• fit a multi variance model: Expression ~ Strain + Sample + Strain:
Sample • adding covariates in the QTL model to adjust for the gender effect.
2. Power issues: How many strains, replicates and how many terms in the model.• simulate expression data and calculate power as a function of the
sample size.
3. Parametric vs non-parametric analysis.
4. Multiple tests adjustments.
Open questions:
Ontological Discovery for Ethanol ResearchOntological Discovery for Ethanol Research(…the new acronym stinks)(…the new acronym stinks)
Elissa J. CheslerElissa J. CheslerDepartment of Anatomy and NeurobiologyDepartment of Anatomy and Neurobiology
Center for Genomics and BioinformaticsCenter for Genomics and Bioinformatics
University of Tennessee-MemphisUniversity of Tennessee-Memphis
Health Science CenterHealth Science Center
Ontological Discovery for
Ethanol ResearchSPECIFIC AIMS
• Aim 1: To develop a data archive of ethanol, brain and behavior related gene sets that have been derived both empirically and through literature review.
• Aim 2: To develop a tool that allows cross-species, cross-molecule type gene set comparison.
• Aim 3: To develop a Web interface to the data archive and analysis that is aimed toward behavioral neuroscientists.
Pressure
Audiogenic
EtOH Withdrawal
Cocaine& PTZ
T4
ATPases The Seizure Related
Phenotype Landscape
Highly related phenotypes share many
common mRNA correlates
Ontological Discovery fromPhenotype Centered Gene Sets
ERGO:
Ethanol Related Gene Ontology
Phenotypes are operationally defined, based on phenomenology.
Gene sets can be empirically associated with phenotypes.
But what underlying construct really “IS”?Can we identify it by examining shared biological
substrates of related processes.
AIM 1: Gene set assembly and archive
• Gene set is broadly defined. – mRNA differential
expression– mRNA correlation– Literature review– KO, mutants with
trait effects
• Search – by gene– by descriptor– by set matching
• Attributes of each gene set include:• Type (mRNA, lit, protein)• Species• Free text description• Structured description,
e.g. MPO• Source DB (GO, KEGG,
WebQTL100)• Associated document
(e.g. abstract, publication)
Aim 2: Analytic tool
• Translates gene sets to a common reference species via homology.
• Similar to existing tools, but archives more information about gene set
• Allows multiple set comparisons (intersection analyses are not limited to two sets).
• Percent positive matching allows estimation of the relation of gene sets w/o specific regard to identity of genes. This allows a basis for clustering phenotypes based on gene annotation
GeneKeyDB can be used to generate translation tables across species
Aim 3 Behavioral Neuroscience Friendly interface
• Does the world need another boutique? • Making genomics accessible to broader research
community. • Text searching to retrieve, e.g. all gene sets related
to ‘stress’.• Text mining• Apparatus specific details• OUR GOAL IS TO CREATE A TOOL FOR
PHENOTYPIC ANALYSIS, GENES CAN BE A BLACK BOX THAT GET US THERE!
Future DirectionsBleeding Edge
From a matrix of set-set correlations estimated by jacquards positive match, can we draw and analyzegraphs of gene set relations?
From a set of documents associated with overlapping gene sets, can we mine text for frequently occurring terms? e.g. to answer “What term is most commonly occuring in the set of sets extracted by match to expression upregulation in response to handling stress?”
Research challenges
• Translation of genes across species:– Homology is not perfect, how do we match when
no homologues are found?
• Reference Set– What is the “reference set” for category
representation analysis when gene sets are drawn from diverse sources?
– Lack of comprehensivity of reference sets, e.g. a list of KO mice does not include all genes screened.
• Generation and curation of gene sets: establishing meaningful protocols and definitions to increase the quality and utility– Use GenMapp or Stanford models.
Gene set overlap unites diverse phenomena
ConsumptionCorrelates in
RI lines
GeneExpression Correlates of Htr1b
Upregulated in SocialIsolation
Upregulatedin P vs NP
LiteratureOn
NeuroactiveSteroid
Synthesis
ontology
Induction of a research question:
“If I antagonize the gene product of consumption correlate in socially isolated monkeys, consumption will decrease.”
“Hey, you put your social isolation in my NP mice!Yeah, well you put your P mice in my binge drinking!”