LHC progress Saturday 2 nd May 2015 Coordination: Mike Lamont, Wolfgang Hofle.
Mike Langston’s Progress Report Summer, 2005
description
Transcript of Mike Langston’s Progress Report Summer, 2005
![Page 1: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/1.jpg)
PROGRESS REVIEW
Mike Langston’s Research Team
Department of Computer ScienceUniversity of Tennessee
with collaborative efforts at
Oak Ridge National Laboratory
June 27, 2005
![Page 2: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/2.jpg)
Team Members in Attendance
Bhavesh Borate, Suman Duvvuru, John Eblen, Mike Langston, Xinxia Peng, Andy Perkins, Jon S
charff, Henry Suters, Yun Zhang
Team Members Absent
Josh Steadmon
![Page 3: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/3.jpg)
Mike Langston’s Progress ReportSummer, 2005
• Team Changes– Graduating Soon: Xinxia Peng, Jon Scharff– New Member: Andy Perkins
• Team Foci– FPT Tools and Applications– Computational Biology
• Recent Conference Talks– AICCSA-05 (Egypt), RTST-05 (Lebanon), DIMACS (New Jersey)
• Recent Visits– Cold Spring Harbor Lab (New York)
• Upcoming Conference Talks– ACiD-05 (England), Dagstuhl-05 (Germany), COCOON-05 (China)
• Upcoming Major Program Committee Service– AICCSA-06 (Program Chair), IWPEC-06 (Program Co-Chair)
![Page 4: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/4.jpg)
John EblenDr. Ivan Gerling’s Data
• Details– Leukocyte data - 2 ages, 3 strands– Islet data – 3 ages, 4 strands
• Current Project – Adding Proteins– Add 60 proteins to leukocyte data of 22690
probe sets– How can we improve correlation?– What other types of analysis are possible?
![Page 5: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/5.jpg)
General Clique Problem
• Specific Approaches– “Biographs” or graphs created from correlation values– Brock graphs– Approach for keller graphs?
• Information Gathering– Markov chains– General graph properties
• Combining Algorithms
![Page 6: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/6.jpg)
Additional Projects
• Fast Direct Clique Codes– Currently testing on DIMACS challenge
graphs– Work continues
• Common Neighbor Preprocessing
![Page 7: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/7.jpg)
Jon ScharffDifferential Expression
• Student’s t-test, in two normally distributed populations:– Mean assumed to be equal– Variance assumed to be equal
• Gene by gene basis
![Page 8: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/8.jpg)
Differential Correlation
![Page 9: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/9.jpg)
Differential Cliquification
• Cliques that appear in one graph but not the comparison graph
![Page 10: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/10.jpg)
Nucleus Cliques/Clique Nuclei
![Page 11: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/11.jpg)
Yun ZhangClique Enumeration Problem (1)
• Proposed a new maximal clique enumeration algorithm– Inspired by Kose et al algorithm– Enumerates cliques in non-decreasing order of sizes– Uses bitwise operations to speedup and reduce spac
e requirements– Sequential algorithm is parallelizable– Serial code is almost 400 times faster than Kose RAM
on the 0.85 threshold MAS5.0 graph (size 12,422)
![Page 12: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/12.jpg)
Clique Enumeration Problem (2)
• Space required to hold the cliques is enormous
Memory Usage on a graph with 2895 vertices
0
2
4
6
8
10
12
14
16
18
20
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Clique Size
Mem
ory
usage (
GB
yte
s)
On 0.7 threshold MAS5.0 graph, it used up to almost 1 terabyte memory after 12 hours running
![Page 13: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/13.jpg)
Clique Enumeration Problem (3)
• Parallelism on shared-memory machine– SGI Altix, 256 processors, 2 Terabytes shared
memory, 8GB per CPU– Use a dynamic task scheduler to
• Synchronize multiple threads• Make load balancing decisions
– Achieves a super-linear speedup on up to 64 processors
![Page 14: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/14.jpg)
Clique Enumeration Problem (4)
Run times with/ without load balancing using up to 64processors on a graph with 2895 vertices
0.00
100.00
200.00
300.00
400.00
500.00
600.00
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Number of processors (threads)
Ru
n t
ime (
secon
ds)
no load balancing with load balancing
![Page 15: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/15.jpg)
Maximum Common Subgraph
• Clique branch algorithm by Henry (Cocoon05)– Takes advantage of the special structure of
association graph built from two graphs
• Finished serial implementation• Preliminary performance testing on small graphs• Next step:
– Benchmarking– Parallel implementation
![Page 16: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/16.jpg)
Andy Perkins
• Working with Jon on Brynn Voy's low dose IR mouse data
• Finding and examining paracliques in the low dose data
• Thresholding via spectral graph theory
• Clique on MPSS mouse data
![Page 17: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/17.jpg)
Ways of getting to the threshold
• Graph features/characteristics
• Using confidence intervals with Bayesian statistics
• Random: 0.5% edges in graph
• Gene Ontology
• Utilization of info from pathway databases
Graph & Statistical Analysis Utilizing Biological Info
Bravesh BorateThresholding in High-Throughput data
![Page 18: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/18.jpg)
Normal distribution of No. of edges
-200000
0
200000
400000
600000
800000
1000000
1200000
0 0.2 0.4 0.6 0.8 1 1.2
Series1
Spleen data
-50000
0
50000
100000
150000
200000
250000
300000
350000
400000
0 0.2 0.4 0.6 0.8 1 1.2
Series1
Skin data
-500000
0
500000
1000000
1500000
2000000
2500000
3000000
0 0.2 0.4 0.6 0.8 1 1.2
Series1
MAS5 data
-200000
0
200000
400000
600000
800000
1000000
1200000
1400000
0 0.2 0.4 0.6 0.8 1 1.2
Series1
RMA data
-200000
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
0 0.2 0.4 0.6 0.8 1 1.2
Series1PDNN data
![Page 19: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/19.jpg)
Comparison with other datasets
Data No of edges
Maximum Degree
Size of Maximum
Clique
Avg. size of Maximal Cliques
Spleen Data (0.85)
34753 349 39 20.03229
Skin data
(0.87)
32384 606 66 48.009
MAS5 data
(0.84)
3704 134 19 10.35285
RMA data (0.92)
34814 698 116 --
PDNN data
(0.87)
34225 678 88 68.1974
![Page 20: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/20.jpg)
Gene Ontology
1 0.8 0.6 0.4 0.2 0
1
0.8
0.6
0.4
0.2
0
Scores from GO
Correlation Scores
Limitations
•GO data: helpful but blind reliability questionable
• Only applicable to genes with GO annotation
•For Elissa’s data: Bing got inexplicable results (a more so flat curve)
![Page 21: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/21.jpg)
Info from Pathway Databases
• What graphs mean in biological context• Extrapolate info from what is “known” to the “unknown”.• Expression data from House-keeping genes invaluable.
Limitations
• Info in Pathway databases not arranged in tissue-specific or condition-specific manner.
![Page 22: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/22.jpg)
A Combinatorial Strategy
• Get info & develop algo to make sense of it all and suggest a threshold to the user.
• Also suggest the biologist ideal thresholds with each method !!!!
• Provide facility for displaying the graph at each threshold• Better so, if it is interactive and dynamic (perhaps too am
bitious ???)• User discretion in the end, determines the right threshold.
![Page 23: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/23.jpg)
Comparison of clustering algorithms
-Suman Duvvuru
![Page 24: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/24.jpg)
What is clustering
• Clustering:– Partitioning into dissimilar groups of similar objects (in
our case objects refer to genes).– Cluster analysis is used to identify genes that show si
milar expression patterns over a wide range of experimental conditions.
• Traditional definition of a “good” clustering:– Points assigned to same cluster should be highly
similar.– Points assigned to different clusters should be highly
dissimilar.
![Page 25: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/25.jpg)
Overview of clustering algorithms
• K-cores (Implemented):– A k-core of a graph is a largest subgraph wit
h minimum degree at least k – The k-cores of a graph can be generated by
• deleting the vertices from the graph whose degree is less than k and
• Performing a DFS on the resulting graph to find all the cores.
![Page 26: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/26.jpg)
– The edge connectivity or simply the connectivity k(G) of a graph G is the minimum number k of edges whose removal results in a disconnected graph.
– A minimum cut abbreviated mincut is a cut with a minimum number of edges.
– A graph G with n vertices is called highly connected if k(G) > n/2.
– A highly connected subgraph HCS is an induced subgraph H such that H is highly connected.
– This algorithm identifies highly connected subgraphs as clusters.
HCS (Highly connected graph):
![Page 27: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/27.jpg)
HCS Algorithm
• Using Dinics algorithm to compute mincut. The complexity of this computation is O(nm2/3).
• Edge density half as compared to our clique method.
![Page 28: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/28.jpg)
HCS: An example
![Page 29: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/29.jpg)
Other clustering methods
Using cluster 3.0 software:
• K-means
• Hierarchical clustering
Disadvantages:
• None of these methods allow a single gene to be present in multiple clusters.
![Page 30: Mike Langston’s Progress Report Summer, 2005](https://reader031.fdocuments.net/reader031/viewer/2022020918/5681466b550346895db39010/html5/thumbnails/30.jpg)
Quality assessing
• Different measures for the quality of a clustering solution are applicable in different situations.
• It depends on the data and on the availability of the true solution.
• In case the true solution is known, and we wish to compare it to another solution, we can use the Minkowski measure or the Jaccard coefficient.
• When the true solution is not known, edge density, Homogeneity and separation, Average silhouette are used as criteria for evaluation.