Post on 04-Mar-2021
Statystyczna analiza danych
wykład I: motywacja i organizacja
Anna Gambin Instytut Informatyki UW
piątek, 10 lutego 2012
plan wykładu
• motywacja: statystyczna bioinformatyka - dwa przykłady (z własnego podwórka)
• spektrometria mas
• mikromacierze aCGH
• organizacja: plan wykładu i reguły zaliczenia
piątek, 10 lutego 2012
Diagram of the microarray-based comparative genomic hybridization (aCGH)process © 2008 Nature Education
EKSPERYMENT
DANE (duża skala)
STATYSTYCZNA ANALIZA DANYCH
piątek, 10 lutego 2012
MichałPiotrek
ludzie (bioputer.mimuw.edu.pl)
piątek, 10 lutego 2012
projekty (przykład 1: spektrometria)
piątek, 10 lutego 2012
2. MASS SPECTRA PREPROCESSING
• if here is exactly one candidate cluster we extend it with the active peak,
• if more than one candidate cluster exists we assign the active peak to the
cluster whose monoisotopic peak is the highest (such a situation is quite
rare but it happens when the signal coming from one peptide is artificially
split in the domain of the retention time (c.f. Fig. 2.6)).
Figure 2.6: Artificial peak separation in the retention time domain. Isotopic
envelopes visualized by Sparky tool (Goddard and Kneller, 2006). Peaks found
by NMRPipe are depicted as black Xes. Horizontal axis — mass-to-charge ratio,
vertical axis — retention time, height is color coded increasing from red to blue.
2.3.3 Automated Charge Determination
We have implemented two versions of this step, simple and fast and a more
sophisticated one. The simple version uses only information from the peak spacing
14
spektrometria maswidmo
peptydu = obwiednia izotopowa
piątek, 10 lutego 2012
2.3 Isotopic envelopes detection
Figure 2.8: Masses and charges calculated by our algorithm for the fragment
of the spectrum. Peaks are marked as black crosses, small arrow denotes the
monoisotopic peak in each isotopic cluster, the monoisotopic mass (M) and charge
(Q) are given for each identified peptide.
21
automatyczna interpretacja widma
piątek, 10 lutego 2012
DBSCAN
piątek, 10 lutego 2012
3.2 Alignment via clustering
Figure 3.6: Colorectal cancer data clustered with the DBSCAN algorithm, �m =
5, �rt = 30, minPts = 10. Picture on the right presents fragment of the data in
greater detail. The cluster colors are recycled.
• the upper limit for size of a preliminary cluster was 1000 elements.
In case of minPts = 0 no peaks are treated as noise, even the ones in the very
sparse regions. For minPts = 10 we assume that some of the peaks might have
noise origins. Of course another explanation for their origins can be that they
in fact correspond to real peptides, but the retention time drift was so big, that
they cannot be aligned to any peaks in this iteration. Hopefully, excluding them
at this stage does not necessarily mean they they will never be properly aligned.
If retention time correction step (discussed in Section 3.2.3) is performed after
the clustering step the drifts might become smaller.
There were 9026 preliminary groups obtained with parameter minPts = 0 and
8216 with minPts = 10 (c.f. Fig. 3.6). In the latter case 3076 points were marked
as noise.
All the models were fitted within each of the preliminary clusters in the second
stage of the algorithm. At one time the same model was assumed in all the
preliminary clusters. Hence, model selection problem was solely to select the
appropriate number of clusters, the one that minimizes the BIC value. For a
45
klastrowanie w domenie czasu retencji
metody obszarowe: DBSCAN
piątek, 10 lutego 2012
3.2 Alignment via clustering
a) λrBr b) λB c) λrB
!"#$ !"!$ !""$
"%##
&!##
&&##
'(
)*
!"#$ !"!$ !""$
"%##
&!##
&&##
'()*
!"#$ !"!$ !""$
"%##
&!##
&&##
'(
)*
d) λBr e) m/z 0.04, rt 50 f) m/z 0.04, rt 100
!!
!
!
!
!!!!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!
!!
!
!
!!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!!!!!
!!
!!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!!!
!
!
!
!
!
!!
!!
!
!!
!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!!!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!!!!
!
!!
!
!!!
!!
!
!
!
!
!
!
!
!!
!
!!
!!
!
!
!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
1205 1210 1215 1220 1225 1230
2900
3000
3100
3200
3300
mz
rt
!!
!
!
!
!!!!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!
!!
!
!
!!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!!!!!
!!
!!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!!!
!
!
!
!
!
!!
!!
!
!!
!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!!!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!!!!
!
!!
!
!!!
!!
!
!
!
!
!
!
!
!!
!
!!
!!
!
!
!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
1205 1210 1215 1220 1225 1230
2900
3000
3100
3200
3300
mz
rt
!!
!
!
!
!!!!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!
!!
!
!
!!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!!!!!
!!
!!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!!!
!
!
!
!
!
!!
!!
!
!!
!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!!!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!!!!
!
!!
!
!!!
!!
!
!
!
!
!
!
!
!!
!
!!
!!
!
!
!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
1205 1210 1215 1220 1225 1230
2900
3000
3100
3200
3300
mzrt
g) m/z 0.04, rt 200 h) m/z 0.04, rt estim. i) m/z 0.04, var. rt estim.
!!
!
!
!
!!!!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!
!!
!
!
!!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!!!!!
!!
!!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!!!
!
!
!
!
!
!!
!!
!
!!
!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!!!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!!!!
!
!!
!
!!!
!!
!
!
!
!
!
!
!
!!
!
!!
!!
!
!
!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
1205 1210 1215 1220 1225 1230
2900
3000
3100
3200
3300
mz
rt
!!
!
!
!
!!!!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!
!!
!
!
!!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!!!!!
!!
!!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!!!
!
!
!
!
!
!!
!!
!
!!
!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!!!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!!!!
!
!!
!
!!!
!!
!
!
!
!
!
!
!
!!
!
!!
!!
!
!
!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
1205 1210 1215 1220 1225 1230
2900
3000
3100
3200
3300
mz
rt
!!
!
!
!
!!!!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!
!!
!
!
!!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!!!!!
!!
!!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!!!
!
!
!
!
!
!!
!!
!
!!
!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!!!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!!!!
!
!!
!
!!!
!!
!
!
!
!
!
!
!
!!
!
!!
!!
!
!
!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!!
!
!
!
!
!!
!!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
1205 1210 1215 1220 1225 1230
2900
3000
3100
3200
3300
mz
rt
Figure 3.7: Example of clustering acquired for one of the preliminary clusters
from Figure 3.6 of size 999. a) – i) stand for fitted models: a) λrBr, b) λB, c)
λrB, d) λBr, e) – i) models have deviation in the m/z dimension fixed to 0.04,
retention time deviations are: e) 50, f) 100, g) 200, h) estimated from the data,
the same for every cluster, i) estimated from the data for each cluster.
51
klastrowaniealgorytm k-średnich
algorytm EM
wykorzystujące model probabilistyczny
piątek, 10 lutego 2012
ocena jakości
piątek, 10 lutego 2012
(jednoczesne) testowanie (wielu) hipotez statystycznych
FDR - false dicovery rate
piątek, 10 lutego 2012
28 A. Gambin et al. / International Journal of Mass Spectrometry 260 (2007) 20–30
Fig. 10. Results of PPC method on the colorectal cancer example. Each of the four panels in the right figure shows a histogram of peptide intensities in the training setfor one mass cluster for healthy donors (top) and cancer patients (bottom). Cluster centroid coordinates: m/z and retention time value is given. Additionally we estimatedthe density of signal abundance and calculated the PPC threshold as before. Right panel shows the false-discovery-rate (FDR) value as a function of the PPC threshold.
two lines (retention times) in the previous approach. Our algo-rithm operates on the sorted peaks list (of length n) and calculatesthe list of masses (i.e., peptides) of length k. Assuming that ev-ery vertical strip intersects constant number of isotopic clusters
(in practice this number is less than 10), we can estimate timecomplexity of the algorithm to be O(n(log n + log k)). Memoryrequirements can be bounded by O(n + k). Effectivity tests wereperformed to find the execution time and memory requirements.
selekcja cech (biomarkerów)metoda PPC
piątek, 10 lutego 2012
t-test (Welch)
selekcja cech (biomarkerów)
piątek, 10 lutego 2012
niestabilność biomarkerów
piątek, 10 lutego 2012
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!+,-'+ .//)-00111234567897:/;<*29560=>$=?"=#+0@0'+0'+
A<B7!==!5C!=D/0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5
The structure of the graph underlying the Markov chainMC1 is presented in Figure 7. Depicted is the nested familyof sets grouped together during the consecutive phases ofthe algorithm.
Readers interested in the details of the approximationalgorithm are referred to [19], (an extended version of it isavailable at [20]).
The hierarchical approach from [19] complies well withchain MC1. The structure of MC4 does not allow for effi-
cient aggregation, but because of the limited number offeatures selected from each input ranking, it is still tracta-ble. Table 1 summarizes the efficiency of Markov chainaggregation method. For MC1 we have 7 grouping stages;the sizes of resulting groups vary from 4 to 87. For MC4 all242 states are grouped together in the first phase and thestationary distribution vector is calculated exactly usingthe GTH algorithm [18]. The significant speed-upobtained with the approximation algorithm is crucialwhen aggregation is applied to complete rankings of fea-tures from massive datasets.
Markov chain hierarchical structureFigure 7Markov chain hierarchical structure. The structure of the state space graph for rank aggregation Markov chain MC1. The type of edge corresponds to the transition probability. Ellipses surround the top ranked features appearing in each phase (from 1 up to 3 in this example). States joined at an earlier stage have higher stationary probability, and therefore rank higher in the aggregated ranking.
modele Markowa
piątek, 10 lutego 2012
uliniowienie próbek
piątek, 10 lutego 2012
klasyfikacja
piątek, 10 lutego 2012
konsensus biomarkerów
piątek, 10 lutego 2012
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!+,-'+ .//)-00111234567897:/;<*29560=>$=?"=#+0@0'+0'+
A<B7!+!5C!=D/0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5
Ovarian cancer classification resultsFigure 3Ovarian cancer classification results. Classification results for four classifiers (random forest (RF), SVM, decision trees (DT) and LDA) on the MALDI-TOF ovarian cancer dataset are shown separately in the four panels. Classifier performance using a specified number of best features from individual scoring functions (peak probability contrast (PPC), mutual information (MI), t-statistic (TT) and random forest feature ranking (RF)) are plotted in black. Performance with features selected by MC1 rank aggregation of the four functions is shown in blue. Results for regular PCA and our modified "Consensus" version using only the best features from the four scoring functions are plotted in green and red respectively. For all methods the average accuracy (fraction of samples correctly classified) over 20 cross-validation runs is shown. See Section Results for discussion.
klasyfikacja
RF- random forest
SVM- support vector machines
DT - decision trees
LDA- linear discriminant
analysis
piątek, 10 lutego 2012
Introduction Model and methods Results Bibliography
Proteolysis
peptidases — enzymes cleaving polipeptide chains, divided into:
exopeptidases - cleave near the ends of thepolypeptide chainendopeptidases - cleave the polypeptide chain inthe middle
process of proteolysis — decomposition of proteins into peptidesand amino acids
modelowanie aktywności proteolitycznej
piątek, 10 lutego 2012
!
FT
FTS
TS
FTSS
TSS
SS
FTSST
TSST
SST
ST
FTSSTS
TSSTS
SSTS
STS
SSTSY
STSY
TSY
SY
†
Figure 2: The cleavage graph for 2 precursor peptides FTSSTS and SSTSY with source
and sink nodes added.
model I: egzopeptydazy
piątek, 10 lutego 2012
model II: endopepydazy
piątek, 10 lutego 2012
modelowanie
piątek, 10 lutego 2012
bayesowskie modelowanie
piątek, 10 lutego 2012
prawdopodobieństwoa posteriori
piątek, 10 lutego 2012
Metropolis-Hastings
piątek, 10 lutego 2012
aktywność trypsyny
piątek, 10 lutego 2012
!
!
!
!
!
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
2
4
6
8
clr(a
ctivit
y)
A (re
al da
ta)
elasta
se.1
tryps
in.1
ADAM
10.pe
ptida
se
matrix
.meta
llope
ptida
se.20
memb
rane
.type
.matr
ix.me
tallop
eptid
ase.3
cathe
psin.
S
memb
rane
.type
.matr
ix.me
tallop
eptid
ase.4
calpa
in.2
memb
rane
.type
.matr
ix.me
tallop
eptid
ase.6
ADAM
TS5.p
eptid
ase
myelo
blasti
n
memb
rane
.type
.matr
ix.me
tallop
eptid
ase.1
calpa
in.1
cathe
psin.
H
! !
!
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
2
4
6
8
clr(a
ctivit
y)
B (sy
ntheti
c data
with
std =
0.1)
!!
!
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
2
4
6
8
clr(a
ctivit
y)
C (sy
ntheti
c data
with
std =
0.01
)
!
!
!
!
!
!
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
2
4
6
8
clr(a
ctivit
y)
D (sy
ntheti
c data
with
std =
0.00
1)
Figure 4. (A) Inferred peptidases’ activities for sample no. 19 (colorectalcancer patient). (B-D) Same parameters for synthetic data generated fromthe model with standard deviation set to 0.1, 0.01, 0.01 respectively. Redlines correspond to model peptidases’ activities, which we aim to recover.
estimated parameters are known (red line). In those cases weonly test the accuracy of estimation procedure.
The set of identified enzymes do not vary significantlybetween all investigated samples (39). There are 9 peptidasesidentified in all samples and 22 peptidases found in at leastone sample.
For further analysis we selected 19 samples (11 healthy, 8diseased), for which acceptable estimates had been obtained(c.f. Table I). Heatmap in figure 5 presents the activities ofpeptidases identified for these samples. Hierarchical clus-
24 25 31 35 32 15 19 39 26 20 6 29 10 5 1 37 23 28 2
data set no.
membrane.type.matrix.metallopeptidase.4
elastase.1
ADAMTS5.peptidase
ADAM10.peptidase
cathepsin.S
membrane.type.matrix.metallopeptidase.3
trypsin.1
membrane.type.matrix.metallopeptidase.1
calpain.2
membrane.type.matrix.metallopeptidase.6
insulysin
neprilysin
signalase..animal..21.kDa.component
cathepsin.G
calpain.1
cathepsin.H
cathepsin.B
aminopeptidase.A
angiotensin.converting.enzyme.2
chymase...Homo.sapiens..type.
myeloblastin
matrix.metallopeptidase.20
healthydiseased
Figure 5. Peptidases’ activities for 19 samples. The red-white scalerepresents peptidase acivities in descending order (for missing peptidasesvalues are set to minimal).
tering of activity profiles groups samples into two clustersbeing in good accordance with patient’s diagnosis.
Let’s take a closer look at the set of identified peptidases.Among them we detected the family of matrix metallopep-tidases, whose role in cancer development and progressionis significant [16], [17]. The calpain enzyme is used as amarker for the early detection of colorectal carcinoma [18]and inhibitors of cathepsins as possible therapeutics incolorectal diseases [19]. Moreover, cathepsins because oftheir ability to degrade extracellular matrix proteins havebeen implicated to play a role in invasion and metastasisof colorectal cancer. Members of ADAM family are alsoknown to be involved in various biological and disease-related processes [20].
V. CONCLUSION
In this paper we significantly extend formal model of pro-tein degradation proposed in [5]. The extension is twofold:firstly current approach encompass endopeptidase activity aswell (while [5] deals with exopeptidases only), and secondlywe integrate our model with knowledge about proteolyticevents stored in MEROPS database [6]. Moreover, we for-mulate the task of inferring parameters of our model as con-strained optimization problem, which we solve by standardprocedure for non-linear least squares. This approach turnedout to be more time efficient for complex MS data whencomparing to previous Markov Chain Monte Carlo method
estymacja parametrów
piątek, 10 lutego 2012
grupowanie
piątek, 10 lutego 2012
PCA: analiza składowych głównych
piątek, 10 lutego 2012
PCA: analiza składowych głównych
piątek, 10 lutego 2012
projekty (przykład 2: mikromacierze aCGH)
piątek, 10 lutego 2012
Diagram of the microarray-based comparative genomic hybridization (aCGH)process © 2008 Nature Education
technologia aCGH
piątek, 10 lutego 2012
baza pacjentów w IMiD
piątek, 10 lutego 2012
zidentyfikowane segmenty
piątek, 10 lutego 2012
piątek, 10 lutego 2012
piątek, 10 lutego 2012
piątek, 10 lutego 2012
piątek, 10 lutego 2012
zidentyfikowane segmenty
piątek, 10 lutego 2012
zidentyfikowane segmenty
piątek, 10 lutego 2012
zidentyfikowane segmenty
piątek, 10 lutego 2012
!"#$%$#&'()*+)",'*$-%./&0 1234)- 5+.6*$7.) 8/'%)"*)9.6*$7.)!"#$%&'()*+,-.. /0 /0 1&2.*1,3.,)3&)"*,34
!"#$%&'()5.1(6. 70 /0 891&-.3
:&*+;3,6)<=>3.*16"'().)*&?6"'( 70 70 891&-.3
@$%4A)#").35"<-&%(6. /0 /0 891&-.3
@$%4A)#")B."2"9.. 70 891&-.3
C41(6)"B*( D 70 1&2.*1,3.,)3&)"*,34
@E D /0 1&2.*1,3.,
!<1,#-."%)"9=23";3.',<$(%,*6. /0 D 891&-.3
:;6*7)"9.6*<)"=/-*.7>"?@A
!!"#$%$#&'"()*+)",'*$-%./&0 1234)- 5+.6*$7.) 8/'%)"*)9.6*$7.)!"#$%&'()*+,-.. /0 FG 891&-.3
:&*+;3,6)A<&'#"A"#"B.,H$%'& 70 70 891&-.3
I29"<(%-().)$%<;6%;<()#&3(*+ /0 /0 891&-.3
86"2"9.& /0 /0 891&-.3
J."*+,-.& 70 D 891&-.3
C41(6)"B*( D 70 1&2.*1,3.,)3&)"*,34
@E D /0 1&2.*1,3.,
!<1,#-."%)"9=23";3.',<$(%,*6. /0 D K&2.*1,3.,)3&)"*,34)
:;6*7)"9.6*<)"=/-*.7>"BC?
!!!"#$%$#&'"()*+)",'*$-%./&0 1234)- 5+.6*$7.) 8/'%)"*)9.6*$7.)L&%,-&%(6& )#($6<,%3&) . )&29,B<&)2.3."'&
/0 /0 891&-.3
!<"9<&-"'&3., ) . ) A<"M,6%"'&3.,)"B.,6%"',
/0 /0 891&-.3
J."2"9.&)6"-=<6. 70 /0 891&-.3
@$%4A)#")B.".35"<-&%(6. FG NG 891&-.3
J."2"9.&)-"2,6;2&<3&)1)9,3,%(6O)*1PQ
70 D 891&-.3
C41(6)"B*( D 70 K&2.*1,3.,)3&)"*,34
@E D /0 K&2.*1,3.,)
!<1,#-."%)"9=23";3.',<$(%,*6. /0 D K&2.*1,3.,)3&)"*,34
:;6*7)"9.6*<)"=/-*.7>"BDA
!E"#$%$#&'>()*+)",'*$-%./&0 1234)- 5+.6*$7.) 8/'%)"*)9.6*$7.)L&%,-&%(6&)"B2.*1,3."'& /0 /0 891&-.3
R%&%($%(*13&)&3&2.1&)#&3(*+ /0 /0 891&-.3
J."2"9.& ) -"2,6;2&<3& ) 1 ) 9,3,%(6O)*1PQQ
/0 /0 891&-.3
@$%4A)#")B.".35"<-&%(6. FG NG 891&-.3
E.1M"2"9.&).)<,9;2&*M&)-,%&B"2.1-; 70 D K&2.*1,3.,)3&)"*,34
L"2,6;2&<3,)A"#$%&'(),31(-"2"9.. 70 D 1&2.*1,3.,)3&)"*,34
C41(6)"B*( D 70 1&2.*1,3.,)3&)"*,34
891&-.3 ) *,<%(5.6&*(M3( ) 1 ) M41(6&)"B*,9" 891&-.3
@E D /0 1&2.*1,3.,
:;6*7)"9.6*<)"=/-*.7>"B?A
piątek, 10 lutego 2012
!"#$%$#&'()*+)",'*$-%./&0 1234)- 5+.6*$7.) 8/'%)"*)9.6*$7.)!"#$%&'()*+,-.. /0 /0 1&2.*1,3.,)3&)"*,34
!"#$%&'()5.1(6. 70 /0 891&-.3
:&*+;3,6)<=>3.*16"'().)*&?6"'( 70 70 891&-.3
@$%4A)#").35"<-&%(6. /0 /0 891&-.3
@$%4A)#")B."2"9.. 70 891&-.3
C41(6)"B*( D 70 1&2.*1,3.,)3&)"*,34
@E D /0 1&2.*1,3.,
!<1,#-."%)"9=23";3.',<$(%,*6. /0 D 891&-.3
:;6*7)"9.6*<)"=/-*.7>"?@A
!!"#$%$#&'"()*+)",'*$-%./&0 1234)- 5+.6*$7.) 8/'%)"*)9.6*$7.)!"#$%&'()*+,-.. /0 FG 891&-.3
:&*+;3,6)A<&'#"A"#"B.,H$%'& 70 70 891&-.3
I29"<(%-().)$%<;6%;<()#&3(*+ /0 /0 891&-.3
86"2"9.& /0 /0 891&-.3
J."*+,-.& 70 D 891&-.3
C41(6)"B*( D 70 1&2.*1,3.,)3&)"*,34
@E D /0 1&2.*1,3.,
!<1,#-."%)"9=23";3.',<$(%,*6. /0 D K&2.*1,3.,)3&)"*,34)
:;6*7)"9.6*<)"=/-*.7>"BC?
!!!"#$%$#&'"()*+)",'*$-%./&0 1234)- 5+.6*$7.) 8/'%)"*)9.6*$7.)L&%,-&%(6& )#($6<,%3&) . )&29,B<&)2.3."'&
/0 /0 891&-.3
!<"9<&-"'&3., ) . ) A<"M,6%"'&3.,)"B.,6%"',
/0 /0 891&-.3
J."2"9.&)6"-=<6. 70 /0 891&-.3
@$%4A)#")B.".35"<-&%(6. FG NG 891&-.3
J."2"9.&)-"2,6;2&<3&)1)9,3,%(6O)*1PQ
70 D 891&-.3
C41(6)"B*( D 70 K&2.*1,3.,)3&)"*,34
@E D /0 K&2.*1,3.,)
!<1,#-."%)"9=23";3.',<$(%,*6. /0 D K&2.*1,3.,)3&)"*,34
:;6*7)"9.6*<)"=/-*.7>"BDA
!E"#$%$#&'>()*+)",'*$-%./&0 1234)- 5+.6*$7.) 8/'%)"*)9.6*$7.)L&%,-&%(6&)"B2.*1,3."'& /0 /0 891&-.3
R%&%($%(*13&)&3&2.1&)#&3(*+ /0 /0 891&-.3
J."2"9.& ) -"2,6;2&<3& ) 1 ) 9,3,%(6O)*1PQQ
/0 /0 891&-.3
@$%4A)#")B.".35"<-&%(6. FG NG 891&-.3
E.1M"2"9.&).)<,9;2&*M&)-,%&B"2.1-; 70 D K&2.*1,3.,)3&)"*,34
L"2,6;2&<3,)A"#$%&'(),31(-"2"9.. 70 D 1&2.*1,3.,)3&)"*,34
C41(6)"B*( D 70 1&2.*1,3.,)3&)"*,34
891&-.3 ) *,<%(5.6&*(M3( ) 1 ) M41(6&)"B*,9" 891&-.3
@E D /0 1&2.*1,3.,
:;6*7)"9.6*<)"=/-*.7>"B?A
piątek, 10 lutego 2012
Organizacja wykładu
✦ wykład ~ teoria (slajdy + notatki)
✦ zaliczenie wykładu = egzamin ustny
✦ laboratorium ~ analiza danych (język R)
✦ zaliczenie labu = projekt
piątek, 10 lutego 2012
literatura
http://cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf
Statistics Using R with Biological Examples Kim Seefeld, Ernst Linder, http://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf
Applied Statistics for Bioinformatics using R Wim P. Krijnen
Statistical Bioinformatics with RSunil K. Mathur, Elsevier Academic Press, 2010
piątek, 10 lutego 2012
✦ wykład 1: wstęp - organizacja wykładu, skąd przyszliśmy, dokąd zmierzamy...
✦ wykład 2 i 3: podstawowe rozkłady prawdopodobieństwa, testy parametryczne i nieparametryczne
✦ wykłady 4, 5: analiza skupień=grupowanie = klasteryzacja (ang. clustering)
✦ metody grafowe, metody hierarchiczne, relokacyjne, oparte o model.
Organizacja wykładu
piątek, 10 lutego 2012
✦ wykłady 6,7: redukcja wymiaru, selekcja cech (biomarkerów)
✦ analiza składowych głównych, skalowanie wielowymiarowe
✦ wykłady 8-11: klasyfikacja
✦ LDA, QDA, regresja liniowa, klasyfikatory drzewowe, boosting, ...
✦ wykłady 11-14: modele Markowa
Organizacja wykładu
piątek, 10 lutego 2012
✦ wykład 1: wstęp - organizacja wykładu, skąd przyszliśmy, dokąd zmierzamy...
✦ wykład 2 i 3: podstawowe rozkłady prawdopodobieństwa, testy parametryczne i nieparametryczne
✦ wykłady 4-7: statystyka bayesowska
✦ modele Markowa, symulacje stochastyczne, próbnik Gibbsa, MCMC....
Organizacja wykładu
piątek, 10 lutego 2012
✦ wykłady 8,9: modelowanie
✦ estymacja parametów modelu
✦ wykłady 10-12: modele liniowe, analiza wariancji
✦ ANOVA, regresja liniowa, ...
✦ wykłady 13-14: projektowanie eksperymentów
Organizacja wykładu
piątek, 10 lutego 2012