CLOSED BOOK!!!
Transcript of CLOSED BOOK!!!
CLOSED BOOK!!!
Maastricht University – Knowledge Engineering Bachelor
EXAMINATION
INTRODUCTION to BIOINFORMATICS
(resit)
Examiner: Rachel Cavill
Date: Wednesday 1st Feb 2017
Time: 9:00-12:00
Place: Tapijn, building Z
Notes:
1. The exam is a closed-book exam. 2. A calculator is allowed. 3. The exam consists of 11 pages (including this page and appendices). 4. The exam time is 180 minutes. 5. The number of exam questions is 5. 6. The numbers in brackets at the end of each subpart give the allocation of marks to that
subpart – there are a total of 10 marks across the entire paper. 7. Before answering the questions, please first read all the exam questions, and then make a
plan to spend the three hours. 8. When answer the questions please do not forget:
to write your name and student number on each answer page;
to number the answers; and
to number the answer pages.
Good Luck!!!
1. Sequence statistics (1 mark)
a) Describe what has been plotted in this plot? What labels would you add to each
axis? (0.2 marks)
b) What would happen if you lowered the window size in the calculations for this
plot? (0.1 marks)
c) Why would such a plot be useful? Suggest a hypothesis that such a plot could be
used to visually test. (0.1 marks)
d) *A C G T
A* 0.1202 0.0505 0.0483 0.0912
C 0.0665 0.0372 0.0396 0.0484
G 0.0514 0.0522 0.0363 0.0499
T 0.0721 0.0518 0.0656 0.1189
This is the dimer dinucleotide frequency in influenza.
Given this table for the probabilities of the dimers, calculate the probability of
each individual nucleotide occurring singularly. (0.2 marks)
e) Given these probabilities, calculate the odds ratios for the most and least
common dimers. (0.2 marks)
f) Describe briefly how you would assess if these odds ratios are significant. (0.2
marks)
2. Alignment (2 marks)
Here is a small score matrix for amino acids, created by considering only certain amino acids in the
BLOSUM62 matrix:
a) Which substitution/s receive/s the lowest score according to this score matrix? (0.1 marks)
b) What would such a score matrix be used for? (0.1 marks)
c) How are BLOSUM score matrices obtained? (0.3 marks)
d) Explain the off-diagonal positive scores. (0.2 marks)
e) Using this score matrix and a gap penalty of -2 use Needleman-Wunsch to align DELD and
DEELM. (1 mark)
f) The above process was global alignment, how would local alignment differ and when would
it be applicable? (0.3 marks)
3. Gene finding (1 mark)
a) What is an Open Reading Frame? (0.2 marks)
b) Describe a simple algorithm to find open reading frames. (0.4 marks)
c) How do the discovered open reading frames relate to genes? (0.1 marks)
d) If you wish to find genes what further steps could you then perform on your open reading
frames to help this process. (0.3 marks)
4. Hidden Markov Models (3 marks)
G-protein Coupled Receptors (GPCRs) are proteins found in the cell membrane and they allow signals
to be transferred from outside to inside the cell. They thread through the membrane 7 times, and as
such are a subfamily of the 7-transmembrane protein family.
A pharmaceutical company is interested in a subset of these GPCRs which are targets of their
potential new drug. Since they can physically test which GPCRs are targeted by the drug, and then
they can sequence these GPCRs, they have three lists of GPCRs, those which are known drug targets,
those which are known non-targets and those which are untested (they also have sequence data for
the untested GPCRs).
a) Using this data, describe how you could use a Hidden Markov Model to generate a classifier
which could be used to predict which of the untested GPCRs would be targeted by the drug.
(1 mark)
b) Describe how you would ensure this classifier did not overfit the input data. (0.3 marks)
The pharmaceutical company has found a new protein, in a previously unstudied species. It is clear,
that like GPCRS this protein has regions which are inside the cell, regions outside the cell and regions
which thread through the cell membrane. However, unlike GPCRs, this new protein does not always
go from in to out, but sometimes enters the membrane and then emerges again on the same side as
before.
You suggest that due to the different environments there may be differences in the amino acid
distribution inside, outside and within the membrane.
Your supervisor gives you access to the full database about known transmembrane proteins which
contains their full amino acid sequences, labelled with whether each amino acid is found inside,
outside or within the membrane.
c) Describe how you would use this data to test your theory that there are differences in the
distribution of amino acids. (0.3 marks)
d) Using this approach you show that there are indeed differences in the amino acid
distributions in these three areas. Explain how you could use this data to build another
Hidden Markov Model to predict which parts of a protein amino-acid sequence will be inside,
outside and within the cell membrane. Include an estimated transmission matrix, an
explanation as to how you would use your gathered data to estimate the emissions matrices
and a diagram showing the architecture of the HMM. (1 mark)
e) Describe how you would use this HMM to predict the folding pattern through the cell
membrane for the newly discovered protein. What problems might you foresee in this
process? (0.4 marks)
5. Omics data and phylogenetic trees (2.5 marks)
Matthias and his team want to look at the bacteria found in their patients’ guts.
First they look at which species are present in the different patients. They grow colonies of each of
the species they find and use metabolomics to analyse the metabolic variation between the species.
They want to use this metabolomics data to build a hierarchical clustering tree.
a) Describe the general steps needed to build this tree from the raw data table of metabolomics
data. Include any preprocessing and quality control procedures you would apply. (1 mark)
The tree obtained from this procedure is shown on the right.
They spot that in terms of the broad categories of bacteria the metabolic
profiles which they measured group the species according to their
phylogenetic grouping. They now want to know if the fine structure of the
tree generated from the metabolic profiles also matches the tree generated
from the sequences. They decide to concentrate on the clostridia species as
these are the most numerous in their dataset.
b) The table below contains partial sequences for these species, based on
these sequences construct the distance matrix between these species,
using substitution number as your distance metric. (0.2 marks)
Species Sequence
V3-b24 CGT CCC ATC AGT
CL-b11 GGA CTC AGC AGT
CL-b8 GCA CTC AGG AGC
CL-b16 GCA CTC AGC AGT
V3-b34 CGT CCC ATG GGT
CL-b14 CGA CCC ATC AGT
c) How do these distances relate to the true genetic distances? (0.1 marks)
d) Calculate an estimate of the true genetic distance between V3-b24 and CL-b11 using two
separate methods explaining the difference between these methods? (0.3 marks)
e) Use UPGMA and your distance matrix from part b to construct a phylogenetic tree based on
these sequences (0.9 marks).
f) Based on your result, can you conclude that the metabolic variation matches the sequence
variation? (0.2 marks)
g) What other algorithm/s could you have used instead of UPGMA in the above example? How
would they differ? (0.3 marks)
APPENDICES
APPENDIX 1: STANDARD GENETIC Code DNA → protein translation table
NOTE: ‘M’ = ‘START’ (first codon) or Methionine (else); ‘*’ = ‘STOP’
APPENDIX 2: Needleman-Wunsch algorithm for global alignment
1. Create a table M of size (m+1)x(n+1) for sequences s and t of lengths m and n, 2. Fill table entries (m:1) (first column: s) and (1:n) (first row: t) with the values:
3. Starting from the top left, compute each entry using the recursive relation:
Select the highest value and fill this in the cell, and draw an arrow to the preceding cell that generated this value (or multiple arrows if required)
4. Perform the trace-back procedure from the bottom-right corner
),(
),(
),(
max
1,
,1
1,1
,
jji
iji
jiji
ji
M
M
M
M
t
s
ts
j
k
kj
i
k
ki MM1
,1
1
1, ),(,),( ts
APPENDIX 3: Smith-Waterman algorithm for local alignment
APPENDIX 4: Jukes-Cantor correction
The relation between the observed genetic distance d and true genetic distance K (here d and K are
fractions of the sites so always between 0 and 1) in the model of Jukes and Cantor is given by:
APPENDIX 5: Kimura-2-parameter correction
The relation between the true genetic distance K and the fractions of transitions P and fractions of
transversions Q (K, P, and Q are fractions so always between 0 and 1) in the model of Kimura is given
by:
APPENDIX 6: Rooted and unrooted trees
A phylogenetic (a.k.a. evolutionary) tree is a special kind of graph of a set of nodes (e.g. species) and
connections (e.g. evolutionary relation) between these nodes.
An unrooted tree exists of only internal and external nodes. An external node (a.k.a. leaf) has degree
1 (i.e. has only connection to other nodes) and internal nodes have degree 3 (i.e. has exactly three
connections to other nodes).
In a rooted tree we define *one* special internal node called the root that has degree 2 (i.e. has
exactly two connections to other nodes), and can be regarded as the last common ancestor (LCA).
We can make a rooted tree from an unrooted tree by selecting the appropriate connection, and
putting the root right there.
For a set of n>3 unconnected nodes, the number of all possible unrooted trees is:
!32
!523
n
nn
, and
the number of all possible rooted trees of is: !22
!322
n
nn
.
APPENDIX 7: The UPGMA algorithm
The unweighted pair-group method with arithmetic mean (UPGMA) is a popular distance analysis
method for constructing unrooted ultrametric phylogenetic trees that assumes the same
evolutionary speed on all lineages, i.e. the rate of mutations is constant over time and for all lineages
in the tree. This is called a 'molecular clock hypothesis'. UPGMA starts with a matrix of pairwise
distances K(1..n, 1..m).
1. Find that pair (taxon i and j) with the smallest distance value in the distance matrix: K(i,j). 2. Define a new taxon comprising taxon i and j: Taxon i is connected by a branch to the common
ancestor node. The same applies for taxon j. Therefore, the distance K(i,j) is split onto the two branches. So, each of the two branches obtains a length of K(i,j)/2.
3. If i and j were the last 2 taxa, the tree is finished. If not the algorithm finds a new taxon called u.
4. Define the distance from u to each other taxon (n, with n ≠ i or j) to be an average of the distances K(n,i) and K(n,j): K(n,u) = (K(n,i) + K(n,j))/2.
5. Go back to step 1 with one less taxon. Taxa i and j are eliminated, and taxon u is added to the tree.
Reference: Michener, C.D., Sokal, R.R. (1957): A quantitative approach to a problem of classification.
Evolution, 11:490–499.
APPENDIX 9:
VITERBI Algorithm
APPENDIX 10: The Neighbour joining algorithm:
Appendix 11: Over-representation calculation
),min(
,i
i
NK
kj
ii
i
K
M
jK
NM
j
N
ppathwaysi
Where Ni is the size of pathway i, M is the
overall number of genes or metabolites available to be picked, K is the number of genes or metabolites picked and k
i is the
number of genes or metabolites picked from the pathway.