CLOSED BOOK!!!

CLOSED BOOK!!!

Maastricht University – Knowledge Engineering Bachelor

EXAMINATION

INTRODUCTION to BIOINFORMATICS

(resit)

Examiner: Rachel Cavill

Date: Wednesday 1st Feb 2017

Time: 9:00-12:00

Place: Tapijn, building Z

Notes:

1. The exam is a closed-book exam. 2. A calculator is allowed. 3. The exam consists of 11 pages (including this page and appendices). 4. The exam time is 180 minutes. 5. The number of exam questions is 5. 6. The numbers in brackets at the end of each subpart give the allocation of marks to that

subpart – there are a total of 10 marks across the entire paper. 7. Before answering the questions, please first read all the exam questions, and then make a

plan to spend the three hours. 8. When answer the questions please do not forget:

to write your name and student number on each answer page;

to number the answers; and

to number the answer pages.

Good Luck!!!

1. Sequence statistics (1 mark)

a) Describe what has been plotted in this plot? What labels would you add to each

axis? (0.2 marks)

b) What would happen if you lowered the window size in the calculations for this

plot? (0.1 marks)

c) Why would such a plot be useful? Suggest a hypothesis that such a plot could be

used to visually test. (0.1 marks)

d) *A C G T

A* 0.1202 0.0505 0.0483 0.0912

C 0.0665 0.0372 0.0396 0.0484

G 0.0514 0.0522 0.0363 0.0499

T 0.0721 0.0518 0.0656 0.1189

This is the dimer dinucleotide frequency in influenza.

Given this table for the probabilities of the dimers, calculate the probability of

each individual nucleotide occurring singularly. (0.2 marks)

e) Given these probabilities, calculate the odds ratios for the most and least

common dimers. (0.2 marks)

f) Describe briefly how you would assess if these odds ratios are significant. (0.2

marks)

2. Alignment (2 marks)

Here is a small score matrix for amino acids, created by considering only certain amino acids in the

BLOSUM62 matrix:

a) Which substitution/s receive/s the lowest score according to this score matrix? (0.1 marks)

b) What would such a score matrix be used for? (0.1 marks)

c) How are BLOSUM score matrices obtained? (0.3 marks)

d) Explain the off-diagonal positive scores. (0.2 marks)

e) Using this score matrix and a gap penalty of -2 use Needleman-Wunsch to align DELD and

DEELM. (1 mark)

f) The above process was global alignment, how would local alignment differ and when would

it be applicable? (0.3 marks)

3. Gene finding (1 mark)

a) What is an Open Reading Frame? (0.2 marks)

b) Describe a simple algorithm to find open reading frames. (0.4 marks)

c) How do the discovered open reading frames relate to genes? (0.1 marks)

d) If you wish to find genes what further steps could you then perform on your open reading

frames to help this process. (0.3 marks)

4. Hidden Markov Models (3 marks)

G-protein Coupled Receptors (GPCRs) are proteins found in the cell membrane and they allow signals

to be transferred from outside to inside the cell. They thread through the membrane 7 times, and as

such are a subfamily of the 7-transmembrane protein family.

A pharmaceutical company is interested in a subset of these GPCRs which are targets of their

potential new drug. Since they can physically test which GPCRs are targeted by the drug, and then

they can sequence these GPCRs, they have three lists of GPCRs, those which are known drug targets,

those which are known non-targets and those which are untested (they also have sequence data for

the untested GPCRs).

a) Using this data, describe how you could use a Hidden Markov Model to generate a classifier

which could be used to predict which of the untested GPCRs would be targeted by the drug.

(1 mark)

b) Describe how you would ensure this classifier did not overfit the input data. (0.3 marks)

The pharmaceutical company has found a new protein, in a previously unstudied species. It is clear,

that like GPCRS this protein has regions which are inside the cell, regions outside the cell and regions

which thread through the cell membrane. However, unlike GPCRs, this new protein does not always

go from in to out, but sometimes enters the membrane and then emerges again on the same side as

before.

You suggest that due to the different environments there may be differences in the amino acid

distribution inside, outside and within the membrane.

Your supervisor gives you access to the full database about known transmembrane proteins which

contains their full amino acid sequences, labelled with whether each amino acid is found inside,

outside or within the membrane.

c) Describe how you would use this data to test your theory that there are differences in the

distribution of amino acids. (0.3 marks)

d) Using this approach you show that there are indeed differences in the amino acid

distributions in these three areas. Explain how you could use this data to build another

Hidden Markov Model to predict which parts of a protein amino-acid sequence will be inside,

outside and within the cell membrane. Include an estimated transmission matrix, an

explanation as to how you would use your gathered data to estimate the emissions matrices

and a diagram showing the architecture of the HMM. (1 mark)

e) Describe how you would use this HMM to predict the folding pattern through the cell

membrane for the newly discovered protein. What problems might you foresee in this

process? (0.4 marks)

5. Omics data and phylogenetic trees (2.5 marks)

Matthias and his team want to look at the bacteria found in their patients’ guts.

First they look at which species are present in the different patients. They grow colonies of each of

the species they find and use metabolomics to analyse the metabolic variation between the species.

They want to use this metabolomics data to build a hierarchical clustering tree.

a) Describe the general steps needed to build this tree from the raw data table of metabolomics

data. Include any preprocessing and quality control procedures you would apply. (1 mark)

The tree obtained from this procedure is shown on the right.

They spot that in terms of the broad categories of bacteria the metabolic

profiles which they measured group the species according to their

phylogenetic grouping. They now want to know if the fine structure of the

tree generated from the metabolic profiles also matches the tree generated

from the sequences. They decide to concentrate on the clostridia species as

these are the most numerous in their dataset.

b) The table below contains partial sequences for these species, based on

these sequences construct the distance matrix between these species,

using substitution number as your distance metric. (0.2 marks)

Species Sequence

V3-b24 CGT CCC ATC AGT

CL-b11 GGA CTC AGC AGT

CL-b8 GCA CTC AGG AGC

CL-b16 GCA CTC AGC AGT

V3-b34 CGT CCC ATG GGT

CL-b14 CGA CCC ATC AGT

c) How do these distances relate to the true genetic distances? (0.1 marks)

d) Calculate an estimate of the true genetic distance between V3-b24 and CL-b11 using two

separate methods explaining the difference between these methods? (0.3 marks)

e) Use UPGMA and your distance matrix from part b to construct a phylogenetic tree based on

these sequences (0.9 marks).

f) Based on your result, can you conclude that the metabolic variation matches the sequence

variation? (0.2 marks)

g) What other algorithm/s could you have used instead of UPGMA in the above example? How

would they differ? (0.3 marks)

APPENDICES

APPENDIX 1: STANDARD GENETIC Code DNA → protein translation table

NOTE: ‘M’ = ‘START’ (first codon) or Methionine (else); ‘*’ = ‘STOP’

APPENDIX 2: Needleman-Wunsch algorithm for global alignment

1. Create a table M of size (m+1)x(n+1) for sequences s and t of lengths m and n, 2. Fill table entries (m:1) (first column: s) and (1:n) (first row: t) with the values:

3. Starting from the top left, compute each entry using the recursive relation:

Select the highest value and fill this in the cell, and draw an arrow to the preceding cell that generated this value (or multiple arrows if required)

4. Perform the trace-back procedure from the bottom-right corner

),(

),(

),(

max

1,

,1

1,1

,

jji

iji

jiji

ji

M

M

M

M

t

s

ts

j

k

kj

i

k

ki MM1

,1

1

1, ),(,),( ts

APPENDIX 3: Smith-Waterman algorithm for local alignment

APPENDIX 4: Jukes-Cantor correction

The relation between the observed genetic distance d and true genetic distance K (here d and K are

fractions of the sites so always between 0 and 1) in the model of Jukes and Cantor is given by:

APPENDIX 5: Kimura-2-parameter correction

The relation between the true genetic distance K and the fractions of transitions P and fractions of

transversions Q (K, P, and Q are fractions so always between 0 and 1) in the model of Kimura is given

by:

APPENDIX 6: Rooted and unrooted trees

A phylogenetic (a.k.a. evolutionary) tree is a special kind of graph of a set of nodes (e.g. species) and

connections (e.g. evolutionary relation) between these nodes.

An unrooted tree exists of only internal and external nodes. An external node (a.k.a. leaf) has degree

1 (i.e. has only connection to other nodes) and internal nodes have degree 3 (i.e. has exactly three

connections to other nodes).

In a rooted tree we define *one* special internal node called the root that has degree 2 (i.e. has

exactly two connections to other nodes), and can be regarded as the last common ancestor (LCA).

We can make a rooted tree from an unrooted tree by selecting the appropriate connection, and

putting the root right there.

For a set of n>3 unconnected nodes, the number of all possible unrooted trees is:

!32

!523

n

nn

, and

the number of all possible rooted trees of is: !22

!322

n

nn

.

APPENDIX 7: The UPGMA algorithm

The unweighted pair-group method with arithmetic mean (UPGMA) is a popular distance analysis

method for constructing unrooted ultrametric phylogenetic trees that assumes the same

evolutionary speed on all lineages, i.e. the rate of mutations is constant over time and for all lineages

in the tree. This is called a 'molecular clock hypothesis'. UPGMA starts with a matrix of pairwise

distances K(1..n, 1..m).

1. Find that pair (taxon i and j) with the smallest distance value in the distance matrix: K(i,j). 2. Define a new taxon comprising taxon i and j: Taxon i is connected by a branch to the common

ancestor node. The same applies for taxon j. Therefore, the distance K(i,j) is split onto the two branches. So, each of the two branches obtains a length of K(i,j)/2.

3. If i and j were the last 2 taxa, the tree is finished. If not the algorithm finds a new taxon called u.

4. Define the distance from u to each other taxon (n, with n ≠ i or j) to be an average of the distances K(n,i) and K(n,j): K(n,u) = (K(n,i) + K(n,j))/2.

5. Go back to step 1 with one less taxon. Taxa i and j are eliminated, and taxon u is added to the tree.

Reference: Michener, C.D., Sokal, R.R. (1957): A quantitative approach to a problem of classification.

Evolution, 11:490–499.

APPENDIX 9:

VITERBI Algorithm

APPENDIX 10: The Neighbour joining algorithm:

Appendix 11: Over-representation calculation

),min(

,i

i

NK

kj

ii

i

K

M

jK

NM

j

N

ppathwaysi

Where Ni is the size of pathway i, M is the

overall number of genes or metabolites available to be picked, K is the number of genes or metabolites picked and k

i is the

number of genes or metabolites picked from the pathway.

CLOSED BOOK!!!

Documents

Transcript of CLOSED BOOK!!!