CSE280 Vineet Bafna
CSE280a: Projects
Vineet Bafna
CSE280 Vineet Bafna
Project Logisitics
• Research project (70%)• Work individually, or in groups of 2• Two presentations:
– Introductory presentation: Feb 1st week (20 minutes) (20% grade)
• Describe the goals of the project• Describe your (computational) formulation• Summarize/critique reading assignment• Present an algorithm• Constructive criticism of other projects
– One on one meeting with instructor (end February) (10% grade)
• Discuss preliminary results– Final presentation (last 2-3 classes): (30% grade)
• Submit a final report• Final presentation
CSE280 Vineet Bafna
Project 1: disease gene mapping
• Recall, Linkage Disequilibrium• In the absence of recombination,
– Correlation between columns– The joint probability Pr[A=a,B=b] is
different from P(a)P(b)• With extensive recombination
– Pr(a,b)=P(a)P(b)
CSE280 Vineet Bafna
Measures of LD
• Consider two bi-allelic sites with alleles marked with 0 and 1
• Define– P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]
– P0* = Pr[Allele 0 in locus 1]
• Linkage equilibrium if P00 = P0* P*0
• D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …
CSE280 Vineet Bafna
LD can be used to map disease genes
• LD decays with distance from the disease allele.
• By plotting LD, one can short list the region containing the disease gene.
011001
DNNDDN
LD
CSE280 Vineet Bafna
Multiple loci
• In complex diseases, multiple loci interact to confer disease susceptibility
001001
DNNDDN
LD
011000
CSE280 Vineet Bafna
Testing for multiple loci
• Assume SNP matrix with n individuals, m loci. Testing for all sets of 5 SNPs implies a huge number of computations?
• Can you come out with computational strategies that can speed it up?
€
m
5
⎛
⎝ ⎜
⎞
⎠ ⎟n
CSE280 Vineet Bafna
Speeding up multiple locus computations
• A filtering strategy?• Input: a SNP matrix with one or more
pairs that interactively associate• Output: a set of SNP pairs that includes
the interacting pair(s).• Method should be fast, and should NOT
consider all pairs.
CSE280 Vineet Bafna
110011
Speeding up the computations
• Correlated SNPs should also have low hamming distance.• Random SNPs should have high hamming distance.• Strategy: select k individuals at random.
– Hash each individual restricted to k individuals– Correlated SNPs should fall in the same bin with high
probability
001001
101011
k=2
CSE280 Vineet Bafna
Project 2: mtDNA phylogeny
• In the absence of recombination, the history of mitochondrial DNA can be expressed by a tree.
• The goal of this project is to build a robust phylogeny using a heuristic modification of the perfect phylogeny.
CSE280 Vineet Bafna
The Genographic project
• The genographic project aims to trace geographic origins of the human race using mitochondrial DNA.
https://www3.nationalgeographic.com/genographic/atlas.html
CSE280 Vineet Bafna
Without recurrent mutations
• Unique tree can explain the evolutionary history
r
E B
C
D
A
1 2 3 4 5
A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0
1
2
4
3
5
CSE280 Vineet Bafna
With recurrent mutations
• Adding another individual F destroys perfect phylogeny
• Why?• It is not so easy
to place F• Can you suggest
a strategy?
r
E B
C
D
A
1 2 3 4 5
A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0F 0 1 0 0 0
1
2
4
3
5
1
F
2
CSE280 Vineet Bafna
Tests of Selection
• In class, we have discussed alleles that can be selectively neutral, or under active selection– Active selection may be positive or negative
• How do we identify regions under positive, or negative selection?
• Balancing selection: sometimes it is helpful for a population to
CSE280 Vineet Bafna
Adaptive Selection
• Selection leads to loss of heterozygosity (will be explained in detail in the next lecture).
• Can you come up with a test for selection?
CSE280 Vineet Bafna
Balancing selection
• Sometimes both alleles are useful in a population, and it helps to have both around
• A simple example is when diversity is important (the two variants help maintain diversity)
• Bipolar disorder genes could be under balancing selection– High creativity which might confer some
selective/reproductive advantage.– Depression offers a disadvantage
• If so, the tests for this disorder might be tricky. • How can we identify regions under balancing
selection?
CSE280 Vineet Bafna
Testing for Balancing Selection
• Adaptive selection leads to loss of heterozygosity (will be explained in detail in the next lecture).
• Balancing selection leads to two dominant haplotypes• Can you come up with a test for balancing selection?
CSE280 Vineet Bafna
Project: Primer design for cancer genomics
CSE280 Vineet Bafna
The Science behind Gleevec
Fusions– observed in leukemia,
lymphoma, and sarcomas• “Philadelphia Translocation”
– Drugs target this fusion protein
CSE280 Vineet Bafna
Fluoroscent in situ hybridization
• Cancer genomes show extensive structural variation
CSE280 Vineet Bafna
Assaying for tumor variants
• Most tumors start off with a single cell, which then proliferate.
• Drugs like Gleevec are used well after cancer has taken hold.
• Can we detect the cancer early by detecting the genomic abnormality?
– If a very few cells in the person are cancerous, can we still detect it?
• Can we track a patient through his treatment?
CSE280 Vineet Bafna
Cancer genomics
• In cancers, large genetic changes can occur, including deletions, inversions, and rearrangements of genomes
• In the early stages, only a few cells will show this
deletion
CSE280 Vineet Bafna
Polymerase Chain Reaction
• PCR is a technique for amplifying and detecting a specific portion of the genome
• Amplification takes place if the primers are ‘appropriate’ distance apart (<2kb)
CSE280 Vineet Bafna
Assaying for Rare Variants
• PCR can be used to assay for a given genomic abnormality, even in a heterogenous population of tumor and normal cells
Extract Genomic DNA
PCR
Distance too large for amplification
Tumor cell
Detection
CSE280 Vineet Bafna
Variant Variants
• What if the variant is the minority in the cell population?
• What if deletion boundaries are uncertain?
Deletion
Deletion
Deletion
Patient A
Patient B
Patient C
CSE280 Vineet Bafna
Observed variation in deletion size
Sizes of homozygous deletions in cell lines from different human cancers.
(scale is in megabases).
CSE280 Vineet Bafna
Primer Approximation Multiplex PCR (PAMP)*
• Multiple primers are optimally spaced, flanking a breakpoint of interest– Upstream of breakpoint, forward primers– Downstream of breakpoint, reverse primers
• The primers are run in a multiplex PCR reaction– Any pair can form a viable product
Deletion Deletion
Patient B Patient C
CSE280 Vineet Bafna
Experimental Design (500Kb region)
• 10 sets of 25 primers: upstream and downstream
– 250 upstream– 250 downstream
• Primer-pairs closest to breakpoint amplified
• Assay by oligo array
Goal: Computational selection of an ‘optimal’ primer set
CSE280 Vineet Bafna
Goal
• Input, a collection of primers• Identify a subset of primers that do not cross-hybridize,
are unique, yet cover the region completely• Use combinatorial optimization, simulated annealing,
integer linear programming…..
CSE280 Vineet Bafna
Spectral Networks Algorithms for De Novo Interpretation of
Tandem Mass Spectra
Nuno Bandeira, Ph.D.
Department of Computer Science and Engineering, University of California, San Diego
ProtIG seminar series
September 21, 2007
CSE280 Vineet Bafna
Proteins and their modifications
Proteins are fundamental players in the regulation of biological processes.
DNA Proteins
regulate
encodes for
Knowing proteins involves knowingmany things. This dissertation focuses on: - Identification - Sequencing - Post-translational modifications ( )
CSE280 Vineet Bafna
Protein sequences and modifications
From a computational perspective, a protein can be represented as a string over a weighted alphabet:
…AFSRLEMILGF…
AFSRLSRLEMILGF
EMILG
Subsequences are called peptides
(obtained via enzymatic digestion)
Amino acid Mass
A 71
F 147
S 87
R 156
L 113
E 129
M 131
I 113
G 57
Protein sequence:
SRLEM ILGF
Modifications change amino acid masses:
SRLEMILGFMass(SRLEMILGF)=1047
Mass(M)=131
Mass(SRLEM ILGF)=1063
Mass(M )=147
Mass( )=16
CSE280 Vineet Bafna
Nobel prize in chemistry, 2002
CSE280 Vineet Bafna
What is mass spectrometry?
http://nobelprize.org/chemistry/laureates/2002/chemadv02.pdf
Amino acid Mass
A 71
F 147
S 87
…
CSE280 Vineet Bafna
Modified peptide LARG*E
L
G
R
A
L
Prefix masses
Mass (m/z)
Intensity
LLA
LAR
LARG
L A
RL A
EG
R
A
Suffix masses
E
L
G
R
A
L
Prefix masses
Mass (m/z)
Intensity
LLA
LAR
LARG
E GE
RGE
ARGE
L A
RL A
EG
R EG
Tandem Mass Spectrometry (MS/MS)
…THISISAVERYLARGESAMPLEPRTEINSEQENCE…Protein Sequence:
Peptide LARGE
MS/MS spectrum
Modification: any event that changes the mass
at a specific site.
Mass (m/z)
Intensity
LLA LA
RG*
E G*E
RG*E
ARG*E
Mass shifts
Suffix masses
E
L
G*
R
A
L
Prefix masses
L A
RL A
E
R E
A R E
G*
G*
G*
: b
y:
: b
: y
PM
CSE280 Vineet Bafna
Example of a real MS/MS spectrum
Symmetric
b10
y12
CSE280 Vineet Bafna
Tandem Mass Spectrometry (MS/MS)
Enzymatic digestionTandem
Mass SpectrometryProteins
Peptides
…Large set of
MS/MS spectra …
Database ofknown peptides
MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, SEQUENCE, HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Database ofknown peptides
MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, SEQUENCE, HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
ss
ss
ss
ff
eeee
ee
ee
ee
ee
ee
qquu
uu
uu
nn
nn
nn
eecc
cc
cc
ssee
ee
ee
uunn
cc
Peptide SEQUENCE
Database search De novo sequencing
CSE280 Vineet Bafna
Mixture spectraSometimes, the instrument generates a single spectrum from two or more peptides:
Mixture spectrum
Pep
tide
A:
NLA
FF
QLR
Pep
tide
B:
ALD
DIL
NLK
?
CSE280 Vineet Bafna
How to identify mixture spectra?
CSE280 Vineet Bafna
Proposed approach
• When identifying a mixture spectrum of peptides A,B, assume you have non-mixture spectra for the same peptides.
• Compare the non-mixture spectra of known peptides to putative mixture spectra to determine peptide identifications
CSE280 Vineet Bafna
Project description
• Implement an algorithm to identify mixture spectra from pairs of peptides by combining previously identified spectra from isolated peptides.
• Test the above implementation by simulating mixture spectra using an existing database of spectra from isolated peptides.
• Propose a scoring procedure to separate correct from false identifications.
Nuno [email protected]
Top Related