Beyond Genome Annotation - Characterizing Chromosome Features Terry Clark Assistant Professor...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Beyond Genome Annotation - Characterizing Chromosome Features Terry Clark Assistant Professor...
Beyond Genome Annotation -Characterizing Chromosome
Features
Terry Clark
Assistant Professor Electrical Engineering and Computer Science
The University of Kansas
2005 ITTC Research Review
April 7, 2005
ABSTRACT
Genome sequence data and their annotations are routinely used for determining genetic variation, assessing gene products, designing primers for various experiments, designing microarrays and other laboratory and computational applications. Well-known methods for genome analysis include sequence alignment, motif-based systems, and stochastic models, among others. Genome sequences are also representative of a dynamic chemical and physical interplay among proteins and DNA in the eukaryotic nucleus involving chromatin and various proteins. This organization of nuclear DNA is critical to the function and specialization of cells through regulation of genes. Toward understanding genome structure, our laboratory develops, uses, and applies methods ranging from computational linguistics to molecular modeling. One such method is an unsupervised, alignment-free approach that naturally tolerates re-organizations and insertions common to genome evolution; and as unsupervised permits de novo determination of features and feature association. In this presentation I develop a notion in an unsupervised, alignment-free context that we call a lexicon, an inductively generated set of nucleotide “words” of varying length devised to represent optimally a given sequence. The resulting lexicon and parse provide points of departure for sequence analyses utilizing lexicon content, the sequence representation, and sequence information content. The insights gained from bioinformatics are rationalized by and also steer molecular modeling studies. A representative application will be presented in this talk. (Selected slides from the presentation follow.)
DNA sequence
GCTGAGGGAAGTGAGAGACTGAGGTGGGGNCTGGAGGAGCCTGAAAAGCAGAAGTAGGAGGAAGCAGAGCTGCTCGGAACAGATCCAGAAACAGCATGTACTCACCCATCCCCCAGAGCGGCTCTCCGTTCCCACCGACCGTGAAGCTCCCTGGCCTGCACATATGGAGGGTGGAGAAGCTGAAGCCAGTGCCTGTGGCCCCTGAGAACTACGGCATTTTCTTCTCGGGAGACTCCTACCTGGTGCTGCACAATGGCCCGGAAGAGCTCTCCCACCTGCACCTGTGGATCGGCCAGCAGTCGTCCCGGGACGAGCAGGGGGGCTGCGCCATATTGGCCGTGCACCTCAACACCCTGCTCGGAGAGCGGCCTGTGCAGCACCGAGAGTCACAGGGCAATGAGTCCGACCTCTTCATGAGCTACTTCCCCCACGGCCTCAAGTACCAGGAAGGCGGCGTGGAGTCGGCGTTTCACAAGACCTCCCCAGGAACCGCCCCAGCTGCCATCAAGAAACTCTACCAGGTGAAGGGCAAGAAGAACATTCGTGCCACTGAGCGGGTGCTGAGCTGGGACAGTTTCAACACAGGGGACTGCTTCATCCTGGATCTGGGCCAGAACATCTTTGCCTGGTGTGGTGCGAAGTCCAACATATTGGAGCGGAACAAGGCACGGGACCTGGCACTGGCCATCCGGGACAGCGAGCGGCAGGGCAAGGCCCACGTGGAGATCGTCACCGATGGGGAGGAGCCTGCCGACATGATACAGGTCTTGGGTCCCAAGCCCTCTCTGAAGGAGGGTAACCCTGAGGAAGACCTCACAGCTGACCGGACAAACGCACAGGCCGCGGCTCTGTATAAGGTCTCTGACGCCACTGGACAGATGAACCTGACCAAGCTGGCTGATTCCAGCCCCTTCGCCCTCGAGCTGCTGATACCCGATGACTGCTTTGTGTTGGACAACGGACTCTGCGGCAAGATCTACATCTGGAAGGGGCGCAAAGCTAATGAGAAGGAGAGGCAGGCGGCCCTCCAAGTGGCGGAGGACTTTATCACCCGCATGCGGTATGCCCCAAACACTCAGGTGGAGATTCTGCCCCAGGGCCGCGAGAGTGCCATCTTCAAGCAATTCTTCAAGGACTGGAAGTGAGGGTGGGCATCTCCCTGCCCCTACCTCCTACCCACTTGCTCCTCC
The Model: DNA as a Sequence of Features
gene gene
binding site
transposonLTR
LTR
LTRLTR
To detect features in a nucleotide sequence without prior knowledge solely based on nucleotide occurrence patterns, we apply an unsupervised algorithm developed initially for modeling speech acquisition.
Text (the DNA sequences) are presented to the algorithm as unbroken sequences of characters using the nucleotide alphabet. The task is to find the vocabulary for the text, which we also call a corpus.
A chromosome may be thought of as a collection of different languages. This analogy intuitively follows from the inhomogeneities in nucleotide compositions arising from the various functions that DNA performs.
A central computation in this approach is the probability of a parameter in the representation of the sequence (corpus). For this, the well-known forward – backward algorithm is used which takes into account all paths through a lattice of representations, where a representation of a sequence is a concatenation of words.
Represented above are two positions in a sequence, namely, positions a and b. The arcs into these locations are all possible paths, each using some combination of the current lexicon. Roughly, is the sum of the probabilities of paths in the model from the front of the sequence to location a; whereas is the same from the end of the sequence back to location b. The parameter under consideration, word w, spans the sequence between locations a and b.
wa b
)(a
)(a
)(b
)(b
With the forward and backward probabilities, and the probability of the parameter under consideration, w, the probability of w spanning the region from a to b in sequence s is given by:
)(
)()()()|,(
sp
swpsswbap
G
bGaG
With this equation for all representations, and all points a and b, the count of parameter w is determined. Such counts are the basis of the expectation step in the EM optimization algorithm; the maximization step adjusts probabilities in the model to maximize the expectation of the evidence based on the model.
Parameters are added to and deleted from the lexicon by combining existing parameters based on the evidence and the estimated cost/benefit of the new parameter to the description length.
A 1363 0.312471T 664 0.152224C 624 0.143054G 465 0.106602
...
CCTTA 9 0.00206327AAACCCTAAT 9 0.00206327GTTTT 9 0.00206327TCCTAAACCCT 9 0.00206327CAAACC 8 0.00183402CCAT 8 0.00183402AACCCTAAACC 8 0.00183402ACTCCA 8 0.00183402CCTTAAACCCTAAACC 8 0.00183402CTAAACCCTAA 8 0.00183402CTTTAAAACCTAAATCCTA 8 0.00183402CTAG 8 0.00183402ATCCTACTTTAGCTTC 8 0.00183402TTCGTATGATTTTTGGTTTTC 7 0.00160477GGATT 7 0.00160477ACCCTAAACATTAAAACCTAAACCC 7 0.00160477ATCTTCCAACAAGGAAAGAACACTTTA 7 0.00160477ATCTAGTCATATTTGAC 7 0.00160477AAAGTATATTTGGTC 7 0.00160477CTTCTA 7 0.00160477GTTGCGGTTCTAGTTCTTATACTCAATC 7 0.00160477
A portion of a lexicon from a chunk containing satellites
% wc -l chr4range007[789]_Lexicon_Frequency.txt
201 chr4range0077_Lexicon_Frequency.txt 117 chr4range0078_Lexicon_Frequency.txt 215 chr4range0079_Lexicon_Frequency.txt
Number of words contained in lexicons around this region
word count in representation frequency
>1KX5:A HISTONE H3
ARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTEL LIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEASEAYLVALFEDTNLCAIHAKRVTIM PKDIQLARRIRGERA
>1KX5:B HISTONE H4
SGRGKGGKGLGKGGAKRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKV FLENVIRDAVTYTEHAKRKTVTAMDVVYALKRQGRTLYGFGG
. . .
>1KX5:H HISTONE H2B.2
PEPAKSAPAPKKGSKKAVTKTQKKDGKKRRKTRKESYAIYVYKVLKQVHPDTGISSKAMS IMNSFVNDVFERIAGEASRLAHYNKRSTITSREIQTAVRLLLPGELAKHAVSEGTKAVTK YTSAK
>1KX5:I DNA
ATCAATATCCACCTGCAGATACTACCAAAAGTGTATTTGGAAACTGCTCCATCAAAAGGC ATGTTCAGCTGGAATCCAGCTGAACATGCCTTTTGATGGAGCAGTTTCCAAATACACTTT TGGTAGTATCTGCAGGTGGATATTGAT
>1KX5:J DNA
ATCAATATCCACCTGCAGATACTACCAAAAGTGTATTTGGAAACTGCTCCATCAAAAGGC ATGTTCAGCTGGATTCCAGCTGAACATGCCTTTTGATGGAGCAGTTTCCAAATACACTTT TGGTAGTATCTGCAGGTGGATATTGAT
Protein and DNA Sequences: 8 Histones and 2 DNA Strands
ITTC High Performance Computing Infrastructure
• 128 processor cluster (64 nodes)– 3.2 GHz Processors (Xeon based)– 4 GB RAM / node– 146 GB SCSI Disk / node
• 8 dual processor server nodes• 25-Terabyte File Server• Tape Robot System (LTO3 Ultrium)• High Performance Network
Compute nodes and server cluster components. System housed in newly expanded and remodeled machine room 218, Nichols Hall.