Finola O'Kane - The Liminal Situation of Mrs. Elizabeth Fagan
Kevin C. O'Kane Department of Computer Science The University of Northern Iowa
description
Transcript of Kevin C. O'Kane Department of Computer Science The University of Northern Iowa
The Effect of Inverse Document Frequency Weights on Retrieval of
Genomic Sequences:
Towards a vector space approach
Kevin C. O'KaneDepartment of Computer ScienceThe University of Northern Iowa
Cedar Falls, Iowa 50613
The area of natural language text indexing and retrieval has been studied since the mid-50's. In text retrieval, the problem is to locate documents related to a natural language query.
To this purpose, natural language text indexing programs have employed many techniques to identify terms in a document most likely to be contentdescriptors as opposed to terms that are poor content descriptors.
By eliminating poor descriptors and pre-indexing documents by descriptors more likely to be good discriminators, the speed of selection and precision of document relevance ranking can be improved.
The “vector space model”, developed by G. Salton, views the problem as an n-dimensional hyperspace in which documents and queries.
Overview
In text retrieval, the problem is to locate documents related to a natural language query.
Natural language text indexing programs identify terms in a document most likely to be content descriptors.
The goal of these experiments is to apply text indexing techniques to genomic data bases.
Natural Language Indexing
Natural language text indexing and retrieval has been studied since the mid-50's. In text retrieval, the problem is to locate documents related to a natural language query.
Natural language text indexing programs employ techniques to identify terms in a document most likely to be content descriptors.
By eliminating poor descriptors and pre-indexing documents by descriptors likely to be good discriminators, the speed of selection and precision of document relevance ranking can be improved.
The “vector space model”, developed by G. Salton, views the problem as an n-dimensional hyperspace of documents and queries.
Document Hyperspace
Hyperspace Queries
Clustering Objects by Feature
Cosine Similarity Coefficient
Genomic Data Bases
EMBL (http://www.embl.org) SWISS-PROT (http://www.expasy.org/sprot/sprot-top.html)PROSITE (http://www.expasy.org/prosite/)PIR (http://pir.georgetown.edu/home.shtml)NCBI/NLM GenBank (http://www.ncbi.nih.gov/)MGD: The Mouse Genome Database (http://www.informatics.jax.org/)OMIM - Online Mendelian Inheritance in Man
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM)
“nt” Sequence Data Base
NCBI “nt” data base: ~ 12 billion bytes in length comprising 2,584,440 sequences in FASTA format (Sept 2004).
Example sequence:
> gi|2695852|emb|Y13263.1|ABY13263 Acipenser baeri mRNA for immunoglobulin heavy chain, clone CAAGAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTATAATAATGACAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAACTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCAACAACATGGGCTGGGTTCGACAAGCTCCTGGAAAGGGTCTGGAATGGGTGTCTACTATAAGCTATAGTGTAAATGCATACTATGCCCAGTCTGTCCAGGGAAGATTCACCATCTCCAGAGACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACTCTGCCGTGTATTACTGTGCTCGAGAGTCTAACTTCAACCGCTTTGACTACTGGGGATCCGGGACTATGGTGACCGTAACAAATGCTACGCCATCACCACCGACAGTGTTTCCGCTTATGCAGGCATGTTGTTCGGTCGATGTCACGGGTCCTAGCGCTACGGGCTGCTTAGCAACCGAATTC
GenBank● LOCUS AAB2MCG1 289
bp DNA linear PRI 23-AUG-2002● DEFINITION Aotus azarai beta-2-
microglobulin precursor exon 1.● ACCESSION AF032092● VERSION AF032092.1 GI:3265027● KEYWORDS .● SEGMENT 1 of 2● SOURCE Aotus azarai (Azara's night
monkey)● ORGANISM Aotus azarai● Eukaryota; Metazoa; Chordata;
Craniata; Vertebrata; Euteleostomi;● Mammalia; Eutheria; Primates;
Platyrrhini; Cebidae; Aotinae; Aotus.● REFERENCE 1 (bases 1 to 289)● AUTHORS Canavez,F.C., Ladasky,J.J.,
Muniz,J.A., Seuanez,H.N., Parham,P. and● Cavanez,C.● TITLE beta2-Microglobulin in
neotropical primates (Platyrrhini)● JOURNAL Immunogenetics 48 (2), 133-140
(1998)● MEDLINE 98298008● PUBMED 9634477● REFERENCE 2 (bases 1 to 289)● AUTHORS Canavez,F.C., Ladasky,J.J.,
Seuanez,H.N. and Parham,P.● TITLE Direct Submission
JOURNAL Submitted (31-OCT-1997) Structural Biology, Stanford University,
● Fairchild Building Campus West Dr. Room D-100, Stanford, CA
● 94305-5126, USA● FEATURES Location/Qualifiers● source 1..289● /organism="Aotus
azarai"● /mol_type="genomic
DNA"●
/db_xref="taxon:30591"● sig_peptide 134..193● exon <134..200● /number=1● intron 201..>289● /number=1● ORIGIN● 1 gtccccgcgg gccttgtcct gattggctgt
ccctgcgggc cttgtcctga ttggctgtgc● 61 ccgactccgt ataacataaa tagaggcgtc
gagtcgcgcg ggcattactg cagcggacta● 121 cacttgggtc gagatggctc gcttcgtggt
ggtggccctg ctcgtgctac tctctctgtc● 181 tggcctggag gctatccagc gtaagtctct
cctcccgtcc ggcgctggtc cttcccctcc●
Sequence Matching
Currrent access to sequence databases mainly by heuristic-assisted pattern matching on flat or nearly flat files using programs such as BLAST and FASTA.
Underlying data bases growing rapidly with consequent deterioration of search times even on large, multiprocessor systems as current software tools reach design limits.
BLAST systems index data base sequences according to short code letter words (usually, 3 letters for amino acids and 11 for nucleotide data bases); scoring matrices.
Queries also decomposed to similar short code words. The data base is scanned & sequences with words in common with the query are processed to extend the initial code word match.
Example BLAST Output● Score E●Sequences producing significant alignments: (bits) Value
●emb|BX015832.1|CNS08KDO Single read from an extremity of a ... 918 0.0 ●emb|BX032891.1|CNS08XJJ Single read from an extremity of a ... 902 0.0 ●emb|BX065445.1|CNS09MNT Single read from an extremity of a ... 894 0.0 ●emb|BX052703.1|CNS09CTV Single read from an extremity of a ... 894 0.0 ●emb|BX030708.1|CNS08VUW Single read from an extremity of a ... 894 0.0 ●emb|BX030663.1|CNS08VTN Single read from an extremity of a ... 894 0.0 ●.............................................................................
●>emb|BX015832.1|CNS08KDO Single read from an extremity of a full-length cDNA clone made from Anopheles gambiae total adult females. 3-PRIME end of clone FK0AAA23DA12 ● Length = 866
● Score = 918 bits (463), Expect = 0.0● Identities = 535/559 (95%)● Strand = Plus / Plus
● ●Query: 1 tctttactattattggggaatttcgaggaacatttgttccccttacaatgcatttctata 60● |||||||||||||||| |||||||| |||||||||||||||||||||||||||||||||●Sbjct: 1 tctttactattattggtcaatttcgatgaacatttgttccccttacaatgcatttctata 60
● ●Query: 61 acctacacctggagtaggtggttccggttcagccacttcagtgggaggaacttccgtttc 120● | |||||||||||||||||||||||||||||||||||||||||||||| |||||||||||●Sbjct: 61 aactacacctggagtaggtggttccggttcagccacttcagtgggaggcacttccgtttc 120
Developing A Vector Space Approach to Sequence Indexing
This work attempts to explore natural language indexing techniques applied genomic data bases through:
Weight based indexing of k-tuples derived from NCBI “nt” sequence data base.
Text terms used in genomic sequence data banks and literature;
Both applications are implemented for Linux and written in Mumps and MDH, a Mumps related C++ toolkit capable of indexing data sets of up to 256 terabytes using a B-tree based multidimensional data model, that includes many retrieval and sequence matching functions.
Inverse Document Frequency Wgt.
The IDF weight yields higher values for words whose distribution is more concentrated and lower values for words whose use is more widespread.
Thus, words of broad context are weighted lower than words of narrow context.
Words of low weight are hypothesized to be poor indexing terms while words with high weights are hypothesized to be good indexing terms.
The bulk of the words, as is the case in natural language text, reside in the middle range.
Natural Language ExampleWord Freq(i,j) TotFreq DocFreq Wgt1 Wgt2 Wgt3 MCA
[1] Death of a cult. (Apple Computer needs to alter its strategy) (column)
apple 4 261 112 1.716 9.757 17 -1.1625computer 4 706 358 2.028 5.109 10 -19.4405mac 2 146 71 0.973 6.290 6 -0.0256macintosh 4 210 107 2.038 9.940 20 -0.5855strategy 2 79 67 1.696 6.406 11 -0.0592
[3] WordPerfect. (WordPerfect for the Macintosh 2.0) (evaluation) Taub, Eric.
edit 2 111 77 1.387 6.128 8 -0.0961frame 2 9 7 1.556 10.924 17 0.0131import 2 29 19 1.310 8.927 12 0.0998macintosh 3 210 107 1.529 7.705 12 -0.5855macro 3 38 24 1.895 12.189 23 0.1075outstand 1 10 9 0.900 5.711 5 0.0168user 4 861 435 2.021 4.330 9 -26.8094wordperfect 8 24 8 2.667 39.627 106 0.1747
[4] Radius Pivot for Built-In Video an Radius Color Pivot. (Hardware Review) (new Mac monitors)(includes related article on design of
built-in 3 35 29 2.486 11.621 29 0.0678color 3 81 47 1.741 10.173 18 0.0809mac 2 146 71 0.973 6.290 6 -0.0256monitor 6 88 52 3.545 18.739 66 0.0946resolution 2 50 32 1.280 7.884 10 0.0288screen 2 92 62 1.348 6.561 9 0.0199video 4 106 61 2.302 12.188 28 0.0187
Indexing Experiment
Sequences from the NCBI "nt" (non-redundant nucleotide) data base were used.
The “nt” data base is approximately ~12 billion bytes in length comprising 2,584,440 sequences in FASTA format (Sept 2004).
A “word” size of 11 was used throughout. A total of 4,194,299 words were identified, slightly less than the theoretical maximum of 4,194,304.
Calculating the IDF Weight
The overall frequencies of occurrence of all possible 11 character words from each sequence were determined along with the number of sequences in which each unique word was found.
A weight Wgti for each word i was calculated by taking the Log
10,
multiplied by 10 and truncated to the nearest integer, of the total number of sequences (N) divided by the number of sequences in which the word occurred (DocFreq
i).
Wgti= (int) 10 * Log
10 ( N / DocFreq
i )
In natural language indexing, this is referred to as the inverse document
frequency (IDF) weight.
File Sizes
Initial file analysis produces about 110 intermediate files of about 440 million bytes each from the input data base (12 GB).
out.table is a large (40 billion byte) word-sequence file.
freq.bin contains the inverse document frequency weight for each word (53 million bytes);
index (76 million bytes) gives for each word the eight byte offset of the word's entry in out.table.
index and freq.bin are merged into ITABLE (112 million bytes) which contains for each word its weight, offset, and a pointer to a list of aliases (not used with the “nt” data base).
Data Base
W = ( w1, w2, w3, ... wM) vector of M weights
F = ( f1,1 f1,2 f1,3 ... f1,N ) ( f2,1 f2,2 f2,3 ... f2,N ) ( f3,1 f3,2 f3,3 ... f3,N )
... word-sequence matrix
( fM,1 fM,2 fM,3 ... fM,N )
Number of Words at each Weight
for i = 1 to 120z
i ← 0
for j = 1 to Mif w
j = i then z
i ← z
i + 1
Number of Words at Each IDF Wgt.
26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101
104
107
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
Sum of all Instances of Each Weight
for i = 1 to 120 // for each weightx
i ← 0
for j = 1 to M // for each wordfor k = 1 to N // for each sequence
if fj, k
= i then xi ← x
i + 1
Number of Occurrences at Each IDF Level
26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101
104
107
0
100000000
200000000
300000000
400000000
500000000
600000000
700000000
Sequence Retrieval
For retrieval, a query sequence is read and decomposed into 11 character words. These words are reduced to a numeric equivalent which is used as an index into the word-sequence table. Entries in a master vector corresponding to sequences are incremented by the weight of the word if the word occurs in the sequence if the weight of the word lies within a specified range. When all words have been processed, entries in the master sequence vector are normalized according to the length of the underlying sequence in respect to the length of the query. Finally, the master sequence vector is sorted and the top scoring entries printed or submitted to a Smith-Waterman alignment, sorted and then printed. Optionally, the Smith-Waterman alignments themselves can be printed and the selected sequences can be extracted from the nt data base and stored in a separate output file for additional processing. FASTA post-processing is an option.
Unweighted Result for 500 Random Queries
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 590
10
20
30
40
50
60
Result for 500 Random Queries Weight Range 65-120
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 580
50
100
150
200
250
300
350
Overall Results for 500 Random Queries
Range Time Avg Rank Mean Found N/F
Weights 50-60 17.13 8.04 1.00 495.00 5.00Weights 60-70 17.23 6.46 1.00 497.00 3.00Weights 70-80 17.23 5.69 1.00 499.00 1.00Weights 80-90 17.38 8.25 1.00 499.00 1.00Weights 90-100 17.50 5.06 1.00 497.00 3.00
Weights 40-65 17.28 7.31 1.00 500.00 0.00Weights 55-70 17.25 8.04 1.00 495.00 1.00Weights 70-85 17.19 6.47 1.00 497.00 3.00Weights 85-100 17.26 5.69 1.00 499.00 1.00
Weights 40-60 17.26 7.31 1.00 500.00 0.00Weights 60-80 17.27 8.04 1.00 495.00 5.00Weights 80-100 17.34 6.47 1.00 497.00 3.00
Weights 75-84 17.37 7.31 1.00 500.00 0.00Weighted 65-120 18.34 7.31 1.00 500.00 0.00Weighted 0-120 19.50 7.67 1.00 499.00 1.00
Unweighted 23.60 38.27 29.00 492.00 8.00BLAST 41.74 6.12 1.00 499.00 1.00
Index Scoring Results
Query: >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71Query string has 289 lettersSearching ...
68224 >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p31420 >gi|29467317|dbj|AB089555.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol30508 >gi|19911912|dbj|AB072084.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p30296 >gi|29467668|dbj|AB100815.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol29800 >gi|14150634|gb|AF369255.1| Hepatitis C virus Pt.2F NS3 protease gene, partial cds29444 >gi|19911960|dbj|AB072108.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p29240 >gi|19911888|dbj|AB072072.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p29196 >gi|14150646|gb|AF369261.1| Hepatitis C virus Pt.6A NS3 protease gene, partial cds29120 >gi|19911862|dbj|AB072059.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p28896 >gi|14150628|gb|AF369252.1| Hepatitis C virus Pt.128 NS3 protease gene, partial cds28116 >gi|2731651|gb|U81612.1|HCU81612 Hepatitis C virus polyprotein gene, partial cds28116 >gi|3157741|dbj|AB013621.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region),27700 >gi|14150620|gb|AF369248.1| Hepatitis C virus Pt.1 NS3 protease gene, partial cds.............................................................................................
Total fetch time used: 13Total number of accessions found: 501Total primary query word count: 279Total alias count: 0Total number of sequences searched: 2584440Max Indx 1300023
Smith-Waterman Result Scoring
Query: >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71Query string has 289 letters
top= >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71
166 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCAAATCACCCAGATGTACACCAATGTAGACCAGGACCT 245 ::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::: ::: 1 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCAAGTCACCCAGATGTACACCAATGTAGACCAGGTCCT 80
246 CGTCGGCTGGCCGGCGCCCCCCGGAGCGCGTTCCTTGACACCATGCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 325 :::::::::::::::::: ::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::: 81 CGTCGGCTGGCCGGCGCCGCCCGGAGCGCGTTCCTTGAGACCATGCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 160
326 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGGGGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 405 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 161 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGGGGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 240
406 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGCTGTGG 454 ::::::::::::::::::::::::::::::::::::::::::::::::: 241 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGCTGTGG 289
score=566
S-W Scores
566 >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p505 >gi|29467668|dbj|AB100815.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol504 >gi|29467317|dbj|AB089555.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol503 >gi|19911914|dbj|AB072085.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p503 >gi|14150628|gb|AF369252.1| Hepatitis C virus Pt.128 NS3 protease gene, partial cds502 >gi|29467247|dbj|AB089520.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol502 >gi|3157741|dbj|AB013621.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region),501 >gi|19911862|dbj|AB072059.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p499 >gi|29467670|dbj|AB100816.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol498 >gi|3157753|dbj|AB013627.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region),498 >gi|14150634|gb|AF369255.1| Hepatitis C virus Pt.2F NS3 protease gene, partial cds497 >gi|29467311|dbj|AB089552.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol497 >gi|19911934|dbj|AB072095.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p497 >gi|19911912|dbj|AB072084.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p496 >gi|19911900|dbj|AB072078.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p495 >gi|14150638|gb|AF369257.1| Hepatitis C virus Pt.3O NS3 protease gene, partial cds495 >gi|14150616|gb|AF369246.1| Hepatitis C virus Pt.1A NS3 protease gene, partial cds495 >gi|14150646|gb|AF369261.1| Hepatitis C virus Pt.6A NS3 protease gene, partial cds494 >gi|14150620|gb|AF369248.1| Hepatitis C virus Pt.1 NS3 protease gene, partial cds494 >gi|19911888|dbj|AB072072.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p493 >gi|19911960|dbj|AB072108.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p...Total fetch time used: 26Total number of accessions found: 31Total primary query word count: 279Total alias count: 0Total number of sequences searched: 2584440Max Indx 1300023
Larger Sequences
On larger query sequences (5,000 to 6,000 letters), the IDF method performed slightly better than BLAST. On 25 sequences randomly generated, the IDF method correctly ranked the original sequence first 24 times and once at rank 3. BLAST, on the other hand, ranked the original sequence first 21 times while the remaining 4 were ranked 2, 2, 3 and 4. Average time per query for the IDF method was 47.4 seconds and the average time for BLAST was 122.8 seconds.
The Next Step
Future work
Weighted Term Vectors.
Other weighting schemes such as the Modified Centroid Algorithm.
Sequence-Sequence and Term-Term Correlations.
Sequence clustering.
ReferencesAltschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J. Mol. Biol. 215:403-10.
O'Kane, K.C.; and Lockner, M. J. (2004) Indexing genomic sequence libraries, Information Processing and Management, 41:265-274.
O'Kane, K.C. (2004) The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval, submitted.
Pearson, W. R. (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132:185-219.
Salton, G. (1983), Introduction to Modern Information Retrieval, McGraw-Hill (New York 1983).
Smith, T.F. & Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147:195-197
Hierarchical Data Base
Bioinformatics
Sloan Report on Bioinformatics from June 2004.
Number of graduates * There were only 26 new PhD's produced... * 102 masters degrees awarded... * Only 17 Bachelor's degrees produced... * The data is for January 2002 until March 2003.
"... in the next few years the number of graduates is expected to increase by two or three times."
Average program enrollment: * 103 = Bachelors * 435 = Masters * 296 = Phd
B.S. in Bioinformtics at UNI
Mathematics: 800:060; 800:061; 800:064; 800:152; 800:164 (17 hours)
Computer Science: 810:061; 810:062; 810:065; 810:066; 810:080; 810:114; 810:115; 810:180 (24 hours) Biology: 840:051; 840:052; 840:130; 840:140; 840:153 (19 hours) Chemistry: 860:070* or both 860:044 and 860:048; 860:063 (9-12 hours)
Elective: One course from the following (3 hours) Computer Science: 810:143; 810:147; 810:153; 810:155; 810:161; 810:172; 810:181
Total 73-75 hours
Courses
800:060. Calculus I . The derivatives and integrals of elementary functions and their applications.
800:061. Calculus II. Continuation of 800:060
800:064. Elementary Probability and Statistics for Bioinformatics. Descriptive statistics, basic probability concepts, confidence intervals, hypothesis testing, correlation and regression, elementary concepts of survival analysis
800:152. Introduction to Probability. Axioms of probability, sample spaces having equally likely outcomes, conditional probability and independence, random variables, expectation, moment generating functions, jointly distributed random variables, weak law of large numbers, central limit theorem
Courses
800:164. Statistical Methods in Bioinformatics. Analysis of a DNA sequence, analysis of multiple DNA and protein sequences, BLAST.
810:061. Computer Science I. Introduction to computer programming in the context of a modern object-oriented programming language. Emphasis on good programming techniques, object-oriented design, and style through extensive practice in designing, coding, and debugging programs.
810:062. Computer Science II. Intermediate programming in an object-oriented environment. Topics include object-oriented design, implementation of classes and methods, dynamic polymorphism, frameworks, patterns, software reuses, limitations, exceptions, and threads.
Courses
810:065. Computing for Bioinformatics I. Intermediate programming with emphasis on bioinformatics. Includes file handling, memory management, multi-threading, B-trees, introduction to dynamic programming including Wunsch-Neddleman and Smith-Waterman algorithms for optimal alignments, exploration of BLAST, FASTA and gapped alignment, substitution matrices.
810:066. Computing for Bioinformatics II. Advanced bioinformatics computing: Perl and CGI programming; data base facilities for bioinformatics; pattern matching with regular expressions; advanced dynamic programming: optimal versus local alignment, multiple alignments; data base mining tools, Entrez, SRS, BLAST, FASTA, CLUSTAL; graphical 3-D representation of proteins; phylogenic trees.
Courses810:080. Discrete Structures. Topics include propositional and first-
order logic; proofs and inference; mathematical induction; sets, relations, and functions; and graphs, lattices, and Boolean algebra, all in the context of computer science.
810:114. Database Systems. Storage of, and access to, physical databases; data models, query languages, transaction processing, and recovery techniques; object-oriented and distributed database systems; and database design.
810:115. Information Storage and Retrieval. Natural language processing; analysis of textual material by statistical, syntactic, and logical methods; retrieval systems models, dictionary construction, query processing, file structures, content analysis; automatic retrieval systems and question-answering systems; and evaluation of retrieval effectiveness.
Courses
810:180. Undergraduate Research in Computer Science
840:051. General Biology: Organismal Diversity. Study of organismic biology emphasizing evolutionary patterns and diversity of organisms and interdependency of structure and function in living systems.
840:052. General Biology: Cell Structure and Function. Study of cells, genetics, and DNA technology emphasizing the chemical basis of life and flow of information.
840:130. Molecular Biology of the Cell. Introduction to the molecular, biochemical, and cellular structure and function of cells, DNA structure and functions, and the translation of genetic information into functional structures of living cells. DNA replication, transcription of genes, and synthesis and processing of proteins will be emphasized.
Courses
840:140. Genetics. Analytical approach to classical, molecular, and population genetics
840:153. Recombinant DNA Techniques. Study of techniques for manipulating and analyzing DNA, including genomic library construction, polymerase chain reaction, oligonucleotide synthesis, genomic analysis with computers, and DNA and RNA isolation.
860:070. General Chemistry I-II. Accelerated course for well-prepared students. Content similar to 860:044 and 860:048 but covered in one semester. Completion satisfies General Chemistry requirement of any chemistry major.
Courses
860:063. Applied Organic and Biochemistry. Basic concepts in organic chemistry and biochemistry, including nomenclature, functional groups, reactivity, and macromolecules.
Elective from:
810:143(g). Operating Systems. History and evolution of operating systems; process and processor management; primary and auxiliary storage management; performance evaluation, security, and distributed systems issues; and case studies of modern operating systems.
Courses
810:147. Networking. Network architectures and communication protocol standards. Topics include communication of digital data, data-link protocols, local-area networks, network-layer protocols, transport-layer protocols, applications, network security, and management.
810:153. Design and Analysis of Algorithms. Algorithm design techniques such as dynamic programming and greedy algorithms; complexity analysis of algorithms; efficient algorithms for classical problems; intractable problems and techniques for addressing them; and algorithms for parallel machines.
810:155. Translation of Programming Languages. Introduction to analysis of programming languages and construction of translators.
Courses810:161. Artificial Intelligence. Models of intelligent behavior and
problem solving; knowledge representation and search methods; learning; topics such as knowledge-based systems, language understanding, and vision; optional 1-hour lab in symbolic programming techniques: heuristic programming; symbolic representations and algorithms; and applications to search, parsing, and high-level problem-solving tasks.
810:172. Software Engineering. Study of software life cycle models and their phases--planning, requirements, specifications, design, implementation, testing, and maintenance. Emphasis on tools, documentation, and applications.
810:181. Theory of Computation. Topics include regular languages and grammars; finite state automata; context-free languages and grammars; language recognition and parsing; and turing computability and undecidability.