Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

47
The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa 50613

description

Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa 50613. The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach. - PowerPoint PPT Presentation

Transcript of Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Page 1: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

The Effect of Inverse Document Frequency Weights on Retrieval of

Genomic Sequences:

Towards a vector space approach

Kevin C. O'KaneDepartment of Computer ScienceThe University of Northern Iowa

Cedar Falls, Iowa 50613

Page 2: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

The area of natural language text indexing and retrieval has been studied since the mid-50's. In text retrieval, the problem is to locate documents related to a natural language query.

To this purpose, natural language text indexing programs have employed many techniques to identify terms in a document most likely to be contentdescriptors as opposed to terms that are poor content descriptors.

By eliminating poor descriptors and pre-indexing documents by descriptors more likely to be good discriminators, the speed of selection and precision of document relevance ranking can be improved.

The “vector space model”, developed by G. Salton, views the problem as an n-dimensional hyperspace in which documents and queries.

Page 3: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Overview

In text retrieval, the problem is to locate documents related to a natural language query.

Natural language text indexing programs identify terms in a document most likely to be content descriptors.

The goal of these experiments is to apply text indexing techniques to genomic data bases.

Page 4: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Natural Language Indexing

Natural language text indexing and retrieval has been studied since the mid-50's. In text retrieval, the problem is to locate documents related to a natural language query.

Natural language text indexing programs employ techniques to identify terms in a document most likely to be content descriptors.

By eliminating poor descriptors and pre-indexing documents by descriptors likely to be good discriminators, the speed of selection and precision of document relevance ranking can be improved.

The “vector space model”, developed by G. Salton, views the problem as an n-dimensional hyperspace of documents and queries.

Page 5: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Document Hyperspace

Page 6: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Hyperspace Queries

Page 7: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Clustering Objects by Feature

Page 8: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Cosine Similarity Coefficient

Page 9: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Genomic Data Bases

EMBL (http://www.embl.org) SWISS-PROT (http://www.expasy.org/sprot/sprot-top.html)PROSITE (http://www.expasy.org/prosite/)PIR (http://pir.georgetown.edu/home.shtml)NCBI/NLM GenBank (http://www.ncbi.nih.gov/)MGD: The Mouse Genome Database (http://www.informatics.jax.org/)OMIM - Online Mendelian Inheritance in Man

(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM)

Page 10: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

“nt” Sequence Data Base

NCBI “nt” data base: ~ 12 billion bytes in length comprising 2,584,440 sequences in FASTA format (Sept 2004).

Example sequence:

> gi|2695852|emb|Y13263.1|ABY13263 Acipenser baeri mRNA for immunoglobulin heavy chain, clone CAAGAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTATAATAATGACAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAACTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCAACAACATGGGCTGGGTTCGACAAGCTCCTGGAAAGGGTCTGGAATGGGTGTCTACTATAAGCTATAGTGTAAATGCATACTATGCCCAGTCTGTCCAGGGAAGATTCACCATCTCCAGAGACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACTCTGCCGTGTATTACTGTGCTCGAGAGTCTAACTTCAACCGCTTTGACTACTGGGGATCCGGGACTATGGTGACCGTAACAAATGCTACGCCATCACCACCGACAGTGTTTCCGCTTATGCAGGCATGTTGTTCGGTCGATGTCACGGGTCCTAGCGCTACGGGCTGCTTAGCAACCGAATTC

Page 11: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

GenBank● LOCUS AAB2MCG1 289

bp DNA linear PRI 23-AUG-2002● DEFINITION Aotus azarai beta-2-

microglobulin precursor exon 1.● ACCESSION AF032092● VERSION AF032092.1 GI:3265027● KEYWORDS .● SEGMENT 1 of 2● SOURCE Aotus azarai (Azara's night

monkey)● ORGANISM Aotus azarai● Eukaryota; Metazoa; Chordata;

Craniata; Vertebrata; Euteleostomi;● Mammalia; Eutheria; Primates;

Platyrrhini; Cebidae; Aotinae; Aotus.● REFERENCE 1 (bases 1 to 289)● AUTHORS Canavez,F.C., Ladasky,J.J.,

Muniz,J.A., Seuanez,H.N., Parham,P. and● Cavanez,C.● TITLE beta2-Microglobulin in

neotropical primates (Platyrrhini)● JOURNAL Immunogenetics 48 (2), 133-140

(1998)● MEDLINE 98298008● PUBMED 9634477● REFERENCE 2 (bases 1 to 289)● AUTHORS Canavez,F.C., Ladasky,J.J.,

Seuanez,H.N. and Parham,P.● TITLE Direct Submission

JOURNAL Submitted (31-OCT-1997) Structural Biology, Stanford University,

● Fairchild Building Campus West Dr. Room D-100, Stanford, CA

● 94305-5126, USA● FEATURES Location/Qualifiers● source 1..289● /organism="Aotus

azarai"● /mol_type="genomic

DNA"●

/db_xref="taxon:30591"● sig_peptide 134..193● exon <134..200● /number=1● intron 201..>289● /number=1● ORIGIN● 1 gtccccgcgg gccttgtcct gattggctgt

ccctgcgggc cttgtcctga ttggctgtgc● 61 ccgactccgt ataacataaa tagaggcgtc

gagtcgcgcg ggcattactg cagcggacta● 121 cacttgggtc gagatggctc gcttcgtggt

ggtggccctg ctcgtgctac tctctctgtc● 181 tggcctggag gctatccagc gtaagtctct

cctcccgtcc ggcgctggtc cttcccctcc●

Page 12: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Sequence Matching

Currrent access to sequence databases mainly by heuristic-assisted pattern matching on flat or nearly flat files using programs such as BLAST and FASTA.

Underlying data bases growing rapidly with consequent deterioration of search times even on large, multiprocessor systems as current software tools reach design limits.

BLAST systems index data base sequences according to short code letter words (usually, 3 letters for amino acids and 11 for nucleotide data bases); scoring matrices.

Queries also decomposed to similar short code words. The data base is scanned & sequences with words in common with the query are processed to extend the initial code word match.

Page 13: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Example BLAST Output● Score E●Sequences producing significant alignments: (bits) Value

●emb|BX015832.1|CNS08KDO Single read from an extremity of a ... 918 0.0 ●emb|BX032891.1|CNS08XJJ Single read from an extremity of a ... 902 0.0 ●emb|BX065445.1|CNS09MNT Single read from an extremity of a ... 894 0.0 ●emb|BX052703.1|CNS09CTV Single read from an extremity of a ... 894 0.0 ●emb|BX030708.1|CNS08VUW Single read from an extremity of a ... 894 0.0 ●emb|BX030663.1|CNS08VTN Single read from an extremity of a ... 894 0.0 ●.............................................................................

●>emb|BX015832.1|CNS08KDO Single read from an extremity of a full-length cDNA clone made from Anopheles gambiae total adult females. 3-PRIME end of clone FK0AAA23DA12 ● Length = 866

● Score = 918 bits (463), Expect = 0.0● Identities = 535/559 (95%)● Strand = Plus / Plus

● ●Query: 1 tctttactattattggggaatttcgaggaacatttgttccccttacaatgcatttctata 60● |||||||||||||||| |||||||| |||||||||||||||||||||||||||||||||●Sbjct: 1 tctttactattattggtcaatttcgatgaacatttgttccccttacaatgcatttctata 60

● ●Query: 61 acctacacctggagtaggtggttccggttcagccacttcagtgggaggaacttccgtttc 120● | |||||||||||||||||||||||||||||||||||||||||||||| |||||||||||●Sbjct: 61 aactacacctggagtaggtggttccggttcagccacttcagtgggaggcacttccgtttc 120

Page 14: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Developing A Vector Space Approach to Sequence Indexing

This work attempts to explore natural language indexing techniques applied genomic data bases through:

Weight based indexing of k-tuples derived from NCBI “nt” sequence data base.

Text terms used in genomic sequence data banks and literature;

Both applications are implemented for Linux and written in Mumps and MDH, a Mumps related C++ toolkit capable of indexing data sets of up to 256 terabytes using a B-tree based multidimensional data model, that includes many retrieval and sequence matching functions.

Page 15: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Inverse Document Frequency Wgt.

The IDF weight yields higher values for words whose distribution is more concentrated and lower values for words whose use is more widespread.

Thus, words of broad context are weighted lower than words of narrow context.

Words of low weight are hypothesized to be poor indexing terms while words with high weights are hypothesized to be good indexing terms.

The bulk of the words, as is the case in natural language text, reside in the middle range.

Page 16: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Natural Language ExampleWord Freq(i,j) TotFreq DocFreq Wgt1 Wgt2 Wgt3 MCA

[1] Death of a cult. (Apple Computer needs to alter its strategy) (column)

apple 4 261 112 1.716 9.757 17 -1.1625computer 4 706 358 2.028 5.109 10 -19.4405mac 2 146 71 0.973 6.290 6 -0.0256macintosh 4 210 107 2.038 9.940 20 -0.5855strategy 2 79 67 1.696 6.406 11 -0.0592

[3] WordPerfect. (WordPerfect for the Macintosh 2.0) (evaluation) Taub, Eric.

edit 2 111 77 1.387 6.128 8 -0.0961frame 2 9 7 1.556 10.924 17 0.0131import 2 29 19 1.310 8.927 12 0.0998macintosh 3 210 107 1.529 7.705 12 -0.5855macro 3 38 24 1.895 12.189 23 0.1075outstand 1 10 9 0.900 5.711 5 0.0168user 4 861 435 2.021 4.330 9 -26.8094wordperfect 8 24 8 2.667 39.627 106 0.1747

[4] Radius Pivot for Built-In Video an Radius Color Pivot. (Hardware Review) (new Mac monitors)(includes related article on design of

built-in 3 35 29 2.486 11.621 29 0.0678color 3 81 47 1.741 10.173 18 0.0809mac 2 146 71 0.973 6.290 6 -0.0256monitor 6 88 52 3.545 18.739 66 0.0946resolution 2 50 32 1.280 7.884 10 0.0288screen 2 92 62 1.348 6.561 9 0.0199video 4 106 61 2.302 12.188 28 0.0187

Page 17: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Indexing Experiment

Sequences from the NCBI "nt" (non-redundant nucleotide) data base were used.

The “nt” data base is approximately ~12 billion bytes in length comprising 2,584,440 sequences in FASTA format (Sept 2004).

A “word” size of 11 was used throughout. A total of 4,194,299 words were identified, slightly less than the theoretical maximum of 4,194,304.

Page 18: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Calculating the IDF Weight

The overall frequencies of occurrence of all possible 11 character words from each sequence were determined along with the number of sequences in which each unique word was found.

A weight Wgti for each word i was calculated by taking the Log

10,

multiplied by 10 and truncated to the nearest integer, of the total number of sequences (N) divided by the number of sequences in which the word occurred (DocFreq

i).

Wgti= (int) 10 * Log

10 ( N / DocFreq

i )

In natural language indexing, this is referred to as the inverse document

frequency (IDF) weight.

Page 19: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

File Sizes

Initial file analysis produces about 110 intermediate files of about 440 million bytes each from the input data base (12 GB).

out.table is a large (40 billion byte) word-sequence file.

freq.bin contains the inverse document frequency weight for each word (53 million bytes);

index (76 million bytes) gives for each word the eight byte offset of the word's entry in out.table.

index and freq.bin are merged into ITABLE (112 million bytes) which contains for each word its weight, offset, and a pointer to a list of aliases (not used with the “nt” data base).

Page 20: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Data Base

W = ( w1, w2, w3, ... wM) vector of M weights

F = ( f1,1 f1,2 f1,3 ... f1,N ) ( f2,1 f2,2 f2,3 ... f2,N ) ( f3,1 f3,2 f3,3 ... f3,N )

... word-sequence matrix

( fM,1 fM,2 fM,3 ... fM,N )

Page 21: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Number of Words at each Weight

for i = 1 to 120z

i ← 0

for j = 1 to Mif w

j = i then z

i ← z

i + 1

Page 22: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Number of Words at Each IDF Wgt.

26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101

104

107

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

Page 23: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Sum of all Instances of Each Weight

for i = 1 to 120 // for each weightx

i ← 0

for j = 1 to M // for each wordfor k = 1 to N // for each sequence

if fj, k

= i then xi ← x

i + 1

Page 24: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Number of Occurrences at Each IDF Level

26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101

104

107

0

100000000

200000000

300000000

400000000

500000000

600000000

700000000

Page 25: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Sequence Retrieval

For retrieval, a query sequence is read and decomposed into 11 character words. These words are reduced to a numeric equivalent which is used as an index into the word-sequence table. Entries in a master vector corresponding to sequences are incremented by the weight of the word if the word occurs in the sequence if the weight of the word lies within a specified range. When all words have been processed, entries in the master sequence vector are normalized according to the length of the underlying sequence in respect to the length of the query. Finally, the master sequence vector is sorted and the top scoring entries printed or submitted to a Smith-Waterman alignment, sorted and then printed. Optionally, the Smith-Waterman alignments themselves can be printed and the selected sequences can be extracted from the nt data base and stored in a separate output file for additional processing. FASTA post-processing is an option.

Page 26: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Unweighted Result for 500 Random Queries

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 590

10

20

30

40

50

60

Page 27: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Result for 500 Random Queries Weight Range 65-120

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 580

50

100

150

200

250

300

350

Page 28: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Overall Results for 500 Random Queries

Range Time Avg Rank Mean Found N/F

Weights 50-60 17.13 8.04 1.00 495.00 5.00Weights 60-70 17.23 6.46 1.00 497.00 3.00Weights 70-80 17.23 5.69 1.00 499.00 1.00Weights 80-90 17.38 8.25 1.00 499.00 1.00Weights 90-100 17.50 5.06 1.00 497.00 3.00

Weights 40-65 17.28 7.31 1.00 500.00 0.00Weights 55-70 17.25 8.04 1.00 495.00 1.00Weights 70-85 17.19 6.47 1.00 497.00 3.00Weights 85-100 17.26 5.69 1.00 499.00 1.00

Weights 40-60 17.26 7.31 1.00 500.00 0.00Weights 60-80 17.27 8.04 1.00 495.00 5.00Weights 80-100 17.34 6.47 1.00 497.00 3.00

Weights 75-84 17.37 7.31 1.00 500.00 0.00Weighted 65-120 18.34 7.31 1.00 500.00 0.00Weighted 0-120 19.50 7.67 1.00 499.00 1.00

Unweighted 23.60 38.27 29.00 492.00 8.00BLAST 41.74 6.12 1.00 499.00 1.00

Page 29: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Index Scoring Results

Query: >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71Query string has 289 lettersSearching ...

68224 >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p31420 >gi|29467317|dbj|AB089555.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol30508 >gi|19911912|dbj|AB072084.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p30296 >gi|29467668|dbj|AB100815.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol29800 >gi|14150634|gb|AF369255.1| Hepatitis C virus Pt.2F NS3 protease gene, partial cds29444 >gi|19911960|dbj|AB072108.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p29240 >gi|19911888|dbj|AB072072.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p29196 >gi|14150646|gb|AF369261.1| Hepatitis C virus Pt.6A NS3 protease gene, partial cds29120 >gi|19911862|dbj|AB072059.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p28896 >gi|14150628|gb|AF369252.1| Hepatitis C virus Pt.128 NS3 protease gene, partial cds28116 >gi|2731651|gb|U81612.1|HCU81612 Hepatitis C virus polyprotein gene, partial cds28116 >gi|3157741|dbj|AB013621.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region),27700 >gi|14150620|gb|AF369248.1| Hepatitis C virus Pt.1 NS3 protease gene, partial cds.............................................................................................

Total fetch time used: 13Total number of accessions found: 501Total primary query word count: 279Total alias count: 0Total number of sequences searched: 2584440Max Indx 1300023

Page 30: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Smith-Waterman Result Scoring

Query: >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71Query string has 289 letters

top= >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71

166 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCAAATCACCCAGATGTACACCAATGTAGACCAGGACCT 245 ::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::: ::: 1 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCAAGTCACCCAGATGTACACCAATGTAGACCAGGTCCT 80

246 CGTCGGCTGGCCGGCGCCCCCCGGAGCGCGTTCCTTGACACCATGCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 325 :::::::::::::::::: ::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::: 81 CGTCGGCTGGCCGGCGCCGCCCGGAGCGCGTTCCTTGAGACCATGCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 160

326 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGGGGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 405 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 161 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGGGGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 240

406 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGCTGTGG 454 ::::::::::::::::::::::::::::::::::::::::::::::::: 241 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGCTGTGG 289

score=566

Page 31: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

S-W Scores

566 >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p505 >gi|29467668|dbj|AB100815.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol504 >gi|29467317|dbj|AB089555.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol503 >gi|19911914|dbj|AB072085.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p503 >gi|14150628|gb|AF369252.1| Hepatitis C virus Pt.128 NS3 protease gene, partial cds502 >gi|29467247|dbj|AB089520.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol502 >gi|3157741|dbj|AB013621.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region),501 >gi|19911862|dbj|AB072059.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p499 >gi|29467670|dbj|AB100816.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol498 >gi|3157753|dbj|AB013627.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region),498 >gi|14150634|gb|AF369255.1| Hepatitis C virus Pt.2F NS3 protease gene, partial cds497 >gi|29467311|dbj|AB089552.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol497 >gi|19911934|dbj|AB072095.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p497 >gi|19911912|dbj|AB072084.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p496 >gi|19911900|dbj|AB072078.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p495 >gi|14150638|gb|AF369257.1| Hepatitis C virus Pt.3O NS3 protease gene, partial cds495 >gi|14150616|gb|AF369246.1| Hepatitis C virus Pt.1A NS3 protease gene, partial cds495 >gi|14150646|gb|AF369261.1| Hepatitis C virus Pt.6A NS3 protease gene, partial cds494 >gi|14150620|gb|AF369248.1| Hepatitis C virus Pt.1 NS3 protease gene, partial cds494 >gi|19911888|dbj|AB072072.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p493 >gi|19911960|dbj|AB072108.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p...Total fetch time used: 26Total number of accessions found: 31Total primary query word count: 279Total alias count: 0Total number of sequences searched: 2584440Max Indx 1300023

Page 32: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Larger Sequences

On larger query sequences (5,000 to 6,000 letters), the IDF method performed slightly better than BLAST. On 25 sequences randomly generated, the IDF method correctly ranked the original sequence first 24 times and once at rank 3. BLAST, on the other hand, ranked the original sequence first 21 times while the remaining 4 were ranked 2, 2, 3 and 4. Average time per query for the IDF method was 47.4 seconds and the average time for BLAST was 122.8 seconds.

Page 33: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

The Next Step

Future work

Weighted Term Vectors.

Other weighting schemes such as the Modified Centroid Algorithm.

Sequence-Sequence and Term-Term Correlations.

Sequence clustering.

Page 34: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

ReferencesAltschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J. Mol. Biol. 215:403-10.

O'Kane, K.C.; and Lockner, M. J. (2004) Indexing genomic sequence libraries, Information Processing and Management, 41:265-274.

O'Kane, K.C. (2004) The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval, submitted.

Pearson, W. R. (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132:185-219.

Salton, G. (1983), Introduction to Modern Information Retrieval, McGraw-Hill (New York 1983).

Smith, T.F. & Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147:195-197

Page 35: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa
Page 36: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Hierarchical Data Base

Page 37: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Bioinformatics

Sloan Report on Bioinformatics from June 2004.

Number of graduates * There were only 26 new PhD's produced... * 102 masters degrees awarded... * Only 17 Bachelor's degrees produced... * The data is for January 2002 until March 2003.

"... in the next few years the number of graduates is expected to increase by two or three times."

Average program enrollment: * 103 = Bachelors * 435 = Masters * 296 = Phd

Page 38: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

B.S. in Bioinformtics at UNI

Mathematics: 800:060; 800:061; 800:064; 800:152; 800:164 (17 hours)

Computer Science: 810:061; 810:062; 810:065; 810:066; 810:080; 810:114; 810:115; 810:180 (24 hours) Biology: 840:051; 840:052; 840:130; 840:140; 840:153 (19 hours) Chemistry: 860:070* or both 860:044 and 860:048; 860:063 (9-12 hours)

Elective: One course from the following (3 hours) Computer Science: 810:143; 810:147; 810:153; 810:155; 810:161; 810:172; 810:181

Total 73-75 hours

Page 39: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Courses

800:060. Calculus I . The derivatives and integrals of elementary functions and their applications.

800:061. Calculus II. Continuation of 800:060

800:064. Elementary Probability and Statistics for Bioinformatics. Descriptive statistics, basic probability concepts, confidence intervals, hypothesis testing, correlation and regression, elementary concepts of survival analysis

800:152. Introduction to Probability. Axioms of probability, sample spaces having equally likely outcomes, conditional probability and independence, random variables, expectation, moment generating functions, jointly distributed random variables, weak law of large numbers, central limit theorem

Page 40: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Courses

800:164. Statistical Methods in Bioinformatics. Analysis of a DNA sequence, analysis of multiple DNA and protein sequences, BLAST.

810:061. Computer Science I. Introduction to computer programming in the context of a modern object-oriented programming language. Emphasis on good programming techniques, object-oriented design, and style through extensive practice in designing, coding, and debugging programs.

810:062. Computer Science II. Intermediate programming in an object-oriented environment. Topics include object-oriented design, implementation of classes and methods, dynamic polymorphism, frameworks, patterns, software reuses, limitations, exceptions, and threads.

Page 41: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Courses

810:065. Computing for Bioinformatics I. Intermediate programming with emphasis on bioinformatics. Includes file handling, memory management, multi-threading, B-trees, introduction to dynamic programming including Wunsch-Neddleman and Smith-Waterman algorithms for optimal alignments, exploration of BLAST, FASTA and gapped alignment, substitution matrices.

810:066. Computing for Bioinformatics II. Advanced bioinformatics computing: Perl and CGI programming; data base facilities for bioinformatics; pattern matching with regular expressions; advanced dynamic programming: optimal versus local alignment, multiple alignments; data base mining tools, Entrez, SRS, BLAST, FASTA, CLUSTAL; graphical 3-D representation of proteins; phylogenic trees.

Page 42: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Courses810:080. Discrete Structures. Topics include propositional and first-

order logic; proofs and inference; mathematical induction; sets, relations, and functions; and graphs, lattices, and Boolean algebra, all in the context of computer science.

810:114. Database Systems. Storage of, and access to, physical databases; data models, query languages, transaction processing, and recovery techniques; object-oriented and distributed database systems; and database design.

810:115. Information Storage and Retrieval. Natural language processing; analysis of textual material by statistical, syntactic, and logical methods; retrieval systems models, dictionary construction, query processing, file structures, content analysis; automatic retrieval systems and question-answering systems; and evaluation of retrieval effectiveness.

Page 43: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Courses

810:180. Undergraduate Research in Computer Science

840:051. General Biology: Organismal Diversity. Study of organismic biology emphasizing evolutionary patterns and diversity of organisms and interdependency of structure and function in living systems.

840:052. General Biology: Cell Structure and Function. Study of cells, genetics, and DNA technology emphasizing the chemical basis of life and flow of information.

840:130. Molecular Biology of the Cell. Introduction to the molecular, biochemical, and cellular structure and function of cells, DNA structure and functions, and the translation of genetic information into functional structures of living cells. DNA replication, transcription of genes, and synthesis and processing of proteins will be emphasized.

Page 44: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Courses

840:140. Genetics. Analytical approach to classical, molecular, and population genetics

840:153. Recombinant DNA Techniques. Study of techniques for manipulating and analyzing DNA, including genomic library construction, polymerase chain reaction, oligonucleotide synthesis, genomic analysis with computers, and DNA and RNA isolation.

860:070. General Chemistry I-II. Accelerated course for well-prepared students. Content similar to 860:044 and 860:048 but covered in one semester. Completion satisfies General Chemistry requirement of any chemistry major.

Page 45: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Courses

860:063. Applied Organic and Biochemistry. Basic concepts in organic chemistry and biochemistry, including nomenclature, functional groups, reactivity, and macromolecules.

Elective from:

810:143(g). Operating Systems. History and evolution of operating systems; process and processor management; primary and auxiliary storage management; performance evaluation, security, and distributed systems issues; and case studies of modern operating systems.

Page 46: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Courses

810:147. Networking. Network architectures and communication protocol standards. Topics include communication of digital data, data-link protocols, local-area networks, network-layer protocols, transport-layer protocols, applications, network security, and management.

810:153. Design and Analysis of Algorithms. Algorithm design techniques such as dynamic programming and greedy algorithms; complexity analysis of algorithms; efficient algorithms for classical problems; intractable problems and techniques for addressing them; and algorithms for parallel machines.

810:155. Translation of Programming Languages. Introduction to analysis of programming languages and construction of translators.

Page 47: Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Courses810:161. Artificial Intelligence. Models of intelligent behavior and

problem solving; knowledge representation and search methods; learning; topics such as knowledge-based systems, language understanding, and vision; optional 1-hour lab in symbolic programming techniques: heuristic programming; symbolic representations and algorithms; and applications to search, parsing, and high-level problem-solving tasks.

810:172. Software Engineering. Study of software life cycle models and their phases--planning, requirements, specifications, design, implementation, testing, and maintenance. Emphasis on tools, documentation, and applications.

810:181. Theory of Computation. Topics include regular languages and grammars; finite state automata; context-free languages and grammars; language recognition and parsing; and turing computability and undecidability.