Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online...

31
Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu

Transcript of Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online...

Page 1: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Bioinfo/Stat 545 Biostat646Data Analysis in Molecular Biology

Lab 1: Bioinformatics Online Resources

Dongxiao Zhu

Page 2: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Overview

Main types of biological data Sequence Data Interaction Data Microarray and gene expression data Others, macromolecule structure

data, human genes and disease data

Information Retrieval Strategies

Page 3: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Part I. Online Biological Data

Resources 2004 Nucleic Acid Research database issue

http://www3.oup.co.uk/nar/database/cap/ (database list)

Total 548 databases listed, 162 more than last year Main types of biomedical data

Sequence Data (DNA and Protein Sequence) Gene sequencing, “Whole genome shotgun” and Lander &

Waterman Assembly Algorithm Protein sequencing, de novo sequencing from tandem Mass

Spectra Gene Prediction, Sequence alignment and BLAST Gene Annotation and Gene Ontology Protein/RNA secondary/tertiary structure prediction

Interaction data – Biological pathway and network Microarray and Gene Expression Data Others, structure data, human genes and disease

Page 4: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Gene/Protein sequencing –data acquiring and data

accuracyWhole genome shotgun[1]

Double end sequencing short reads off both ends of large inserts additional information for assemble

Clone coverage vs. sequence coverage Scaffolds

ordered and oriented contigs sequence gaps

De novo protein sequencing from Tandem Mass Spectra[2]

Accuracy issues: Large scale repeats Missing and contaminating data Plasmids and minichoromosomes Signature of tandem repeats Polymorphism

Page 5: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Gene Prediction, Annotation and Gene Ontology

Genescan webservice[3]

http://genes.mit.edu/GENSCAN.html Sensitive in recognizing at least on exon

Biochemical Functional Annotation (Biochemical View) Clone, expression and functional studies Database homolog/ortholog search

Sequence alignment (similar seq -> similar function) Structure alignment (similar structure -> similar function)

Protein sub-cellular location prediction using primary sequence alone (Cellular View)

Codon usage bias in differently localized protein Signal peptide

Gene ontology – consistent descriptions of gene products in different databases

Page 6: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Sequence Alignment/BLAST and Literature Search – Bioinformatics approaches to gene

annotation

Why BLAST? Explosively increasing novel sequences, in arguable most characterized

~4200 E.coli proteins, half of them are not experimental studied. Moreover, every newly sequenced genome encodes hundreds to thousands novel proteins

There is a need to infer functional roles of these novel proteins. compare novel sequences with previously characterized genes to

annotate function BLAST algorithm[4]

http://www.bioinformatics.med.umich.edu/Courses/526/lecturenotes.html

BLAST program selection guide http://www.ncbi.nlm.nih.gov/BLAST/producttable.shtml

BLAST tutorial http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

Literature Search (Part II)

Page 7: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Gene ontology (GO)[5]

Why GO? Use of GO terms by several collaborating

databases facilitates uniform queries across them Hierarchical structured to allow query a

vocabulary at different levels. For example, you can use GO to find all the gene

products in the mouse genome that are involved in signal transduction, or you can zoom in on all the receptor tyrosine kinases

Allows annotators to assign properties to gene products at different levels, depending on how much is known about a gene product

http://www.geneontology.org/index.shtml#downloads

Page 8: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

What GO[5] is?

GO is designed to be a structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products in any organism. GO is used to annotate genes and gene productsThree categories of GO

Biological Process: a biological objective to which the gene or gene product contributes. A process is accomplished via one or more ordered assemblies of molecular functions. E.g. “cell growth and maintenance” , “signal transduction”, “cAMP biosynthesis”.

Molecular Function: the biochemical activity of a gene product. E.g. “enzyme”, “ligand”, “Toll receptor ligand”.

Cellular Component: the place in the cell where a gene product is active. E.g. “ribosome” or “proteasome”, “nuclear membrane”.

Page 9: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

An interesting analog of GO

Statistician’s view A multivariate definition

DB developer’s view A entity/attributes definition in a DB schema

Biologist’s view A nomenclature accepted by

Biochemist/Molecular Biologist, Cell Biologist, Geneticist, Neuroscientist and Development Biologist

Page 10: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

What GO is NOT?GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context. GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not, in itself, sufficient. Reasons for this include the following.

a. Knowledge changes and updates lag behind. b. Individual curators evaluate data differently. While we can agree to

use the word 'kinase', we must also agree to support this by stating how and why we use 'kinase', and consistently apply it. Only in this way can we hope to compare gene products and determine whether they are related.

c. GO does not attempt to describe every aspect of biology. For example, domain structure, 3D structure, evolution and expression are not described by GO.

GO is not a dictated standard, mandating nomenclature across databases. Groups participate because of self-interest, and cooperate to arrive at a

consensus

Page 11: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Protein/RNA secondary/tertiary structure

prediction

Protein secondary/tertiary structure prediction Server list, http://www.embl-heidelberg.de/predictprotein/doc/explain_meta.html#list Prediction methologies: Sliding window based and Machine learning based Easier and feasible at this moment: prediction of 2D topology for some

functional important and simple patterned protein, e.g. Transmembrane protein [7].

RNA secondary/tertiary structure prediction Algorithms: Biological sequence analysis, R.Durbin et.al. Cambridge

University Press, 1988 p267 Michael Zuker’s prediction server [6]

http://www.bioinfo.rpi.edu/applications/mfold/old/rna/form1.cgi

Page 12: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Interaction Data – Biological Pathway and Network

Three main types of interaction data Signal transduction or transcription regulation Protein-protein interaction Metabolic pathway (best in terms of studying network

topology)

Interaction databases KEGG database, metabolic pathways and signal

transduction pathways in 107 organisms http://dip.doe-mbi.ucla.edu/dip/Links.cgi

Network model (random vs. scale free, small world)Network analysis and visualization software

http://www-personal.umich.edu/~mejn/courses/2004/cscs535/syllabus.pdf

Pajek, AT&T DOT etc.

Page 13: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Metabolic Network in Homo sapiens

Page 14: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Summary statistics of network analysis in 16 organisms

Domain, Kingdom and Phylum Organism Num nodes

Num edges

Max Kout

Max Kin

of roots

of leaves

Single edges

Mutual edges

H.sapiens 1040 1528 12 11 0.130769 0.161538 572 478 R.norvegious 763 1028 10 8 0.138925 0.165138 348 340

C.elegans 706 974 10 9 0.15864 0.157224 324 325

Eukarya

S. cerevisiae 748 1072 9 10 0.129679 0.140374 396 338 E.coli 893 1365 12 14 0.139978 0.113102 459 453 gamma

V.cholerae 738 1076 12 12 0.150407 0.123306 370 353 Proteobacteria

beta R.solanacearum 864 1238 11 12 0.138889 0.118056 406 416 Bacillales B.subtilis 787 1151 12 12 0.133418 0.125794 401 375 Firmicutes

Lactobacillales L.lactis 545 778 11 11 0.157798 0.12844 280 249 Actinobacteria S.coelicolor 814 1154 12 12 0.14742 0.135135 406 374

Bacteria

Cyanobacteria T.elongates 509 697 12 12 0.143418 0.133595 237 230 M.acetivorans 489 633 8 7 0.143149 0.134969 209 212 Euryarchaeota T.acidophilum 458 593 8 8 0.170306 0.135371 197 198 S.solfataricus 586 730 8 7 0.187713 0.151877 256 237

S.tokodaii 522 651 8 7 0.180077 0.149425 229 211

Archaea

Crenarchaeota

P.aerophilum 482 622 8 7 0.161826 0.120332 204 209

Page 15: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Microarray and Gene Expression Data

Assumptions Measured signal is proportional to amount of

corresponding cDNA/mRNA Amount of mRNA determines amount of protein, i.e.

there is no regulation on translation level Both of assumptions have NOT been proven yet.

DNA microarray databases (useful links) http://industry.ebi.ac.uk/~alan/MicroArray/ http://genome-www5.stanford.edu/resources.html http://www.ebi.ac.uk/microarray/ A lot more, you explore it!

Page 16: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Download Gene Expression Data from SMD – An example

Stanford Microarray Database (SMD)Retrieving public data from SMD Retrieving data for an organism

ftp://genome-ftp.stanford.edu/pub/smd/organisms One directory per organism, whose names are two-letter code

used by SMD Under each directory, one file per experiment Three ways to retrieve

Web Client. i.e. IE, Netscape, etc. Graphic ftp client, e.g. Flashget, etc Command line ftp client

ftp –i genome-ftp.stanford.edu (-i get them all) Name: anonymous Password: XX@ cd pub/smd/organisms/SC mget *gz

Page 17: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Continued

Retrieving all public data for an publication Go to http://genome-www5.stanford.edu/cgi-bin/tools/display/

listMicroArrayData.pl?tableName=publication

Click any entry in column “Data in SMD” Click “view” to read brief experiment design

description Click “display data” to do experiment-wise

query. Click “Data Retrieval and Analysis” to filter

data and retrieve data

Page 18: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Part II. Information Retrieval in Bioinformatics

Master effective information retrieval techniques can keep your research thinking and works up-to-dateMy steps in doing biomedical research

Identify an interesting topic and rise a scientific hypothesis Start from NCBI Entrez, the life science search engine. http://www.ncbi.nlm.nih.gov/Entrez/ Input the keyword or phrase into the query box and click GO Numbers of pieces of retrieved information are displayed Briefly go through each kinds of resources

NCBI Entrez (Good starting point) Common retrieval interface to many databases Controlled links between databases Maintained at the National Center for Biotechnology Information

(NCBI) in the National Library of Medicine (NLM)

Page 19: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Pubmed and related IR Strategies

- biomedical literature and books

What is pubmed? PubMed is a web-based database of bibliographic information drawn

primarily from the life sciences literature

Pubmed tutorial: http://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html

Search Mechanisms PubMed uses an Automatic Term Mapping feature Look first in the MeSH Translation Table (Translate keywords into MeSH

term, e.g. from “renal transplant” to “kidney transplant”) Then look into journal translation table Finally in author index

As soon as PubMed finds a match, the mapping stops. That is, if a term matches in the MeSH Translation Table, PubMed does not continue looking in the next table. Its absolutely necessary to specify the “Limit” in NCBI. E.g. “cell” is MeSH term and also a journal name

Page 20: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Pubmed - Continued

What if “no match” is found? PubMed is unable to match a search term with either of the

translation tables or the Author Index PubMed will then search the individual words in All Fields. Individual

terms will be combined (ANDed) together. Example: TATA Box associated transcription factor

Phrase Searching These formats for phrase searching instruct PubMed to bypass

automatic term mapping. Instead PubMed looks for the phrase in its Index of searchable terms. If the phrase is in the Index, PubMed will retrieve citations that contain the phrase.

PubMed may fail to find a phrase because it is not in the Index. Your phrase may actually appear in citation and abstract data,

but may not be in the Index. If this is the case, the double quotes are ignored and the phrase is processed using Automatic Term Mapping.

Page 21: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

MeSH database (“GO” in literature search)

Database of indexing termsEntry example

NF-kappa B Ubiquitous, inducible, nuclear transcriptional activator

that binds to enhancer elements in many different cell types and is activated by pathogenic stimuli. The NF-kappa B complex is a heterodimer composed of two DNA-binding subunits: NF-kappa B1 and relA.

Year introduced: 1991

Entrez => MeSH http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=mesh

NLM => MeSHhttp://www.nlm.nih.gov/mesh/meshhome.html

Page 22: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Structure of MeSH (Combination of EC and

GO)Divisions

Anatomy [A] Organisms [B] Diseases [C] Chemicals and Drugs [D] Analytical, Diagnostic and Therapeutic

Techniques and Equipment [E] Psychiatry and Psychology [F] Biological Sciences [G] Physical Sciences [H] Anthropology, Education, Sociology and

Social Phenomena [I] Technology and Food and Beverages [J] Humanities [K] Information Science [L] Persons [M] Health Care [N] Geographic Locations [Z]

Hierarchy with Multiple Inheritance

Amino Acids, Peptides, and Proteins [D12]

Proteins [D12.776] DNA-Binding Proteins [D12.776.260]  NF-kappa B [D12.776.260.600]

Amino Acids, Peptides, and Proteins [D12]

Proteins [D12.776]       Nuclear Proteins [D12.776.660] NF-kappa B [D12.776.260.600]

Amino Acids, Peptides, and Proteins [D12]

Proteins [D12.776] Transcription Factors [D12.776.930] NF-kappa B [D12.776.260.600]

Page 23: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

MeSH Full ListingNF-kappa B

Ubiquitous, inducible, nuclear transcriptional activator that binds to enhancer elements in many different cell types and is activated by pathogenic stimuli. The NF-kappa B complex is a heterodimer composed of two DNA-binding subunits: NF-kappa B1 and relA.

Year introduced: 1991Subheadings:

administration and dosage agonists analysis antagonists and inhibitors biosynthesis blood cerebrospinal fluid chemistry classification deficiency diagnostic use drug effects genetics immunology isolation and purification metabolism pharmacokinetics pharmacology physiology radiation effects secretion therapeutic use toxicity ultrastructure

Restrict Search to Major Topic headings only Do Not Explode this term

(i.e., do not include MeSH terms found below this term in the MeSH tree).

Entry Terms: NF-kB NF kB Nuclear Factor kappa B kappa B Enhancer Binding Protein Immunoglobulin Enhancer-Binding Protein Enhancer-Binding Protein, Immunoglobulin Immunoglobulin Enhancer Binding Protein Transcription Factor NF-kB Factor NF-kB, Transcription NF-kB, Transcription Factor Transcription Factor NF kB Ig-EBP-1 Ig EBP 1

Previous Indexing: DNA-Binding Proteins (1987-1990) Transcription Factors (1987-1990)

See Also: I-kappa B

All MeSH Categories Chemicals and Drugs Category Amino Acids, Peptides, and Proteins Proteins DNA-Binding Proteins NF-kappa B

All MeSH Categories Chemicals and Drugs Category Amino Acids, Peptides, and Proteins Proteins Nuclear Proteins NF-kappa B

All MeSH Categories Chemicals and Drugs Category Amino Acids, Peptides, and Proteins Proteins Transcription Factors NF-kappa B

Page 24: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Tips for increasing your searching sensitivity and

specificity Chop query yourself with logic AND, OR, look a term up yourself in MeSH database, and use MeSH terms in your query Use tags to do efficient search

[au],”author”, e.g. States DJ[au]. [dp],”date of publication”,e.g. 2004[dp]. [ad], “address”, e.g. Ann Arbor[ad], etc. [MeSH], “MeSH term”, e.g. Transcription factor[MeSH]

Select “Limited to” option to prevent the search stopping prematurelyUse phrase searching “” if you don’t want your phrase to be partially searched.

Page 25: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Entrez Clipboard and Address Issue

Send to “clipboard”Place to save results collected from multiple searchesSaved for ~ 1hr

Task: Find a local expert on NF kappa B“NF kappa B” AND (48109 [ad] OR “Ann Arbor” [ad] NOT Pfizer

[ad])(scan results for the most common senior author)

Need to think about all the ways people write addresses“University of Michigan” fails to pick up “Univ. Mich.” or

“UMMS” etc.Zipcodes are very specific, but only get about 70%

Won’t catch co-authored articles with a remote collaborator

Page 26: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

IR Strategies

Term search Simple search for term matches (exact or stemmed)“Find articles containing ‘p53’”

Boolean Logical combination of term matches“Find articles containing ‘p53’ AND ‘apoptosis’”

Statistical neighboring Assume that articles on the same subject will use similar words Rank articles by similarity of word use“Find articles using vocabulary similar to the vocabulary in this

title/abstract”Deeper parsing

Natural language processing and deeper understanding The field is still in its infancy“Find articles describing the mechanism of p53 activation in

apoptosis”

Page 27: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Boolean Searches

Entrez attempts to intelligently parse your queryQuery: dna binding transcription factor macrophageDetails => (((("dna"[MeSH Terms] OR dna[Text Word]) AND

(("pharmacokinetics"[MeSH Subheading] OR "pharmacokinetics“ [MeSH Terms]) OR binding [Text Word])) AND ("transcription factors“ [MeSH Terms] OR transcription factor [Text Word])) AND ("macrophages"[MeSH Terms] OR macrophage [Text Word]))

You can force a Boolean searchQuery: “dna binding” AND “transcription factor” AND

macrophageDetails => (("dna binding"[All Fields] AND "transcription

factor"[All Fields]) AND ("macrophages"[MeSH Terms] OR macrophage[Text Word]))

Page 28: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Phrase Searching

Specify with quotes“transcription factor” vs. “transcription”

“factor”

Precomputed Fast Often mapped to synonyms and MeSH

terms Just because you get a “phrase not

found” message does not mean it is not present

Page 29: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Text Neighboring

Related articles link (single or multiple articles) Term usage similarity

Articles talking about the same thing are likely to use the same words

Good recall (sensitivity) Precomputed and fast

Limitations Strictly algorithmic, no understanding

“Ras activates PI3K” vs. “PI3K activates Ras” Historical and author biases in vocabulary Poor precision (specificity) Ranking can not satisfy everyone

Page 30: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Computational Issues in Statistical Text Retrieval

Stop words Simple words like “the” and “and” are not worth scoring

Term weights Should weight matches of rare words more heavily than

matches of common wordsStemming and synonyms

Need to stem verbs and plural forms May or may not be able to reduce to a normalized set of

synonmsNormalizing for length

Don’t want to exclude short articles or articles without an abstract

All vs. all comparison is not feasible 107 articles => 1014 comparisons, not feasible Compute demands of the task are growing faster than

Moore’s law

Page 31: Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu.

Acknowledgements

Some slides in Part II are taken from Dr.States’ Bioinfo 526 class

http://www.bioinformatics.med.umich.edu/Courses/526

Dr. Zhaohui (Steve) Qin for helpful discussionAll authors of references that I have cited