Bioinformatics for Genomic and Proteomic data analysis

51
Bioinformatics for Genomic and Proteomic data analysis Sequence Analysis -- Predicting Function, domains etc. -- Predicting phyico-chemical properties of protein (ProtParam). -- Predicting signal peptides and transmembrane proteins (SignalP). -- finding homology between sequences, identifying repeats etc (DOTPLOT). -- Major databases and retrieval techniques. Structure analysis -- Gene Prediction -- Phylogenetic analysis -- Alignment techniques (BLAST, PSI-BLAST) -- Analysis of Protein structure and conformation (Rasmol, SwissPDBViewer, VMD etc). -- Protein structure predictions- Homology modeling (SwissModel, Mod Some practical applications -- Sequence analysis -- Structure analysis

description

Bioinformatics for Genomic and Proteomic data analysis. -- Gene Prediction. Sequence Analysis. -- Alignment techniques (BLAST, PSI-BLAST). -- Major databases and retrieval techniques. -- Predicting Function, domains etc. - PowerPoint PPT Presentation

Transcript of Bioinformatics for Genomic and Proteomic data analysis

Page 1: Bioinformatics for Genomic and Proteomic data analysis

Bioinformatics for Genomic and Proteomic data analysis

• Sequence Analysis

-- Predicting Function, domains etc.

-- Predicting phyico-chemical properties of protein (ProtParam).

-- Predicting signal peptides and transmembrane proteins (SignalP).

-- finding homology between sequences, identifying repeats etc (DOTPLOT).

-- Major databases and retrieval techniques.

• Structure analysis

-- Gene Prediction

-- Phylogenetic analysis

-- Alignment techniques (BLAST, PSI-BLAST)

-- Analysis of Protein structure and conformation (Rasmol, SwissPDBViewer, VMD etc).

-- Protein structure predictions- Homology modeling (SwissModel, Modeller).

• Some practical applications

-- Sequence analysis

-- Structure analysis

Page 2: Bioinformatics for Genomic and Proteomic data analysis

Major Bioinformatics databases, Search engines and data

formats.

By: Sachin Pundhir Bioinformatics sub-centre DAVV, Indore

Page 3: Bioinformatics for Genomic and Proteomic data analysis

Database

• Collection of records and files

• Organized for a particular purpose

• Tables• Tuples (records)

– Attributes» Values

Page 4: Bioinformatics for Genomic and Proteomic data analysis

BIO520 Student Database

1998

Name ID Grade

Amy 123 A

Joe 456 B

Sue 789 C

Table

Tuple

.

Attribute.

Value

Page 5: Bioinformatics for Genomic and Proteomic data analysis

Database Operations

• Tables– Create, delete

• Tuples (Records)– Read,write, delete

• Search, sort, modify, print…

1998

Name ID Grade

Amy 123 A

Joe 456 B

Sue 789 C

Page 6: Bioinformatics for Genomic and Proteomic data analysis

International Nucleotide Sequence Database Collaboration (INSDC)

• Consists of

DDBJ (Japan)

GenBank (USA)

EMBL Nucleotide Sequence Database.

• The three databases exchange new and updated data on a daily basis to achieve optimal synchronisation.

Page 7: Bioinformatics for Genomic and Proteomic data analysis

Bioinformatics databases

• Nucleotide sequence database:

– Genbank: Nucleotide sequence database. Highly redundant.

– DDBJ: DNA Data Bank of Japan.

– EMBL: nucleotide sequence database.

– Refseq: integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein

products, for major research organisms.

Primary databases

• Protein sequence database:

• Genpept: Protein sequence database.

• UniProtKB/Swiss-Prot: curated protein sequence database, minimal level of redundancy and high

level of integration with other databases.

• UniProtKB/TrEMBL: computer-annotated supplement of Swiss-Prot that contains all the

translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

•Refseq: Well curated, non-redundant database.

• Structure Database

•PDB: Protein Data Bank

•MMDB: Molecular Modeling Database

Secondary database

Page 8: Bioinformatics for Genomic and Proteomic data analysis

GenBank Record

Header

information that apply to the whole record

Features

annotations on the record

Sequence

Page 9: Bioinformatics for Genomic and Proteomic data analysis

GeneBank Record

modification date

Header

GenBank Record

Locus Name

Sequence Length

Molecule Type

GenBank Division

Modification DateAccession Number

Version Number

Page 10: Bioinformatics for Genomic and Proteomic data analysis

GeneBank Record

Link to Seq

FEATURE

Page 11: Bioinformatics for Genomic and Proteomic data analysis

GenBank RecordSequence

Page 12: Bioinformatics for Genomic and Proteomic data analysis

Using Entrez

An integrated database

search and retrieval system

Page 13: Bioinformatics for Genomic and Proteomic data analysis

WWWAccess

Entrez&BLAST

Page 14: Bioinformatics for Genomic and Proteomic data analysis

Genomes

Taxonomy

Entrez: Database Integration

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structure

Word weight

VAST

BLASTBLAST

Phylogeny

Page 15: Bioinformatics for Genomic and Proteomic data analysis

Database Searching with Entrez

Using limits and field restriction to find human MutL homologLinking and neighboring with MutL

Page 16: Bioinformatics for Genomic and Proteomic data analysis

Global Entrez Search

Page 17: Bioinformatics for Genomic and Proteomic data analysis

Document Summaries:MutL[All Fields]

Page 18: Bioinformatics for Genomic and Proteomic data analysis

Entrez Nucleotides: Limits & Preview/Index

Tabs

Page 19: Bioinformatics for Genomic and Proteomic data analysis

MutL

Entrez Nucleotides: LimitsAccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitleUidVolume

Field Restriction

Exclude bulk sequences

Page 20: Bioinformatics for Genomic and Proteomic data analysis

MutL

Entrez Nucleotides: Limits

Title == Definition

Exclude Bulk Sequences

Page 21: Bioinformatics for Genomic and Proteomic data analysis

Document Summaries: Limits

Page 22: Bioinformatics for Genomic and Proteomic data analysis

Adding Terms: Preview/IndexAccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle UidVolume

Page 23: Bioinformatics for Genomic and Proteomic data analysis

Human MutL Search Results

Page 24: Bioinformatics for Genomic and Proteomic data analysis

Human MutL RefSeq

GenBank Records

Page 25: Bioinformatics for Genomic and Proteomic data analysis

NM_000249: Links

Page 26: Bioinformatics for Genomic and Proteomic data analysis

Literature Links

PubMed

OMIM

Page 27: Bioinformatics for Genomic and Proteomic data analysis

NM_000249: PubMed

Books

Page 28: Bioinformatics for Genomic and Proteomic data analysis

Books Link

Page 29: Bioinformatics for Genomic and Proteomic data analysis

OMIM: Human Disease Genes

Conserved Domain

Page 30: Bioinformatics for Genomic and Proteomic data analysis

Sequence Links

Nucleotide Protein

Page 31: Bioinformatics for Genomic and Proteomic data analysis

NM_000249: Related Sequences

simila

rity

Original GenBank mRNAs

Original GenBank genomic

Genome Project BAC

Page 32: Bioinformatics for Genomic and Proteomic data analysis

Taxonomy Link

The Tax Browser

NCBI’s Taxonomy

Page 33: Bioinformatics for Genomic and Proteomic data analysis

Taxonomy Link

Page 34: Bioinformatics for Genomic and Proteomic data analysis

NCBI Protein Databases

• GenPept GenBank, EMBL, DDBJ CDS translations

• RefSeq mRNA based (NP_) and genome based (XP_)

• Swiss-Prot curated high quality protein reviews

• PIR protein information resource Georgetown University

• PRF protein resource foundation

• PDB Protein Databank sequences from structures

Page 35: Bioinformatics for Genomic and Proteomic data analysis

Protein Link

BLAST Link

Conserved Domains

Page 36: Bioinformatics for Genomic and Proteomic data analysis

Related Proteins: Redundancy

Red

un

dan

t Seq

uen

ces

Page 37: Bioinformatics for Genomic and Proteomic data analysis

Sequence from MutL structure

Related Proteins: Links

Page 38: Bioinformatics for Genomic and Proteomic data analysis

BLink: non-redundant relatives

Arabidopsis homolog

Conserved Domain

Page 39: Bioinformatics for Genomic and Proteomic data analysis

MLH1 Domain Structure: CDD

ATPase Domain Mismatch Repair Domain

Page 40: Bioinformatics for Genomic and Proteomic data analysis

MLH1: ATPase Domain

Page 41: Bioinformatics for Genomic and Proteomic data analysis

ATPase structural alignment

ATP Binding site helix

Page 42: Bioinformatics for Genomic and Proteomic data analysis

Genome Resources

Page 43: Bioinformatics for Genomic and Proteomic data analysis

NM_000249: Genome Links

Page 44: Bioinformatics for Genomic and Proteomic data analysis

Higher Genome Resources

Page 45: Bioinformatics for Genomic and Proteomic data analysis

MLH1: UniGene Cluster

Page 46: Bioinformatics for Genomic and Proteomic data analysis

ESTs in UniGene

Page 47: Bioinformatics for Genomic and Proteomic data analysis

The New Homologene

early globin gene

A-chain gene B-chain gene

frog A chick A mouse A mouse B chick B frog B

paralogsorthologs orthologs

gene duplication

• No longer UniGene based• Protein similarities first• Guided by taxonomic tree• Includes orthologs and paralogs

Page 48: Bioinformatics for Genomic and Proteomic data analysis

The New Homologene

Page 49: Bioinformatics for Genomic and Proteomic data analysis

Entrez Genes: integrated gene-based access

LocusLinkComplete Genomes

•eukaryotic•microbial•organelle

Page 50: Bioinformatics for Genomic and Proteomic data analysis

Genes MLH1: Central Resource

Page 51: Bioinformatics for Genomic and Proteomic data analysis

QUESTIONS!!!