Crash course in Omics

47
Crash course in Omics EuPathDB.org University of Pennsylvania

description

Crash course in Omics. EuPathDB.org University of Pennsylvania. Putative Function (GO). Expression profiling arrays (many platforms). BLAST Similarities. Subcellular Location. Genome and Annotation. Protein Expression. RNA Sequencing. Phenotype. Pathways. Isolates. SNPs. ChIP-Chip. - PowerPoint PPT Presentation

Transcript of Crash course in Omics

Page 1: Crash course in Omics

Crash course in Omics

EuPathDB.orgUniversity of Pennsylvania

Page 2: Crash course in Omics

Genome and Annotation

Structure

Phylogenetics

BLAST Similarities

SNPs Expressionprofiling arrays

(many platforms)

ComparativeGenomics

IsolatesPhenotype

Subcellular Location

Protein Expression

Orthologs

RNA Sequencing

Putative Function

(GO)

Interactions

ChIP-Chip

Pathways

Page 3: Crash course in Omics

Genome assembly

• 5X random genome shotgun• Library insert size• Paired-end? (Mated end pairs)• Contigs• Scaffolds

Page 4: Crash course in Omics

PrimerPrimer

SEQUENCE

End Reads (Mates)End Reads (Mates)

550bp

Mean & Std.Dev.Mean & Std.Dev.is knownis known

2-pair2-pair

ScaffoldScaffold

Gaps in scaffolds are traditionally indicated by 100 “N’s”

Pairs Give Order & Orientation

ConsensusConsensus

ReadsReads

SNPsSNPs

Distance?

PlasmidFosmid

NGS

Page 5: Crash course in Omics

ChromosomeChromosome

STS-mapped ScaffoldsSTS-mapped Scaffolds

ContigContig

Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)

ConsensusConsensus

ReadsReads

SNPsSNPs

Anatomy of a WGS Assembly

STSSTS

Page 6: Crash course in Omics
Page 7: Crash course in Omics

Six Frame TranslationORF-finding

ORFs ≠ Genes

Page 8: Crash course in Omics
Page 9: Crash course in Omics
Page 10: Crash course in Omics
Page 11: Crash course in Omics

>Translation Frame 1 MQKPVCLVVAMTPKRGIGINNGLPWPHLTTDFKHFSRVTKTTPEEASRLN GWLPRKFAKTGDSGLPSPSVGKRFNAVVMGRKTWESMPRKFRPLVDRLNI VVSSSLKEEDIAAEKPQAEGQQRVRVCASLPAALSLLEEEYKDSVDQIFV VGGAGLYEAALSLGVASHLYITRVAREFPCDVFFPAFPGDDILSNKSTAA QAAAPAESVFVPFCPELGREKDNEATYRPIFISKTFSDNGVPYDFVVLEK RRKTDDAATAEPSNAMSSLTSTRETTPVHGLQAPSSAAAIAPVLAWMDEE DRKKREQKELIRAVPHVHFRGHEEFQYLDLIADIINNGRTMDDRT

Page 12: Crash course in Omics

Terminology

Page 13: Crash course in Omics

Eukaryotic Relationships ca. 2005

Keeling, 2005

Page 14: Crash course in Omics

Homology

Early Globin Gene

PARALOGS

ORTHOLOGS ORTHOLOGS

Gene Duplicationα-chain gene β-chain gene

β mouseα mouse β chickα chick β frogα frog

HOMOLOGS

Page 15: Crash course in Omics

Evolutionary relationships

• Homology - related by evolutionary descent not equivalent to similarity

• Orthology - same gene in different organisms, e.g. alpha hemoglobin in humans and chimps

• Paralogy - genes within an organism related by gene duplication, e.g. alpha and beta hemoglobin in humans

• Xenology - genes related by gene transfer

Page 16: Crash course in Omics

Synteny = large regions of chromosomes containing the same

genes

Page 17: Crash course in Omics

Synteny among Plasmodia

Page 18: Crash course in Omics

Expression Profiles

• The pattern of expression of one or more genes over time or a set of experimental conditions, e.g. during development or a drug treatment or in a genetic mutant such as a gene knock-out.

• Always… has a time and space component

Page 19: Crash course in Omics

Microarrays

•cDNA microarrays•“GeneChip” in situ synthesized oligonucleotide arrays•Oligomer (~70mer) arrays

Experiments are almost always Competitions between conditions or stages

Page 20: Crash course in Omics

The RNA samples from the test and the control are labeled with different colors in a reverse-transcription reaction and then hybridized, together, competitively to a slide or chip containing gene sequences in multiple copies.

Page 21: Crash course in Omics

Ratios of experimental to control expression are often expressed as colors rather than numbers

Page 22: Crash course in Omics

ClusteredMicroarrayDataGenes with SimilarExpressionProfiles areGroupedtogether

Page 23: Crash course in Omics

Other RNA expression

• Expressed Sequence Tags, ESTs– Usually represent

partial cDNA– Often clustered– Come from libraries

that may, or may not be normalized

– Often used to identify genes in genomes and locations of introns

• SAGE-tags (Serial Analysis of Gene Expression)– Primary purpose is

relative levels of gene expression

• RNA-Seq (NGS)– Little sequence bias– Quantitative– Can be strand-

specific

Page 24: Crash course in Omics

Genes can be located on either DNA strand

Page 25: Crash course in Omics

Overview of transcription: Either strand can serve as a template for a gene

Figure 8-4

Page 26: Crash course in Omics

Sequences of DNA and transcribed RNA

ConventionGene location = non-template strand, i.e. same as the mRNA

Page 27: Crash course in Omics

Complex patterns of eukaryotic mRNA splicing: What is a Gene?

Figure 8-14

Page 28: Crash course in Omics

Bioinformatics uses algorithms

• Algorithms are sets of rules for solving problems or identifying patterns

• Algorithms can be general or case specific and often need to be trained

• Computational analysis, like wet-bench analyses are only as good as the tools, techniques and material allow, and all interpretations come with caveats (like the experimental conditions, often call parameters in bioinformatics.

Page 29: Crash course in Omics

How to find an intron

• Usually begins with GT and end with AG• Must be longer than 19 nucleotides• Must contain a branchpoint “A”• Donor GT often followed by a sequence

pattern. This pattern is species-specific• Acceptor AG often proceeded by

pyrimidine stretch• Has a mean length of “X” as is observed

in this species

Page 30: Crash course in Omics
Page 31: Crash course in Omics

Different prediction methods often generate different

results

Prediction 1

Prediction 2

Page 32: Crash course in Omics

Protein Expression/Sequence

Data• MW-Isoelectric point• MW• Sequence/spans

Technology• 2D gel electrophoresis• Mass spectrometry• Tandem MS (MS-MS,

LC MS-MS etc)

Typical 2 D gel

Page 33: Crash course in Omics

•Direct identification of proteins from biological sample.•Capillary liquid chromatography apparatus (LC) coupled with...•Electrospray tandem mass spectroscopy (MS/MS)•“Sequest”, Mascot, or other software links mass spectra with genomic sequence database.

High throughput mass spectrometry

Page 34: Crash course in Omics

How Tandem MS WorksHow Tandem MS Works

Complex mixture Protein

Complex mixture Protein

PeptidesPeptides

PeptidesPeptidesIonizedIonized

+

+

+

++ +

+

+

+

+

MeasurementMeasurementIsolationIsolation FragmentationFragmentation

++ ++

Collision Inducted

Dissociation (CID)

Collision Inducted

Dissociation (CID)

Liquid chromatograp

hy

Liquid chromatograp

hy

Page 35: Crash course in Omics

Mass Spectrometer Protein Database

Nucleic Acid Database

EST Database

Tandem Mass SpectrumTheoretical Mass Spectrum

Correlation AnalysisCorrelation Analysis

Ranked Score of Matched PeptidesRanked Score of Matched Peptides

Sequest Database Search

Page 36: Crash course in Omics

ENNPCKLQYDYNTNVTHGFGQEYPCETDIVERFSDTEGAQCDKKKIKDNSEGACAPYRRLHVCVRNLENINDYSKINNKHNLLVEVCLAAKYEGESITGRYPQHQETNPDTKSQLCTVLARSFADIGDIIRGKDLYRGGNTKEKKKRKKLEENLKTIFGHIYDELKNGKTNGEEELQKRYRGDKDNDFYQLREDWWDANRETVWKAITCNAGSYQYSQPTCGRGEIPYVTLSKCQCIAGEVPTYFDYVPQYLRWFEEWAEDFCRKKKKKIPNVKTNCRQVQRGKEKYCDRDGYNCDGTIRKQYIYRLDTDCTKCSLACKTFAEWIDNQKEQFDKQKQKYQNEISGGGGRRQKRSTHSTKEYEGYEKHFNEELRNEGKDVRSFLQLLSKEKICKERIQVGEETANYGNFENESNTFSHTEYCDRCPLCGVDCSSDNCRKKPDKSCDEQITDKEYPPENTTKIPKLTAEKRKTGILKKYEKFCKNSDGNNGGQIKKWECHYEKNDKDDGNGDINNCIQGDWKTSKNVYYPISYYSFFYGSIIDMLNESIEWRERLKSCINDAKLGKCRKGCKNPCECYKRWVEKKKDEWDKIKEFFRKQKDLLKDIAGMDAGELLEFYLENIFLEDMKNANGDPKVIEKFKEILGKENEEVQDPLKTKKTIDDFLEKELNEAKNCVEKNPDNECPKQKAPGDGAAPSDPPREDITHHDGEHSSDEDEEEEEEEEQQPPAEGTEQGEEKSESKEVVEQQETPQKDTEKTVPTTTPTVDVCDTVKTALADTGSLNAACSLKYVTGKNYGWRCIAPSGTTSGKDGAICVPPRTQELCLYYLKELSDTTQKGLREAFIKTAAQETYLLWQKYKEDKQNETASTELDIDDPQTQLNGGEIPEDFKRQMFYTFGDYRDLFLGRYIGNDLDKVNNNITAVFQNGDHIPNGQKTDRQRQEFWGTYGKDIWKGMLCALQEAGGKKTLTETYNYSNVTFNGHLTGTKLNEFASRPSFLRWMTEWGDQFCRERITQLQILKERCMVYQYNGDKGKDDKKEKCTEACTYYKEWLTNWQDNYKKQNQRYTEVKGTSPYKEDSDVKESKYAHGYLRKILKNIICTSGTDIAYCNCMEGTSTTDSSNNDNIPESLKYPPIEIEEGCTCKDPSPGEVIPEKKVPEPKVLPKPPKLPKRQPKERDFPTPALKNAMLSSTIMWSIGIGFATFTYFYLKKKTKSTIDLLRVINIPKSDYDIPTKLSPNRYIPYTSGKYRGKRYIYLEGDSGTDSGYTDHYSDITSSSESEYEELDINDIYAPRAPKYKTLIEVVLEPSGNNTTASGNNTPSDTQNDIQNDGIPSSKITDNEWNTLKDEFISQYLQSEQPNDVPNDYSSGDIPLNTQPNTLYFDNPDEKPFITSIHDRDLYSGEEYSYNVNMVNTNNDIPISGKNGTYSGIDLINDSLNSNN

Peptide database

Note: ORFs in addition to predictedGenes must be searched

Page 37: Crash course in Omics

Mass Spectrometry can be used to

measure metabolic and other chemical

compounds

Page 38: Crash course in Omics
Page 39: Crash course in Omics
Page 40: Crash course in Omics

Homologous chromsomes (in a diploid)

Page 41: Crash course in Omics

Loci, alleles and SNPs in a population

A

a

AAGCCTCATC

ACGCCTCATC

SNP =Single Nucleotide Polymorphism

Page 42: Crash course in Omics

Alleles and Phenotype

• Some phenotypes are caused by a single locus in the genome and a single allele at that locus (e.g. some flower colors, or Drosophila eye color)

• Other phenotypes (Type-I diabetes, heart disease are multi-locus or “complex” (i.e. many genes are involved, each potentially with many alleles)

Page 43: Crash course in Omics

Population data

Data• Single Nucleotide

Polymorphisms, SNPs• Alleles• Allele frequency• Haplotypes

Technology• Chip-Seq• NGS

Page 44: Crash course in Omics

Alleles have frequencies in

different populations

Page 45: Crash course in Omics

Populations and alleles have geographic

boundariesA parasite isolate comes from a particular population, a

particular location and will have a specific haplotype (e.g. representation of alleles) often characterized via SNPs

Page 46: Crash course in Omics

Parasite Isolates

Data• Species, Strain, • Isolate• Location, Date• SNP• Sequence• Allele• phenotpe

Technology• PCR-RFLP• Sequencing• SNP chip• GPS

Page 47: Crash course in Omics

Experimental systems

PathogenVector

Host

Infectious Disease Paradigm