Bioinformatics Basics Cyrus Courtesy from LO Leung Yau’s original presentation.

25
Bioinformatics Basics Cyrus Courtesy from LO Leung Yau’s original presenta tion
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Bioinformatics Basics Cyrus Courtesy from LO Leung Yau’s original presentation.

Bioinformatics Basics

CyrusCourtesy from LO Leung Yau’s original presentation

Outline

Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression

Bioinformatics Sequence Analysis Phylogentic Trees Data Mining

Biological Background – Cell

Basic unit of organisms Prokaryotic Eukaryotic

A bag of chemicals Metabolism controlled

by various enzymes Correct working needs

Suitable amounts of various proteins

Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

Biological Background – Protein Polymer of 20 types of

Amino Acids Folds into 3D structure Shape determines the

function Many types

Transcription Factors Enzymes Structural Proteins …

Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Amino_acid

Biological Background – DNA & RNA DNA

Double stranded Adenine, Cytosine, Guani

ne, Thymine A-T, G-C Those parts coding for pr

oteins are called genes RNA

Single stranded Adenine, Cytosine, Guani

ne, Uracil

Picture taken from http://en.wikipedia.org/wiki/Gene

Biological Background – Genes Genes – protein coding regions

3 nucleotides code for one amino acid

There are also start and stop codons

Biological Background—in a nutshell Abstractions

Functional Units: Proteins

Templates: RNAs

Blueprints: DNAs

Templates: RNAs

Blueprints: DNAs

Not only the information (data), but also the control signals about what and how much data is to be sentProteins (TFs) so help

Biological Background – Sequences Abstractions

Sequences

acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAacctactggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaatactggatacagggcatataaaacaggggcaaggcacagactc

FT intron <1..28FT /gene="CREB"FT /number=3FT /experiment="experimental evidence…FT recorded"FT exon 29..174FT /gene="CREB"FT /number=4FT /experiment="experimental evidence…FT recorded"FT intron 175..>189FT /gene="CREB"FT /number=4

Annotations

Visualizations

Biological Background – DNA RNA Protein

Picture taken from http://en.wikipedia.org/wiki/Gene

gene

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Complex Interactions between Genes, TFs and TFBSs

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C

pairing Can monitor expression

of many genes

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

Gene Expression Microarray Data

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray

Genes

Time points/Condiditions

Colors: Expression (RNA) Levels

Bioinformatics—Sequence Analysis Alignments

a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences

http://en.wikipedia.org/wiki/Sequence_alignment

Bioinformatics—Sequence Analysis Pair-wise alignments

Method: dynamic programming!

No penalty for the consecutive ‘-’s before and after the sequence to be aligned

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures

Bioinformatics—Sequence Analysis Multiple (global) sequence alignment

Also dynamic programming (but can’t scale up!)

Bioinformatics—Sequence Analysis Multiple local sequence alignment

i.e. Motif (pattern) discovery

>seq1acatggccgatcagctggtttttgtgtgcctgtttctgaatc>seq2ttctattttacgtaaatcagcttgaacatgtacctactggtg>seq3atgcacctttgatcaataccagctagacaaacgtgtgttg>seq4agtccaaagatcagggctggctgaatactggatcagct>seq5cagctacagggcatataaaggggcaaggcacagactc

Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes).

TFBSs are the controlling key holes in gene regulation!

DNA motifs

Similar DNA fragments across individuals and/or species TFBS Motifs: DNA fragments similar to “TATAA” are common in order to

make genes functioning Expensive and time-consuming to try a large set of candidates in biological

experiments

Transcription

RNA

Translation

Protein

TATAA

TFBS (controlling)

Gene(functioning)

TF

Transcription Factor

DNA

Motif discovery

CGATTGAf

Similar controlled functionse.g. cancer gene activities

Maximized

TFBS Motif Discovery

SNP (single nucleotide polymorphism) Motif Discovery

DNA from different people

Normal

Disease!

AA

A

C

CC

TTT

G

GG

A T

C G

……

f NormalDisease!

distinguish

Maximized

Bioinformatics—Data mining

Classification To predict! Pre-processing—tidy up your materials! Feature selection—the key points to go over Classifier—the thinking style/manner of how to combine the

key points and get some answer Training—your practice of your thinking manner with

answers known Validation—mock quiz to evaluate what you’ve learnt from

the training Testing—your examination!

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf

Underfitting & Overfitting

TRANSFAC Project

TF-Transcription Factors, important regulatorsTFBS-Transcription Factor Binding Site, major regulatory elementsTRANSFAC-The most representative DB for TFs and TFBSs

Modeling: statistical models, representations, Markov chains; Discovery: stochastic searching, indexing (suffix trees)

1

Relationship: TF-TFBS; TFBS-Gene… (understanding, prediction)Mining: text mining, approximate matching

2

Annotations: accurate wet-lab candidates (reduced labor and costs);Computation: large scale data processing; parallel computing

3

Representative Publications

[1] Gang Li, Tak-Ming Chan, Kwong-Sak Leung and Kin-Hong Lee, A Cluster Refinement Algorithm for Motif Discovery, IEEE/ACM Transaction on Computational Biology and Bioinformatics (accepted)

[2] Tak-Ming Chan, Kwong-Sak Leung, Kin-Hong Lee, TFBS identification based on genetic algorithm with combined representations and adaptive post-processing. Bioinformatics, 2008, 24(3), pp. 341-349

Bioinformatics—Data mining

Evaluation (scores!) Confusion Matrix Binary Classification

Performance Evaluation Metrics Accuracy Sensitivity/Recall/TP Rate Specificity/TN Rate Precision/PPV …

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf

FNFPTNTP

TNTP

FNTP

TP

FPTP

TP

FPTN

TN

Bioinformatics—Data mining

Evaluation ROC (Receiver Operating Characteristics) Trade-off between positive hits (TP) and false alar

ms (FP)

Not The End

Your corresponding tutor will have more project-specific stuff to tell you

Thanks Q & A