Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et...

Post on 19-Aug-2020

3 views 0 download

Transcript of Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et...

“V Jornada de Usuarios de la RES”

Use of TBLASTX to find regions of homology

among multiple large-size mammalian genomes

Francisco Câmara Ferreira

Bioinformatics & Genomics Unit (Roderic Guigó,CRG)

Why TBLASTX?

• SGP2 (Parra et al. 2003)

• ab initio Geneid + sequence similarity search algorithm (TBLASTX)

SGP2 is a comparative gene prediction tool: QUERY sequences from a genome (i.e H.sapiens ) is compared against a collection of sequences from a second TARGET (REFERENCE;i.e. M.musculus) genome (TBLASTX) and the results of the comparison generate “HSPs” are used to modify the scores of the exons produced by the underlying ab initio gene prediction tool GENEID

WHAT IS SGP2??

Geneid: • Geneid is a protein-coding gene prediction tool: can be optimized for prediction in different species. • Geneid follows a hierarchical structure: signal -> exon -> gene • Exon score: Score of exon-defining signals + protein-coding potential • Dynamic programming algorithm: maximize score of assembled exons -> assembled gene

SGP2

TBLASTX as a gene-prediction tool

Coding sequences evolve slowly

compared to surrounding DNA

“Proper” evolutionary distance?

TBLASTX CHR5_1_5000000 CHR1_mm -

hspmax=0 -gspmax=0 W=5 E=0.01

E2=0.01 -nogap -filter=xnu+seg S2=80

-matrix=blosum62 -altscore="* any -

999" -altscore="any * -999”

TBLASTX is computationally expensive

“flavour”of BLAST

6-frame translation of query/target

Why marenostrum?

• H.sapiens vs. M.musculus

•7-10 days on a 20-25 CPU grid

•12-13 hours on 256 CPUs

• Multiple genomes compared

concurrently

¡PARALLELIZATION!

LARGE SIZE OF MAMMALIAN GENOMES (i.e. Human & Cow ~3 Gbases, Mouse 2.5 Gb…)

Strategy for MN TBLASTX:

• Fragment “query” genome:

• H.sapiens genome: >650 5-Mbase fragments

• Reference genome divided into 10-

Mbase fragments (internally)

•22 chromosomes for M.musculus

TBLASTX MN PIPELINE: David García Cortés/Xavier Pastor

Significant publications (MN-derived)

SGP2 importance as an annotation tool

component of the comparative gene prediction pipelines to annotate:

• Human (MN)

• Mouse • Rat • Cow (MN)

• Chicken • Paramecium

• Also several species of insects and plants (Melon/Bean)

UCSC Genome browser: http://www.genome.ucsc.edu

GBL Web server: http://genome.crg.es/genepredictions/

Acknowledgments

• BSC-CNS/U. de Cantabria

•Xavier Pastor

•David García Cortés

• Genis Parra/Josep Abril/Roderic

Guigo (developers of SGP2)