Fauteux Seeder Bosc2009

Post on 29-Nov-2014

967 views 0 download

description

 

Transcript of Fauteux Seeder Bosc2009

François Fauteux

Department of Plant Science

McGill University

Macdonald campus

Seeder: Perl Modules for

Cis-regulatory Motif Discovery

Bioinformatics Open Source Conference

June 28 2009, Stockholm

• Precise control of where,

when and at which level

transcription occurs

• Synthetic promoterengineering

M. Venter, Trends Plant Sci 12, 118 (2007).

Introduction

Transcription Factor Binding Sites

• Searching for imperfect

copies of an unknown pattern

• Sequence-drivenapproaches: not guaranteed toyield a global optimum

• Enumerative approaches:computationally expensive

• Convergence towards low-complexity motifs

D. GuhaThakurta, Nucleic Acids Res 34, 3585 (2006).

DNA Motif Discovery

W. W. Wasserman, A. Sandelin,

Nat Rev Genet 5, 276 (2004).

• Set B={B1,...,Bm} of background sequences

• Set P={P1,...,Pn} of positive sequences

• Length k of the motif seed

• Length l of the full motif to discover

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Seeder Algorithm: Input

• Enumerate all words [A C G T]

• SMD: smallest HD between w and a |w|-length substring of s

• SMDs between word w and background sequences

probability distribution gw(y)

Seeder::Background

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

• Sum S(w) of SMDs between w andpositive sequences p-value

• Closest match to word w* (min. q-value) found in each

positive sequence seed PWM

• Matrix is extended to motif width and sites maximizing the

score to the extended weight matrix are selected

• PWM is built from those sites and the process is iterated

Seeder::Finder

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Seeder::Index

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

• List of indices corresponding

to words of increasing HD

• Efficient lookup of minimally

distant subsequence

Seeder::Index

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Seeder Algorithm: Usage

#!/usr/bin/perl

use Seeder::Index;use Seeder::Finder;use Seeder::Background;

my $index = Seeder::Index->new( seed_width => "6", out_file => "6.index",);$index->get_index;

my $background = Seeder::Background->new( seed_width => "6", strand => "revcom", hd_index_file => "6.index", seq_file => "seqs.fasta", out_file => "seqs.bkgd",);$background->get_background;

my $finder = Seeder::Finder->new( seed_width => "6", strand => "revcom", motif_width => "12", n_motif => "1", hd_index_file => "6.index", seq_file => "prom.fasta", bkgd_file => "seqs.bkgd", out_file => "prom.finder",);$finder->find_motifs;

• Binding site sequences from the Transfac database

G. K. Sandve, O. Abul, V. Walseng, F. Drablos, BMC Bioinformatics 8, 193 (2007).

Benchmark Against Popular Tools

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

SSP Promoter Motifs

F. Fauteux, M. V. Stromvik, submitted.

http://seeder.agrenv.mcgill.ca

SupervisorDr Martina Strömvik

Advisory committeeDr Mathieu BlanchetteDr Pierre Dutilleul

Acknowledgements