lecture-2 - Pennsylvania State University · 2012-08-30 ·...
Transcript of lecture-2 - Pennsylvania State University · 2012-08-30 ·...
2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$
$$Week$1,$Lecture$2$
István'Albert''
Biochemistry$and$Molecular$Biology$$and$Bioinforma;cs$Consul;ng$Center$
$Penn$State$
Get$a$good$text$editor$
Desired$features:$syntax'highligh3ng,$line$numbering,$ability$to$view$white$space$$• Komodo$Edit$• Sublime$Text$• TextMate$$
There$are$many$other$op;ons.$$
Download$the$data$for$the$lecture$
The$url$sent$out$via$email$(also$on$the$course$webpage)$$
hVp://downloads.yeastgenome.org/cura;on/chromosomal_feature/saccharomyces_cerevisiae.gff$$$
Biological$file$formats$
Each$file$format$represents$ $
1. Informa3on$–$types$of$knowledge$that$are$ stored$in$the$file $$
2. Op3miza3on$–$$types$of$opera;ons$that$are$easy/efficient$to$perform$
The$above$implies$that$some$informa;on$may$not$be$present$or$cannot$be$easily$extracted$from$a$certain$file$format. $
Tabular$formats$
• Many$common$bioinforma;cs$data$formats$are$column$based$and$tab%separated$$
• First$format$we$deal$with$will$be$the$$
GFF3 '–'Generic''Feature''Format'
(search$for$GFF3$to$see$the$specifica;on$for$version$3 )$$
hVp://www.sequenceontology.org/gff3.shtml$ $
The$GFF3$specifica;on$
GFF$format$Search$for$GFF3$!$hVp://www.sequenceontology.org/gff3.shtml$
Tab$separated$with$9$columns.$Missing$aVributes$may$be$replaced$with$a$$dot$!$.$
1. Seqid'$$$$$$$$$$(usually$chromosome)$2. Source$$$$$$$$$(where$is$the$data$coming$from)$3. Type$$$$$$$$$$$$$(usually$a$term$from$the$sequence$ontology)$4. Start''$$$$$$$$$$$(interval$start$rela;ve$to$the$seqid)$5. End''''$$$$$$$$$$$(interval$end$rela;ve$to$the$seqid)$6. Score'''$$$$$$$$$(the$score$of$the$feature,$a$floa;ng$point$number)$7. Strand''$$$$$$$$(+/%/.)$8. Phase'''''''$$$$(used$to$indicate$reading$frame$for$coding$sequences)$9. APributes$$$$(semicolon$separated$aVributes$!$Name=ABC;ID=1)$
Example$aVribute$specifica;on:$name=REB1;id=YP33546
Variants$of$GFF$–$GTF$2 $$
GTF$2$–$Gene'Transfer'Format' same$9$columns$as$the$GFF$$
hPp://mblab.wustl.edu/GTF2 .html'
Differences$$1. Only$a$subset$of$types$are$allowed$in$column$3:$CDS, start_codon, stop_codon a nd$a$
few$more$$
2. AVribute$column$format$change,$key$values$are$separated$by$space$and$not$semicolon$=$$3. Two$mandatory$aVributes$at$the$end$of$the$record:$
$• gene_id'value;$$$$$A$globally$unique$iden;fier$for$the$genomic$source$of$the$transcript$
$• transcript_id'value;$$$$$A$globally$unique$iden;fier$for$the$predicted$transcript.$
$Example$aVribute$specifica;on:$name “REB1”; id “YP33546”$
What$do$the$terms$mean?$
Sequence$ontology$browser$
Searching$for$$
X_element_combinatorial_repeat$$
Unix$commands$in$this$lecture$
$• wc, cat, head, tail, sort, cut, grep, more, clear
Handy'Tips'$
CTRL%C$!$interrupts$any$process$that$may$be$running$$
clear$!$clears$the$screen$$
$cursor$keys$allow$you$to$recall$past$commands$$$
$auto%complete$!$write$part$of$the$filename$then$press$TAB $
Inves;gate$your$data$
Check$head/tail$of$the$file$
Paging$data$with:$less$(more)$
• q$or$ESC$!$quits$the$pager$
• SPACE$or$f$!$go$forward,$next$page$
• b$!$go$backward$
• /$word$!$search$for$a$word$$$
• /$!$repeats$the$search$for$the$last$word$
Find$paVerns$in$the$file$
Connec;ng$streams$
• Input$streams:$entry$from$the$keyboard$or$$files$
• Output$streams:$print$on$screen,$into$files$
Stream$redirec;on$the$symbols$of$“arrows”$<,$>$$
Input$stream$redirec;on$from$file:$$<'filename'Output$stream$redirec;on$to$a$file:$>'filename''
Redirec;ng$to$a$file$$creates/overwrites$that$file$
Piping$streams$
• The$pipe$character$$|'channels$the$output$of$one$command$into$the$other$
$(located$above$the$ENTER$key)$
$
You$can$pipe$mul;ple$commands$together$
Piping$commands$
Isola;ng$relevant$parts$of$our$file$
How$many$of$each$elements$
Find$out$how$many$of$each$features$
Homework$2$
• Create$a$file$that$lists$all$possible$ontology$terms$that$are$present$in$the$provided$GFF$file$with$a$count$of$how$many$;mes$this$element$occurs$in$the$yeast$genome.$Sort$this$file$by$this$count$in$reverse$order$(hint:$man$sort)$
• Pick$an$ontology$term$that$is$unfamiliar$to$you$and$look$it$up$in$the$Sequence$Ontology,$paste$the$explana;on$into$the$homework$