Download - Data files DB NGS Data and Sequence WEB Alignmenthpc.ilri.cgiar.org/.../content/NGSDataAlignments.pdfData files Applications and Servers Compute SSH WEB WEB App Service Data files

Transcript

8/11/16

1

DB

IGV

SERVER/REMOTE

Personal Computer/Local

Terminal

Data files

Applications and Servers

Compute

SSH

WEB

WEB

App

Service

Data files

SCP

Window

NGSDataandSequenceAlignment

ManpreetS.KatariAug11,2016

Outline

• NGSData• FastA• FastQ• SAM• BAM• GFF

• SequenceAlignment• GlobalvsLocal• DynamicProgramming• BurrowWheeler’sAlgorithm.

ImportantfilestypesFASTA

FASTQ

SAMBAM

GFF

Sequencefiles

Alignmentfiles

Annotationfiles

Importantfiletypes:FASTAAsequenceinFASTAformatbeginswithasingle-linedescription,followedbylinesofsequencedata.Thedescriptionlineisdistinguishedfromthesequencedatabyagreater-than(">")symbolinthefirstcolumn.Thewordfollowingthe">"symbolistheidentifierofthesequence,andtherestofthelineisthedescription(bothareoptional).

Thereshouldbenospacebetweenthe">"andthefirstletteroftheidentifier.Itisrecommendedthatalllinesoftextbeshorterthan80characters.Thesequenceendsifanotherlinestartingwitha">"appears;thisindicatesthestartofanothersequence.

Importantfiletypes:FASTA>chrICCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG

GCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACA

CACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTAT

CCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTGGCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAA

TATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTAGCGT

8/11/16

2

Importantfiletypes:FASTA>seq0FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF>seq1

KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM>seq2EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK>seq3

MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK>seq4EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL

Importantfiletypes:FASTQ

FASTQformatisatext-basedformatforstoringbothabiologicalsequence(usuallynucleotidesequence)anditscorrespondingqualityscores.BoththesequenceletterandqualityscoreareeachencodedwithasingleASCIIcharacterforbrevity

Fastq formatReadIdentifier

ReadSequence

ReadSequenceQuality

FASTQ:DataFormat• FASTQ

• Textbased• EncodessequencecallsandqualityscoreswithASCIIcharacters• Storesminimalinformationaboutthesequenceread• 4linespersequence

• Line1:beginswith@;followedbysequenceidentifierandoptionaldescription

• Line2:thesequence• Line3:beginswiththe“+”andisfollowedbysequenceidentifiersanddescription(bothareoptional)

• Line4:encodingofqualityscoresforthesequenceinline2

• References/Documentation• http://maq.sourceforge.net/fastq.shtml• Cocketal.(2009).Nuc AcidsRes38:1767-1771.

Sequencedataformat

Phred QualityScore Probabilityofincorrectbasecall Basecallaccuracy

10 1in10 90 %

20 1in100 99 %

30 1in1000 99.9 %

40 1in10000 99.99 %

50 1in100000 99.999 %

Q=Phred QualityScoresP=Base-callingerrorprobabilities

Qualityscores

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~| | | | | |33 59 64 73 104 126

S - Sanger Phred+33, raw reads typically (0, 40)X - Solexa Solexa+64, raw reads typically (-5, 40)

I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)

with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Format/Platform QualityScoreType ASCIIencodingSanger Phred:0-93 33-126Solexa Solexa:-5-62 64-126Illumina 1.3 Phred:0-62 64-126Illumina 1.5 Phred:0-62 64-126Illumina 1.8 Phred:0-62 33-126 ***Sangerformat!

Qualityscoreencodingdifferamongtheplatforms

MostanalysistoolsrequireSangerfastq qualityscoreencoding

8/11/16

3

http://en.wikipedia.org/wiki/FASTQ_format

FASTQQualityscores

http://en.wikipedia.org/wiki/FASTQ_format

FASTQQualityscores

CapitalLettersaregood!

SAM(SequenceAlignment/Map)

• SAMistheoutputofalignersthatmapreadstoareferencegenome

• Tabdelimitedw/headersectionandalignmentsection

• Headersectionsbeginwith@(areoptional)• Alignmentsectionhas11mandatoryfields

• BAMisthebinaryformatofSAM

http://samtools.sourceforge.net/

Alignmentdataformat

http://samtools.sourceforge.net/SAM1.pdf

MandatoryAlignmentFields

BitwiseFlag

0x20= 16^1*2+16^0*0 = 32What is 77? Find greatest value without going over77-64 = 13 4013- 8 = 5 8

5- 4 = 1 41- 1 = 0 1

What is 141?

12481632641282565121024

CIGARstring

29M 1D 9M 1D 9M 2D 21M 2D 18M 1D 70M

8/11/16

4

SAMformat SAMformatgetsconvertedtoaBAMfile

AnnotationFormats

• Mostlytabdelimitedfilesthatdescribethelocationofgenomefeatures(i.e.,genes,etc.)

• Alsousedfordisplayingannotationsonstandardgenomebrowsers• Importantforassociatingalignmentswithspecificgenomefeatures• Descriptions

Columns:Seqid,Source,Type,Start,End,Score,Strand,Phase,Attribute(Identifier)

GFFformat

Columns:Seqid,Source,Type,Start,End,Score,Strand,Phase,Attribute(Identifier)

Globalandlocalapproachestoaligningsequences

24

GLOBAL: Attempt to “match” and assess similarity between two entire sequences

LOCAL: Find subsequences of high similarity

… and then possibly “stick” (chain, net, thread) together local alignments to

obtain an overall comparison of the original sequences.

The second approach is more meaningful

(especially for long sequences, of different lengths, like whole

genomes)

Two protein or DNA sequences are unlikely to present a straightforward overall “match”, even if they are closely

related.

Why? Substitutions are not the only process by which they diverge:

insertions, deletions and rearrangements

8/11/16

5

25

Dynamicprogramming:Pairwisesequencealignment

Createamatrix tableforcomparingsequenceswithonesequencealongeachaxis(sizem+1,n+1)

Fillinpartialalignmentscores untilscoreforentiresequencehasbeencalculated:- Assignscoreforeachpositioni,j,progressingfrom

toplefttobottomright

- Scorerule:takemaximumofthethreechoices1.Takevaluefrom left,assigngappenalty(gapalongleftaxis)

2.Takevaluefrom top,assigngappenalty(gapalongtopaxis)

3.Takevaluefrom diagonalaboveleft,assignmatch/m ismatchscore

- Repeat untiltableiscompleted.

Usetrace-back toobtainfullalignment:AC--TCG

ACAGTAG

BLAST: heuristic database search using local alignmentBasic Local Alignment Search Tool

26

AcartoonforBLAST

(parameters)

…TLSRDQHAWRLS……

QW(queryword),sizeW=3.

{(RDQ ,16)(RBQ ,14)…(REQ ,12)…(RDB,11)} ForeachQW,usethescoringmatrixtoformaneighborhoodNB={allwordsofsizeWwithascore> T=11}

…TLSRDQ HAWRLS………RLSREQ HTWRSS……

Findmatch(es)towordsbelongingtoNBinthesubject(target)sequence

Foreachmatch,usethescoringmatrixandgappenalties toproduceanHSP(HighscoringSegmentPair),thenextendalignmentonbothsides,until• dropis>X,or• scoregoesbelowSmin (minimumhitscore)

querysequence

Smin

X

(cumulative)score

AboutBlat(fromgenome.ucsc.edu)• “BLATonDNAisdesignedtoquicklyfindsequencesof95%andgreatersimilarityoflength25basesormore.“

• “Itmaymissmoredivergentorshortersequencealignments.Itwillfindperfectsequencematchesof20bases.“

• “BLATisnotBLAST.”

• “DNABLATworksbykeepinganindexoftheentiregenomeinmemory.Theindexconsistsofalloverlapping11-merssteppingby5exceptforthoseheavilyinvolvedinrepeats.”

• “Theindextakesupabout2gigabytesofRAM.Thegenomeitselfisnotkeptinmemory,allowingBLATtodeliverhighperformanceonareasonablypricedLinuxbox.“

• “Theindexisusedtofindareasofprobablehomology,whicharethenloadedintomemoryforadetailedalignment.”

HashTable(BLAT)

28

ShortReadApplications

Findingthealignmentsistypicallytheperformancebottleneck

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGTTTGCGGTA

GCGGTATA

GTATAC…

TCGGAAATTCGGAAATTT

CGGTATAC

TAGGCTATA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGT

TCGGAAATTCGGAAATTTCGGAAATTT

AGGCTATATAGGCTATATAGGCTATATGGCTATATG

CTATATGCG

…CC…CC…CCA…CCA…CCAT

ATAC…C…C…

…CCAT…CCATAG TATGCGCCC

GGTATAC…CGGTATAC

GGAAATTTG

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…ATAC……CC

GAAATTTGC

NewalignmentalgorithmsmustaddresstherequirementsandcharacteristicsofNGSreads

• Millionsofreadsperrun(30xofgenomecoverage)• ShortReads(asshortas36bp)• Differenttypesofreads(single-end,paired-end,mate-pair,etc.)• Base-callingqualityfactors• Sequencingerrors(~1%)• Repetitiveregions• Sequencingorganismvs.referencegenome• Mustadjusttoevolvingsequencingtechnologiesanddataformats

8/11/16

6

Indexing• Genomesandreadsaretoolargefordirectapproacheslikedynamicprogramming

• Indexing isrequired

• ChoiceofindexiskeytoperformanceSuffixtree Suffixarray Seedhashtables

Manyvariants,incl.spacedseeds

SuffixArray Find"ctat”inthereference

• InventedbyDavidWheelerin1983(BellLabs).Publishedin1994.“ABlockSortingLosslessDataCompressionAlgorithm”

SystemsResearchCenterTechnicalReportNo124.PaloAlto,CA:DigitalEquipmentCorporation,BurrowsM,WheelerDJ.1994

• Originallydevelopedforcompressinglargefiles(bzip2,etc.)

• Lossless,FullyReversible

• AlignmentToolsbasedonBWT:bowtie,BWA,SOAP2,etc.

• Approach:• Alignreadsonthetransformed referencegenome,usinganefficientindex(FMindex)• Solvethesimpleproblemfirst(alignonecharacter)andthenbuildonthatsolutiontosolveaslightly

harderproblem(twocharacters)etc.

• Resultsingreatspeedandefficiencygains(afewGigaByte ofRAMfortheentireH.Genome).OtherapproachesrequiretensofGigaBytes ofmemoryandaremuchslower.

NGSReadAlignmentBurrowsWheelerTransformation(BWT)

c t g a a a c t g g t $t g a a a c t g g t $ cg a a a c t g g t $ c ta a a c t g g t $ c t ga a c t g g t $ c t g aa c t g g t $ c t g a ac t g g t $ c t g a a at g g t $ c t g a a a cg g t $ c t g a a a c tg t $ c t g a a a c t gt $ c t g a a a c t g g$ c t g a a a c t g g t

Text= c t g a a a c t g g t $

Ø Introduce$attheendandconstructallcyclicpermutationsofText

BurrowsWheelerTransformation

$ c t g a a a c t g g ta a a c t g g t $ c t ga a c t g g t $ c t g aa c t g g t $ c t g a ac t g a a a c t g g t $c t g g t $ c t g a a ag a a a c t g g t $ c tg g t $ c t g a a a c tg t $ c t g a a a c t gt $ c t g a a a c t g gt g a a a c t g g t $ ct g g t $ c t g a a a c

BWT(Text)= t g a a $ a t t g g c cBurrowsWheelerMatrix

• Sortrowsalphabetically,keepingofwhichrowwentwhere

BurrowsWheelerTransformation

ExactMatchingwithFMIndex

• Inprogressiverounds,top &bot delimittherangeofrowsbeginningwithprogressivelylongersuffixesofQ

8/11/16

7