MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant...
-
Upload
aubrey-alexander -
Category
Documents
-
view
220 -
download
6
Transcript of MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant...
![Page 1: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/1.jpg)
Genome Informatics I (2015 Spring)
MES7594-01 Genome Infor-matics I
- Lecture V. Short Read Alignment
Sangwoo Kim, Ph.D.Assistant Professor,
Severance Biomedical Research Institute, Yonsei University College of Medicine
![Page 2: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/2.jpg)
Genome Informatics I (2015 Spring)
Overview• Goal of this lecture
– You will learn the principle of mapping NGS short read to reference genome and practice alignment tools
• Short Read Alignment Theory– Why do we need special algorithm?– The Burrows-Wheeler Transformation (BWT)
• BWT indexing• LF search• Examples
• Practice with BWA• with NA18507 sequences
• Understanding alignment information– Viewing/Converting SAM/BAM format– Interpreting alignment information
![Page 3: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/3.jpg)
Genome Informatics I (2015 Spring)
SHORT READ ALIGNMENT THEORY
![Page 4: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/4.jpg)
Genome Informatics I (2015 Spring)
RAW NGS DATA (FASTQ)@SRR764745.4352210/1TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAA-GAGCTGTGAGA+5FIFEFHFGHHEFFEEIFFIFHFGGGGKGFJHFEKJJIFKKJGHGGGJFKHGGGLLFGGHLKHJJMGGGJNJKIJJLLIIIKJIHIKJEGFACGEEEDC>[email protected]/1ATATATGAAGGAAAGATACAGTCATTTTCAGACAAACAAATGCTGACAGAATTTGCCATTACCAAGCCAGGACTCTAAGAACTGCTAAAAG-GAGCTCTAAA+6FFDBDGDEGFEEEGEDBEEFDFEEDEEFFGEEFGFFFGFEHGGHEFFGFGEFFHGGFFFDGGGGHGGGHHGFHGGEGHGHFGIIGCFFFED?ADC>B<>>@SRR764746.2695391/1TAAAAGAGACAAAGAGAGACAGTATATCATCTGTCATCTGACAGTCTCATCCAACAGAAAAATATGACAATCCTAAACATATGTGAACCTAA-CACTGGAGC+6FIEEFDFEEEFEFEFEFEEEFDBECEFFEFFGEFFEFGHEFFGDGGFFEEGFGFFHFGGGGEDFHFFGHFGFHFGGGFFEFIGJFGGIHBDECCCD?;>[email protected]/1TTAAATAACCTGCTCCTGAATGAGCATTGGGTGAAAAACGAAATCAAGATGGAAATGTAAAAAATTTCTTCGAACTGGATGACACAACCTAT-CAAGACCTC+5FBCC@A*CHDFDDDDEFBDDGADFCBDFFEEGEGADEEAE4DEFFEGBEHE8;ADHD@DGGFCGDEDGFB==B?GNG@FMC@JFF>:FG=DDED=&>@A#@SRR764746.5506495/1CACAACCTATCAAGACCTCTGGGATACAGCAAAGGCAGTGCTAAGAGGAAAGTTTATAGCACTAAACACCTACGTCGAAAAGTCTGAAAGAG-CACAGACAA+5HIDDDEEBDEEEFEEEFEFGFFEECFFGFFFFGFFFGDHGGCFGFGGFGGHDEFDFDHGGFGDGGFGFGFDFAEFBCFFFFJDIKCEEFACFBCA?;A@[email protected]/1CCATAGAAAGGAATGAATTAACAGCATTTCCTGTGACCTGGACGAGATTGGAGACTATTGTTCTAAGTGATGTAACCCAGGAATGGAAAACT-CAACATTGT+5IHCBE@EEFFDEDGDEDDCFEEGFEEEDFDFGEHEFFFHEBHABHDEDHGDGFFGDFFHEEGGDGHFIFFIEDGFGHGHHCJCIGCEEEHFAB?B@<[email protected]/1TGTCCTTTCCAGGGACATGGATGAAGCTGGAAACCATCATTCTCAGCAAACTAACACAAGAAAAGAAAACCAGGCCAGGAGCAGTGGCTCAT-GCCTGTAGT+5JIAIHEDHHDHGGFFFEIJFFHDCIHHHKFGHIIGGFGGGGHIGDGGIIIIGGJGFGGIIFHHKHIJIJKHLKILGCIIHMHKDKMLKFJBHHHBGFABB@SRR764745.944258/1GAGAACACATGGACACAGGGAGGGGAACATCACACACTGGGGCCTGTCAAAGGGTGGGAGGCTGGGGGAGGAACAGCATTAGGAGAAAT-ACCTAATGTAGA+5FFDEFEFEDIH?CECEHEHCHIJI>BCCCIDFFFFIHIBHBHFAAFEGGFHMM8FDCDGIEHGAGG@BGAAFKH?6>DKDDNIK?9<FHGBICDBG@<<[email protected]/1TGGGGAAAAAAAACATTCTCTGAAATTTGCTTTTATACCATTAAAGACTTATTTTTTATTACCAGCAATACAGGGCAACT-CATTCAGGTTGAATCTTGAAG+6NMHHFBGGFFEGHEEEIHIDIFGFDFFHFFEFEEGFIJGGGEHHLHIJEFHGHGHFFGGFJKHJJHHFFMHKNBEIFMMGLEIGJHMJCM@CA?FCD;GB
![Page 5: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/5.jpg)
Genome Informatics I (2015 Spring)
Mapping back to genome
TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA
Where is this sequence in human genome?
![Page 6: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/6.jpg)
Genome Informatics I (2015 Spring)
Mapping back to genome
TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA
Where is this sequence in human genome?
Do this as fast as possible!
![Page 7: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/7.jpg)
Genome Informatics I (2015 Spring)
brute force way
T G A C G T G T G A T T C A A A A A A G C
The reference genome (chr1, start)
G A T T C A A A Your query
G A T T C A A A
G A T T C A A A
G A T T C A A A
Find “GATTCAAA” in human genome
This is very long (3 bil-lion)
![Page 8: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/8.jpg)
Genome Informatics I (2015 Spring)
How fast should it be?
time per 1 read (sec)
time per 80x WGS (sec)
is equal to
eyeballing 3x109 3.6x1018 1x1011 yrs
naïve matching 2400 1.2x109 7,608 yrs
improved algo-rithm
3 3.6x108 10 yrs
minimum re-quired
0.01 1.2x107 11.5 days
desired 0.001 1.2x106 1.2 daysbased on 200bp read length, 80x single-end wgs
![Page 9: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/9.jpg)
Genome Informatics I (2015 Spring)
Searching with index• Assume you’re searching
“genome” in a English dictio-nary– You don’t search every line in ev-
ery page– You first find the page range of “g”
in the dictionary– in the above range (of ‘g’), you
find the page range of “ge” in the dictionary
– in the above range (of ‘ge’), you find the page range of “gen” in the dictionary
– ...– until you find “genome”
![Page 10: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/10.jpg)
Genome Informatics I (2015 Spring)
Indexing genome
• We are going to make an index for genome– to make it possible to search a read-sequence
as we do it in an English dictionary
![Page 11: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/11.jpg)
Burrows-Wheeler Transformation
BANANA
![Page 12: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/12.jpg)
Burrows-Wheeler Transformation
BANANA$Lexicographically smallest
![Page 13: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/13.jpg)
Burrows-Wheeler Transformation
BANANA$ANANA$B
![Page 14: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/14.jpg)
Burrows-Wheeler Transformation
BANANA$ANANA$BNANA$BA
![Page 15: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/15.jpg)
Burrows-Wheeler Transformation
BANANA$ANANA$BNANA$BAANA$BANNA$BANAA$BANAN$BANANA
![Page 16: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/16.jpg)
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
![Page 17: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/17.jpg)
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
![Page 18: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/18.jpg)
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
ANNB$AA
last col-umn
![Page 19: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/19.jpg)
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
ANNB$AA
last col-umn
BWT(“BANANA$”) = “ANNB$AA”
![Page 20: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/20.jpg)
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
ANNB$AA
last col-umn
BWT(“BANANA$”) = “ANNB$AA”1. BWT just changes the order of the string2. BWT tends to collect similar characters together3. With only the transformed string, we can easily get the original
string
![Page 21: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/21.jpg)
Inverse BWT
We are given “ANNB$AA”
![Page 22: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/22.jpg)
Inverse BWT
We are given “ANNB$AA”
ANNB$AA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
![Page 23: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/23.jpg)
Inverse BWT
We are given “ANNB$AA”
ANNB$AA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$AAABNN
sort
![Page 24: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/24.jpg)
Inverse BWT
We are given “ANNB$AA”
ANNB$AA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$AAABNN
sort
![Page 25: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/25.jpg)
Inverse BWT
We are given “ANNB$AA”
ANNB$AA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$AAABNN
Attach the last column
![Page 26: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/26.jpg)
Inverse BWT
We are given “ANNB$AA”
A$NANABA$BANAN
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
![Page 27: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/27.jpg)
Inverse BWT
We are given “ANNB$AA”
A$NANABA$BANAN
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$BA$ANANBANANA
sort
![Page 28: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/28.jpg)
Inverse BWT
We are given “ANNB$AA”
A$NANABA$BANAN
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$BA$ANANBANANA
sort
ANNB$AA
Attach the last column
![Page 29: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/29.jpg)
Inverse BWT
We are given “ANNB$AA”
A$NANABA$BANAN
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$BA$ANANBANANA
sort
ANNB$AA
Attach the last column
![Page 30: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/30.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
![Page 31: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/31.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NANN
ANNAN
![Page 32: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/32.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “N” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point
• the number of ‘N’• to determine the end point
start
end
![Page 33: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/33.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “N” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point
• =5 • the number of ‘N’
• to determine the end point• =2
start
end
![Page 34: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/34.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “N” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point
• =5 • the number of ‘N’
• to determine the end point• =2
start
end
![Page 35: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/35.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
![Page 36: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/36.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
This is a range for ‘A’ not ‘AN’!!
![Page 37: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/37.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
![Page 38: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/38.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
count of ‘A’ before start point = 1
![Page 39: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/39.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’ + number of ‘A’ before start point• to determine the start point
• =1 + 1 = 2• the number of ‘A’ before end point
• to determine the end point• =3
start
end
count of ‘A’ before start point = 1
“Ax” is not “AN” and less than “AN”
![Page 40: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/40.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
start
end
The range of strings that start with “NAN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘N’ + number of ‘N’ before start point• to determine the start point
• =5 + 1 = 6• the number of ‘N’ before end point
• to determine the end point• =2
![Page 41: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/41.jpg)
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
startend
2nd row at the original permutation=number of rotations of original string=“NAN” exists at the 3rd position of “BANANA”
BANANA
![Page 42: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/42.jpg)
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
![Page 43: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/43.jpg)
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
![Page 44: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/44.jpg)
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
![Page 45: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/45.jpg)
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
![Page 46: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/46.jpg)
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
![Page 47: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/47.jpg)
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
![Page 48: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/48.jpg)
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
![Page 49: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/49.jpg)
Genome Informatics I (2015 Spring)
Inexact matching
T G A C G T G T G A T T C A A A A A A G C
G A T T G A A A
When exact match does not exist:• continue other possible candidates (G -> A, C, T) and increase the
mismatch count• If another mismatch occurs, again branch it out. • So edit distance is critical to alignment speed
![Page 50: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/50.jpg)
Genome Informatics I (2015 Spring)
Goal achieved
time per 1 read (sec)
time per 80x WGS (sec)
is equal to
eyeballing 3x109 3.6x1018 1x1011 yrs
naïve matching 2400 1.2x109 7,608 yrs
improved algo-rithm
3 3.6x108 10 yrs
minimum re-quired
0.01 1.2x107 11.5 days
desired 0.001 1.2x106 1.2 days
![Page 51: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/51.jpg)
Genome Informatics I (2015 Spring)
PRACTICE WITH BWA
![Page 52: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/52.jpg)
Genome Informatics I (2015 Spring)
BWA
![Page 53: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/53.jpg)
Genome Informatics I (2015 Spring)
bwa practice
• In the cluster– >bwa
![Page 54: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/54.jpg)
Genome Informatics I (2015 Spring)
bwa process• bwa index
– to index the reference genome (one time process)• = to create bwt for reference genomoe
• bwa aln– will calculate suffix array (SA) coordinate
• bwa samse (or bwa sampe for paired end se-quencing)– will convert the SA coordinate to chromosomal locations
• Input for bwa– reference genome– fastq file (the raw NGS data)
![Page 55: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/55.jpg)
Genome Informatics I (2015 Spring)
reference data
![Page 56: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/56.jpg)
Genome Informatics I (2015 Spring)
reference data
“bwa index” will index the reference genome (so reference is ready) it is already done here, do not try do it again
![Page 57: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/57.jpg)
Genome Informatics I (2015 Spring)
sequence data
- Pick one chromosome for you- copy the fastq file to your
directory
- use “cp” command to do it
- example (copying chr8 NGS data to rachmani di-rectory)
>cp NA18507_chr8.* /scratch/2015_GenomeInformatics/rachmani/
![Page 58: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/58.jpg)
Genome Informatics I (2015 Spring)
run bwa aln
>bwa aln reference yourdata.fastq > yourdata.sai
example>bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.fastq > NA18507_chr8.01.sai
runbwaaln.sh
>qsub runbwaaln.sh
write a job script
submit to clus-ter
![Page 59: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/59.jpg)
Genome Informatics I (2015 Spring)
run bwa samse
>bwa samse reference yourdata.sai yourdata.fastq > yourdata.sam
example>bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.sai NA18507_chr8.01.fastq > NA18507_chr8.01.sam
runbwasamse.sh
>qsub runbwasamse.sh
write a job script
submit to clus-ter
![Page 60: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/60.jpg)
Genome Informatics I (2015 Spring)
the output
>less NA18507_chr8.01.sam
This is your first alignment with real NGS data
![Page 61: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/61.jpg)
Genome Informatics I (2015 Spring)
break
• Please ask any questions to us if you have problems (do not give up)
• If possible, try mapping in a paired-end mode– bwa sampe reference data01.sai data02.sai
data01.fastq data02.fastq > output.sam
![Page 62: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/62.jpg)
Genome Informatics I (2015 Spring)
The SAM Format
For more details about SAM format please refer to:https://samtools.github.io/hts-specs/SAMv1.pdf
![Page 63: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/63.jpg)
Genome Informatics I (2015 Spring)
SAM/BAM
• SAM and BAM are convertible (exactly same information)
• SAM file– human readable text file
• BAM file (binary)– human unreadable binary file– compressed (much smaller size)– able to index (for random access)
![Page 64: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/64.jpg)
Genome Informatics I (2015 Spring)
Converting SAM to BAM
• >samtools view yourdata.sam –Sb > your-data.bam– -S option means input is SAM format– -b option means output is BAM format–
![Page 65: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/65.jpg)
Genome Informatics I (2015 Spring)
Sorting and Indexing BAM
• samtools sort yourdata.sam yourdata.-sorted– will create yourdata.sorted.bam
• samtools index yourdata.bam– will create yourdata.bam.bai
• Now everything’s ready
![Page 66: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/66.jpg)
Genome Informatics I (2015 Spring)
Visualizing alignment
• IGV (Integrative Genomics Viewer)
![Page 67: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/67.jpg)
Genome Informatics I (2015 Spring)
Visualizing alignment
• samtools tview yourdata.bam reference– example:
• >samtools tview NA18507_chr8.01.sorted.bam /data/resource/reference/human/UCSC/hg19/BWAIn-dex/genome.fa
![Page 68: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei.](https://reader035.fdocuments.net/reader035/viewer/2022062718/56649e7d5503460f94b80bba/html5/thumbnails/68.jpg)
Genome Informatics I (2015 Spring)