1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl...
Transcript of 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl...
![Page 1: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/1.jpg)
1. HPC & I/O �2. BioPerl
![Page 2: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/2.jpg)
![Page 3: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/3.jpg)
A simplified picture of the system
compute farm Login server(s) jhpce01.jhsph.edu jhpce02.jhsph.edu
LAN 0.1- 1 Gbs
Mbs
User machines
.!.!.
.!.!.
direct attached storage
| <------- NFS exported file systems -----> | /users
72 nodes ~3000 cores
DCS03
Ethernet switches (10Gpbs – 40Gps)
data transfer server transfer01.jhsph.edu
Research network
10-100Gbps
DCL01
Lustre file system
DCS02 DCS01
![Page 4: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/4.jpg)
Review of technical notions
n Central Processing Unit (CPU)
n The part of the computer that executes instructions (programs)
n Random Access Memory (RAM)
n Very fast volatile memory that is used like a scratchpad by the cpu
n Mass Storage (Disk)
n Where data & apps are kept more or less permanently. Very slow compared to RAM
n Network (ethernet, internet)
n Computers and devices communicate over networks. n These days it’s mostly ethernet.
![Page 5: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/5.jpg)
n Storage and memory sizes
n 1 Byte = 8 bits = 1 charactern 1 megabyte (GB) = 106 bytesn 1 gigabyte (GB) = 1000 MB = 109 bytesn 1 terabyte (TB) = 1000 GB = 1012 bytesn 1 petabyte (PB) = 1000 TB = 1015 bytes
n Typical sizes
n USB stick 4-128 GBn laptop disk drive 250 – 1000 GB n Enterprise Storage Appliance 100Bn Scale-out cluster storage > 1PB
Review of sizes
![Page 6: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/6.jpg)
n Bandwidth
n How much data per second can you pump through a pipe.n measured in Gigabits per second (Gbps).
n Latency
n How long does it take for that first piece of data to get through?n measure in nano, micro or (gasp!) milliseconds
n A practical demonstration
n http://speedtest.comcast.net
key technical notions (Network)
![Page 7: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/7.jpg)
Time scales for data transfer
Latency Comparison Numbers--------------------------L1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 25 nsMain memory reference 100 ns 20x L2 cache, 200x L1 cacheCompress 1K bytes with Zippy 3,000 nsSend 1K bytes over 1 Gbps network 10,000 ns 0.01 msRead 4K randomly from SSD* 150,000 ns 0.15 msRead 1 MB sequentially from memory 250,000 ns 0.25 msRound trip within same datacenter 500,000 ns 0.5 msRead 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memoryDisk seek 10,000,000 ns 10 ms 20x datacenter roundtripRead 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSDSend packet CA->Netherlands->CA 150,000,000 ns 150 ms
![Page 8: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/8.jpg)
What if we multiply all the time scales by 1 billion to humanize them?
Main memory reference 1.6 min Brushing your teeth
Send 2KB over 1 Gbps network 5.5 hr From lunch to end of work day
Read 1 MB sequentially from memory 2.9 days A long weekend
Round trip within same datacenter 5.8 days A medium vacation
Reading 1MB from SSD SSD random read 1.7 days A normal weekend SSD read 1 MB sequentially 11.6 days Waiting for almost 2 weeks for a delivery
Reading 1MB from Disk Seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Almost producing a new human being Total time: 1 year
Internet packet Round trip from California to Netherlands
4.8 years Average time it takes to complete a bachelor's degree
![Page 9: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/9.jpg)
The main lesson
n Do as much computing as you can in RAM
n Avoid disk i/o as much as possiblen If you must go to the disk suck in entire
files at a time rather than fetching one record at a time.
![Page 10: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/10.jpg)
BioPerl on the Cluster
![Page 11: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/11.jpg)
BioPerln Bioperl provides object-oriented software modules for
many of the typical tasks of bioinformatics programming. �
n Manipulating individual sequencesn Accessing genomic data directly from databasesn Transforming formats of database/ file recordsn Searching for ``similar'' sequencesn Creating and manipulating sequence alignmentsn Searching for genes and other structures in DNAn Developing machine readable sequence annotations
![Page 12: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/12.jpg)
www.bioperl.org
![Page 13: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/13.jpg)
Bioperl Module Groups
![Page 14: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/14.jpg)
BioPerl IO modulesSeqIO FASTA, GenBank, EMBL, etc.SearchIO BLAST, FASTA, HMMERAlignIO ClustalW, Phylip, MSF, etc.TreeIO Newick, Nexus, NHXMapIO MapMakerMatrix::IO Scoring, PhylipAssembly::IO Ace, PhrapOntology::IO InterPro, GO, SOMore…
![Page 15: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/15.jpg)
Bio::SeqIOn The principal class for input/output
n methodsn new -- opens a new seqstream for I/On next_seq -- gets the next entry in the input seqstream n write_seq -- writes to a seqstream n there is more…
n Refer to the web site for documentationn Example: format conversion
![Page 16: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/16.jpg)
UniProtKB/SwissPro format
Each sequence entry is composed of lines. Each line begins with a two-character code, which indicates the type of data contained in the line
ID - Identification. AC - Accession number(s). DT - Date. DE - Description. GN - Gene name(s). OS - Organism species. OG - Organelle. OC - Organism classification. RN - Reference number. RP - Reference position. RC - Reference comments.
RX - Reference cross-references. RA - Reference authors. RL - Reference location. CC - Comments or notes. DR - Database cross-references. KW - Keywords. FT - Feature table data. SQ - Sequence header. - (blanks) sequence data. // - Termination line.
![Page 17: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/17.jpg)
swisspro to fasta format conversion
#!/usr/bin/perl -wuse strict;use Bio::SeqIO;
# create a SeqIO object for the input streammy $in = Bio::SeqIO->new('-file' => "sprot.txt",
'-format' => 'swiss’);
# create a SeqIO object for the input streammy $out = Bio::SeqIO->new('-file' => ">sprot.fasta",
'-format' => 'fasta’);
# read the the input stream and write to the output stream # one record at a timewhile ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }
n swiss2fasta.pl
![Page 18: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics](https://reader034.fdocuments.net/reader034/viewer/2022042713/5faf34102c925118832d57de/html5/thumbnails/18.jpg)
Example: Remote database query#!/usr/bin/perl -wuse strict;use Bio::DB::GenBank;
my ($gb, $seq1, $seq2, $seq_id);
# use eval to test for success of code blockeval { $gb = new Bio::DB::GenBank() };if ($@) { die "Warning: Couldn't connect to Genbank";}
# get by sequence id$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');$seq_id = $seq1->display_id();print "got seq1 display id is $seq_id \n";
# get by accession number$seq2 = $gb->get_Seq_by_acc('AF303112');$seq_id = $seq2->display_id();print "got seq2 display id is $seq_id \n";
# get a bunch of sequences by accession numbermy $seqio = $gb->get_Stream_by_id([ qw(2981014 J00522 AF303112)]);while( my $seqobj = $seqio->next_seq()) {
print $seqobj->display_id(),"\n";print $seqobj->seq()."\n\n";
}