Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology -...
-
Upload
melanie-atkins -
Category
Documents
-
view
221 -
download
0
description
Transcript of Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology -...
Andy Kudlicki
Office: BSB 547
Phone: 772-2253, 771-1011 cell
BMB 6216 – Algorithms for Biology - Class 1
Welcome!
Imagine doing science without computers? It can (almost all) be done:
– Paper file folders
– Xeroxing
– Photographs on film
– Actually going to the library to browse journals
– Abstract collections
– Telephone, Snail-mail, Telegrams
– Typewriters
BMB 6216 – Algorithms for Biology
The one exception:
Science is quantitative, and has always been.
BMB 6216 – Algorithms for Biology
This course:
– Using computers for computing.
– Aspects useful in biology / bioinformatics
• Simple tasks ( 2 * 71.12 = ? )
• Simple repetitive tasks (few or many repetitions)
• Somewhat complicated tasks
• Typical problems of high complexity
– BLAST, genome assembly, motif discovery, ...
BMB 6216 – Algorithms for Biology
This course:
– Using computers for computing.
– Aspects useful in biology / bioinformatics
• Simple tasks ( 2 * 71.12 = ? )
• Simple repetitive tasks (few or many repetitions)
• Somewhat complicated tasks
• Typical problems of high complexity
– BLAST, genome assembly, motif discovery, ...
BMB 6216 – Algorithms for Biology
spreadsheets
( Solved, software available )
Course Overview
Class 1 Introduction to the course and to the Perl programming language
Class 2 Computational complexity and numerical stability of algorithms
Class 3 Data Structures and Containers in PERL and other languages
1. Tables, lists, queues, hashes and when to use them
2. When PERL is not enough: A quick look at R and C++
Class 4 Matrix operations; Principal Component Analysis; ICA
Class 5 Network / graph algorithms
1. Interaction Networks
2. Regulation networks
3. Graphs for enumerating hypotheses
BMB 6216 – Algorithms for Biology – Class 1
Course Overview
Class 6 Strings and Regular Expressions
1. In silico enzyme digestion
2. Gene translation
Class 7 Randomization and Monte Carlo simulations
1. Randomization by permutation
2. Modeling the null-hypothesis probability distribution
Class 8 Custom vector graphics: generating SVG from your data
1. Create and re-create the killer graph for your paper
Class 9 Visualization of multidimensional data
Class 10 Web tools
1. The components of a web page, elements of HTML.
2. Extracting data from webpages and other documents.
3. Connect to GenBank using BioPerl
BMB 6216 – Algorithms for Biology
Course Overview
Class 11 Cgi-bin: Creating dynamic web-based tools for data analysis.Class 12 Relational databases and SQL1. Relational Model, normalization2. Basic SQL3. Examples: Experimental results,Class 13 Databases and WWWClass 14 Clustering1. Hierarchical2. K-means3. friends-of-FriendsClass 15 Timecourses and spectral analysis; Convolution.
BMB 6216 – Algorithms for Biology
Format:
Mixed – lecture with hands-on assignments.
Computer environment:
Linux
Perl, also C/C++, R, shell, awk, sed, ..., when needed
Supplementary reading:Larry Wall et al: Programming PerlWing-Kin Sung: Algorithms in BioinformaticsJames Tisdall: Beginning Perl for BioinformaticsJames Tisdall: Mastering Perl for BioinformaticsStroustrup: The C++ Programmming Language
Special requests: Welcome !
BMB 6216 – Algorithms for Biology
Format:
Mixed – lecture with hands-on assignments.
Computer environment:
Linux
* Rich in standard tools, mostly open-source
* Industry standard
– * Very similar to MacOS, Android, iOS, BSD, ChromeOS, etc.
– Has many flavors created for specific purposes
BMB 6216 – Algorithms for Biology
Using your laptop in class:
To get a *nix environment:
* linux laptop (or unix console on Mac)– Live CD distribution
* cygwin* virtual machine
* remote session (preferred, guaranteed to work)
BMB 6216 – Algorithms for Biology
Remote session:
Use
– “Remote Desktop Connection” from win*
– Server: 129.109.88.185
From mac – install “Remote Desktop Connection Client for Mac”
From Linux “rdesktop 129.109.88.185”
Also works from off campus
• (mycitrix.utmb.edu -> remote desktop session)
Other options:
– ssh (puTTY on windows) , no graphics though, only on-campus
– NX NoMachine
Login to: 129.109.54.80
Username:
Password:
BMB 6216 – Algorithms for Biology
Unix / linux shell / command line:
– List files: ls ls -a ls -1 ls -l ls -lrt
– Directory: cd pwd
– Copy, move, delete, link: cp mv rm ln
– Machine status: ps w uptime top df du whoami /sbin/ifconfig date
– Text editors: joe nano emacs (c-x c-f) vi
– Pager: more less; also: cat, head, tail, tac
– Misc: echo tr sed man wc chmod
BMB 6216 – Algorithms for Biology
Simple data flow / spreadsheet-like
• Find in file : grep [grep -v; grep -f; egrep]
• Select top/bottom lines from file: head, tail
• Select columns: awk awk '{print $2, $3, $5+$6}'
• Merge lines: cat
• Merge columns: paste
• Sort
• Data flow: > >> < | tee tac
BMB 6216 – Algorithms for Biology
Exercise:
The file /data/students/classes/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36)
• How many named genes are there?
• What is the average expression at timepoint 1? In how many genes it is above average?
• What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W)
• List 200 named genes that have the highest (t7+t19+t31)-(t1+t13+t25)
BMB 6216 – Algorithms for Biology
Log in to your account (on 129.109.88.185)
– Make a fresh directory, e.g.
mkdir bmb6216
cd bmb6216
mkdir class_1; cd class_1
cp /data/students/classes/hello.pl .
* Cat it. * Less it. * Run it.
• Backup: cp hello.pl hello-0.pl
• Edit it: vi hello.pl
BMB 6216 – Algorithms for Biology
Editing with vi
– I / i (insert)
– A / a (append)
– X / x / dd (delete)
– R (eplace) / r (eplace 1 character)
– {n} W / w / B / b / hjkl -move around
– [ESC] – back from insert to command
– ZZ / :w / :q / :wq / :x / :q! - exit / save / quit
– xp – swap chars. ddp – swap lines
BMB 6216 – Algorithms for Biology
Exercise:
The file /home/students/classes/Class_1/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36)
• How many named genes are there?
• What is the average expression at timepoint 1? In how many genes it is above average?
• What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W), named genes also have a common name in column 2.
• List 200 named genes that have the highest (t7+t19+t31)-(t1+t13+t25)
BMB 6216 – Algorithms for Biology
PERL
Why PERL?
Practical Extraction and Report Language
Pathologically Eclectic Rubbish Lister
• Versatile, portable
• Widely used in bioinformatics and web applications
• There's more than one way to do it
• Not the most elegant language, great for dirty hacks
• Easily integrated with anything
BMB 6216 – Algorithms for Biology
Warning: PERL6 ain't PERL
BMB 6216 – Algorithms for Biology
PERL
HELLO WORLD:
print ''Hello \n'';
BMB 6216 – Algorithms for Biology
PERL
HELLO WORLD:
> perl
print ''Hello \n'';
^D
BMB 6216 – Algorithms for Biology
PERL
HELLO WORLD:
>perl -e 'print ''Hello \n'';'
BMB 6216 – Algorithms for Biology
PERL
HELLO WORLD:
hello.pl
==================
#!/usr/bin/perl
print ''Hello \n'';
==================
BMB 6216 – Algorithms for Biology
> perl hello.plOr> ./hello.pl
(after chmod +x hello.pl)
VARIABLES:
Scalar:
$dna = 'ATTTGCCCTGCCCATT';
$mouse_tail_inches = 2.13;
$RNA = ''GGGUUCAAUAUAUGGC'';
$seven = -6;
Default variable: $_
No need to declare variables. If not specified, $_ is assumed.
BMB 6216 – Algorithms for Biology
VARIABLES:
No need to declare variables.
Risky though:
$my_variable = 51;
$something = $my_variable + 3;
$something_else = $myvariable + 4;
use strict;
BMB 6216 – Algorithms for Biology
OPERATIONS:
String:
$dna = “ATAGAGGTA” . “CATATC”;
$at_repeat = “AT” x 50;
substr() sub-string
length()
Binding: print $dna if $dna =~ /ATA/;
chop (last char)
chomp (end of line)
Special characters: \t \n
BMB 6216 – Algorithms for Biology
The different quotations
$x=6;
print ''x= $x \n'';
print 'x= $x \n';
BMB 6216 – Algorithms for Biology
OPERATIONS:
Arithmetic:
$a + $b
$a - $b
$a * $b
$a % $b
$a ** $b
BMB 6216 – Algorithms for Biology
OPERATIONS:
Incrementation (C-like)
$a ++
$a *= 4
$repeat = 'AT'; $repeat x=36;
BMB 6216 – Algorithms for Biology
LISTS/TABLES:
@a = (4, 6, 3.21, 7, 'cat', ''dog'');
$a[0] = 6;
$#a address of last element
@a + 0 size of array
OPERATIONS:
* join / split
* push / pop / shift / unshift
BMB 6216 – Algorithms for Biology
LISTS/TABLES:
@a = (4, 6, 3.21, 7, 'cat', ''dog'');
$a[0] = 6;
$#a address of last element
@a + 0 size of array
OPERATIONS:
* join / split
* push / pop / shift / unshift
BMB 6216 – Algorithms for Biology
HASHES:
The most important data type in biology!
$expression{''RPS16''} = 4.65;
%expression = (
RPL12 => 1.23,
CDC28 => 5.31,
STAT1 => ''experiment gone south”
);
BMB 6216 – Algorithms for Biology
FLOW CONTROL:
if ( $a > 4 ) { print sqrt ($a), “\n”; };
while ( $x > 0 ) { print --$x , “\n”};
$x>0 or $x = 6;
for $z (1..333) {print $z, ' ';};
for ($i=0; $i<=1000; ++$i)
{
next unless $a[$i] > 0
};
BMB 6216 – Algorithms for Biology
TRUE or FALSE
false strings:
– ''0''
– ''''
Every other string is true!
''0.00'' is true
''0.00'' + 0 is false
– if ( 'Elvis is alive' ) { print 4+5, “\n”; };
– undef() is false
BMB 6216 – Algorithms for Biology
SUBROUTINES
sub addit {
my ($x1, $x2) = @_;
return $x1 + $x2;
};
BMB 6216 – Algorithms for Biology
Input / Output:
while (<>)
{
chomp;
$sum += $_;
};
BMB 6216 – Algorithms for Biology
Input:
open BLABLA, “data.csv”;
$firstline = <BLABLA>;
@headers = split “\t”, $firstline;
while (<BLABLA>) {something};
close BLABLA;
BMB 6216 – Algorithms for Biology
Output:
– print $x, ''\n'';
– printf ''format'', $x;
– print + join '' '', @list;
open BLABLA, “>outdata.csv”;
print BLABLA $x, $y, ''\n''; #no comma!!!
close BLABLA;
BMB 6216 – Algorithms for Biology
Exercises:
1. repeat in PERL the awk/sort exercise from last hour
2. a-S_cer_TANAY_1000upstream.fasta contains the sequences out UTRs of genes. What is the correlation between the position of GATGAGA sequence and avg expression of the gene?
BMB 6216 – Algorithms for Biology
Simple data flow / spreadsheet-like
• Find in file : grep [grep -v; grep -f; egrep]
• Select top/bottom lines from file: head, tail
• Select columns: awk awk '{print $2, $3, $5+$6}'
• Merge lines: cat
• Merge columns: paste
• Sort
• Data flow: > >> < | tee tac
BMB 6216 – Algorithms for Biology
C / C++ -> for total control
=========================== Hello.C ======
#include <iostream>
using namespace std;
int main ()
{
cout << "Hello :) " << 5+4 << endl;
};
BMB 6216 – Algorithms for Biology