Applied Bioinformatics - Vanderbilt...
Transcript of Applied Bioinformatics - Vanderbilt...
Applied Bioinformatics Course Overview & Introduction to Linux
Bing Zhang Department of Biomedical Informatics
Vanderbilt University
What is bioinformatics
2
Bio informatics
Data
§ Hypotheses § Questions § Samples § Experiments
§ DNA § RNA § Protein § Metabolite § Phenotype
§ Sequence § Expression § Structure § Interaction
§ Storage/retrieval § Visualization § Computational methods § Statistical methods
Bioinformatics
Genomic sequences
3
http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
Human genome project (1990-2003)
First bacterial (H. influenzae)
First eukaryote
(yeast)
First metazoan (C. elegans)
http://www.genomesonline.org
Completely Sequenced Genomes September 2012
Genome sequencing costs plunge
4
The Cancer Genome Atlas (TCGA)
n Mission (Bio) q To accelerate the understanding of the molecular basis of cancer
through the application of genome analysis technologies.
n 2014 target (Data)
q 25 tumor types x 500 cases each
q Exome/whole genome sequencing
q Copy number variation
q Promoter methylation
q mRNA expression
q miRNA expression
5
Why now?
6
Bio informatics
Data
§ Hypotheses § Questions § Samples § Experiments
§ DNA § RNA § Protein § Metabolite § Phenotype
§ Sequence § Expression § Structure § Interaction
§ Storage/retrieval § Visualization § Computational methods § Statistical methods
informatics
Roles for different investigators in bioinformatics
n Algorithm developer q Statisticians
q Mathematicians
q Computer scientists
n Tool developer q Bioinformaticians
n Data provider/consumer q Biologists
7
Graph courtesy of http://www.incogen.com/
Comprehensive resource list
8
01/2012
01/2013
http://bioinformatics.ca/links_directory/
Sequence and structure databases
n Genbank: http://www.ncbi.nlm.nih.gov/genbank/ q Annotated collection of all publicly available DNA sequences
q 126,551,501,141 bases in 135,440,924 sequence as of April 2011
n UniProt: http://www.uniprot.org/
q Comprehensive resource for protein sequences and functional information
q 534,242 reviewed entries as of January 2012
n PDB: http://www.rcsb.org/ q 3D structures of large biological molecules, including proteins and nucleic acids
q 79,180 structures as of February 2012
n Pfam: http://pfam.sanger.ac.uk/ q Collection of protein families, each represented by multiple sequence alignments
and hidden Markov models (HMMs)
q 13,672 families as of November 2011
9
Genome browsers
n UCSC genome browser q http://genome.ucsc.edu/cgi-bin/hgGateway
n Ensembl genome browser q http://www.ensembl.org/index.html
10
Gene-centric databases
n Entrez Gene q http://www.ncbi.nlm.nih.gov/gene
q NCBI/NIH
q All completely sequenced genomes
q One gene per page
n Ensembl BioMart q http://www.ensembl.org/biomart/martview
q EMBL-EBI and Sanger Institute
q Vertebrates and other selected eukaryotic species
q Batch information retrieval
11
Gene expression data
n Gene Expression Omnibus (GEO) q http://www.ncbi.nlm.nih.gov/geo/
n ArrayExpress q http://www.ebi.ac.uk/arrayexpress/
12
Pathway and network resources
n Gene Ontology (GO): http://www.geneontology.org/
n Pathway databases q KEGG: http://www.genome.jp/kegg/pathway.html
q Reactome: http://www.reactome.org/
q WikiPathways: http://www.wikipathways.org/
n Protein-protein interaction databases q DIP: http://dip.doe-mbi.ucla.edu/ q MINT: http://mint.bio.uniroma2.it/mint/ q BioGRID: http://www.thebiogrid.org/ q HPRD: http://www.hprd.org
n Protein-DNA interaction database q Transfac: http://www.gene-regulation.com
13
Course content and grades
14
Applied Bioinformatics
IGP300B Bioregulation II, Spring 2014
(M/W/F, 10:00-10:55am, Location TBA)
Module director: Bing Zhang, Ph.D. ([email protected]; Department of Biomedical Informatics; 2525 West End Ave, Room 656; Phone: 936-0090)
Team members: William Bush, Ph.D., Qi Liu, Ph.D., Zhongming Zhao, Ph.D.
Date Room Subject Instructor Homework (HW) / Project 2/14 206 PRB Course overview & Introduction to Linux Zhang 2/17 407 A-C LH Pairwise sequence alignment Zhao 2/19 407 A-C LH Multiple sequence alignment Zhao 2/21 206 PRB Inferring phylogenetic relationships Zhao HW I distribution (20 pts) 2/24 407 A-C LH Gene prediction Bush 2/26 407 A-C LH Gene regulatory elements and conservation Bush HW I due 2/28 208 LH In silico and In clinico characterization of genetic variations Bush HW II distribution (20 pts) 3/3 407 A-C LH Supervised analysis of gene expression data Zhang 3/5 407 A-C LH Unsupervised analysis of gene expression data Zhang HW II due 3/7 206 PRB Functional interpretation of gene lists Zhang 3/10 411 A-C LH Next-Generation Sequencing data analysis Liu HW III distribution (20 pts) 3/12 407 A-C LH Data Analysis Project
Zhang & Liu
3/14 208 LH HW III due 3/17 407 A-C LH 3/19 415 A-C LH Project presentation Project presentation (40 pts) 3/21 206 PRB HW assignments will be graded by each instructor for their respective sections. Final Grade = sum of the three hw scores and the project presentation score (100 pts in total). A: 85-100; B: 75-84; C: 65-74; D: 55-64; F: 0-54
Course materials and assignments
n Lecture slides available at https://sites.google.com/site/vandyigp/bioregulation-ii/minimester-2/applied-bioinformatics
n Homework assignments available at the same URL on the distribution date (2/21, 2/28, 3/10)
n Homework assignments are due on paper at the beginning of class on the due date (2/26, 3/5, 3/14). There will be a 10% per day deduction for late reports.
n Start thinking about forming project teams (~5 person per team)
n Instructor contact information q Dr. Bing Zhang: [email protected]
q Dr. Zhongming Zhao: [email protected]
q Dr. William Bush: [email protected]
q Dr. Qi Liu: [email protected]
15
ACCRE
n Advanced Computing Center for Research & Education q http://www.accre.vanderbilt.edu/
q The compute cluster currently consists of more than 500 Linux systems with quad or hex core processors
n Linux system q An operating system (OS) like Windows or Mac
q Portable, multi-tasking, multi-user OS
q High performance and free, making it idea for high performance computing clusters
16
Get an ACCRE account n http://www.accre.vanderbilt.edu/?page_id=617
n Registration form q Name, VUNetID, Department (VU), School (VU), Email, Phone, Position
q Group: IGP300b_ab (igp300b_ab) q Primary research area: bioinformatics
q Primary application: Existing Application
q Primary application name: R
q Primary application type: Serial
q Expected typical number of processors: NA
q Expected typical number of concurrent running jobs: 1
q Linux experience:
q Expected compilers/languages: C, C++, R, perl, python
q Expected external libraries: NA
q BlueArc User: No
q Other useful information: NA
17
Logging onto the cluster and change password
n Windows q Application: SSH (http://its.vanderbilt.edu/downloads)
q Two steps: add profile -> edit profile
q Host name: vmplogin.accre.vanderbilt.edu
q Username: your_user_name
n Mac q Spotlight to find the application: Terminal
q Command: ssh [email protected]
n Change password q rsh vmpsched
q passwd
n Exit q exit
18
Logging onto the cluster and change password (using SSH in Windows)
19
Logging onto the cluster and change password (using Terminal in Mac)
20
You won’t see any response while typing
password, which is fine.
Hierarchical File system
/
bin usr home scratch etc tmp
chmod
cp
date
grep
mv
rm
vi
igptest annie cody bin lib
bin docs src
libc.so
libgpfs.so
libjpeg.so
libstdc++.so
diff
find
gcc
id
make
perl
ssh
prog1.c
prog2.f77
prog3.cpp
myprog.sh
dothis.pl
dothat.py
/home
/home/igptest
/home/igptest/src/prog3.cpp
21
Working with directories
n pwd (prints your present working directory)
n ls (lists directory contents)
n mkdir (makes a directory)
n cd (changes directories) q .. (parent directory)
q . (current directory)
q ~ or no parameter (home directory)
n rmdir (removes an empty directory)
22
Working with files
n more (displays the contents of a file) q space bar to show next page
q q to exist
n cp (copies files)
n mv (renames/moves files)
n rm (removes files)
23
Getting help
n man (display manual pages for a command) q man ls (display manual for the
ls command)
q space bar to show next page
q q to exist
n Alternatives of ls q ls -a (do not ignore entries
starting with .)
q ls -l (use a long listing format)
q ls -al (use a long listing format and do not ignore entries starting with .)
24
Editing files with nano q cd (change to home directory)
q nano .bashrc (use nano to edit file .bashrc, which includes commands that are executed when starting the system).
q Add “setpkgs –a R” to the end of the file (this will allow you to use the R environment which has been installed in the ACCRE system for statistical computing).
25
Copying files to/from a local computer
n Windows q Application: SSH (http://its.vanderbilt.edu/downloads)
n Mac q Application: Fugu (http://its.vanderbilt.edu/downloads)
q Connect to: vmplogin.accre.vanderbilt.edu
q Username: your_user_name
q Don’t change other items
26
Copying files to/from a local computer (using SSH in Windows)
27
Copying files to/from a local computer (using Fugu in Mac)
28
Homework
n Get an ACCRE account
n Log onto the cluster and change password
n Get familiar with the Linux commands introduced today
n Copy the file sample_file.txt under directory /home/igptest to your home directory
n Add “setpkgs –a R” to the end of your .bashrc file
29