Applied Bioinformatics - Vanderbilt...

29
Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected]

Transcript of Applied Bioinformatics - Vanderbilt...

Page 1: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Applied Bioinformatics Course Overview & Introduction to Linux

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

[email protected]

Page 2: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

What is bioinformatics

2

Bio informatics

Data

§ Hypotheses § Questions § Samples § Experiments

§ DNA § RNA § Protein § Metabolite § Phenotype

§ Sequence § Expression § Structure §  Interaction

§ Storage/retrieval § Visualization § Computational methods § Statistical methods

Bioinformatics

Page 3: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Genomic sequences

3

http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html

Human genome project (1990-2003)

First bacterial (H. influenzae)

First eukaryote

(yeast)

First metazoan (C. elegans)

http://www.genomesonline.org

Completely Sequenced Genomes September 2012

Page 4: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Genome sequencing costs plunge

4

Page 5: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

The Cancer Genome Atlas (TCGA)

n  Mission (Bio) q  To accelerate the understanding of the molecular basis of cancer

through the application of genome analysis technologies.

n  2014 target (Data)

q  25 tumor types x 500 cases each

q  Exome/whole genome sequencing

q  Copy number variation

q  Promoter methylation

q  mRNA expression

q  miRNA expression

5

Page 6: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Why now?

6

Bio informatics

Data

§ Hypotheses § Questions § Samples § Experiments

§ DNA § RNA § Protein § Metabolite § Phenotype

§ Sequence § Expression § Structure §  Interaction

§ Storage/retrieval § Visualization § Computational methods § Statistical methods

informatics

Page 7: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Roles for different investigators in bioinformatics

n  Algorithm developer q  Statisticians

q  Mathematicians

q  Computer scientists

n  Tool developer q  Bioinformaticians

n  Data provider/consumer q  Biologists

7

Graph courtesy of http://www.incogen.com/

Page 8: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Comprehensive resource list

8

01/2012

01/2013

http://bioinformatics.ca/links_directory/

Page 9: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Sequence and structure databases

n  Genbank: http://www.ncbi.nlm.nih.gov/genbank/ q  Annotated collection of all publicly available DNA sequences

q  126,551,501,141 bases in 135,440,924 sequence as of April 2011

n  UniProt: http://www.uniprot.org/

q  Comprehensive resource for protein sequences and functional information

q  534,242 reviewed entries as of January 2012

n  PDB: http://www.rcsb.org/ q  3D structures of large biological molecules, including proteins and nucleic acids

q  79,180 structures as of February 2012

n  Pfam: http://pfam.sanger.ac.uk/ q  Collection of protein families, each represented by multiple sequence alignments

and hidden Markov models (HMMs)

q  13,672 families as of November 2011

9

Page 10: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Genome browsers

n  UCSC genome browser q  http://genome.ucsc.edu/cgi-bin/hgGateway

n  Ensembl genome browser q  http://www.ensembl.org/index.html

10

Page 11: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Gene-centric databases

n  Entrez Gene q  http://www.ncbi.nlm.nih.gov/gene

q  NCBI/NIH

q  All completely sequenced genomes

q  One gene per page

n  Ensembl BioMart q  http://www.ensembl.org/biomart/martview

q  EMBL-EBI and Sanger Institute

q  Vertebrates and other selected eukaryotic species

q  Batch information retrieval

11

Page 12: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Gene expression data

n  Gene Expression Omnibus (GEO) q  http://www.ncbi.nlm.nih.gov/geo/

n  ArrayExpress q  http://www.ebi.ac.uk/arrayexpress/

12

Page 13: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Pathway and network resources

n  Gene Ontology (GO): http://www.geneontology.org/

n  Pathway databases q  KEGG: http://www.genome.jp/kegg/pathway.html

q  Reactome: http://www.reactome.org/

q  WikiPathways: http://www.wikipathways.org/

n  Protein-protein interaction databases q  DIP: http://dip.doe-mbi.ucla.edu/ q  MINT: http://mint.bio.uniroma2.it/mint/ q  BioGRID: http://www.thebiogrid.org/ q  HPRD: http://www.hprd.org

n  Protein-DNA interaction database q  Transfac: http://www.gene-regulation.com

13

Page 14: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Course content and grades

14

Applied Bioinformatics

IGP300B Bioregulation II, Spring 2014

(M/W/F, 10:00-10:55am, Location TBA)

Module director: Bing Zhang, Ph.D. ([email protected]; Department of Biomedical Informatics; 2525 West End Ave, Room 656; Phone: 936-0090)

Team members: William Bush, Ph.D., Qi Liu, Ph.D., Zhongming Zhao, Ph.D.

Date Room Subject Instructor Homework (HW) / Project 2/14 206 PRB Course overview & Introduction to Linux Zhang 2/17 407 A-C LH Pairwise sequence alignment Zhao 2/19 407 A-C LH Multiple sequence alignment Zhao 2/21 206 PRB Inferring phylogenetic relationships Zhao HW I distribution (20 pts) 2/24 407 A-C LH Gene prediction Bush 2/26 407 A-C LH Gene regulatory elements and conservation Bush HW I due 2/28 208 LH In silico and In clinico characterization of genetic variations Bush HW II distribution (20 pts) 3/3 407 A-C LH Supervised analysis of gene expression data Zhang 3/5 407 A-C LH Unsupervised analysis of gene expression data Zhang HW II due 3/7 206 PRB Functional interpretation of gene lists Zhang 3/10 411 A-C LH Next-Generation Sequencing data analysis Liu HW III distribution (20 pts) 3/12 407 A-C LH Data Analysis Project

Zhang & Liu

3/14 208 LH HW III due 3/17 407 A-C LH 3/19 415 A-C LH Project presentation Project presentation (40 pts) 3/21 206 PRB HW assignments will be graded by each instructor for their respective sections. Final Grade = sum of the three hw scores and the project presentation score (100 pts in total). A: 85-100; B: 75-84; C: 65-74; D: 55-64; F: 0-54

Page 15: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Course materials and assignments

n  Lecture slides available at https://sites.google.com/site/vandyigp/bioregulation-ii/minimester-2/applied-bioinformatics

n  Homework assignments available at the same URL on the distribution date (2/21, 2/28, 3/10)

n  Homework assignments are due on paper at the beginning of class on the due date (2/26, 3/5, 3/14). There will be a 10% per day deduction for late reports.

n  Start thinking about forming project teams (~5 person per team)

n  Instructor contact information q  Dr. Bing Zhang: [email protected]

q  Dr. Zhongming Zhao: [email protected]

q  Dr. William Bush: [email protected]

q  Dr. Qi Liu: [email protected]

15

Page 16: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

ACCRE

n  Advanced Computing Center for Research & Education q  http://www.accre.vanderbilt.edu/

q  The compute cluster currently consists of more than 500 Linux systems with quad or hex core processors

n  Linux system q  An operating system (OS) like Windows or Mac

q  Portable, multi-tasking, multi-user OS

q  High performance and free, making it idea for high performance computing clusters

16

Page 17: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Get an ACCRE account n  http://www.accre.vanderbilt.edu/?page_id=617

n  Registration form q  Name, VUNetID, Department (VU), School (VU), Email, Phone, Position

q  Group: IGP300b_ab (igp300b_ab) q  Primary research area: bioinformatics

q  Primary application: Existing Application

q  Primary application name: R

q  Primary application type: Serial

q  Expected typical number of processors: NA

q  Expected typical number of concurrent running jobs: 1

q  Linux experience:

q  Expected compilers/languages: C, C++, R, perl, python

q  Expected external libraries: NA

q  BlueArc User: No

q  Other useful information: NA

17

Page 18: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Logging onto the cluster and change password

n  Windows q  Application: SSH (http://its.vanderbilt.edu/downloads)

q  Two steps: add profile -> edit profile

q  Host name: vmplogin.accre.vanderbilt.edu

q  Username: your_user_name

n  Mac q  Spotlight to find the application: Terminal

q  Command: ssh [email protected]

n  Change password q  rsh vmpsched

q  passwd

n  Exit q  exit

18

Page 19: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Logging onto the cluster and change password (using SSH in Windows)

19

Page 20: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Logging onto the cluster and change password (using Terminal in Mac)

20

You won’t see any response while typing

password, which is fine.

Page 21: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Hierarchical File system

/

bin usr home scratch etc tmp

chmod

cp

date

grep

mv

rm

vi

igptest annie cody bin lib

bin docs src

libc.so

libgpfs.so

libjpeg.so

libstdc++.so

diff

find

gcc

id

make

perl

ssh

prog1.c

prog2.f77

prog3.cpp

myprog.sh

dothis.pl

dothat.py

/home

/home/igptest

/home/igptest/src/prog3.cpp

21

Page 22: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Working with directories

n  pwd (prints your present working directory)

n  ls (lists directory contents)

n  mkdir (makes a directory)

n  cd (changes directories) q  .. (parent directory)

q  . (current directory)

q  ~ or no parameter (home directory)

n  rmdir (removes an empty directory)

22

Page 23: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Working with files

n  more (displays the contents of a file) q  space bar to show next page

q  q to exist

n  cp (copies files)

n  mv (renames/moves files)

n  rm (removes files)

23

Page 24: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Getting help

n  man (display manual pages for a command) q  man ls (display manual for the

ls command)

q  space bar to show next page

q  q to exist

n  Alternatives of ls q  ls -a (do not ignore entries

starting with .)

q  ls -l (use a long listing format)

q  ls -al (use a long listing format and do not ignore entries starting with .)

24

Page 25: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Editing files with nano q  cd (change to home directory)

q  nano .bashrc (use nano to edit file .bashrc, which includes commands that are executed when starting the system).

q  Add “setpkgs –a R” to the end of the file (this will allow you to use the R environment which has been installed in the ACCRE system for statistical computing).

25

Page 26: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Copying files to/from a local computer

n  Windows q  Application: SSH (http://its.vanderbilt.edu/downloads)

n  Mac q  Application: Fugu (http://its.vanderbilt.edu/downloads)

q  Connect to: vmplogin.accre.vanderbilt.edu

q  Username: your_user_name

q  Don’t change other items

26

Page 27: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Copying files to/from a local computer (using SSH in Windows)

27

Page 28: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Copying files to/from a local computer (using Fugu in Mac)

28

Page 29: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding

Homework

n  Get an ACCRE account

n  Log onto the cluster and change password

n  Get familiar with the Linux commands introduced today

n  Copy the file sample_file.txt under directory /home/igptest to your home directory

n  Add “setpkgs –a R” to the end of your .bashrc file

29