CS 466 Introduction to Bioinformatics Saurabh Sinha.
-
date post
19-Dec-2015 -
Category
Documents
-
view
225 -
download
1
Transcript of CS 466 Introduction to Bioinformatics Saurabh Sinha.
CS 466Introduction to Bioinformatics
Saurabh Sinha
What is the course about?
• Algorithmic concepts, applied to sample (toy) problems in “bioinformatics”– Follows the text book
• “Real” bioinformatics research– Follows the best journals
• Not about practical training in the use of popular bioinformatics software
Grading
• Assignments: 40%– About one every two weeks
• Mid Term: 30%
• Final: 30%
Expectations
• Some programming skills– Any programming language is fine
Administrative Details
• Instructor: – Saurabh Sinha– Room 2122, Siebel Center– Email: [email protected]
• Class hrs: Tue & Thu, 3:30pm-4:45pm, 1131SC• Office hrs: Tue, before class (2:30 - 3:30 pm) 2122SC• Web site: http://veda.cs.uiuc.edu/courses/fa08/cs466/
• Welcome to sit in, if not taking for credit
Text books
• Jones and Pevzner: required
• Durbin et al.: recommended
Other course
• CS 591 BIO: weekly seminar on biomedical informatics
• Fridays at 10:30 am.
Motivating bioinformatics
Special issue of journal Science, July 1, 2005.
>What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?
Many of the most profound scientific questions of today are within the realm of bioinformatics research
“Why do humans have so few genes ?”
A simple organism
GENE
Raw
mat
eria
lsEnvironmental signal
Response (protein)
A simple organism
GENE1
GENE2
GENE3
A simple organism
GENE1
GENE2
GENE3
GENE4
GENE5
GENE6
GENE7
GENE8
GENE9
GENE10
A complex organism
GENE1
GENE2
GENE3
GENE4
GENE5
GENE6
GENE7
GENE8
GENE9
GENE10
Complex circuit of interactions
Regulatory networks
• This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity)
• Bioinformatics can unravel such networks, given the genome (DNA sequence) and gene activity information
Decoding the regulatory network
• Find patterns (“motifs”) in DNA sequence that occur more often than expected by chance
• Statistics on DNA sequences and words
• Knowing these can tell us about edges in the regulatory network
Decoding the regulatory network
• An example computational problem:• Given a string of length 10,000 over the
alphabet {A,C,G,T}• Count the number of occurrences Nw
of every 6 letter word w
• Are there specific words that occur more frequently than expected by chance?
Decoding the regulatory network
• What is expected by chance?• What is “more frequently”?• Interesting mathematical questions
• The Moby Dick example
Decoding the regulatory network
• What is expected by chance?• What is “more frequently”?• Interesting mathematical questions
• The Moby Dick example• This is called “motif finding”, and helps
decode the regulatory network!
Comparing DNA
• Humans are about 99.9% identical to each other, DNA-wise.
• How do we know that ?
• Compare the genome of two individuals.
• The computational problem: Are two sequences similar ?
Sequence alignment
• Why is this a problem?
• The two sequences will differ by “substitutions”, “insertions” and “deletions” accumulated during evolution
• The comparison algorithm has to be robust to such possibilities.– A special technique called “dynamic programming”
does all this, and is “efficient”
Sequence alignment
• Why should we care?
• Compare human genome with fish. You’ll see some portions that are highly similar.
• These “conserved” portions are often genes …
• … or regulatory sequences! The regulatory network again.
On counting genes
• The original question was “Why do humans have so few genes?”
• How do we know how many genes there are in the human genome ? (And where they are in the genome)
• Experiments can be designed, but bioinformatics plays a major role
Gene prediction
• The task of predicting the locations of genes in a new genome (“annotation”)
• Gene prediction software
• The more sophisticated ones use “Hidden Markov models” (HMM) and multiple species comparison
HMM for Gene Prediction
http://researchweb.watson.ibm.com/journal/rd/453/birney.html
What is this graph?
It captures the “architecture” of a gene.
It translates into a “probabilisticmodel”.
It leads directly to a gene finding algorithm
“What controls organ regeneration ?”
“How does a single somatic cell become a whole plant ?”
Development and Regeneration
• Developmental biology
• The timeline from a single cell (with genetic material from mother and father) to a multicellular embryo, and to an adult
• A paradox : All cells in the adult body have the same DNA, then how come different cells are different ?
Regulatory networks again
• Bioinformatics used to scan entire genome for regions that participate in “segmenting” the embryo
• Hidden Markov models used to detect such regions
• Multiple species comparison aids discovery
“How did cooperative behavior evolve?”
Social behavior and bioinformatics?
• Social behavior in honey bees
• Young worker bees are nurses in the hive; older ones go out to forage
• This behavioral maturation is determined by needs of colony. What is the genetic basis of this ?
Social behavior and bioinformatics
• Illinois team scanned the genome to understand this (2006)
• Regulatory network of social behavior
• Statistical tools, such as Hypergeometric test
• Machine learning tools such as “support vector machine classification”
Other challenges
Protein structure prediction
http://www.denizyuret.com/students/vkurt/thesis-main_dosyalar/image006.gif
Protein structure prediction
• Can we predict the 3-D structure of a protein from its amino acid sequence ?
• Why ? – One good reason: structure gives clues about function. If we
can tell the structure, we can perhaps tell the function
– We can design amino acid sequences that will fold into proteins that do what we want them to do. Drug design !!
• Neural networks, a popular technique in computer science, applied to this problem
Metagenomics• Most studies to date are on genomes of one
species
• A sample from the soil contains hundreds of bacteria, thousands of viruses. Can we study all of these ?
• The Sorcerer II expedition• http://www.sorcerer2expedition.org/version1/HTML/main.htm
Many more challenges
• New types of data come due to technological breakthroughs in biology
• High throughput data carries unprecedented amount of information
• Too much noise
• Bioinformatics removes the noise and reveals the truth
Bioinformatics
• Is not about one problem (e.g., designing better computer chips, better compilers, better graphics, better networks, better operating systems, etc.)
• Is about a family of very different problems, all related to biology, all related to each other
• How can computers help solve any of this family of problems ?
Bioinformatics and You
• You can learn the tools of bioinformatics• These tools owe their origin to computer
science, information theory, probability theory, statistics, etc.
• You can learn the language of biology, enough to understand what the problems are
• You can apply the tools to these problems and contribute to science