Basic Examples from Genomics - unimi.it Examples from Genomi… · The Plate. • Usually made...
Transcript of Basic Examples from Genomics - unimi.it Examples from Genomi… · The Plate. • Usually made...
Basic Examples from Genomics
Scientific Computing 2013-2014
Genetics as a set of principles and analytical procedures did not begin until 1866, when anAugustinian monk named Gregor Mendel performed a set of experiments that revealed thebasic inheritance mathematics (information that is carried between generation).
Until 1944, it was generally assumed that chromosomal proteins carry genetic information, and that DNAplays a secondary role. This view was shattered by Avery and McCarty who demonstratedthat the molecule deoxy-ribonucleic acid (DNA) is the major carrier of genetic material in living organisms, i.e., responsible for inheritance. The basic biological units responsiblefor possession and passing on of a single characteristic are called genes.
In 1953 James Watson and Francis Crick deduced the three dimensional double helix structure of DNA andimmediately inferred its method of replication.
In February 2001, the first draft of the human genome was published.
TIMELINE
• Prokaryotes – no nucleus (bacteria)• Their genomes are circular
• Eukaryotes – have nucleus (animal,plants)• Linear genomes with multiple chromosomes in pairs.
Two kinds of Cells
DNA
• Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms.
Backbone:sugars and phosphate groupsDNA is a long polymer of simple units
called nucleotides
BasesA: adenosine C: cytidine G: guanosine T: thymidine
CTGCTGTACTAGGATGCTGGTGGAGAGAGCTGCATATAAATCTTTGAGAGATGCACCAAG AATCACCATCATGGTTTCCGCCATAGGGGCTTCTTTTTTTATTCAAAATCTTGCCATTGT TTTATTTGGTGGTAGACCGAAAACTGTTCCAACGGTGGAGGTATTGTCCGGGGTGATAAA GCTGGGGTCCGTATCTCTACAAAGGCTGACCTTAGTGATTCCAGTAGTAACCATACTGCT ATTATTTCTTTTGATGTTTTTAGTGAACCAAACGAAAACTGGAATGGCAATGCGTGCCGT ATCCAAGGACTATGAAACCGCGCGGCTTATGGGAATTGACGTCAATAAAATTATTACCAT AACCTTTGGTATTGGCTCTGCTCTGGCAGCTATTGGTGGCATCATGTGGGGCGCAAAATT TCCTAAAATAGACCCTTTTGTTGGGACTATGCCGGGTATTAAATGCTTTATTGCTGCAGT TCTAGGTGGAATCGGAAACATTCCCGGTGCAGTAATCGGGGGGTTCATCTTAGGGATTGG AGAGATTATGCTCATTGCTTTTCTACCGAGCCTAACTGGCTATCGAGATGCCTTTGCTTT CATACTACTGATTATCATTCTACTGTTTAAGCCAACAGGAATCATGGGTGAAAAAATTGC GGAGAAGGTGTAGACGATGAAAAAGAAAAATACCATATTAACTGGATTAGCAGTATTGCT TTTATTGATTTATTTGATTTATGCAAATAAGAATTATGATTCTTATAAAATTAGAGTTCT AAATCTATGTGCAATTTATGCTGTATTGGGACTCAGTATGAATTTGATCAATGGATTTAC AGGTTTATTTTCCCTTGGACATGCAGGTTTTATGGCAGTAGGTGCCTATACTACCGCTCT TCTGACCATGACACCGCAAAGTAAGGAGGCAACATTCTTCTTAGTGCCCATTGTAGAGCC TTTGGCTAAAATTCAGCTTCCTTTTTTTGTGGCACTGATCATCGGTGGACTACTTTCAGC AATGGTGGCATTTTTAATCGGTGCACCGACTTTAAGGCTGAAGGGCGATTATTTAGCCAT
Complementary Base Pairing:A TC G
Sizes of Genomes
Protein Structure and Function
10
Views of a proteinWireframe Ball and stick
Protein Structure and Function
11
Views of a proteinSpacefill Cartoon CPK colors
Carbon = green, black, or grey
Nitrogen = blue
Oxygen = red
Sulfur = yellow
Hydrogen = white
VIDEO:
- DNA is Packaged
- Central Dogma (of Biology)
- Transcription
- Translation
Clustering
Clustering is the process of grouping data objects into a set of disjoint classes, called clusters, so that objects within a class have high similarity to each other, while objects in separate classes are more dissimilar.
Clustering is an example of unsupervised classification.
“Classification” refers to a procedure that assigns data objects to a set of classes.
“Unsupervised” means that clustering does not rely on predefined classes and training examples while classifying the data objects. Thus, clustering is distinguished from pattern recognition or the areas of statistics known asdiscriminant analysis and decision analysis, which seek to find rules for classifying objects from a given set of preclassified objects.
What is DNA Microarray?• Scientists used to be able to perform genetic
analyses of a few genes at once. DNA microarray allows us to analyze thousands of genes in one experiment!
Purposes.
• So why do we use DNA microarray?• To measure changes in gene expression levels – two samples’ gene expression
can be compared from different samples, such as from cells of different stages of mitosis.
• To observe genomic gains and losses. Microarray Comparative Genomic Hybridization (CGH)
• To observe mutations in DNA.
The Plate.
• Usually made commercially.• Made of glass, silicon, or nylon.• Each plate contains thousands of spots, and each spot contains a probe for a
different gene.• A probe can be a cDNA fragment or a synthetic oligonucleotide, such as BAC
(bacterial artificial chromosome set).• Probes can either be attached by robotic means, where a needle applies the
cDNA to the plate, or by a method similar to making silicon chips for computers. The latter is called a Gene Chip.
Let’s perform a microarray!1) Collect Samples.
2) Isolate mRNA.
3) Create Labelled DNA.
4) Hybridization.
5) Microarray Scanner.
6) Analyze Data.
STEP 1: Collect Samples.
This can be from a variety of organisms. We’ll use two samples – cancerous human skin tissue & healthy human skin tissue
STEP 2: Isolate mRNA.
• Extract the RNA from the samples. Using either a column, or a solvent such as phenol-chloroform.
• After isolating the RNA, we need to isolate the mRNA from the rRNA and tRNA. mRNA has a poly-A tail, so we can use a column containing beads with poly-T tails to bind the mRNA.
• Rinse with buffer to release the mRNA from the beads. The buffer disrupts the pH, disrupting the hybrid bonds.
STEP 3: Create Labelled DNA.
Add a labelling mix to the RNA. The labelling mix contains poly-T (oligo dT) primers, reverse transcriptase (to make cDNA), and fluorescently dyed nucleotides.
We will add cyanine 3 (fluoresces green) to the healthy cells and cyanine 5 (fluoresces red) to the cancerous cells.
The primer and RT bind to the mRNA first, then add the fluorescently dyed nucleotides, creating a complementary strand of DNA
STEP 4: Hybridization.
• Apply the cDNA we have just created to a microarray plate.
• When comparing two samples, apply both samples to the same plate.
• The ssDNA will bind to the cDNA already present on the plate.
STEP 5: Microarray Scanner.
The scanner has a laser, a computer, and a camera.
The laser causes the hybrid bonds to fluoresce.
The camera records the images produced when the laser scans the plate.
The computer allows us to immediately view our results and it also stores our data.
STEP 6: Analyze the Data.
GREEN – the healthy sample hybridized more than the diseased sample.
RED – the diseased/cancerous sample hybridized more than the nondiseased sample.
YELLOW - both samples hybridized equally to the target DNA.
BLACK - areas where neither sample hybridized to the target DNA.
By comparing the differences in gene expression between the two samples, we can understand more about the genomics of a disease.
Benefits.
• Relatively affordable (for some people!), about $60,000 for an arrayer and scanner setup.
• The plates are convenient to work with because they are small.
• Fast - Thousands of genes can be analyzed at once.
Problems.
• Oligonucleotide libraries – redundancy and contamination.
• DNA Microarray only detects whether a gene is turned on or off.
• Massive amounts of data.
EXAMPLES (only two experiments)
A GENE EXPRESSION MATRIX
The original gene expression matrix obtained from a scanning process contains noise, missing values, andsystematic variations arising from the experimental procedure. Data preprocessing is indispensable before any clusteranalysis can be performed.
Hierarchical clustering is a technique that organizes elements into a tree, rather than forming an explicit partitioning of the elements into clusters. In this case, the genes are represented as the leaves of a tree. The edges of the trees are assigned lengths and the distances between leaves—that is, the length of the path in the tree that connects two leaves—correlate withentries in the distance matrix. Such trees are used in both the analysis of expression data and in studies of molecular evolution.
Example.
The HIERARCHICALCLUSTERING algorithm below takes an n×n distance matrix d as an input, and progressively generates n different partitions of the data as the tree it outputs. The largest partition has n single-element clusters, with every element forming its own cluster. The second-largest partition combines the two closest clusters from the largest partition, and thus has n − 1 clusters. In general, the ith partition combines the two closest clusters from the (i − 1)th partition and has n − i + 1 clusters.