Human Genome Sequence and Variability
description
Transcript of Human Genome Sequence and Variability
Human Genome Sequence
and Variability
Gabor T. Marth, D.Sc.
Department of Biology, Boston [email protected]
Medical Genomics Course – Debrecen, Hungary, May 2006
Lecture overview
1. Genome sequencing strategies, sequencing informatics
2. Genome annotation, functional and structural features in the human genome
3. Genome variability, DNA nucleotide, structural, and epigenetic variations
1. The Human genome sequence
The nuclear genome (chromosomes)
The genome sequence
• the primary template on which to outline functional features of our genetic code (genes, regulatory elements, secondary structure, tertiary structure, etc.)
Completed genomes
~1 Mb~100 Mb
>100 Mb
~3,000 Mb
Main genome sequencing strategies
Clone-based shotgun sequencing
Whole-genome shotgun sequencing
Human Genome Project Celera Genomics, Inc.
Hierarchical genome sequencing
BAC library construction
clone mapping
shotgun subclone library construction
sequencing
sequence reconstruction (sequence assembly)Lander et al. Nature 2001
Clone mapping – “sequence ready” map
Hierarchical genome sequencing
BAC library construction
clone mapping
shotgun subclone library construction
sequencing/read processing
sequence reconstruction (sequence assembly)Lander et al. Nature 2001
Shotgun subclone library construction
BAC primary clone cloning vector
sequencing vector
subclone insert
Hierarchical genome sequencing
BAC library construction
clone mapping
shotgun subclone library construction
sequencing/read processing
sequence reconstruction (sequence assembly)Lander et al. Nature 2001
Sequencing
Robotic automation
Lander et al. Nature 2001
Base calling
PHREDbase = AQ = 40
Vector clipping
Hierarchical genome sequencing
BAC library construction
clone mapping
shotgun subclone library construction
sequencing/read processing
sequence reconstruction (sequence assembly)Lander et al. Nature 2001
Sequence assembly
PHRAP
Repetitive DNA may confuse assembly
Sequence completion (finishing)
CONSED, AUTOFINIS
H
gapregion of low sequence coverage and/or quality
2. Human genome annotation
Genome annotation – Goals
protein coding genes RNA genesrepetitive elements
GC content
The starting material
AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAGTCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
Coding genes – ab initio predictions
ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA
Open Reading Frame = ORF
Stop codonStart codon
PolyA signal
Ab initio predictions
Gene structure
Ab initio predictions
…AGAATAGGGCGCGTACCTTCCAACGAAGACTGGG…
splice donor site splice acceptor site
Ab initio predictions
GenscanGrailGenieGeneFinderGlimmeretc…
EST_genomeSim4SpideyEXALIN
Homology based predictions
ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA
ACGGAAGTCT
known coding sequence from another organism
GGACTATAAA
expressed sequence
genes predicted by homology
GenomescanTwinscanetc…
Consolidation – gene prediction systems
Otto
Ensembl
FgenesH
Genscan
Grail
Genewise
Sim4 dbEst
ncRNA genes
prediction based on structure (e.g. tRNAs)
for other novel ncRNAs, only homology-based predictions have been successful
Repeat annotations
Repeat annotation are based on sequence similarity to known repetitive elements in a repeat sequence library
The landscape of the human genome
Gene annotations – # of coding genes
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Gene annotations – gene length
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Gene annotations – gene function
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
GC content and coding potential
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
ncRNAs
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Segmental duplications
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Repeat elements
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Genes and repeats
Physical vs. genetic map (Mb/cM)
0.4 cM 1.3 cM 0.7 cM
0.4 Mb 0.7 Mb 0.3 Mb
3. Human genome variability
DNA sequence variations
• the reference Human genome sequence is 99.9% common to each human being
• sequence variations make our genetic makeup unique
SNP
• the most abundant human variations are single-nucleotide polymorphisms (SNPs) – 10 million SNPs are currently known
DNA sequence variations
insertion-deletion (INDEL) polymorphisms
Structural variations
Speicher & Carter, NRG 2005
Structural variations
Feuk et al. Nature Reviews Genetics 7, 85–97 (February 2006) | doi:10.1038/nrg1767
Detection of structural variants
Feuk et al. Nature Reviews Genetics 7, 85–97 (February 2006) | doi:10.1038/nrg1767
Epigenetic changes: chromatin structure
Sproul, NRG 2005
Epigenetic changes: DNA methylation
Laird, NRC 2003