Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative...
Transcript of Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative...
Introduction to human genomics and genome informatics
Prince of Wales Clinical School
Session 1
Dr Jason Wong
ARC Future Fellow
Head, Bioinformatics & Integrative Genomics
Adult Cancer Program, Lowy Cancer Research Centre
Prince of Wales Clinical School, Faculty of Medicine, UNIVERSITY OF NEW SOUTH
WALES, SYDNEY NSW 2052
What we will cover
• Structure of the human genome
• Layers of genomic information – DNA (Sequence variation) – RNA (Genes & gene expression) – Epigenetics (DNA methylation) – Epigenetics (Histone code/Transcription factors)
• Genomic data acquisition technologies – Microarray – Next-generation sequencing
Structure of human genome
• Consist of 23 pairs of chromosomes.
• Each chromosome is paired meaning that it is diploid.
• Each individual chromosome made up of double stranded DNA.
• Approximately ~3 billion bases in total.
Reference human genome
• Human genomes vary significantly between individuals (~0.1%)
• Computationally, a reference genome is used.
• Important things to note about the reference genome: – Is haploid (i.e. only 1 sequence)
– Is a composite sequence (i.e. does not correspond to anyone’s genome)
Representation of genomic data
• Genomic data is most common represented in two ways:
1. Sequence data – fasta format (.fa or .fasta)
2. Location data – bed format (.bed)
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ACAGTACTGGCGGATTATAGGGAAACACCCGGAGCATATGCTGTTTGGTC
TCAgtagactcctaaatatgggattcctgggtttaaaagtaaaaaataaa
tatgtttaatttgtgaactgattaccatcagaattgtactgttctgtatc
ccaccagcaatgtctaggaatgcctgtttctccacaaagtgtttactttt
....
chr1 934343 935552 HES4 0 -
chr1 948846 949919 ISG15 0 +
...
All about genomic formats here - http://genome.ucsc.edu/FAQ/FAQformat.html
chromosome start end name score strand
What do chromosomes contain?
Genes: ~1.2% coding ~2% non-coding
Regulatory regions: ~2%
Repetitive elements comprise another ~50% of the human genome
Layers of genetic information
• DNA sequence variation
• Gene expression – Coding
– Non-coding
• Epigenetic regulation – DNA methylation
– Histone/transcription factor binding
Sequence variation
Variations in DNA sequence
• Cytological level: – Chromosome numbers – Segmental duplications, rearrangements,
and deletions
• Sub-chromosomal level: – Transposable elements – Short Deletions/Insertions, Tandem repeats
• Sequence level: – Single Nucleotide Polymorphisms (SNPs) – Small Nucleotide Insertions and Deletions
(Indels)
Sequence variation
• Single nucleotide polymorphisms (SNPs) – DNA sequence variations that
exist with members of a species. – They are inherited at birth and
therefore present in all cells.
• Somatic mutations – Are somatic – i.e. only present
in some cells – Mutations are often observed in
cancer cells
Types of SNPs/Mutations
• Most SNPs and mutations fall in intergenic regions.
• Within genes, they can either fall in the non-coding or coding regions.
• Within coding regions, they can either not-change (synonymous) or change (non-synonymous) amino acids.
Intergenic region Non-coding
Synonymous Coding
Non-Synonymous TSS
Effects of sequence variation
• Non-synonymous variants: – Missense (change protein structure)
– Nonsense (truncates protein)
• Synonymous or non-coding variants: – Alter transcriptional/translational efficiency
– Alter mRNA stability
– Alter gene regulation (i.e. alter TF binding)
– Alter RNA-regulation (i.e. affect miRNA binding)
Majority of sequence variation are neutral
Genes and gene expression
• A gene is a functional unit of DNA that is transcribed into RNA.
• Total genes in the human genome – 57,445
Types of genes
Source: GENCODE (version 18)
Coding genes
Source: http://www.news-medical.net
• Traditionally considered to be the most important functional unit of genomes.
• ~ 20,000 in the human genome.
• Due to splicing one gene can make many proteins.
Non-coding genes
microRNA • Plays a role in post-
transcriptional regulation.
• Only discovered in 1993.
• Acts by either causing RNA degradation or inhibition of translation.
• Implicated in many aspects of health and disease including: – Development – Cancer – Heart disease
Long non-coding RNA (lncRNA)
• Arbitrarily defined as non-coding transcripts > 200 nt in length.
• Implicated in many functions including: – Altering protein/DNA
interaction. – Binds mRNA. – Sink for miRNAs. – Etc…
• Unlike coding and miRNAs, lncRNA are less conserved and function of many are unknown.
Prensner and Chinnaiyan (2011) Cancer Discov. 1:391
Gene expression
• Measuring the level of RNA (typically mRNA) in the sample.
• Generally microarray- or sequencing-based.
• Commonly used for measuring differential expression – between samples, or – between genes
• Computation analysis and
normalisation of expression data can be complicated.
Source: OPENbeta
Gene-set/Pathway analysis
• Differential expression of individual genes not necessarily informative.
• Genes are often grouped in gene-sets based on ontology or biological pathways.
• Gene-set and pathway analyses are therefore a common downstream after differential gene expression analysis.
Gene regulation
Gene regulation/epigenetics
• Epigenetics is the study of mechanisms that alter cellular function independent to any changes in DNA sequence
• Mechanisms include: – DNA methylation
– Nucleosome positioning/Histone modification
– Transcription factors
– Non-coding RNA
DNA methylation
• DNA is methylated on cytosines in CpG dinucleotides
Nucleosomes & Histones
• Histones are proteins that package DNA into nucleosomes.
Histone modifications
• Acetylation
• Methylation
• Phosphorylation
• Ubiquitination
• Can enhance or repress gene expression
Transcription factors
• Proteins that bind DNA to regulate gene expression.
• Typically binds at gene promoters or enhancers.
Studying gene regulation
• Has traditionally been more difficult than studying gene expression because:
– Location of many regulatory regions are poorly defined.
– Regulatory regions differ greatly between cell types.
– Many modes of gene regulation.
• Next-generation sequencing technologies has enabled great progress to be made.
Genomic technologies
Genomic technologies
• Microarray-based data
– SNP profiling
– Copy number profiling
– DNA methylation profiling
– Gene expression profiling
• Next-generation sequencing
– “Swiss-army knife” of genomics
Data acquisition
• Relies on fluorescence-based on hybridisation of DNA against complementary probe on array.
• Can be used to study DNA or any molecule that can be converted to cDNA. – SNP array (probe for two alleles)
– Methylation array (probe for bisulfide converted DNA)
– Expression array (probe for exonic DNA regions)
• Limited by probes present on the array.
Microarray gene expression analysis
•Gene signatures
• Sample classification
Gene Value
D26528_at 193
D26561_cds1_at -70
D26561_cds2_at 144
D26561_cds3_at 33
D26579_at 318
D26598_at 1764
D26599_at 1537
D26600_at 1204
D28114_at 707
C la s s S n o D 2 6 5 2 8 D 6 3 8 7 4 D 6 3 8 8 0 …
A L L 2 1 9 3 4 1 5 7 5 5 6
A L L 3 1 2 9 1 1 5 5 7 4 7 6
A L L 4 4 4 1 2 1 2 5 4 9 8
A L L 5 2 1 8 8 4 8 4 1 2 1 1
A M L 5 1 1 0 9 3 5 3 7 1 3 1
A M L 5 2 1 0 6 4 5 7 8 9 4
A M L 5 3 2 1 1 2 4 3 1 2 0 9
…
Data Mining
and analysis
Microarray chips Images scanned by laser
Datasets
Next-generation sequencing
What is NGS?
A number of different technologies. We use the technology by Illumina sequencers as an example.
Figures provided by Illumina Inc.
Sequences are inferred from fluorescence signals during synthesis
Figures provided by Illumina Inc.
Short sequencing reads
Aligned reads
Gene
Alignment
NGS file formats
• Fastq – Stores sequencing reads from NGS. Contains read sequence and quality scores.
• BAM/SAM – A BAM file (.bam) is a binary file containing coordinates of where a read has mapped to in a genome. SAM is the same file in text format
• BedGraph/Wig – for storing continuous profile
information for visualisation.
• VCF – for storing information about variants.
https://powcs.med.unsw.edu.au/sites/default/files/powcs/page/example_file_formats.zip
Pros/cons of each technology
• NGS – Greater dynamic range (only limited by depth of
sequencing)
– Coverage of genome does not need to be limited.
– Many more applications from sequencing data.
– Data analysis and management can be challenging.
• Microarrays – Microarrays are still significantly cheaper.
– Largest public datasets are likely to be microarray based.
– Data analysis pipelines are well standardised.
Example of using public resources to tell us more about our data
http://www.powcs.med.edu.au/OncoCis
OncoCis uses public data from various sources to assign potential function to non-coding mutations
Given a non-coding mutation what do we want to know? 1. Does the mutation fall within a cis-
regulatory region (ENCODE/Human Epigenome Atlas).
2. Is the mutation site highly conserved (UCSC)?
3. What gene might the mutation affect (FANTOM5)?
4. What transcription factor binding site might be altered (JASPAR)?
5. Does the mutation affect a gene which is druggable (DGIdb)?
Gene mapping from FANTOM5 or GREAT Link out to UCSC
genome browser
Epigenetic data from ENCODE/Epigenome project
Conservation data from UCSC
Motif data from JASPAR
FANTOM5 regulatory data
Link out to Drug-Gene interaction database (DGIdb)