Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative...

Introduction to human genomics and genome informatics

Prince of Wales Clinical School

Session 1

Dr Jason Wong

ARC Future Fellow

Head, Bioinformatics & Integrative Genomics

Adult Cancer Program, Lowy Cancer Research Centre

Prince of Wales Clinical School, Faculty of Medicine, UNIVERSITY OF NEW SOUTH

WALES, SYDNEY NSW 2052

What we will cover

• Structure of the human genome

• Layers of genomic information – DNA (Sequence variation) – RNA (Genes & gene expression) – Epigenetics (DNA methylation) – Epigenetics (Histone code/Transcription factors)

• Genomic data acquisition technologies – Microarray – Next-generation sequencing

Structure of human genome

• Consist of 23 pairs of chromosomes.

• Each chromosome is paired meaning that it is diploid.

• Each individual chromosome made up of double stranded DNA.

• Approximately ~3 billion bases in total.

Reference human genome

• Human genomes vary significantly between individuals (~0.1%)

• Computationally, a reference genome is used.

• Important things to note about the reference genome: – Is haploid (i.e. only 1 sequence)

– Is a composite sequence (i.e. does not correspond to anyone’s genome)

Representation of genomic data

• Genomic data is most common represented in two ways:

1. Sequence data – fasta format (.fa or .fasta)

2. Location data – bed format (.bed)

>chr1

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

ACAGTACTGGCGGATTATAGGGAAACACCCGGAGCATATGCTGTTTGGTC

TCAgtagactcctaaatatgggattcctgggtttaaaagtaaaaaataaa

tatgtttaatttgtgaactgattaccatcagaattgtactgttctgtatc

ccaccagcaatgtctaggaatgcctgtttctccacaaagtgtttactttt

....

chr1 934343 935552 HES4 0 -

chr1 948846 949919 ISG15 0 +

...

All about genomic formats here - http://genome.ucsc.edu/FAQ/FAQformat.html

chromosome start end name score strand

http://genome.ucsc.edu/FAQ/FAQformat.html

http://genome.ucsc.edu/FAQ/FAQformat.html

What do chromosomes contain?

Genes: ~1.2% coding ~2% non-coding

Regulatory regions: ~2%

Repetitive elements comprise another ~50% of the human genome

Layers of genetic information

• DNA sequence variation

• Gene expression – Coding

– Non-coding

• Epigenetic regulation – DNA methylation

– Histone/transcription factor binding

Sequence variation

Variations in DNA sequence

• Cytological level: – Chromosome numbers – Segmental duplications, rearrangements,

and deletions

• Sub-chromosomal level: – Transposable elements – Short Deletions/Insertions, Tandem repeats

• Sequence level: – Single Nucleotide Polymorphisms (SNPs) – Small Nucleotide Insertions and Deletions

(Indels)

Sequence variation

• Single nucleotide polymorphisms (SNPs) – DNA sequence variations that

exist with members of a species. – They are inherited at birth and

therefore present in all cells.

• Somatic mutations – Are somatic – i.e. only present

in some cells – Mutations are often observed in

cancer cells

Types of SNPs/Mutations

• Most SNPs and mutations fall in intergenic regions.

• Within genes, they can either fall in the non-coding or coding regions.

• Within coding regions, they can either not-change (synonymous) or change (non-synonymous) amino acids.

Intergenic region Non-coding

Synonymous Coding

Non-Synonymous TSS

Effects of sequence variation

• Non-synonymous variants: – Missense (change protein structure)

– Nonsense (truncates protein)

• Synonymous or non-coding variants: – Alter transcriptional/translational efficiency

– Alter mRNA stability

– Alter gene regulation (i.e. alter TF binding)

– Alter RNA-regulation (i.e. affect miRNA binding)

Majority of sequence variation are neutral

Genes and gene expression

• A gene is a functional unit of DNA that is transcribed into RNA.

• Total genes in the human genome – 57,445

Types of genes

Source: GENCODE (version 18)

Coding genes

Source: http://www.news-medical.net

• Traditionally considered to be the most important functional unit of genomes.

• ~ 20,000 in the human genome.

• Due to splicing one gene can make many proteins.

Non-coding genes

microRNA • Plays a role in post-

transcriptional regulation.

• Only discovered in 1993.

• Acts by either causing RNA degradation or inhibition of translation.

• Implicated in many aspects of health and disease including: – Development – Cancer – Heart disease

Long non-coding RNA (lncRNA)

• Arbitrarily defined as non-coding transcripts > 200 nt in length.

• Implicated in many functions including: – Altering protein/DNA

interaction. – Binds mRNA. – Sink for miRNAs. – Etc…

• Unlike coding and miRNAs, lncRNA are less conserved and function of many are unknown.

Prensner and Chinnaiyan (2011) Cancer Discov. 1:391

Gene expression

• Measuring the level of RNA (typically mRNA) in the sample.

• Generally microarray- or sequencing-based.

• Commonly used for measuring differential expression – between samples, or – between genes

• Computation analysis and

normalisation of expression data can be complicated.

Source: OPENbeta

Gene-set/Pathway analysis

• Differential expression of individual genes not necessarily informative.

• Genes are often grouped in gene-sets based on ontology or biological pathways.

• Gene-set and pathway analyses are therefore a common downstream after differential gene expression analysis.

Gene regulation

Gene regulation/epigenetics

• Epigenetics is the study of mechanisms that alter cellular function independent to any changes in DNA sequence

• Mechanisms include: – DNA methylation

– Nucleosome positioning/Histone modification

– Transcription factors

– Non-coding RNA

DNA methylation

• DNA is methylated on cytosines in CpG dinucleotides

Nucleosomes & Histones

• Histones are proteins that package DNA into nucleosomes.

Histone modifications

• Acetylation

• Methylation

• Phosphorylation

• Ubiquitination

• Can enhance or repress gene expression

Transcription factors

• Proteins that bind DNA to regulate gene expression.

• Typically binds at gene promoters or enhancers.

Studying gene regulation

• Has traditionally been more difficult than studying gene expression because:

– Location of many regulatory regions are poorly defined.

– Regulatory regions differ greatly between cell types.

– Many modes of gene regulation.

• Next-generation sequencing technologies has enabled great progress to be made.

Genomic technologies

Genomic technologies

• Microarray-based data

– SNP profiling

– Copy number profiling

– DNA methylation profiling

– Gene expression profiling

• Next-generation sequencing

– “Swiss-army knife” of genomics

Data acquisition

• Relies on fluorescence-based on hybridisation of DNA against complementary probe on array.

• Can be used to study DNA or any molecule that can be converted to cDNA. – SNP array (probe for two alleles)

– Methylation array (probe for bisulfide converted DNA)

– Expression array (probe for exonic DNA regions)

• Limited by probes present on the array.

Microarray gene expression analysis

•Gene signatures

• Sample classification

Gene Value

D26528_at 193

D26561_cds1_at -70

D26561_cds2_at 144

D26561_cds3_at 33

D26579_at 318

D26598_at 1764

D26599_at 1537

D26600_at 1204

D28114_at 707

C la s s S n o D 2 6 5 2 8 D 6 3 8 7 4 D 6 3 8 8 0 …

A L L 2 1 9 3 4 1 5 7 5 5 6

A L L 3 1 2 9 1 1 5 5 7 4 7 6

A L L 4 4 4 1 2 1 2 5 4 9 8

A L L 5 2 1 8 8 4 8 4 1 2 1 1

A M L 5 1 1 0 9 3 5 3 7 1 3 1

A M L 5 2 1 0 6 4 5 7 8 9 4

A M L 5 3 2 1 1 2 4 3 1 2 0 9

…

Data Mining

and analysis

Microarray chips Images scanned by laser

Datasets

Next-generation sequencing

What is NGS?

A number of different technologies. We use the technology by Illumina sequencers as an example.

Figures provided by Illumina Inc.

Sequences are inferred from fluorescence signals during synthesis

Figures provided by Illumina Inc.

Short sequencing reads

Aligned reads

Gene

Alignment

NGS file formats

• Fastq – Stores sequencing reads from NGS. Contains read sequence and quality scores.

• BAM/SAM – A BAM file (.bam) is a binary file containing coordinates of where a read has mapped to in a genome. SAM is the same file in text format

• BedGraph/Wig – for storing continuous profile

information for visualisation.

• VCF – for storing information about variants.

https://powcs.med.unsw.edu.au/sites/default/files/powcs/page/example_file_formats.zip

https://powcs.med.unsw.edu.au/sites/default/files/powcs/page/example_file_formats.zip

Pros/cons of each technology

• NGS – Greater dynamic range (only limited by depth of

sequencing)

– Coverage of genome does not need to be limited.

– Many more applications from sequencing data.

– Data analysis and management can be challenging.

• Microarrays – Microarrays are still significantly cheaper.

– Largest public datasets are likely to be microarray based.

– Data analysis pipelines are well standardised.

Example of using public resources to tell us more about our data

http://www.powcs.med.edu.au/OncoCis



OncoCis uses public data from various sources to assign potential function to non-coding mutations

Given a non-coding mutation what do we want to know? 1. Does the mutation fall within a cis-

regulatory region (ENCODE/Human Epigenome Atlas).

2. Is the mutation site highly conserved (UCSC)?

3. What gene might the mutation affect (FANTOM5)?

4. What transcription factor binding site might be altered (JASPAR)?

5. Does the mutation affect a gene which is druggable (DGIdb)?

Gene mapping from FANTOM5 or GREAT Link out to UCSC

genome browser

Epigenetic data from ENCODE/Epigenome project

Conservation data from UCSC

Motif data from JASPAR

FANTOM5 regulatory data

Link out to Drug-Gene interaction database (DGIdb)

Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative...

Documents

Transcript of Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative...