Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex...

28
Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex [email protected] Looking for signals in tens of thousands of GeneChips There are >10 5 GeneChip experiments in the public domain, that cost ~$10 9 to produce. Extracting further information from this resource will be very cost effective.

Transcript of Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex...

Page 1: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Dr Andrew Harrison

Departments of Mathematical Sciences and Biological Sciences

University of Essex

[email protected]

Looking for signals in tens of thousands of GeneChips

There are >105 GeneChip experiments in the public domain, that cost ~$109 to produce. Extracting further information from this resource will be very cost effective.

Page 2: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Faculty Degrees in …..Dr Andrew Harrison Physics Professor Graham Upton StatisticsDr Berthold Lausen Statistics+ Dr Hugh Shanahan (Royal Holloway) Physics

PhD studentsFarhat Memon Computer ScienceAnne Owen MathematicsFajriyah Rohmatul Statistics

Microarray informatics at Essex UniversityDepartments of Mathematical Sciences and Biological Sciences

AlumniDr Jose Arteaga-Salas StatisticsDr Renata Camargo Computer ScienceDr Caroline Johnston Molecular Biology and BioinformaticsDr William Langdon Computer Science and PhysicsDr Joanna Rowsell MathematicsDr Olivia Sanchez-Graillet Computer Science and BioinformaticsDr Maria Stalteri Inorganic Chemistry and Bioinformatics+ 4 former MSc students

Current MSc and UG studentsAleksandra Iljina Statistics and Data AnalysisLina Hamadeh Statistics and Data AnalysisMadalina Ghita Mathematics

Page 3: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

There is a huge multiple-testing problem.

m=log2(Fold Change), a=log2(Average Intensity)

What can be learnt from comparing different experiments?

Perfect Match (PM)

Mismatch (MM)

The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene -

Harrison, Johnston and Orengo, 2007, BMC Bioinformatics, 8: 195

Page 4: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Some genes are represented by multiple probe-sets.

Probe-set A Probe-set B

If they are measuring the same thing the signals should be up and down regulated together.

Is that always true? No

Stalteri and Harrison, 2007, BMC Bioinformatics, 8:13

Page 5: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Probes map to different exons. Alternative splicing may cause some exons to be upregulated and others to be downregulated.

Page 6: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Genes come in pieces.

But exons do not. Multiple probes mapping to the same exon should measure the same thing.

Page 7: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

We are studying the correlations in expression across >6,000 GeneChips (HGU-133A), sampling RNA from many tissues and phenotypes.

Page 8: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

The correlations in intensities (log2) between probes in probeset 208772_at on the HG-U133A array.

The number in each square is the correlation ×10

Blue = low correlationYellow = high correlation

Average intensity in GEO

The correlation calculated for PM probes 9 and 11 , the data in the earlier scatter plot, is reported as 8 (0.76 multiplied by 10 and rounded).

Probe order along the gene

Page 9: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

This probeset shows no coherent correlations amongst its probes.

Page 10: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Some probesets clearly have outliers.

Page 11: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Probes 1-11 all map to the same exon.

This is a different probe-set mapping to the same exon – there seems to be one outlier.

Page 12: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

The outliers are correlated with each other!

Page 13: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.
Page 14: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Virtually all of the probes in the group have runs of Guanines within their 25 bases.

TCCTGGACTGAGAAAGGGGGTTCCT

GAGACACACTGTACGTGGGGACCAC

GGTAGACTGGGGGTCATTTGCTTCC

There is little sequence similarity between the probes, they are from probe-sets picking up different biology, yet they are correlated!

Page 15: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

3 0.14

4 0.42

5 0.49

6 0.62

7 0.75

Number of contiguous Gs

Mean Correlation

Comparing probes with runs of Gs.

We are only looking at a small fraction of the entire probe, yet it is dominating the effects across all experiments.

Page 16: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Probes all have the same sequence in a cell – a run of guanines will result in closely packed DNA with just the right properties to form G-quadruplexes.

Upton et al. 2008 BMC Genomics, 9, 613

GGGG

GGGG

GGGG

G-quadruplexes

Page 17: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

How do we deal with known outliers such as G-quadruplexes?

What is the best way to calculate expression in the presence of outliers?

Page 18: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

G-stacks bias which genes are reported to be clustered together within published experiments.

Page 19: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.
Page 20: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Kerkhoven et al. 2008, PLoS ONE 3(4): e1980

Probes containing GCCTCCC will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridization.

Page 21: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Log(magnitude) of averaged probe values

Colour coded by size. Note the perimeter of bright-dark pairs.

Cell (0,0) contains a probe which does not measure any biology

Page 22: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Corner correlations(correlations with values in cell (0,0))

Numbers are correlations times 10 (red greater than 0.8) Negative correlations appear as blanksFilled circles indicate probes not listed in CDF file. Large circles indicate correlations greater than 0.8

Page 23: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Correlations with cell (0,0)

Being in the opposite corner has not reduced the correlations of the interior row and column

Page 24: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

What are in the sheep pens?

Entries are log(mean(Intensity))

Entries are correlation with cell (0,0)

Sheep!

Page 25: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Many thousands of probes are correlated with each other simply because they are adjacent to bright probes.

We believe that the focus of the scanner may be responsible – regions adjacent to bright spots will gain the same fraction of light.

A comparison of many images at different levels of blurriness will appear to indicate that dark regions adjacent to bright regions are correlated in their intensities.

Page 26: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

A CEL file contains information about the ID of the scanner as well as the date on which the image was scanned – how does the impact of blur change over time for each scanner?

Upton and Harrison, 2010, Stat Appl Genet Mol Biol, 9(1), Article 37

Page 27: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

How best to transform a DAT image into a CEL file?We are testing whether ideas from astronomy are applicable.

We are checking whether the temporal patterns in scanner performance for human and other organisms are related.

Page 28: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands.

Bioinformatix, Genomix, Mathematix, Physix, Statistix, Transcriptomix

are needed in order to extract reliable information from Affymetrix GeneChips

Thank you for your attention.