Final VIPER presentation at BioVis 2013

23
Jessie Kennedy, Martin Graham Edinburgh Napier University Trevor Paterson, Andy Law The Roslin Institute, University of Edinburgh Visual Cleaning of Genotype Data

Transcript of Final VIPER presentation at BioVis 2013

Jessie Kennedy, Martin Graham Edinburgh Napier University

Trevor Paterson, Andy Law The Roslin Institute, University of Edinburgh

Visual Cleaning of Genotype Data

• VIPER is a visualisation for spotting areas of error (impossible inheritance) in pedigree genotype datasets

Background

Many More Markers, with similar data per marker

Pedigreestructure

G | G

T | A G | G

G | G

G | AG | T

T | C

• The visualisation aggregated errors across markers and displayed them as offspring groups– Along with ancillary tables and bar charts

• For it to be a useful biological tool , it needed extended to become a data cleaning application

Background

• Data Wrangling– Fixing unreliable or useless data– General Purpose vs Specific Task

• General Purpose Tools– Wrangler / Google Refine– Tabular data

• Ours is a Specific Task– Remove the errors as they break further analyses– Fixing errors often creates new ones as our data is an

inheritance graph of related data rather than a table

Background

• Error Visualisation Topics (in order of vol of work)– Uncertainty visualisation – show bounds of reliability– Missing data visualisation – is data present

• Usually the bane of visualisation rather than the aim– Correctness visualisation – is data right

Background

• We cover missing data and correctness. For us...– Incorrect data – bad. – Missing (incomplete) data – manageable.

• Cleaning ≠ Correcting– Correction is preferable, but often impossible

• We clean by deleting erroneous data points and inferring data from ancestor individuals– We swap wrong data for missing data

Data Cleaning

• Four basic masking operations

Data Cleaning - Operations

1. Mask markers

2. Mask individuals

3. Mask single data points

4. Break relationships

• Markers are independent of each other.– Masking one marker doesn’t change the errors in any

other markers

• Thus markers with lots of errors can be quickly removed with no side-effect– Early version in VIPER hid errors (but didn’t do anything to

the underlying data)

Data Cleaning - Markers

• Wanted to adopt the same approach...

– But something odd happened.

– Removing individuals changes the error counts of other individuals

• Because individuals inherit from each other• So e.g. Removing every individual with > 5 errors

produced individuals with >5 errors.

Data Cleaning - Individuals

• Some errors turned out to simply drop from one generation to the next– Literal “chase to the bottom”, lots of lost data

• In these situations it is often necessary to break a child/parent relationship across all markers in the pedigree– Which is where the fourth masking operation originates

Data Cleaning - Individuals

www.napier.ac.uk/iidi

Masking - 1

A/G G/T

A/G C/G G/T C/AG/AG/C C/C

C/C G/G G/T G/G G/C

C/A A/C

G/C G/G

C/C G/CMask all errorsRecheck for errorsRepeat

Lose 50% of data

www.napier.ac.uk/iidi

Masking - 2

A/G G/T

A/G C/G G/T C/AG/AG/C C/C

C/C G/G G/T G/G G/C

C/A A/C

G/C G/G

C/C G/CMask errors top downRecheck for errorsRepeat

Lose 25% of data

www.napier.ac.uk/iidi

Masking - 3

A/G G/T

A/G C/G G/T C/AG/AG/C C/C

C/C G/G G/T G/G G/C

C/A A/C

G/C G/G

C/C G/CMask errors top down + cut linksRecheck for errorsRepeat

Lose <20% of data

• Masked and missing data are shown in a different colour to error data

Showing Missing

• Being careful not to use any other colours in the interface, we can see how cleaning is going (red vsblue)

• New masking interactions available through standard context menus (and through tables)

Representations

• With such a hypothetical / experimental method of cleaning errors, undo is a must– Part of Shneiderman’s mantra– Beyond single-step, branching history

Visual History

Final Interface

• Genotype Checker vs VIPER+ interfaces• Both run using the same underlying data checking

algorithm• Same dataset

• 11 Biologists/Geneticists/Bioinformaticians at The Roslin Institute

• Asked them to attempt a pair of representative tasks with both interfaces (split into 12 Q’s)

Experiment

Experiment - Objective

• Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11

GenotypeChecker

Viper

Experiment - Objective

• Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10 11

Genotype Checker

VIPER

Experiment - Subjective

Key: 1 = Strongly prefer Viper, 5 = Strongly prefer GC, Bold = Median

Question VP No Pref GCFinding structural information on a pedigree 7 1 2 1 0Finding descendents of an individual 8 2 0 1 0

Finding ancestors of an individual 7 3 1 0 0

Finding error information on a single individual 4 1 1 4 1

Finding error information on a single marker 3 3 2 3 0

Distinguishing between different types of error 7 2 2 0 0Tracing errors to a shared parent 8 0 2 1 0

Finding error information on a single family 7 1 2 1 0Comparing errors between related families (one shared parent) 8 1 1 1 0

Masking errors 1 2 4 3 1Overall understanding of errors 5 1 4 1 0Overall ease of use 5 2 3 0 1

• A lot of incorrect/skipped answers in both scenarios– GC 61/132 = 46%– VP 45/132 = 34%

• These users were occasional users of cleaning software but it does show that Pedigree Cleaning is hard

• Excelitis – Biologists love Excel. The first move of many was to investigate the tables of error info rather than the main pedigree visualisation

Experiment - Observations

• Thanks for listening

• Sponsored by BBSRC

• http://www.bioinformatics.roslin.ed.ac.uk/viper/

End