Heatmap best practices
-
Upload
alex-priem -
Category
Data & Analytics
-
view
237 -
download
0
description
Transcript of Heatmap best practices
![Page 1: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/1.jpg)
Alex Priem (@_alex_priem_) Edwin de Jonge (@edwindjonge) Strata, 21 nov 2014, Barcelona
Patterns and meta patterns in Income Tax Data
![Page 2: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/2.jpg)
Age vs mortgage debt (men)
![Page 3: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/3.jpg)
Who are we?
Statistical consultants / Data scientists
working @ R&D department of Statistics
Netherlands
Statistics Netherlands (SN):
- Government agency
- Produces all official statistics of The Netherlands
3
![Page 4: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/4.jpg)
Income statistics based on Tax data
4
![Page 5: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/5.jpg)
Income Tax data
– Contains all income tax records for the Netherlands
– Approx 17M records with 550 variables.
– Used to produce income statistics!
Analysis is not trivial
– Income Tax is complex (at least in the Netherlands)
‐ stages of progressive tax
‐ Complex Tax deductions (mortgage, flex workers)
‐ Complex Tax benefits (child care, social benefits)
5
![Page 6: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/6.jpg)
Tax data (2)
- 550 variables (for each person in NL):
- 15 identificators/unique keys
- Dwelling, person id, etc.
- 70 categorical
- 250 numerical variables from the income tax
form
- >200 derived variables (useful for analysis)
- E.g. expandable income, income of
dwelling/household
6
![Page 7: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/7.jpg)
Income/tax distributions
Income (re)distribution hot topic since Piketty
So how are income/tax/benefits distributed?
- Look at 1D distributions: histograms
- Look at 2D distributions: heatmaps
- Problem: potentially 0.5 n(n-1) > 100k heatmaps!
- even more when categorical included
7
![Page 8: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/8.jpg)
Let look at Patterns…
8
![Page 9: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/9.jpg)
Heatmap Patterns
– What defines a pattern in heatmap?
‐ Peak/Spike? (mode, 0D point)
‐ Stripe (1D):
• Horizontal Line?
• Vertical Line?
• Band?
• Ridge?
‐ Blob (2D)
‐ Similarity between distributions (2D)
9
![Page 10: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/10.jpg)
Meta pattern?
Meta patterns constitutes of repeating pattern in:
‐ different subpopulations
• E.g. Male/female, Social economic status, Works
in branch of Industry
‐ different pairs of variables
• Income x age
• Benefits x age
• Etc.
So patterns that are generic over different heatmaps.
10
![Page 11: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/11.jpg)
Looking for patterns
Subpopulations:
– Generate heatmap per category e.g. Age x Gross Income
per social economic status
– Automatic cluster heatmaps on distribution simularity
Pairs of variables:
- Generate heatmaps for all pairs
- Prune: remove heatmaps with low support
1. Use image classification to cluster them
2. Or Cluster on extracted mode/line (wip)
You will still need to look at the result! 11
![Page 12: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/12.jpg)
Why Visualization?
![Page 13: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/13.jpg)
Anscombes quartet…
13
DS1 x
y DS2 x y
DS3 x y DS4 x y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
![Page 14: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/14.jpg)
Anscombe’s quartet
Property Value
Mean of x1, x2, x3, x4 All equal: 9
Variance of x1, x2, x3, x4 All equal: 11
Mean of y1, y2, y3, y4 All equal: 7.50
Variance of y1, y2, y3, y4 All equal: 4.1
Correlation for ds1, ds2, ds3, ds4 All equal 0.816
Linear regression for ds1, ds2, ds3, ds4
All equal: y = 3.00 + 0.500x
Looks the same, right?
![Page 15: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/15.jpg)
Lets plot!
![Page 16: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/16.jpg)
Machine learning
So clustering
(machine learning)
different?
16
![Page 17: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/17.jpg)
17
![Page 18: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/18.jpg)
Visualization helps to …
– Test your (hidden model) assumptions!
– To find structure in data, e.g.
“How is my data distributed?”
–Visually explore patterns:
‐ Are there clusters?
‐ Are there outliers?
18
![Page 19: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/19.jpg)
19
Heatmap recipe
![Page 20: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/20.jpg)
20
1. Take two numerical variables x and y
2. Determine range 𝑟𝑥 = [min 𝑥 ,max(𝑥)]
3. Chop 𝑟𝑥 in 𝑛𝑥 equal pieces
4. Repeat for y
5. We now have 𝑛𝑥. 𝑛𝑦 bins
6. Count # records in each bin
7. Assign colors to counts
8. Plot matrix
9. Enjoy!
![Page 21: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/21.jpg)
Easy as pie?
Best practices and problems with heatmaps:
- Resolution
- Rescaling
- Zooming
- Outliers
- Color scales
21
![Page 22: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/22.jpg)
22
![Page 23: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/23.jpg)
23
1. Take two numerical variables x and y
2. Determine range 𝐫𝐱 = [𝐦𝐢𝐧 𝐱 ,𝐦𝐚𝐱(𝐱)]
3. Chop 𝑟𝑥 in 𝑛𝑥 equal pieces
4. Repeat for y
5. We now have 𝑛𝑥. 𝑛𝑦 bins
6. Count # records in each bin
7. Assign colors to counts
8. Plot matrix
9. Enjoy!
![Page 24: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/24.jpg)
Range: Outliers? (1D)
24 +5M€ -1M€
Gross Income
![Page 25: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/25.jpg)
Range: outliers removed (1% removed)
25
Gross Income
+150k€
![Page 26: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/26.jpg)
Range: outliers…
Does your data contain outliers?
- If so: most pixels are empty
- Most cases: outliers have low mass and are
barely visible
Truncate range: in x or y direction: e.g. 99%
quantile
- Interactively: allow for zoom and pan. 26
![Page 27: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/27.jpg)
Range: data skewed?
27
–Many variables are not normal
distributed:
‐ Power law: 𝒙𝛼
‐ Exponential: 𝑒𝑎𝒙+𝑏
So rescale x or y or both
![Page 28: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/28.jpg)
28
1. Take two numerical variables x and y
2. Determine range rx = [min x ,max(x)]
3. Chop 𝒓𝒙 in 𝒏𝒙 equal pieces
4. Repeat for y
5. We now have 𝑛𝑥. 𝑛𝑦 bins
6. Count # records in each bin
7. Assign colors to counts
8. Plot matrix
9. Enjoy!
![Page 29: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/29.jpg)
Chop: AKA “Binning”
29
![Page 30: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/30.jpg)
30
Chop: resolution
Resolution matters
![Page 31: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/31.jpg)
31
25 x 25
![Page 32: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/32.jpg)
32
50 x 50
![Page 33: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/33.jpg)
33
100 x 100
![Page 34: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/34.jpg)
34
250 x 250
![Page 35: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/35.jpg)
35
500 x 500
![Page 36: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/36.jpg)
Chop: Too small / Too big
If #bins too small:
- patterns are hidden
If #bins too large:
- heatmap is noisy (signal vs noise)
Optimal nr bins depends on data.
(kernel based approx), but always play with bin
size / resolution!
36
![Page 37: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/37.jpg)
Chop: integers…
37
![Page 38: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/38.jpg)
38
1. Take two numerical variables x and y
2. Determine range rx = [min x ,max(x)]
3. Chop 𝑟𝑥 in 𝑛𝑥 equal pieces
4. Repeat for y
5. We now have 𝑛𝑥. 𝑛𝑦 bins
6. Count # records in each bin
7. Assign colors to counts
8. Plot matrix
9. Enjoy!
![Page 39: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/39.jpg)
Count: zero counts
Not every variable is relevant for each person!
39
![Page 40: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/40.jpg)
Count: exclude zero values
40
![Page 41: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/41.jpg)
Assign colors!
41
![Page 42: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/42.jpg)
42
1. Take two numerical variables x and y
2. Determine range rx = [min x ,max(x)]
3. Chop 𝑟𝑥 in 𝑛𝑥 equal pieces
4. Repeat for y
5. We now have 𝑛𝑥. 𝑛𝑦 bins
6. Count # records in each bin
7. Assign colors to counts
8. Plot matrix
9. Enjoy!
![Page 43: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/43.jpg)
43
![Page 44: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/44.jpg)
Colors: scales
– Color ‘intensity’ implies value
– Percieved response depends on ‘color’ and ‘color
lightness’ (compare #00ff00 with #0000ff)
– Different models for color response:
‐ RGB (models computer monitor)
‐ HSV
‐ HCL
‐ CIELAB (models human eye) – Gradient generator:
http://davidjohnstone.net/pages/lch-lab-colour-gradient-picker
44
![Page 45: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/45.jpg)
Colors
– Color has two functions in heatmap:
‐ Show ‘counts’ in your data
‐ Show ‘patterns’
At least, use a perceptually uniform gradient
- Libs: chroma.js, colorbrewer (R)
…but patterns need distinct colors
45
![Page 46: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/46.jpg)
Color scales
– Range of color scale depends on distribution of data.
– Often have multiple populations/distributions in data
– Severe spikes/stripes drown the smaller distributions:
‐ We suggest log scale
‐ Sometimes log scale is not enough
– In practice, linear scale with low maximum cut-off works
well
– Effect is best understood in 3D (!).
46
![Page 47: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/47.jpg)
Peaks are best cut-off
47
![Page 48: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/48.jpg)
Example: Linear gradient
48
![Page 49: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/49.jpg)
Log-gradient
49
![Page 50: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/50.jpg)
Linear gradient with cut-off
50
![Page 51: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/51.jpg)
Perceptually uniform gradient
51
![Page 52: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/52.jpg)
Colors: background/missings matters
52
![Page 53: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/53.jpg)
Heatmaps side-by-side: gross income, men vs women
53 men
![Page 54: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/54.jpg)
Meta pattern
Meta patterns constitutes of repeating pattern in:
‐ different subpopulations
‐ different pairs of variables
So patterns that are generic over different
heatmaps.
54
![Page 55: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/55.jpg)
Heatmaps decomposed in subpopulations:
55
![Page 56: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/56.jpg)
Gross income by socioeconomic status
56
![Page 57: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/57.jpg)
Gross income, men, categorized by socioeconomic status
57
![Page 58: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/58.jpg)
Patterns
– Stripes are real, not outliers:
– Corresponds with benefits, tax breaks
– Needs paradigm shift: data is not normally
distributed (but we knew that).
58
![Page 59: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/59.jpg)
Meta pattern
Meta patterns constitutes of repeating
pattern in:
‐ different subpopulations
‐ different pairs of variables
So patterns that are generic over different
heatmaps. 59
![Page 60: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/60.jpg)
Image classification of heatmaps
60
![Page 61: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/61.jpg)
No Domain knowledge required?
61
![Page 62: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/62.jpg)
62
![Page 63: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/63.jpg)
Salary pay structure
63
![Page 64: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/64.jpg)
Domain knowledge, take II
64
![Page 65: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/65.jpg)
Pattern removal: Effect of weighting
65
![Page 66: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/66.jpg)
Summary
Heatmaps: – ideal tool for analyzing big datasets
– Be aware of perceptual and data biases!
66
![Page 67: Heatmap best practices](https://reader033.fdocuments.net/reader033/viewer/2022052413/559b0e761a28abb6638b4825/html5/thumbnails/67.jpg)
Questions?
Thank you for your attention!
More info?
[email protected] / @_alex_priem
[email protected] / @edwindjonge
Heatmapping code available at
https://github.com/alexpriem/heatmapr
67