A Visualization Tool for Mining Large Correlation Tables: The...

38
A Visualization Tool for Mining Large Correlation Tables: The Association Navigator Andreas Buja, Abba Krieger, Ed George Statistics Department, The Wharton School, University of Pennsylvania * August 22, 2016 1 Overview The Association Navigator is an interactive visualization tool for viewing large tables of correlations. The basic operation is zooming and panning of a table that is presented in graphical form, here called a “blockplot”. The tool is really a tool box that includes, among other things: (1) display of p-values and missing value patterns in addition to correlations, (2) mark-up facilities to highlight variables and sub-tables as landmarks when navigating the larger table, (3) histograms/barcharts, scatterplots and scatterplot matrices as “lenses” into the distributions of variables and vari- able pairs, (4) thresholding of correlations and p-values to show only strong and highly significant p-values, (5) trimming of extreme values of the variables for robustness, (6) “ref- erence variables” that stay in sight at all times, and (7) wholesale adjustment of groups of variables for other variables. The tool has been applied to data with nearly 2,000 variables and associated tables approaching a size of 2,000×2,000. The usefulness of the tool is less in beholding gigantic ta- bles in their entirety and more in searching for interesting association patterns by navigating manageable but numerous and interconnected sub-tables. 2 Introduction This document describes the Association Navigator (AN for short) in three sections: (1) In this introductory Section 2 we give some background about the data analytic and * This work was partially supported by a grant from the Simons Foundation (SFARI awards #121221 and #296012 to A.K.). We appreciate obtaining access to the phenotypic data on SFARI Base (https://base.sfari.org). Partial support was also provided by NSF Grant DMS-1007689 to A. Buja. 1

Transcript of A Visualization Tool for Mining Large Correlation Tables: The...

Page 1: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

A Visualization Tool for Mining LargeCorrelation Tables:

The Association Navigator

Andreas Buja, Abba Krieger, Ed GeorgeStatistics Department, The Wharton School, University of Pennsylvania∗

August 22, 2016

1 Overview

The Association Navigator is an interactive visualization tool for viewing large tables ofcorrelations. The basic operation is zooming and panning of a table that is presented ingraphical form, here called a “blockplot”.

The tool is really a tool box that includes, among other things: (1) display of p-values andmissing value patterns in addition to correlations, (2) mark-up facilities to highlight variablesand sub-tables as landmarks when navigating the larger table, (3) histograms/barcharts,scatterplots and scatterplot matrices as “lenses” into the distributions of variables and vari-able pairs, (4) thresholding of correlations and p-values to show only strong and highlysignificant p-values, (5) trimming of extreme values of the variables for robustness, (6) “ref-erence variables” that stay in sight at all times, and (7) wholesale adjustment of groups ofvariables for other variables.

The tool has been applied to data with nearly 2,000 variables and associated tablesapproaching a size of 2,000×2,000. The usefulness of the tool is less in beholding gigantic ta-bles in their entirety and more in searching for interesting association patterns by navigatingmanageable but numerous and interconnected sub-tables.

2 Introduction

This document describes the Association Navigator (AN for short) in three sections:(1) In this introductory Section 2 we give some background about the data analytic and

∗This work was partially supported by a grant from the Simons Foundation (SFARI awards #121221and #296012 to A.K.). We appreciate obtaining access to the phenotypic data on SFARI Base(https://base.sfari.org). Partial support was also provided by NSF Grant DMS-1007689 to A. Buja.

1

Page 2: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

age_

at_a

dos_

p1.C

DV

fam

ily_t

ype_

p1.C

DV

sex_

p1.C

DV

ethn

icity

_p1.

CD

Vcp

ea_d

x_p1

.CD

Vad

i_r_

cpea

_dx_

p1.C

DV

adi_

r_so

c_a_

tota

l_p1

.CD

Vad

i_r_

com

m_b

_non

_ver

bal_

tota

l_p1

.CD

Vad

i_r_

b_co

mm

_ver

bal_

tota

l_p1

.CD

Vad

i_r_

rrb_

c_to

tal_

p1.C

DV

adi_

r_ev

iden

ce_o

nset

_p1.

CD

Vad

os_m

odul

e_p1

.CD

Vdi

agno

sis_

ados

_p1.

CD

Vad

os_c

ss_p

1.C

DV

ados

_soc

ial_

affe

ct_p

1.C

DV

ados

_res

tric

ted_

repe

titiv

e_p1

.CD

Vad

os_c

omm

unic

atio

n_so

cial

_p1.

CD

Vss

c_di

agno

sis_

verb

al_i

q_p1

.CD

Vss

c_di

agno

sis_

verb

al_i

q_ty

pe_p

1.C

DV

ssc_

diag

nosi

s_no

nver

bal_

iq_p

1.C

DV

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_p

1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_vm

a_p1

.CD

Vss

c_di

agno

sis_

nvm

a_p1

.CD

Vvi

nela

nd_i

i_co

mpo

site

_sta

ndar

d_sc

ore_

p1.C

DV

srs_

pare

nt_t

_sco

re_p

1.C

DV

srs_

pare

nt_r

aw_t

otal

_p1.

CD

Vsr

s_te

ache

r_t_

scor

e_p1

.CD

Vsr

s_te

ache

r_ra

w_t

otal

_p1.

CD

Vrb

s_r_

over

all_

scor

e_p1

.CD

Vcb

cl_2

_5_i

nter

naliz

ing_

t_sc

ore_

p1.C

DV

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

Vcb

cl_6

_18_

inte

rnal

izin

g_t_

scor

e_p1

.CD

Vcb

cl_6

_18_

exte

rnal

izin

g_t_

scor

e_p1

.CD

Vab

c_to

tal_

scor

e_p1

.CD

Vno

n_fe

brile

_sei

zure

s_p1

.CD

Vfe

brile

_sei

zure

s_p1

.CD

V

age_at_ados_p1.CDVfamily_type_p1.CDV

sex_p1.CDVethnicity_p1.CDVcpea_dx_p1.CDV

adi_r_cpea_dx_p1.CDVadi_r_soc_a_total_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDVadi_r_b_comm_verbal_total_p1.CDV

adi_r_rrb_c_total_p1.CDVadi_r_evidence_onset_p1.CDV

ados_module_p1.CDVdiagnosis_ados_p1.CDV

ados_css_p1.CDVados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDVados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDVssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDVssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDVssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDVssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDVsrs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDVsrs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDVrbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDVcbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDVcbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDVnon_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV

Correlations(Compl.Pairs)

Figure 1: A first example of a “blockplot”: labels in the bottom and left margins show variablenames, and blue and red blocks in the plotting area show positive and negative correlations.

2

Page 3: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

statistical problem addressed by this tool; (2) in Section 3 we describe the graphical displaysused by the tool; (3) in Section 4 we describe the actual operation of the tool. We start withsome background:

An important focus of contemporary statistical research is on methods for large multi-variate data. The term “large” can have two meanings, not mutually exclusive: (1) a largenumber of cases (records, rows), also called the “large-n problem”, or (2) a large number ofvariables (attributes, columns), also called the “large-p problem.” The two types of large-ness call for different data analytic approaches and determine the kinds of questions thatcan be answered by the data. Most fundamentally it should be observed that increasing n,the number of cases, and increasing p, the number of variables, each has very different andin some ways opposite effects on statistical analysis. Since the general multivariate analysisproblem is to make statistical inference about the association among variables, increasing nhas the effect of improving the certainty of inference due to improved precision of estimates,whereas increasing p has the contrary effect of reducing the certainty of inference due to themultiplicity problem or, more colorfully, the “data dredging fallacy.” Therefore the level ofdetail that can be inferred about association among variables improves with increasing n butit plummets with increasing p.

The problem we address here is primarily the large-p problem. From the above discussionit follows that, for large p, associations among variables can generally be inferred only toa low level of detail and certainty. Hence it is sufficient to measure association by simplemeans such as plain correlations. Correlations indicate the basic directionality in pairwiseassociation, and as such they answer the simplest but also most fundamental question: arehigher values in X associated with higher or lower values in Y , at least in tendency?

Reliance on correlations may be subject to objections because they seem limited in theirrange of applicability for several reasons: (1) they are considered to be measures of linearassociation only, (2) they describe bivariate association only, and (3) they apply to quantita-tive variables only. In Appendix A we refute or temper each of these objections by showing(1) that correlations are usually useful measures of directionality even when the associationsare non-linear, (2) that higher-order associations play a reduced role especially in large-pproblems, and (3) that with the help of a few tricks of the trade (“scoring” and “dummycoding”) correlations are useful even for categorical variables, both ordinal and nominal. Inview of these arguments we proceed from the assumption that correlation tables, when usedcreatively, form quite general and powerful summaries of association among many variables.

In the following sections we describe first how we graphically present large correlationtables and second how we navigate and search them interactively. The software writtento this end, the Association Navigator or AN, implements the essential displays andinteractive functionality to support the “mining” of large correlation tables. The AN softwareis written entirely in the R language1.

All data examples in this document are drawn from the phenotypic data in the “SimonsSimplex Collection” (SSC) created by the Simons Foundation Autism Research Initiative(SFARI). Approved researchers can obtain the SSC dataset used in this document by apply-

1http://www.cran.r-project.org

3

Page 4: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

ing at https://base.sfari.org.

3 Graphical Displays

3.1 Graphical Display of Correlation Tables: Blockplots

Figure 1 shows a first example of what we call a “blockplot”2 of a dataset with p = 38variables. This plot is intended as a direct and fairly obvious translation of a numericcorrelation table into visual form. The elements of the plot are as follows:

• The labels in the bottom and left margins show line-ups of the same 38 variables:age at ados p1.CDV, family type p1.CDV, sex p1.CDV,... . In contrast to tables, wherethe vertical axis lists variables top down, we follow the convention of scatterplots wherethe vertical axis is ascending and hence the variables are listed bottom up.

• The blue and red squares or “blocks” represent the pairwise correlations betweenvariables at the intersections of the (imagined) horizontal and vertical lines drawnfrom the respective margin labels. The magnitude of a correlation is reflected in thesize of the block and its sign in the color: positive correlations are shown in blueand negative correlations in red.3 — Along the ascending 45-degree diagonal are thecorrelations +1 of the variables with themselves, hence these blocks are of maximal size.The closeness of other correlations to +1 or −1 can be gauged by a size comparisonwith the diagonal blocks.

• Finally, the plot shows a small comment in the bottom left, “Correlations (Compl.Pairs)”,indicating that what is represented by the blocks is correlation of complete — that is,non-missing — pairs of values of the two variables in question. This comment refersto the missing values problem and to the fact that correlation can only be calculatedfrom the cases where the values of both variables are non-missing. The comment alsoalludes to the possibility that very different types of information could be representedby the blocks, and this is indeed made use of by the AN software (see Sections 3.3and 3.4).

As a “reading exercise” consider Figure 2: This is the same blockplot as in Figure 1, butfor ease of pointing we marked up two variables on the horizontal axis4:

age at ados p1.CDV, ados restricted repetitive p1.CDV,

2 This type of plot is also called “fluctuation diagram” (Hofmann, 2000). The term “blockplot” is ours,and we introduce it because it is more descriptive of the plot’s visual appearance. We may even dare proposethat “blockplot” be contracted to “blot”, which would be in the tradition of contracting “scatterplot matrix”to “splom” and “graphics object” to “grob”.

3 We follow the convention from finance where “being in the red” implies negative numbers; the oppositeconvention is from physics where red symbolizes higher temperatures. Users can easily change the defaultsfor blockplots; see the programming hints in Appendix B.

4 This dataset represents a version of the table proband cdv.csv in Version 9 of the phenotypic SSC.The acronym cdv means “core descriptive variables”.

4

Page 5: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

age_

at_a

dos_

p1.C

DV

fam

ily_t

ype_

p1.C

DV

sex_

p1.C

DV

ethn

icity

_p1.

CD

V

cpea

_dx_

p1.C

DV

adi_

r_cp

ea_d

x_p1

.CD

V

adi_

r_so

c_a_

tota

l_p1

.CD

Vad

i_r_

com

m_b

_non

_ver

bal_

tota

l_p1

.CD

Vad

i_r_

b_co

mm

_ver

bal_

tota

l_p1

.CD

V

adi_

r_rr

b_c_

tota

l_p1

.CD

Vad

i_r_

evid

ence

_ons

et_p

1.C

DV

ados

_mod

ule_

p1.C

DV

diag

nosi

s_ad

os_p

1.C

DV

ados

_css

_p1.

CD

Vad

os_s

ocia

l_af

fect

_p1.

CD

Vad

os_r

estr

icte

d_re

petit

ive_

p1.C

DV

ados

_com

mun

icat

ion_

soci

al_p

1.C

DV

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ssc_

diag

nosi

s_ve

rbal

_iq_

type

_p1.

CD

Vss

c_di

agno

sis_

nonv

erba

l_iq

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_p

1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_vm

a_p1

.CD

Vss

c_di

agno

sis_

nvm

a_p1

.CD

V

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

Vsr

s_pa

rent

_t_s

core

_p1.

CD

V

srs_

pare

nt_r

aw_t

otal

_p1.

CD

Vsr

s_te

ache

r_t_

scor

e_p1

.CD

Vsr

s_te

ache

r_ra

w_t

otal

_p1.

CD

V

rbs_

r_ov

eral

l_sc

ore_

p1.C

DV

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

Vcb

cl_6

_18_

inte

rnal

izin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_ex

tern

aliz

ing_

t_sc

ore_

p1.C

DV

abc_

tota

l_sc

ore_

p1.C

DV

non_

febr

ile_s

eizu

res_

p1.C

DV

febr

ile_s

eizu

res_

p1.C

DV

age_at_ados_p1.CDVfamily_type_p1.CDV

sex_p1.CDVethnicity_p1.CDVcpea_dx_p1.CDV

adi_r_cpea_dx_p1.CDVadi_r_soc_a_total_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDVadi_r_b_comm_verbal_total_p1.CDV

adi_r_rrb_c_total_p1.CDVadi_r_evidence_onset_p1.CDV

ados_module_p1.CDVdiagnosis_ados_p1.CDV

ados_css_p1.CDVados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDVados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDVssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDVssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDVssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDVssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDVsrs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDVsrs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDVrbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDVcbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDVcbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDVnon_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV

Correlations(Compl.Pairs)

Figure 2: A “reading exercise” illustrated with the same example as in Fig-ure 1. The salmon-colored strips highlight the variables age at ados p1.CDV

and ados restricted repetitive p1.CDV on the horizontal axis, and the variablesssc diagnosis vma p1.DCV and ssc diagnosis nvma p1.DCV on the vertical axis. At theintersections of the strips are the blocks that reflect the respective correlations.

5

Page 6: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

meaning “age at the time of the administration of the ADOS or Autism Diagnostic Obser-vation Schedule”, and “problems due to restricted and repetitive behaviors”, respectively.Two other variables are marked up on the vertical axis:

ssc diagnosis vma p1.DCV, ssc diagnosis nvma p1.DCV.meaning “verbal mental age”, and “non-verbal mental age”, respectively, which are relatedto notions of IQ. For readability we will shorten the labels in what follows.

As for the actual reading exercise, in the intersection of the left vertical strip with thehorizontal strip, we find two blue blocks of which the lower is recognizably larger than theupper (the reader may have to zoom in if viewing the figure in a PDF reader), implyingthat the correlation of age at ados.. with both ..vma.. and ..nvma.. is positive, butmore strongly with the former than the latter, which may be news to the non-specialist:verbal skills are more strongly age-related than non-verbal skills. (Strictly speaking we canclaim this only for the present sample of autistic probands.) — Similarly, following the rightvertical strip to the intersection with the horizontal strip, we find two red blocks of whichagain the lower block is slightly larger than the upper, but both are smaller than the blueblocks in the left strip. This implies that ..restricted repetitive.. is negatively correlatedwith both ..vma.. and ..nvma.., but more strongly with the former, and both are moreweakly correlated with ..restricted repetitive.. than with age at ados... All of thismakes sense in light of the apparent meanings of the variables: Any notion of “mental age”is probably quite strongly and positively associated with chronological age; with hindsightwe may also accept that problems with specific behaviors tend to diminish with age, but theassociation is probably less strong than that between different notions of age.

Some other patterns are quickly parsed and understood: The two 2×2 blocks on the upperright diagonal stem from two versions of the same underlying measurements: “raw total”and “t score”. Next, the alternating patterns of red and blue in the center indicate that thethree IQ measures (“verbal”, “nonverbal”, “full scale”) are in an inverse association withthe corresponding IQ types. This makes sense because IQ types are dummy variables thatindicate whether an IQ test suitable for cognitively highly impaired probands was applied.— It becomes apparent that one can spend a fair amount of time wrapping one’s mindaround the visible blockplot patterns and their meanings.

3.2 Graphical Overview of Large Correlation Tables

Figure 1 shows a manageably small set of 38 variables which is yet large enough that thepresentation of a numeric table would be painful for anybody. This table, however, is asmall subset of a larger dataset of 757 variables which is shown in Figure 3. In spite ofthe very different appearance, this, too, is a blockplot drawn by the same tool and by thesame principles, with some allowance for the fact that 7572 = 573, 049 blocks cannot besensibly displayed on screens with an image resolution comparable to 7572. When blocks arerepresented by single pixels, blocksize variation is no longer possible. In this case, the tooldisplays only a selection of correlations that are largest in magnitude. The default (whichcan be changed) is to show 10,000 of the most extreme correlations, and it is these that givethe blockplot in Figure 3 the characteristic pattern of streaks and rectangular concentrations.

6

Page 7: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

fam

ily.ID

pb

_site

s.FA

M

vn_s

ites.

FAM

xx

_site

s.FA

M

sb_s

ites.

FAM

fl_

site

s.FA

M

kr_s

ites.

FAM

sz

.sor

ted_

site

s.FA

M

eval

age_

s1_b

g.S

RS

ev

alag

e_m

o_bg

.SR

S

eval

mon

th_s

1_bg

.SR

S

eval

mon

th_m

o_bg

.SR

S

num

.sib

s_IN

D

sex.

f1m

0.p1

_IN

D

birt

h.p1

_IN

D

birt

h.fa

_IN

D

deat

h.p1

_IN

D

deat

h.m

o_IN

D

gene

tic.a

b.s1

_IN

D

gene

tic.a

b.m

o_IN

D

over

all_

gene

tic_F

AM

fa

ther

_gen

etic

_FA

M

sib_

gene

tic_F

AM

as

ian.

fa_r

aceP

AR

EN

T

nativ

e−ha

wai

ian.

fa_r

aceP

AR

EN

T

not−

spec

ified

.fa_r

aceP

AR

EN

T

ethn

icity

.his

pani

c.fa

_rac

ePA

RE

NT

as

ian.

mo_

race

PAR

EN

T

nativ

e−ha

wai

ian.

mo_

race

PAR

EN

T

not−

spec

ified

.mo_

race

PAR

EN

T

ethn

icity

.his

pani

c.m

o_ra

cePA

RE

NT

ba

pq_r

igid

_ave

rage

.fa_c

uPA

RE

NT

ba

pq_a

loof

_ave

rage

.fa_c

uPA

RE

NT

sr

s_ad

ult_

nbr_

mis

sing

.fa_c

uPA

RE

NT

ba

pq_n

br_m

issi

ng.m

o_cu

PAR

EN

T

bapq

_pra

gmat

ic_a

vera

ge.m

o_cu

PAR

EN

T

bapq

_ove

rall_

aver

age.

mo_

cuPA

RE

NT

sr

s_ad

ult_

tota

l.mo_

cuPA

RE

NT

fa

mily

_typ

e_p1

.CD

V

ethn

icity

_p1.

CD

V

adi_

r_cp

ea_d

x_p1

.CD

V

adi_

r_co

mm

_b_n

on_v

erba

l_to

tal_

p1.C

DV

ad

i_r_

rrb_

c_to

tal_

p1.C

DV

ad

os_m

odul

e_p1

.CD

V

ados

_css

_p1.

CD

V

ados

_res

tric

ted_

repe

titiv

e_p1

.CD

V

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ss

c_di

agno

sis_

nonv

erba

l_iq

_p1.

CD

V

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_p

1.C

DV

ss

c_di

agno

sis_

vma_

p1.C

DV

vi

nela

nd_i

i_co

mpo

site

_sta

ndar

d_sc

ore_

p1.C

DV

sr

s_pa

rent

_raw

_tot

al_p

1.C

DV

sr

s_te

ache

r_ra

w_t

otal

_p1.

CD

V

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

ab

c_to

tal_

scor

e_p1

.CD

V

febr

ile_s

eizu

res_

p1.C

DV

bc

kgd_

hx_h

ighe

st_e

du_f

athe

r_p1

.OC

UV

bc

kgd_

hx_p

aren

t_re

latio

n_st

atus

_p1.

OC

UV

ss

c_dx

_ove

rallc

erta

inty

_p1.

OC

UV

nb

r_st

illbi

rth_

mis

carr

iage

_p1.

OC

UV

fa

mily

_str

uctu

re_p

1.O

CU

V

wor

d_de

lay_

p1.O

CU

V

phra

se_d

elay

_p1.

OC

UV

ad

i_r_

q86_

abno

rmal

ity_e

vide

nt_p

1.O

CU

V

ados

1_al

gorit

hm_p

1.O

CU

V

a1_n

on_e

choe

d_p1

.OC

UV

ad

os_r

ecip

roca

l_so

cial

_p1.

OC

UV

va

bs_i

i_dl

s_st

anda

rd_p

1.O

CU

V

vabs

_ii_

mot

or_s

kills

_p1.

OC

UV

sr

s_pa

rent

_aw

aren

ess_

p1.O

CU

V

srs_

pare

nt_c

omm

unic

atio

n_p1

.OC

UV

sr

s_pa

rent

_mot

ivat

ion_

p1.O

CU

V

srs_

teac

her_

awar

enes

s_p1

.OC

UV

sr

s_te

ache

r_co

mm

unic

atio

n_p1

.OC

UV

sr

s_te

ache

r_m

otiv

atio

n_p1

.OC

UV

rb

s_r_

i_st

ereo

type

d_be

havi

or_p

1.O

CU

V

rbs_

r_iii

_com

puls

ive_

beha

vior

_p1.

OC

UV

rb

s_r_

v_sa

men

ess_

beha

vior

_p1.

OC

UV

ab

c_nb

r_m

issi

ng_p

1.O

CU

V

abc_

iv_h

yper

activ

ity_p

1.O

CU

V

abc_

iii_s

tere

otyp

y_p1

.OC

UV

sc

q_lif

e_nb

r_m

issi

ng_p

1.O

CU

V

scq_

life_

tota

l_p1

.OC

UV

cb

cl_2

_5_s

omat

ic_c

ompl

aint

s_p1

.OC

UV

cb

cl_2

_5_s

leep

_pro

blem

s_p1

.OC

UV

cb

cl_2

_5_a

ggre

ssiv

e_be

havi

or_p

1.O

CU

V

cbcl

_2_5

_affe

ctiv

e_pr

oble

ms_

p1.O

CU

V

cbcl

_2_5

_per

vasi

ve_d

evel

opm

enta

l_p1

.OC

UV

cb

cl_2

_5_o

ppos

ition

al_d

efia

nt_p

1.O

CU

V

cbcl

_6_1

8_so

cial

_p1.

OC

UV

cb

cl_6

_18_

tota

l_co

mpe

tenc

e_p1

.OC

UV

cb

cl_6

_18_

with

draw

n_p1

.OC

UV

cb

cl_6

_18_

soci

al_p

robl

ems_

p1.O

CU

V

cbcl

_6_1

8_at

tent

ion_

prob

lem

s_p1

.OC

UV

cb

cl_6

_18_

aggr

essi

ve_b

ehav

ior_

p1.O

CU

V

cbcl

_6_1

8_af

fect

ive_

prob

lem

s_p1

.OC

UV

cb

cl_6

_18_

som

atic

_pro

b_p1

.OC

UV

cb

cl_6

_18_

oppo

sitio

nal_

defia

nt_p

1.O

CU

V

cbcl

_2_5

_em

otio

nally

_rea

ctiv

e_p1

.OC

UV

m

at_r

heum

.art

hriti

s.ju

veni

le.A

UTO

.IMM

si

blin

g_rh

eum

.art

hriti

s.ju

veni

le.A

UTO

.IMM

pa

t.cou

sin_

rheu

m.a

rthr

itis.

juve

nile

.AU

TO.IM

M

pat.a

untu

ncle

_rhe

um.a

rthr

itis.

juve

nile

.AU

TO.IM

M

y.n_

rheu

m.a

rthr

itis.

adul

t.AU

TO.IM

M

mat

_rhe

um.a

rthr

itis.

adul

t.AU

TO.IM

M

mat

.aun

tunc

le_r

heum

.art

hriti

s.ad

ult.A

UTO

.IMM

m

at.g

rand

pare

nt_r

heum

.art

hriti

s.ad

ult.A

UTO

.IMM

y.

n_sy

stem

ic.lu

pus.

eryt

h.A

UTO

.IMM

m

at_s

yste

mic

.lupu

s.er

yth.

AU

TO.IM

M

mat

.cou

sin_

syst

emic

.lupu

s.er

yth.

AU

TO.IM

M

mat

.aun

tunc

le_s

yste

mic

.lupu

s.er

yth.

AU

TO.IM

M

mat

.gra

ndpa

rent

_sys

tem

ic.lu

pus.

eryt

h.A

UTO

.IMM

y.

n_as

thm

a.A

UTO

.IMM

m

at_a

sthm

a.A

UTO

.IMM

si

blin

g_as

thm

a.A

UTO

.IMM

pa

t.hal

fsib

ling_

asth

ma.

AU

TO.IM

M

pat.c

ousi

n_as

thm

a.A

UTO

.IMM

pa

t.aun

tunc

le_a

sthm

a.A

UTO

.IMM

pa

t.gra

ndpa

rent

_ast

hma.

AU

TO.IM

M

prob

and_

hype

rthy

roid

ism

.AU

TO.IM

M

pat_

hype

rthy

roid

ism

.AU

TO.IM

M

mat

.aun

tunc

le_h

yper

thyr

oidi

sm.A

UTO

.IMM

m

at.g

rand

pare

nt_h

yper

thyr

oidi

sm.A

UTO

.IMM

y.

n_hy

poth

yroi

dism

.AU

TO.IM

M

mat

_hyp

othy

roid

ism

.AU

TO.IM

M

sibl

ing_

hypo

thyr

oidi

sm.A

UTO

.IMM

pa

t.cou

sin_

hypo

thyr

oidi

sm.A

UTO

.IMM

pa

t.aun

tunc

le_h

ypot

hyro

idis

m.A

UTO

.IMM

pa

t.gra

ndpa

rent

_hyp

othy

roid

ism

.AU

TO.IM

M

prob

and_

hash

imot

os.ty

roid

itis.

AU

TO.IM

M

pat_

hash

imot

os.ty

roid

itis.

AU

TO.IM

M

mat

.cou

sin_

hash

imot

os.ty

roid

itis.

AU

TO.IM

M

pat.a

untu

ncle

_has

him

otos

.tyro

iditi

s.A

UTO

.IMM

pa

t.gra

ndpa

rent

_has

him

otos

.tyro

iditi

s.A

UTO

.IMM

m

at_d

iabe

tes.

mel

litus

.type

1.A

UTO

.IMM

si

blin

g_di

abet

es.m

ellit

us.ty

pe1.

AU

TO.IM

M

pat.c

ousi

n_di

abet

es.m

ellit

us.ty

pe1.

AU

TO.IM

M

pat.a

untu

ncle

_dia

bete

s.m

ellit

us.ty

pe1.

AU

TO.IM

M

pat.g

rand

pare

nt_d

iabe

tes.

mel

litus

.type

1.A

UTO

.IMM

pr

oban

d_di

abet

es.m

ellit

us.ty

pe2.

AU

TO.IM

M

pat_

diab

etes

.mel

litus

.type

2.A

UTO

.IMM

pa

t.cou

sin_

diab

etes

.mel

litus

.type

2.A

UTO

.IMM

pa

t.aun

tunc

le_d

iabe

tes.

mel

litus

.type

2.A

UTO

.IMM

pa

t.gra

ndpa

rent

_dia

bete

s.m

ellit

us.ty

pe2.

AU

TO.IM

M

prob

and_

adre

nal.i

nsuf

ficie

ncy.

AU

TO.IM

M

mat

.aun

tunc

le_a

dren

al.in

suffi

cien

cy.A

UTO

.IMM

pa

t.gra

ndpa

rent

_adr

enal

.insu

ffici

ency

.AU

TO.IM

M

prob

and_

psor

iasi

s.A

UTO

.IMM

pa

t_ps

oria

sis.

AU

TO.IM

M

pat.h

alfs

iblin

g_ps

oria

sis.

AU

TO.IM

M

pat.c

ousi

n_ps

oria

sis.

AU

TO.IM

M

pat.a

untu

ncle

_pso

riasi

s.A

UTO

.IMM

pa

t.gra

ndpa

rent

_pso

riasi

s.A

UTO

.IMM

pr

oban

d_bo

wel

.dis

orde

rs.A

UTO

.IMM

pa

t_bo

wel

.dis

orde

rs.A

UTO

.IMM

m

at.h

alfs

iblin

g_bo

wel

.dis

orde

rs.A

UTO

.IMM

pa

t.cou

sin_

bow

el.d

isor

ders

.AU

TO.IM

M

pat.a

untu

ncle

_bow

el.d

isor

ders

.AU

TO.IM

M

pat.g

rand

pare

nt_b

owel

.dis

orde

rs.A

UTO

.IMM

pr

oban

d_ce

llac.

dise

ase.

AU

TO.IM

M

pat_

cella

c.di

seas

e.A

UTO

.IMM

m

at.c

ousi

n_ce

llac.

dise

ase.

AU

TO.IM

M

mat

.aun

tunc

le_c

ella

c.di

seas

e.A

UTO

.IMM

m

at.g

rand

pare

nt_c

ella

c.di

seas

e.A

UTO

.IMM

y.

n_m

ultip

le.s

cler

osis

.AU

TO.IM

M

mat

_mul

tiple

.scl

eros

is.A

UTO

.IMM

m

at.c

ousi

n_m

ultip

le.s

cler

osis

.AU

TO.IM

M

mat

.aun

tunc

le_m

ultip

le.s

cler

osis

.AU

TO.IM

M

mat

.gra

ndpa

rent

_mul

tiple

.scl

eros

is.A

UTO

.IMM

y.

n_ot

her.a

utoi

mm

une.

diso

rder

.AU

TO.IM

M

prob

and_

othe

r.aut

oim

mun

e.di

sord

er.A

UTO

.IMM

pa

t_ot

her.a

utoi

mm

une.

diso

rder

.AU

TO.IM

M

pat.h

alfs

iblin

g_ot

her.a

utoi

mm

une.

diso

rder

.AU

TO.IM

M

pat.c

ousi

n_ot

her.a

utoi

mm

une.

diso

rder

.AU

TO.IM

M

pat.a

untu

ncle

_oth

er.a

utoi

mm

une.

diso

rder

.AU

TO.IM

M

pat.g

rand

pare

nt_o

ther

.aut

oim

mun

e.di

sord

er.A

UTO

.IMM

pr

oban

d_cl

eft.l

ip.p

alat

e.B

TH

.DE

F

sibl

ing_

clef

t.lip

.pal

ate.

BT

H.D

EF

m

at.c

ousi

n_cl

eft.l

ip.p

alat

e.B

TH

.DE

F

mat

.aun

tunc

le_c

left.

lip.p

alat

e.B

TH

.DE

F

mat

.gra

ndpa

rent

_cle

ft.lip

.pal

ate.

BT

H.D

EF

pr

oban

d_op

en.s

pine

.BT

H.D

EF

m

at.c

ousi

n_op

en.s

pine

.BT

H.D

EF

pa

t.aun

tunc

le_o

pen.

spin

e.B

TH

.DE

F

y.n_

cong

enita

l.hea

rt.d

efec

t.BT

H.D

EF

m

at_c

onge

nita

l.hea

rt.d

efec

t.BT

H.D

EF

si

blin

g_co

ngen

ital.h

eart

.def

ect.B

TH

.DE

F

mat

.cou

sin_

cong

enita

l.hea

rt.d

efec

t.BT

H.D

EF

m

at.a

untu

ncle

_con

geni

tal.h

eart

.def

ect.B

TH

.DE

F

mat

.gra

ndpa

rent

_con

geni

tal.h

eart

.def

ect.B

TH

.DE

F

y.n_

kidn

ey.d

efec

t.BT

H.D

EF

m

at_k

idne

y.de

fect

.BT

H.D

EF

si

blin

g_ki

dney

.def

ect.B

TH

.DE

F

pat.c

ousi

n_ki

dney

.def

ect.B

TH

.DE

F

pat.a

untu

ncle

_kid

ney.

defe

ct.B

TH

.DE

F

pat.g

rand

pare

nt_k

idne

y.de

fect

.BT

H.D

EF

pr

oban

d_ab

norm

al.s

hape

.pol

ydac

tyly

.BT

H.D

EF

pa

t_ab

norm

al.s

hape

.pol

ydac

tyly

.BT

H.D

EF

m

at.c

ousi

n_ab

norm

al.s

hape

.pol

ydac

tyly

.BT

H.D

EF

m

at.a

untu

ncle

_abn

orm

al.s

hape

.pol

ydac

tyly

.BT

H.D

EF

m

at.g

rand

pare

nt_a

bnor

mal

.sha

pe.p

olyd

acty

ly.B

TH

.DE

F

y.n_

othe

r.bir

th.d

efec

t.BT

H.D

EF

pr

oban

d_ot

her.b

irth

.def

ect.B

TH

.DE

F

pat_

othe

r.bir

th.d

efec

t.BT

H.D

EF

pa

t.hal

fsib

ling_

othe

r.bir

th.d

efec

t.BT

H.D

EF

pa

t.cou

sin_

othe

r.bir

th.d

efec

t.BT

H.D

EF

pa

t.aun

tunc

le_o

ther

.bir

th.d

efec

t.BT

H.D

EF

pa

t.gra

ndpa

rent

_oth

er.b

irth

.def

ect.B

TH

.DE

F

prob

and_

hear

t.dis

ease

.CH

RC

.ILL

pat_

hear

t.dis

ease

.CH

RC

.ILL

mat

.cou

sin_

hear

t.dis

ease

.CH

RC

.ILL

mat

.aun

tunc

le_h

eart

.dis

ease

.CH

RC

.ILL

mat

.gra

ndpa

rent

_hea

rt.d

isea

se.C

HR

C.IL

L y.

n_st

roke

.CH

RC

.ILL

pat_

stro

ke.C

HR

C.IL

L m

at.a

untu

ncle

_str

oke.

CH

RC

.ILL

mat

.gra

ndpa

rent

_str

oke.

CH

RC

.ILL

y.n_

canc

er.C

HR

C.IL

L pa

t_ca

ncer

.CH

RC

.ILL

mat

.hal

fsib

ling_

canc

er.C

HR

C.IL

L pa

t.cou

sin_

canc

er.C

HR

C.IL

L pa

t.aun

tunc

le_c

ance

r.CH

RC

.ILL

pat.g

rand

pare

nt_c

ance

r.CH

RC

.ILL

prob

and_

deat

h.un

der.5

0.C

HR

C.IL

L m

at.h

alfs

iblin

g_de

ath.

unde

r.50.

CH

RC

.ILL

mat

.cou

sin_

deat

h.un

der.5

0.C

HR

C.IL

L m

at.a

untu

ncle

_dea

th.u

nder

.50.

CH

RC

.ILL

mat

.gra

ndpa

rent

_dea

th.u

nder

.50.

CH

RC

.ILL

y.n_

othe

r.dis

orde

r.illn

ess1

.CH

RC

.ILL

prob

and_

othe

r.dis

orde

r.illn

ess1

.CH

RC

.ILL

pat_

othe

r.dis

orde

r.illn

ess1

.CH

RC

.ILL

mat

.hal

fsib

ling_

othe

r.dis

orde

r.illn

ess1

.CH

RC

.ILL

mat

.cou

sin_

othe

r.dis

orde

r.illn

ess1

.CH

RC

.ILL

mat

.aun

tunc

le_o

ther

.dis

orde

r.illn

ess1

.CH

RC

.ILL

mat

.gra

ndpa

rent

_oth

er.d

isor

der.i

llnes

s1.C

HR

C.IL

L y.

n_ot

her.d

isor

der.i

llnes

s2.C

HR

C.IL

L pr

oban

d_ot

her.d

isor

der.i

llnes

s2.C

HR

C.IL

L pa

t_ot

her.d

isor

der.i

llnes

s2.C

HR

C.IL

L m

at.h

alfs

iblin

g_ot

her.d

isor

der.i

llnes

s2.C

HR

C.IL

L pa

t.cou

sin_

othe

r.dis

orde

r.illn

ess2

.CH

RC

.ILL

pat.a

untu

ncle

_oth

er.d

isor

der.i

llnes

s2.C

HR

C.IL

L pa

t.gra

ndpa

rent

_oth

er.d

isor

der.i

llnes

s2.C

HR

C.IL

L di

et_o

ther

_1_p

ast_

p1.D

T.M

ED

.SLP

di

et_o

ther

_2_p

ast_

p1.D

T.M

ED

.SLP

ot

her_

med

s_2_

desc

_p1.

DT.

ME

D.S

LP

over

_cou

nter

s_2_

desc

_p1.

DT.

ME

D.S

LP

othe

r_m

eds_

1_re

ason

_s1.

DT.

ME

D.S

LP

othe

r_m

eds_

2_re

ason

_s1.

DT.

ME

D.S

LP

over

_cou

nter

s_1_

reas

on_s

1.D

T.M

ED

.SLP

ov

er_c

ount

ers_

2_re

ason

_s1.

DT.

ME

D.S

LP

othe

r_m

eds_

2_de

sc_f

a.D

T.M

ED

.SLP

ov

er_c

ount

ers_

2_de

sc_f

a.D

T.M

ED

.SLP

dp

t_re

actio

n_p1

.DT.

ME

D.S

LP

dtap

_rea

ctio

n_p1

.DT.

ME

D.S

LP

hib_

reac

tion_

p1.D

T.M

ED

.SLP

he

patit

is_b

_rea

ctio

n_p1

.DT.

ME

D.S

LP

polio

_ora

l_re

actio

n_p1

.DT.

ME

D.S

LP

polio

_inj

ecte

dl_r

eact

ion_

p1.D

T.M

ED

.SLP

m

mr_

reac

tion_

p1.D

T.M

ED

.SLP

flu

_sho

t_re

actio

n_p1

.DT.

ME

D.S

LP

chic

ken_

pox_

varic

ella

_if_

recd

_p1.

DT.

ME

D.S

LP

vacc

inat

ion_

othe

r_1_

p1.D

T.M

ED

.SLP

va

ccin

atio

n_ot

her_

1_re

actio

n_p1

.DT.

ME

D.S

LP

vacc

inat

ion_

othe

r_2_

reac

tion_

p1.D

T.M

ED

.SLP

y.

n_an

gelm

an.G

EN

.DIS

y.

n_do

wn.

synd

rom

e.G

EN

.DIS

m

at.c

ousi

n_do

wn.

synd

rom

e.G

EN

.DIS

m

at.a

untu

ncle

_dow

n.sy

ndro

me.

GE

N.D

IS

y.n_

phen

ylke

tonu

ria.G

EN

.DIS

pa

t.cou

sin_

phen

ylke

tonu

ria.G

EN

.DIS

m

at.a

untu

ncle

_cri.

du.c

hat.G

EN

.DIS

sp

ecify

_oth

er.g

enet

ic.G

EN

.DIS

m

at_o

ther

.gen

etic

.GE

N.D

IS

sibl

ing_

othe

r.gen

etic

.GE

N.D

IS

pat.c

ousi

n_ot

her.g

enet

ic.G

EN

.DIS

pa

t.aun

tunc

le_o

ther

.gen

etic

.GE

N.D

IS

pat.g

rand

pare

nt_o

ther

.gen

etic

.GE

N.D

IS

gest

atio

nal_

age_

days

_LB

R.D

LY.B

TH

.FE

ED

la

bor_

dura

tion_

LBR

.DLY

.BT

H.F

EE

D

labo

r_du

ratio

n_na

_LB

R.D

LY.B

TH

.FE

ED

an

esth

esia

_spi

nal_

LBR

.DLY

.BT

H.F

EE

D

anes

thes

ia_g

ener

al_L

BR

.DLY

.BT

H.F

EE

D

labo

r_in

duce

d_LB

R.D

LY.B

TH

.FE

ED

in

duct

ion_

prol

onge

d_ru

ptur

e_m

embr

ane_

LBR

.DLY

.BT

H.F

EE

D

indu

ctio

n_po

st_d

ates

_LB

R.D

LY.B

TH

.FE

ED

in

duct

ion_

reas

on_o

ther

_spe

cify

_LB

R.D

LY.B

TH

.FE

ED

in

duct

ion_

met

hod_

pito

cin_

LBR

.DLY

.BT

H.F

EE

D

indu

ctio

n_m

etho

d_pr

osta

glan

dins

_LB

R.D

LY.B

TH

.FE

ED

in

duct

ion_

met

hod_

othe

r_sp

ecify

_LB

R.D

LY.B

TH

.FE

ED

au

gmen

tatio

n_pr

emat

ure_

rupt

ure_

mem

bran

e_LB

R.D

LY.B

TH

.FE

ED

au

gmen

tatio

n_fa

ilure

_to_

prog

ress

_LB

R.D

LY.B

TH

.FE

ED

au

gmen

tatio

n_re

ason

_oth

er_L

BR

.DLY

.BT

H.F

EE

D

augm

enta

tion_

met

hod_

amni

otom

y_LB

R.D

LY.B

TH

.FE

ED

au

gmen

tatio

n_m

etho

d_st

rippi

ng_L

BR

.DLY

.BT

H.F

EE

D

augm

enta

tion_

met

hod_

othe

r_LB

R.D

LY.B

TH

.FE

ED

pr

esen

tatio

n_ce

phal

ic_L

BR

.DLY

.BT

H.F

EE

D

pres

enta

tion_

tran

sver

se_L

BR

.DLY

.BT

H.F

EE

D

deliv

ery_

type

_LB

R.D

LY.B

TH

.FE

ED

c_

sect

ion_

emer

gent

_LB

R.D

LY.B

TH

.FE

ED

c_

sect

ion_

plan

ned_

LBR

.DLY

.BT

H.F

EE

D

c_se

ctio

n_ot

her_

spec

ify_L

BR

.DLY

.BT

H.F

EE

D

nuch

al_c

ord_

LBR

.DLY

.BT

H.F

EE

D

plac

enta

_pre

via_

LBR

.DLY

.BT

H.F

EE

D

child

_sev

erel

y_tr

aum

atiz

ed_L

BR

.DLY

.BT

H.F

EE

D

birt

h_w

eigh

t_lb

s_LB

R.D

LY.B

TH

.FE

ED

bi

rth_

wei

ght_

not_

sure

_LB

R.D

LY.B

TH

.FE

ED

bi

rth_

head

_circ

umfe

renc

e_pe

rcen

tile_

LBR

.DLY

.BT

H.F

EE

D

birt

h_he

ad_c

ircum

fere

nce_

not_

sure

_lis

t_LB

R.D

LY.B

TH

.FE

ED

bi

rth_

leng

th_n

ot_s

ure_

LBR

.DLY

.BT

H.F

EE

D

apga

r_sc

ore_

at_5

_min

utes

_LB

R.D

LY.B

TH

.FE

ED

ho

spita

l_af

ter_

birt

h_LB

R.D

LY.B

TH

.FE

ED

hy

perb

iliru

bine

mia

_rx_

no_t

reat

men

t_LB

R.D

LY.B

TH

.FE

ED

hy

perb

iliru

bine

mia

_rx_

exch

ange

_tra

ns_L

BR

.DLY

.BT

H.F

EE

D

mec

oniu

m_a

spira

tion_

LBR

.DLY

.BT

H.F

EE

D

o2_s

uppl

emen

t_LB

R.D

LY.B

TH

.FE

ED

o2

_by_

mas

k_an

d_ve

ntila

tion_

LBR

.DLY

.BT

H.F

EE

D

resu

scita

tion_

requ

ired_

LBR

.DLY

.BT

H.F

EE

D

nicu

_dur

atio

n_ad

mis

sion

_LB

R.D

LY.B

TH

.FE

ED

an

emia

_LB

R.D

LY.B

TH

.FE

ED

hy

poka

lem

ia_L

BR

.DLY

.BT

H.F

EE

D

hypo

glyc

emia

_LB

R.D

LY.B

TH

.FE

ED

di

ff_re

gula

te_t

emp_

LBR

.DLY

.BT

H.F

EE

D

phys

ical

_ano

mal

ies_

poly

dact

yly_

LBR

.DLY

.BT

H.F

EE

D

phys

ical

_ano

mal

ies_

kidn

ey_L

BR

.DLY

.BT

H.F

EE

D

brea

st_b

ottle

_fee

d_LB

R.D

LY.B

TH

.FE

ED

br

east

_bot

tle_m

onth

s_LB

R.D

LY.B

TH

.FE

ED

br

east

_tot

al_m

onth

s_LB

R.D

LY.B

TH

.FE

ED

bo

ttle_

tota

l_m

onth

s_LB

R.D

LY.B

TH

.FE

ED

ty

pe_o

f_fo

rmul

a_co

w_m

ilk_L

BR

.DLY

.BT

H.F

EE

D

type

_of_

form

ula_

othe

r_sp

ecify

_LB

R.D

LY.B

TH

.FE

ED

po

or_s

uck_

LBR

.DLY

.BT

H.F

EE

D

stiff

_inf

ant_

LBR

.DLY

.BT

H.F

EE

D

leth

argi

c_ov

erly

_sle

epy_

LBR

.DLY

.BT

H.F

EE

D

prob

and_

spee

ch.d

elay

.LA

NG

.DIS

pa

t_sp

eech

.del

ay.L

AN

G.D

IS

mat

.hal

fsib

ling_

spee

ch.d

elay

.LA

NG

.DIS

m

at.c

ousi

n_sp

eech

.del

ay.L

AN

G.D

IS

mat

.aun

tunc

le_s

peec

h.de

lay.

LAN

G.D

IS

mat

.gra

ndpa

rent

_spe

ech.

dela

y.LA

NG

.DIS

y.

n_ex

pres

sive

.lang

.dis

orde

r.LA

NG

.DIS

m

at_e

xpre

ssiv

e.la

ng.d

isor

der.L

AN

G.D

IS

sibl

ing_

expr

essi

ve.la

ng.d

isor

der.L

AN

G.D

IS

pat.c

ousi

n_ex

pres

sive

.lang

.dis

orde

r.LA

NG

.DIS

pa

t.aun

tunc

le_e

xpre

ssiv

e.la

ng.d

isor

der.L

AN

G.D

IS

y.n_

rece

ptiv

e.la

ng.d

isor

der.L

AN

G.D

IS

mat

_rec

eptiv

e.la

ng.d

isor

der.L

AN

G.D

IS

sibl

ing_

rece

ptiv

e.la

ng.d

isor

der.L

AN

G.D

IS

prob

and_

mix

ed.e

xpre

ssiv

e.di

sord

er.L

AN

G.D

IS

pat_

mix

ed.e

xpre

ssiv

e.di

sord

er.L

AN

G.D

IS

mat

.cou

sin_

mix

ed.e

xpre

ssiv

e.di

sord

er.L

AN

G.D

IS

mat

.aun

tunc

le_m

ixed

.exp

ress

ive.

diso

rder

.LA

NG

.DIS

m

at.g

rand

pare

nt_m

ixed

.exp

ress

ive.

diso

rder

.LA

NG

.DIS

pr

oban

d_co

mm

unic

atio

n.di

sord

er.L

AN

G.D

IS

mat

.cou

sin_

com

mun

icat

ion.

diso

rder

.LA

NG

.DIS

y.

n_pr

agm

atic

s.la

ng.d

isor

der.L

AN

G.D

IS

sibl

ing_

prag

mat

ics.

lang

.dis

orde

r.LA

NG

.DIS

pa

t.cou

sin_

prag

mat

ics.

lang

.dis

orde

r.LA

NG

.DIS

pa

t.aun

tunc

le_p

ragm

atic

s.la

ng.d

isor

der.L

AN

G.D

IS

y.n_

stut

terin

g.LA

NG

.DIS

m

at_s

tutte

ring.

LAN

G.D

IS

sibl

ing_

stut

terin

g.LA

NG

.DIS

pa

t.hal

fsib

ling_

stut

terin

g.LA

NG

.DIS

pa

t.cou

sin_

stut

terin

g.LA

NG

.DIS

pa

t.aun

tunc

le_s

tutte

ring.

LAN

G.D

IS

pat.g

rand

pare

nt_s

tutte

ring.

LAN

G.D

IS

prob

and_

redu

ced.

artic

ulat

ion.

LAN

G.D

IS

pat_

redu

ced.

artic

ulat

ion.

LAN

G.D

IS

mat

.hal

fsib

ling_

redu

ced.

artic

ulat

ion.

LAN

G.D

IS

mat

.cou

sin_

redu

ced.

artic

ulat

ion.

LAN

G.D

IS

mat

.aun

tunc

le_r

educ

ed.a

rtic

ulat

ion.

LAN

G.D

IS

mat

.gra

ndpa

rent

_red

uced

.art

icul

atio

n.LA

NG

.DIS

y.

n_ot

her.l

ang.

diso

rder

.LA

NG

.DIS

pr

oban

d_ot

her.l

ang.

diso

rder

.LA

NG

.DIS

pa

t_ot

her.l

ang.

diso

rder

.LA

NG

.DIS

m

at.h

alfs

iblin

g_ot

her.l

ang.

diso

rder

.LA

NG

.DIS

m

at.c

ousi

n_ot

her.l

ang.

diso

rder

.LA

NG

.DIS

m

at.a

untu

ncle

_oth

er.la

ng.d

isor

der.L

AN

G.D

IS

mat

.gra

ndpa

rent

_oth

er.la

ng.d

isor

der.L

AN

G.D

IS family.ID

pb_sites.FAM vn_sites.FAM xx_sites.FAM sb_sites.FAM

fl_sites.FAM kr_sites.FAM

sz.sorted_sites.FAM evalage_s1_bg.SRS

evalage_mo_bg.SRS evalmonth_s1_bg.SRS

evalmonth_mo_bg.SRS num.sibs_IND

sex.f1m0.p1_IND birth.p1_IND birth.fa_IND

death.p1_IND death.mo_IND

genetic.ab.s1_IND genetic.ab.mo_IND

overall_genetic_FAM father_genetic_FAM

sib_genetic_FAM asian.fa_racePARENT

native−hawaiian.fa_racePARENT not−specified.fa_racePARENT

ethnicity.hispanic.fa_racePARENT asian.mo_racePARENT

native−hawaiian.mo_racePARENT not−specified.mo_racePARENT

ethnicity.hispanic.mo_racePARENT bapq_rigid_average.fa_cuPARENT bapq_aloof_average.fa_cuPARENT

srs_adult_nbr_missing.fa_cuPARENT bapq_nbr_missing.mo_cuPARENT

bapq_pragmatic_average.mo_cuPARENT bapq_overall_average.mo_cuPARENT

srs_adult_total.mo_cuPARENT family_type_p1.CDV

ethnicity_p1.CDV adi_r_cpea_dx_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDV adi_r_rrb_c_total_p1.CDV

ados_module_p1.CDV ados_css_p1.CDV

ados_restricted_repetitive_p1.CDV ssc_diagnosis_verbal_iq_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV ssc_diagnosis_full_scale_iq_p1.CDV

ssc_diagnosis_vma_p1.CDV vineland_ii_composite_standard_score_p1.CDV

srs_parent_raw_total_p1.CDV srs_teacher_raw_total_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV cbcl_6_18_internalizing_t_score_p1.CDV

abc_total_score_p1.CDV febrile_seizures_p1.CDV

bckgd_hx_highest_edu_father_p1.OCUV bckgd_hx_parent_relation_status_p1.OCUV

ssc_dx_overallcertainty_p1.OCUV nbr_stillbirth_miscarriage_p1.OCUV

family_structure_p1.OCUV word_delay_p1.OCUV

phrase_delay_p1.OCUV adi_r_q86_abnormality_evident_p1.OCUV

ados1_algorithm_p1.OCUV a1_non_echoed_p1.OCUV

ados_reciprocal_social_p1.OCUV vabs_ii_dls_standard_p1.OCUV vabs_ii_motor_skills_p1.OCUV

srs_parent_awareness_p1.OCUV srs_parent_communication_p1.OCUV

srs_parent_motivation_p1.OCUV srs_teacher_awareness_p1.OCUV

srs_teacher_communication_p1.OCUV srs_teacher_motivation_p1.OCUV

rbs_r_i_stereotyped_behavior_p1.OCUV rbs_r_iii_compulsive_behavior_p1.OCUV

rbs_r_v_sameness_behavior_p1.OCUV abc_nbr_missing_p1.OCUV

abc_iv_hyperactivity_p1.OCUV abc_iii_stereotypy_p1.OCUV

scq_life_nbr_missing_p1.OCUV scq_life_total_p1.OCUV

cbcl_2_5_somatic_complaints_p1.OCUV cbcl_2_5_sleep_problems_p1.OCUV

cbcl_2_5_aggressive_behavior_p1.OCUV cbcl_2_5_affective_problems_p1.OCUV

cbcl_2_5_pervasive_developmental_p1.OCUV cbcl_2_5_oppositional_defiant_p1.OCUV

cbcl_6_18_social_p1.OCUV cbcl_6_18_total_competence_p1.OCUV

cbcl_6_18_withdrawn_p1.OCUV cbcl_6_18_social_problems_p1.OCUV

cbcl_6_18_attention_problems_p1.OCUV cbcl_6_18_aggressive_behavior_p1.OCUV

cbcl_6_18_affective_problems_p1.OCUV cbcl_6_18_somatic_prob_p1.OCUV

cbcl_6_18_oppositional_defiant_p1.OCUV cbcl_2_5_emotionally_reactive_p1.OCUV

mat_rheum.arthritis.juvenile.AUTO.IMM sibling_rheum.arthritis.juvenile.AUTO.IMM

pat.cousin_rheum.arthritis.juvenile.AUTO.IMM pat.auntuncle_rheum.arthritis.juvenile.AUTO.IMM

y.n_rheum.arthritis.adult.AUTO.IMM mat_rheum.arthritis.adult.AUTO.IMM

mat.auntuncle_rheum.arthritis.adult.AUTO.IMM mat.grandparent_rheum.arthritis.adult.AUTO.IMM

y.n_systemic.lupus.eryth.AUTO.IMM mat_systemic.lupus.eryth.AUTO.IMM

mat.cousin_systemic.lupus.eryth.AUTO.IMM mat.auntuncle_systemic.lupus.eryth.AUTO.IMM

mat.grandparent_systemic.lupus.eryth.AUTO.IMM y.n_asthma.AUTO.IMM

mat_asthma.AUTO.IMM sibling_asthma.AUTO.IMM

pat.halfsibling_asthma.AUTO.IMM pat.cousin_asthma.AUTO.IMM

pat.auntuncle_asthma.AUTO.IMM pat.grandparent_asthma.AUTO.IMM

proband_hyperthyroidism.AUTO.IMM pat_hyperthyroidism.AUTO.IMM

mat.auntuncle_hyperthyroidism.AUTO.IMM mat.grandparent_hyperthyroidism.AUTO.IMM

y.n_hypothyroidism.AUTO.IMM mat_hypothyroidism.AUTO.IMM

sibling_hypothyroidism.AUTO.IMM pat.cousin_hypothyroidism.AUTO.IMM

pat.auntuncle_hypothyroidism.AUTO.IMM pat.grandparent_hypothyroidism.AUTO.IMM

proband_hashimotos.tyroiditis.AUTO.IMM pat_hashimotos.tyroiditis.AUTO.IMM

mat.cousin_hashimotos.tyroiditis.AUTO.IMM pat.auntuncle_hashimotos.tyroiditis.AUTO.IMM

pat.grandparent_hashimotos.tyroiditis.AUTO.IMM mat_diabetes.mellitus.type1.AUTO.IMM

sibling_diabetes.mellitus.type1.AUTO.IMM pat.cousin_diabetes.mellitus.type1.AUTO.IMM

pat.auntuncle_diabetes.mellitus.type1.AUTO.IMM pat.grandparent_diabetes.mellitus.type1.AUTO.IMM

proband_diabetes.mellitus.type2.AUTO.IMM pat_diabetes.mellitus.type2.AUTO.IMM

pat.cousin_diabetes.mellitus.type2.AUTO.IMM pat.auntuncle_diabetes.mellitus.type2.AUTO.IMM

pat.grandparent_diabetes.mellitus.type2.AUTO.IMM proband_adrenal.insufficiency.AUTO.IMM

mat.auntuncle_adrenal.insufficiency.AUTO.IMM pat.grandparent_adrenal.insufficiency.AUTO.IMM

proband_psoriasis.AUTO.IMM pat_psoriasis.AUTO.IMM

pat.halfsibling_psoriasis.AUTO.IMM pat.cousin_psoriasis.AUTO.IMM

pat.auntuncle_psoriasis.AUTO.IMM pat.grandparent_psoriasis.AUTO.IMM proband_bowel.disorders.AUTO.IMM

pat_bowel.disorders.AUTO.IMM mat.halfsibling_bowel.disorders.AUTO.IMM

pat.cousin_bowel.disorders.AUTO.IMM pat.auntuncle_bowel.disorders.AUTO.IMM

pat.grandparent_bowel.disorders.AUTO.IMM proband_cellac.disease.AUTO.IMM

pat_cellac.disease.AUTO.IMM mat.cousin_cellac.disease.AUTO.IMM

mat.auntuncle_cellac.disease.AUTO.IMM mat.grandparent_cellac.disease.AUTO.IMM

y.n_multiple.sclerosis.AUTO.IMM mat_multiple.sclerosis.AUTO.IMM

mat.cousin_multiple.sclerosis.AUTO.IMM mat.auntuncle_multiple.sclerosis.AUTO.IMM

mat.grandparent_multiple.sclerosis.AUTO.IMM y.n_other.autoimmune.disorder.AUTO.IMM

proband_other.autoimmune.disorder.AUTO.IMM pat_other.autoimmune.disorder.AUTO.IMM

pat.halfsibling_other.autoimmune.disorder.AUTO.IMM pat.cousin_other.autoimmune.disorder.AUTO.IMM

pat.auntuncle_other.autoimmune.disorder.AUTO.IMM pat.grandparent_other.autoimmune.disorder.AUTO.IMM

proband_cleft.lip.palate.BTH.DEF sibling_cleft.lip.palate.BTH.DEF

mat.cousin_cleft.lip.palate.BTH.DEF mat.auntuncle_cleft.lip.palate.BTH.DEF

mat.grandparent_cleft.lip.palate.BTH.DEF proband_open.spine.BTH.DEF

mat.cousin_open.spine.BTH.DEF pat.auntuncle_open.spine.BTH.DEF

y.n_congenital.heart.defect.BTH.DEF mat_congenital.heart.defect.BTH.DEF

sibling_congenital.heart.defect.BTH.DEF mat.cousin_congenital.heart.defect.BTH.DEF

mat.auntuncle_congenital.heart.defect.BTH.DEF mat.grandparent_congenital.heart.defect.BTH.DEF

y.n_kidney.defect.BTH.DEF mat_kidney.defect.BTH.DEF

sibling_kidney.defect.BTH.DEF pat.cousin_kidney.defect.BTH.DEF

pat.auntuncle_kidney.defect.BTH.DEF pat.grandparent_kidney.defect.BTH.DEF

proband_abnormal.shape.polydactyly.BTH.DEF pat_abnormal.shape.polydactyly.BTH.DEF

mat.cousin_abnormal.shape.polydactyly.BTH.DEF mat.auntuncle_abnormal.shape.polydactyly.BTH.DEF

mat.grandparent_abnormal.shape.polydactyly.BTH.DEF y.n_other.birth.defect.BTH.DEF

proband_other.birth.defect.BTH.DEF pat_other.birth.defect.BTH.DEF

pat.halfsibling_other.birth.defect.BTH.DEF pat.cousin_other.birth.defect.BTH.DEF

pat.auntuncle_other.birth.defect.BTH.DEF pat.grandparent_other.birth.defect.BTH.DEF

proband_heart.disease.CHRC.ILL pat_heart.disease.CHRC.ILL

mat.cousin_heart.disease.CHRC.ILL mat.auntuncle_heart.disease.CHRC.ILL

mat.grandparent_heart.disease.CHRC.ILL y.n_stroke.CHRC.ILL pat_stroke.CHRC.ILL

mat.auntuncle_stroke.CHRC.ILL mat.grandparent_stroke.CHRC.ILL

y.n_cancer.CHRC.ILL pat_cancer.CHRC.ILL

mat.halfsibling_cancer.CHRC.ILL pat.cousin_cancer.CHRC.ILL

pat.auntuncle_cancer.CHRC.ILL pat.grandparent_cancer.CHRC.ILL

proband_death.under.50.CHRC.ILL mat.halfsibling_death.under.50.CHRC.ILL

mat.cousin_death.under.50.CHRC.ILL mat.auntuncle_death.under.50.CHRC.ILL

mat.grandparent_death.under.50.CHRC.ILL y.n_other.disorder.illness1.CHRC.ILL

proband_other.disorder.illness1.CHRC.ILL pat_other.disorder.illness1.CHRC.ILL

mat.halfsibling_other.disorder.illness1.CHRC.ILL mat.cousin_other.disorder.illness1.CHRC.ILL

mat.auntuncle_other.disorder.illness1.CHRC.ILL mat.grandparent_other.disorder.illness1.CHRC.ILL

y.n_other.disorder.illness2.CHRC.ILL proband_other.disorder.illness2.CHRC.ILL

pat_other.disorder.illness2.CHRC.ILL mat.halfsibling_other.disorder.illness2.CHRC.ILL

pat.cousin_other.disorder.illness2.CHRC.ILL pat.auntuncle_other.disorder.illness2.CHRC.ILL

pat.grandparent_other.disorder.illness2.CHRC.ILL diet_other_1_past_p1.DT.MED.SLP diet_other_2_past_p1.DT.MED.SLP

other_meds_2_desc_p1.DT.MED.SLP over_counters_2_desc_p1.DT.MED.SLP other_meds_1_reason_s1.DT.MED.SLP other_meds_2_reason_s1.DT.MED.SLP

over_counters_1_reason_s1.DT.MED.SLP over_counters_2_reason_s1.DT.MED.SLP

other_meds_2_desc_fa.DT.MED.SLP over_counters_2_desc_fa.DT.MED.SLP

dpt_reaction_p1.DT.MED.SLP dtap_reaction_p1.DT.MED.SLP

hib_reaction_p1.DT.MED.SLP hepatitis_b_reaction_p1.DT.MED.SLP polio_oral_reaction_p1.DT.MED.SLP

polio_injectedl_reaction_p1.DT.MED.SLP mmr_reaction_p1.DT.MED.SLP

flu_shot_reaction_p1.DT.MED.SLP chicken_pox_varicella_if_recd_p1.DT.MED.SLP

vaccination_other_1_p1.DT.MED.SLP vaccination_other_1_reaction_p1.DT.MED.SLP vaccination_other_2_reaction_p1.DT.MED.SLP

y.n_angelman.GEN.DIS y.n_down.syndrome.GEN.DIS

mat.cousin_down.syndrome.GEN.DIS mat.auntuncle_down.syndrome.GEN.DIS

y.n_phenylketonuria.GEN.DIS pat.cousin_phenylketonuria.GEN.DIS

mat.auntuncle_cri.du.chat.GEN.DIS specify_other.genetic.GEN.DIS

mat_other.genetic.GEN.DIS sibling_other.genetic.GEN.DIS

pat.cousin_other.genetic.GEN.DIS pat.auntuncle_other.genetic.GEN.DIS

pat.grandparent_other.genetic.GEN.DIS gestational_age_days_LBR.DLY.BTH.FEED

labor_duration_LBR.DLY.BTH.FEED labor_duration_na_LBR.DLY.BTH.FEED anesthesia_spinal_LBR.DLY.BTH.FEED

anesthesia_general_LBR.DLY.BTH.FEED labor_induced_LBR.DLY.BTH.FEED

induction_prolonged_rupture_membrane_LBR.DLY.BTH.FEED induction_post_dates_LBR.DLY.BTH.FEED

induction_reason_other_specify_LBR.DLY.BTH.FEED induction_method_pitocin_LBR.DLY.BTH.FEED

induction_method_prostaglandins_LBR.DLY.BTH.FEED induction_method_other_specify_LBR.DLY.BTH.FEED

augmentation_premature_rupture_membrane_LBR.DLY.BTH.FEED augmentation_failure_to_progress_LBR.DLY.BTH.FEED

augmentation_reason_other_LBR.DLY.BTH.FEED augmentation_method_amniotomy_LBR.DLY.BTH.FEED

augmentation_method_stripping_LBR.DLY.BTH.FEED augmentation_method_other_LBR.DLY.BTH.FEED

presentation_cephalic_LBR.DLY.BTH.FEED presentation_transverse_LBR.DLY.BTH.FEED

delivery_type_LBR.DLY.BTH.FEED c_section_emergent_LBR.DLY.BTH.FEED

c_section_planned_LBR.DLY.BTH.FEED c_section_other_specify_LBR.DLY.BTH.FEED

nuchal_cord_LBR.DLY.BTH.FEED placenta_previa_LBR.DLY.BTH.FEED

child_severely_traumatized_LBR.DLY.BTH.FEED birth_weight_lbs_LBR.DLY.BTH.FEED

birth_weight_not_sure_LBR.DLY.BTH.FEED birth_head_circumference_percentile_LBR.DLY.BTH.FEED

birth_head_circumference_not_sure_list_LBR.DLY.BTH.FEED birth_length_not_sure_LBR.DLY.BTH.FEED

apgar_score_at_5_minutes_LBR.DLY.BTH.FEED hospital_after_birth_LBR.DLY.BTH.FEED

hyperbilirubinemia_rx_no_treatment_LBR.DLY.BTH.FEED hyperbilirubinemia_rx_exchange_trans_LBR.DLY.BTH.FEED

meconium_aspiration_LBR.DLY.BTH.FEED o2_supplement_LBR.DLY.BTH.FEED

o2_by_mask_and_ventilation_LBR.DLY.BTH.FEED resuscitation_required_LBR.DLY.BTH.FEED

nicu_duration_admission_LBR.DLY.BTH.FEED anemia_LBR.DLY.BTH.FEED

hypokalemia_LBR.DLY.BTH.FEED hypoglycemia_LBR.DLY.BTH.FEED

diff_regulate_temp_LBR.DLY.BTH.FEED physical_anomalies_polydactyly_LBR.DLY.BTH.FEED

physical_anomalies_kidney_LBR.DLY.BTH.FEED breast_bottle_feed_LBR.DLY.BTH.FEED

breast_bottle_months_LBR.DLY.BTH.FEED breast_total_months_LBR.DLY.BTH.FEED bottle_total_months_LBR.DLY.BTH.FEED

type_of_formula_cow_milk_LBR.DLY.BTH.FEED type_of_formula_other_specify_LBR.DLY.BTH.FEED

poor_suck_LBR.DLY.BTH.FEED stiff_infant_LBR.DLY.BTH.FEED

lethargic_overly_sleepy_LBR.DLY.BTH.FEED proband_speech.delay.LANG.DIS

pat_speech.delay.LANG.DIS mat.halfsibling_speech.delay.LANG.DIS

mat.cousin_speech.delay.LANG.DIS mat.auntuncle_speech.delay.LANG.DIS

mat.grandparent_speech.delay.LANG.DIS y.n_expressive.lang.disorder.LANG.DIS

mat_expressive.lang.disorder.LANG.DIS sibling_expressive.lang.disorder.LANG.DIS

pat.cousin_expressive.lang.disorder.LANG.DIS pat.auntuncle_expressive.lang.disorder.LANG.DIS

y.n_receptive.lang.disorder.LANG.DIS mat_receptive.lang.disorder.LANG.DIS

sibling_receptive.lang.disorder.LANG.DIS proband_mixed.expressive.disorder.LANG.DIS

pat_mixed.expressive.disorder.LANG.DIS mat.cousin_mixed.expressive.disorder.LANG.DIS

mat.auntuncle_mixed.expressive.disorder.LANG.DIS mat.grandparent_mixed.expressive.disorder.LANG.DIS

proband_communication.disorder.LANG.DIS mat.cousin_communication.disorder.LANG.DIS

y.n_pragmatics.lang.disorder.LANG.DIS sibling_pragmatics.lang.disorder.LANG.DIS

pat.cousin_pragmatics.lang.disorder.LANG.DIS pat.auntuncle_pragmatics.lang.disorder.LANG.DIS

y.n_stuttering.LANG.DIS mat_stuttering.LANG.DIS

sibling_stuttering.LANG.DIS pat.halfsibling_stuttering.LANG.DIS

pat.cousin_stuttering.LANG.DIS pat.auntuncle_stuttering.LANG.DIS

pat.grandparent_stuttering.LANG.DIS proband_reduced.articulation.LANG.DIS

pat_reduced.articulation.LANG.DIS mat.halfsibling_reduced.articulation.LANG.DIS

mat.cousin_reduced.articulation.LANG.DIS mat.auntuncle_reduced.articulation.LANG.DIS

mat.grandparent_reduced.articulation.LANG.DIS y.n_other.lang.disorder.LANG.DIS

proband_other.lang.disorder.LANG.DIS pat_other.lang.disorder.LANG.DIS

mat.halfsibling_other.lang.disorder.LANG.DIS mat.cousin_other.lang.disorder.LANG.DIS

mat.auntuncle_other.lang.disorder.LANG.DIS mat.grandparent_other.lang.disorder.LANG.DIS

Correlations(Compl.Pairs)

Figure 3: An overview blockplot of 757 variables. Groups of variables are marked by back-ground highlight squares along the ascending diagonal. The blockplot of Figure 1 is containedin this larger plot and can be found in the small highlight square in the lower left marked bya faint crosshair. Readers who are viewing this document in a PDF reader may zoom in toverify that this highlight square contains an approximation to Figure 1.

7

Page 8: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

The function of this blockplot is not so much to facilitate discovery as to provide anoverview organized in such a way that meaningful subsets are recognizable to the expertwho is knowledgeable about the dataset. The tool helps in this regard by providing a wayto group the variables and showing the variable groups as diagonal highlight squares. InFigure 3 two large highlight squares are recognizable near the lower left and the upper right.Closer scrutiny should allow the reader to recognize many more much smaller highlightsquares up and down the diagonal, each marking a small variable group. In particular, thereader should be able to locate the third-largest highlight square in the lower left, shownin turquoise as opposed to gray and pointed at by a faint crosshair: this square marks thegroup of 38 variables shown in Figures 1 and 2.

The mechanism by which variable grouping is conveyed to the AN is a naming conventionfor variable names: to define a variable group, the variables to be included must be givennames that end in the same suffix separated by an underscore “ ” (default, can be changed),and the variables must be contiguous in the order of the dataset. As an example, Figure 4shows the 38 variable group of Figure 1 in the context of its neighbor groups: This group ischaracterized by the suffix “p1.CDV”, whereas the neighbor group on the upper right (onlya small part is visible) has the suffix “p1.OCUV” and the two groups on the lower left havesuffixes “cuPARENT” and “racePARENT”.5 The background highlight squares cover the intra-group correlations for the respective variable groups. As in Figure 3, the highlight squarefor the 38 variable group is shown in turquoise whereas the neighboring highlight squaresare in gray.

Figures 1-4 are a prelude for the zooming and panning functionality to be described inSection 4.

3.3 Other Uses of Blockplots (1): P-Values

Associated with correlations are other quantities of interest that can also be displayed withblockplots, foremost among them the p-values of the correlations. A p-value in this case is ameasure of evidence in favor of the assumption that the observed correlation is spurious, thatis, its deviation from zero is due to chance alone while the population correlation is zero.6

P-values are hypothetical probabilities, hence they fall in the interval [0, 1]. As p-valuesrepresent evidence in favor of the assumption that no linear association exists, it is smallp-values that are of interest, because they indicate that the chance of a spurious detectionof linear association is small. By convention one is looking for p-values at least below 0.05,for a “Type I error” probability of one in twenty or less. When considering p-values ofmany correlations on the same dataset — as is the case here — one needs to protect against“multiplicity”, that is, the fact that 5% of p-values will be below 0.05 even if in truth all

5 These suffixes abbreviate the following full-length meanings: “proband 1, core descriptive variables”,“proband 1, other commonly used variables”, “commonly used for parents” and“race of parents”. Thesevariable groups are from the following SSC tables: proband cdv.csv, proband ocuv.csv, and parent.csv.

6 Technically, the (two-sided) p-value of a correlation is the hypothetical probability of observing a futuresample correlation greater in magnitude than the sample correlation observed in the actual data — assumingthat in truth the population correlation is zero.

8

Page 9: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

nativ

e−am

eric

an.m

o_ra

cePA

RE

NT

na

tive−

haw

aiia

n.m

o_ra

cePA

RE

NT

m

ore−

than

−on

e−ra

ce.m

o_ra

cePA

RE

NT

no

t−sp

ecifi

ed.m

o_ra

cePA

RE

NT

ot

her.m

o_ra

cePA

RE

NT

et

hnic

ity.h

ispa

nic.

mo_

race

PAR

EN

T

bapq

_nbr

_mis

sing

.fa_c

uPA

RE

NT

ba

pq_r

igid

_ave

rage

.fa_c

uPA

RE

NT

ba

pq_p

ragm

atic

_ave

rage

.fa_c

uPA

RE

NT

ba

pq_a

loof

_ave

rage

.fa_c

uPA

RE

NT

ba

pq_o

vera

ll_av

erag

e.fa

_cuP

AR

EN

T

srs_

adul

t_nb

r_m

issi

ng.fa

_cuP

AR

EN

T

srs_

adul

t_to

tal.f

a_cu

PAR

EN

T

bapq

_nbr

_mis

sing

.mo_

cuPA

RE

NT

ba

pq_r

igid

_ave

rage

.mo_

cuPA

RE

NT

ba

pq_p

ragm

atic

_ave

rage

.mo_

cuPA

RE

NT

ba

pq_a

loof

_ave

rage

.mo_

cuPA

RE

NT

ba

pq_o

vera

ll_av

erag

e.m

o_cu

PAR

EN

T

srs_

adul

t_nb

r_m

issi

ng.m

o_cu

PAR

EN

T

srs_

adul

t_to

tal.m

o_cu

PAR

EN

T

age_

at_a

dos_

p1.C

DV

fa

mily

_typ

e_p1

.CD

V

sex_

p1.C

DV

et

hnic

ity_p

1.C

DV

cp

ea_d

x_p1

.CD

V

adi_

r_cp

ea_d

x_p1

.CD

V

adi_

r_so

c_a_

tota

l_p1

.CD

V

adi_

r_co

mm

_b_n

on_v

erba

l_to

tal_

p1.C

DV

ad

i_r_

b_co

mm

_ver

bal_

tota

l_p1

.CD

V

adi_

r_rr

b_c_

tota

l_p1

.CD

V

adi_

r_ev

iden

ce_o

nset

_p1.

CD

V

ados

_mod

ule_

p1.C

DV

di

agno

sis_

ados

_p1.

CD

V

ados

_css

_p1.

CD

V

ados

_soc

ial_

affe

ct_p

1.C

DV

ad

os_r

estr

icte

d_re

petit

ive_

p1.C

DV

ad

os_c

omm

unic

atio

n_so

cial

_p1.

CD

V

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ss

c_di

agno

sis_

verb

al_i

q_ty

pe_p

1.C

DV

ss

c_di

agno

sis_

nonv

erba

l_iq

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

full_

scal

e_iq

_p1.

CD

V

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

vma_

p1.C

DV

ss

c_di

agno

sis_

nvm

a_p1

.CD

V

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

V

srs_

pare

nt_t

_sco

re_p

1.C

DV

sr

s_pa

rent

_raw

_tot

al_p

1.C

DV

sr

s_te

ache

r_t_

scor

e_p1

.CD

V

srs_

teac

her_

raw

_tot

al_p

1.C

DV

rb

s_r_

over

all_

scor

e_p1

.CD

V

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

cb

cl_6

_18_

exte

rnal

izin

g_t_

scor

e_p1

.CD

V

abc_

tota

l_sc

ore_

p1.C

DV

no

n_fe

brile

_sei

zure

s_p1

.CD

V

febr

ile_s

eizu

res_

p1.C

DV

bc

kgd_

hx_h

ighe

st_e

du_m

othe

r_p1

.OC

UV

bc

kgd_

hx_h

ighe

st_e

du_f

athe

r_p1

.OC

UV

bc

kgd_

hx_a

nnua

l_ho

useh

old_

p1.O

CU

V

bckg

d_hx

_par

ent_

rela

tion_

stat

us_p

1.O

CU

V

ssc_

dx_b

est_

estim

ate_

dx_l

ist_

p1.O

CU

V

ssc_

dx_o

vera

llcer

tain

ty_p

1.O

CU

V

gend

er_s

ib1_

p1.O

CU

V

nbr_

still

birt

h_m

isca

rria

ge_p

1.O

CU

V

prob

and_

birt

h_or

der_

p1.O

CU

V

fam

ily_s

truc

ture

_p1.

OC

UV

ad

i_r_

q09_

sing

le_w

ords

_p1.

OC

UV

w

ord_

dela

y_p1

.OC

UV

ad

i_r_

q10_

first

_phr

ases

_p1.

OC

UV

ph

rase

_del

ay_p

1.O

CU

V

adi_

r_q3

0_ov

eral

l_la

ngua

ge_p

1.O

CU

V

adi_

r_q8

6_ab

norm

ality

_evi

dent

_p1.

OC

UV

ad

i_r_

q87_

abno

rmal

ity_m

anife

st_p

1.O

CU

V

ados

1_al

gorit

hm_p

1.O

CU

V

native−hawaiian.mo_racePARENT more−than−one−race.mo_racePARENT

not−specified.mo_racePARENT other.mo_racePARENT

ethnicity.hispanic.mo_racePARENT bapq_nbr_missing.fa_cuPARENT

bapq_rigid_average.fa_cuPARENT bapq_pragmatic_average.fa_cuPARENT

bapq_aloof_average.fa_cuPARENT bapq_overall_average.fa_cuPARENT srs_adult_nbr_missing.fa_cuPARENT

srs_adult_total.fa_cuPARENT bapq_nbr_missing.mo_cuPARENT

bapq_rigid_average.mo_cuPARENT bapq_pragmatic_average.mo_cuPARENT

bapq_aloof_average.mo_cuPARENT bapq_overall_average.mo_cuPARENT srs_adult_nbr_missing.mo_cuPARENT

srs_adult_total.mo_cuPARENT age_at_ados_p1.CDV

family_type_p1.CDV sex_p1.CDV

ethnicity_p1.CDV cpea_dx_p1.CDV

adi_r_cpea_dx_p1.CDV adi_r_soc_a_total_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDV adi_r_b_comm_verbal_total_p1.CDV

adi_r_rrb_c_total_p1.CDV adi_r_evidence_onset_p1.CDV

ados_module_p1.CDV diagnosis_ados_p1.CDV

ados_css_p1.CDV ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV bckgd_hx_highest_edu_mother_p1.OCUV

bckgd_hx_highest_edu_father_p1.OCUV bckgd_hx_annual_household_p1.OCUV

bckgd_hx_parent_relation_status_p1.OCUV ssc_dx_best_estimate_dx_list_p1.OCUV

ssc_dx_overallcertainty_p1.OCUV gender_sib1_p1.OCUV

nbr_stillbirth_miscarriage_p1.OCUV proband_birth_order_p1.OCUV

family_structure_p1.OCUV adi_r_q09_single_words_p1.OCUV

word_delay_p1.OCUV adi_r_q10_first_phrases_p1.OCUV

phrase_delay_p1.OCUV adi_r_q30_overall_language_p1.OCUV

adi_r_q86_abnormality_evident_p1.OCUV adi_r_q87_abnormality_manifest_p1.OCUV

ados1_algorithm_p1.OCUV ados2_algorithm_p1.OCUV

Correlations(Compl.Pairs)

Figure 4: The 38 variable group of Figure 1 in the context of the neigboring variable groups.

9

Page 10: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

population correlations vanish. Such protection is provided by choosing a threshold muchsmaller than 0.05, by the conservative Bonferroni rule as small as 0.05/#correlations. In thedata example with 757 variables, the number of correlations is 286,146, hence one might wantto choose the threshold on the p-values as low as .05/286,146 or about 1.75 in 10 million.The point is that in large-p problems one is interested in very small p-values.7

P-values lend themselves easily to graphical display with blockplots, but the direct map-ping of p-value to blocksize has some drawbacks. These drawbacks, on the other hand, canbe easily fixed:

• P-values are blind to the sign of the correlation: correlation values of +0.95 and -0.95,for example, result in the same two-sided p-value. We correct for this drawback byshowing p-values of negative correlations in red color.

• Of interest are small p-values that correspond to correlations of large magnitude, hencea direct mapping would represent the interesting p-values by small blocks, which isvisually incorrect because the eye is drawn to large objects, not to large holes. Wetherefore invert the mapping and associate blocksize with the complement 1−(p-value).

• Drawing on the preceding discussion, our interest is really in very small p-values, andone may hence want to ignore p-values greater than 0.05 altogether in the display. Wetherefore map the interval [0, 0.05] inversely to blocksize, meaning that p-values belowbut near 0.05 are shown as small blocks and p-values very near 0.00 as large blocks.

The resulting p-value blockplots are illustrated in Figure 5. The two plots show thesame 38 variables group as in Figure 1 with p-values truncated at 0.05 and at 0.000,000,1,respectively, as shown near the bottom left corners of the plots. The p-values are calculatedusing the usual normal approximation to the null distribution of the correlations. In viewof the large sample size, n ≥ 1, 800, the normal approximation can be assumed to be quitegood, even though one is going out on a limb when relying on normal tail probabilities assmall as 10−7. Then again, p-values this small are strong evidence against the assumptionthat the correlations are spurious.

3.4 Other Uses of Blockplots (2): Fraction of Missing and Com-plete Pairs of Values

Missing values are so common that they require special attention and special tools for under-standing their patterns. Missing values are sometimes approached with imputation methods,but in view of the large number of variables we wish to explore we use simple deletion meth-ods that rely on the largest number of available values. For correlations this means that weuse for a given pair of variables the full set of complete pairs of values. Another commonand more stringent deletion method is to use only cases that are complete on all variables,

7 The letters ‘p’ in “large-p” and “p-value” bear no relation. In the former, p is derived from “parameter”,in the latter from “probability”.

10

Page 11: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

age_

at_a

dos_

p1.C

DV

fa

mily

_typ

e_p1

.CD

V

sex_

p1.C

DV

et

hnic

ity_p

1.C

DV

cp

ea_d

x_p1

.CD

V

adi_

r_cp

ea_d

x_p1

.CD

V

adi_

r_so

c_a_

tota

l_p1

.CD

V

adi_

r_co

mm

_b_n

on_v

erba

l_to

tal_

p1.C

DV

ad

i_r_

b_co

mm

_ver

bal_

tota

l_p1

.CD

V

adi_

r_rr

b_c_

tota

l_p1

.CD

V

adi_

r_ev

iden

ce_o

nset

_p1.

CD

V

ados

_mod

ule_

p1.C

DV

di

agno

sis_

ados

_p1.

CD

V

ados

_css

_p1.

CD

V

ados

_soc

ial_

affe

ct_p

1.C

DV

ad

os_r

estr

icte

d_re

petit

ive_

p1.C

DV

ad

os_c

omm

unic

atio

n_so

cial

_p1.

CD

V

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ss

c_di

agno

sis_

verb

al_i

q_ty

pe_p

1.C

DV

ss

c_di

agno

sis_

nonv

erba

l_iq

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

full_

scal

e_iq

_p1.

CD

V

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

vma_

p1.C

DV

ss

c_di

agno

sis_

nvm

a_p1

.CD

V

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

V

srs_

pare

nt_t

_sco

re_p

1.C

DV

sr

s_pa

rent

_raw

_tot

al_p

1.C

DV

sr

s_te

ache

r_t_

scor

e_p1

.CD

V

srs_

teac

her_

raw

_tot

al_p

1.C

DV

rb

s_r_

over

all_

scor

e_p1

.CD

V

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

cb

cl_6

_18_

exte

rnal

izin

g_t_

scor

e_p1

.CD

V

abc_

tota

l_sc

ore_

p1.C

DV

no

n_fe

brile

_sei

zure

s_p1

.CD

V

febr

ile_s

eizu

res_

p1.C

DV

age_at_ados_p1.CDV family_type_p1.CDV

sex_p1.CDV ethnicity_p1.CDV cpea_dx_p1.CDV

adi_r_cpea_dx_p1.CDV adi_r_soc_a_total_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDV adi_r_b_comm_verbal_total_p1.CDV

adi_r_rrb_c_total_p1.CDV adi_r_evidence_onset_p1.CDV

ados_module_p1.CDV diagnosis_ados_p1.CDV

ados_css_p1.CDV ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV

P−values<.05(Normal)

age_

at_a

dos_

p1.C

DV

fa

mily

_typ

e_p1

.CD

V

sex_

p1.C

DV

et

hnic

ity_p

1.C

DV

cp

ea_d

x_p1

.CD

V

adi_

r_cp

ea_d

x_p1

.CD

V

adi_

r_so

c_a_

tota

l_p1

.CD

V

adi_

r_co

mm

_b_n

on_v

erba

l_to

tal_

p1.C

DV

ad

i_r_

b_co

mm

_ver

bal_

tota

l_p1

.CD

V

adi_

r_rr

b_c_

tota

l_p1

.CD

V

adi_

r_ev

iden

ce_o

nset

_p1.

CD

V

ados

_mod

ule_

p1.C

DV

di

agno

sis_

ados

_p1.

CD

V

ados

_css

_p1.

CD

V

ados

_soc

ial_

affe

ct_p

1.C

DV

ad

os_r

estr

icte

d_re

petit

ive_

p1.C

DV

ad

os_c

omm

unic

atio

n_so

cial

_p1.

CD

V

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ss

c_di

agno

sis_

verb

al_i

q_ty

pe_p

1.C

DV

ss

c_di

agno

sis_

nonv

erba

l_iq

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

full_

scal

e_iq

_p1.

CD

V

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

vma_

p1.C

DV

ss

c_di

agno

sis_

nvm

a_p1

.CD

V

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

V

srs_

pare

nt_t

_sco

re_p

1.C

DV

sr

s_pa

rent

_raw

_tot

al_p

1.C

DV

sr

s_te

ache

r_t_

scor

e_p1

.CD

V

srs_

teac

her_

raw

_tot

al_p

1.C

DV

rb

s_r_

over

all_

scor

e_p1

.CD

V

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

cb

cl_6

_18_

exte

rnal

izin

g_t_

scor

e_p1

.CD

V

abc_

tota

l_sc

ore_

p1.C

DV

no

n_fe

brile

_sei

zure

s_p1

.CD

V

febr

ile_s

eizu

res_

p1.C

DV

age_at_ados_p1.CDV family_type_p1.CDV

sex_p1.CDV ethnicity_p1.CDV cpea_dx_p1.CDV

adi_r_cpea_dx_p1.CDV adi_r_soc_a_total_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDV adi_r_b_comm_verbal_total_p1.CDV

adi_r_rrb_c_total_p1.CDV adi_r_evidence_onset_p1.CDV

ados_module_p1.CDV diagnosis_ados_p1.CDV

ados_css_p1.CDV ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV

P−values<.0000001(Normal)

Figure 5: Blockplots of the p-values for the 38 variable group of Figure 1. Smaller and hencestatistically more significant p-values are shown as larger blocks. The colors are inheritedfrom the correlations to reflect their signs.Truncation levels of p-values: Left ≥ 0.05; right ≥ 0.000, 000, 1.Many modest correlations are extremely statistically significant due to n ≥ 1, 800.

age_

at_a

dos_

p1.C

DV

fa

mily

_typ

e_p1

.CD

V

sex_

p1.C

DV

et

hnic

ity_p

1.C

DV

cp

ea_d

x_p1

.CD

V

adi_

r_cp

ea_d

x_p1

.CD

V

adi_

r_so

c_a_

tota

l_p1

.CD

V

adi_

r_co

mm

_b_n

on_v

erba

l_to

tal_

p1.C

DV

ad

i_r_

b_co

mm

_ver

bal_

tota

l_p1

.CD

V

adi_

r_rr

b_c_

tota

l_p1

.CD

V

adi_

r_ev

iden

ce_o

nset

_p1.

CD

V

ados

_mod

ule_

p1.C

DV

di

agno

sis_

ados

_p1.

CD

V

ados

_css

_p1.

CD

V

ados

_soc

ial_

affe

ct_p

1.C

DV

ad

os_r

estr

icte

d_re

petit

ive_

p1.C

DV

ad

os_c

omm

unic

atio

n_so

cial

_p1.

CD

V

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ss

c_di

agno

sis_

verb

al_i

q_ty

pe_p

1.C

DV

ss

c_di

agno

sis_

nonv

erba

l_iq

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

full_

scal

e_iq

_p1.

CD

V

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

vma_

p1.C

DV

ss

c_di

agno

sis_

nvm

a_p1

.CD

V

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

V

srs_

pare

nt_t

_sco

re_p

1.C

DV

sr

s_pa

rent

_raw

_tot

al_p

1.C

DV

sr

s_te

ache

r_t_

scor

e_p1

.CD

V

srs_

teac

her_

raw

_tot

al_p

1.C

DV

rb

s_r_

over

all_

scor

e_p1

.CD

V

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

cb

cl_6

_18_

exte

rnal

izin

g_t_

scor

e_p1

.CD

V

abc_

tota

l_sc

ore_

p1.C

DV

no

n_fe

brile

_sei

zure

s_p1

.CD

V

febr

ile_s

eizu

res_

p1.C

DV

age_at_ados_p1.CDV family_type_p1.CDV

sex_p1.CDV ethnicity_p1.CDV cpea_dx_p1.CDV

adi_r_cpea_dx_p1.CDV adi_r_soc_a_total_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDV adi_r_b_comm_verbal_total_p1.CDV

adi_r_rrb_c_total_p1.CDV adi_r_evidence_onset_p1.CDV

ados_module_p1.CDV diagnosis_ados_p1.CDV

ados_css_p1.CDV ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV

#MissingPairs/N

age_

at_a

dos_

p1.C

DV

fa

mily

_typ

e_p1

.CD

V

sex_

p1.C

DV

et

hnic

ity_p

1.C

DV

cp

ea_d

x_p1

.CD

V

adi_

r_cp

ea_d

x_p1

.CD

V

adi_

r_so

c_a_

tota

l_p1

.CD

V

adi_

r_co

mm

_b_n

on_v

erba

l_to

tal_

p1.C

DV

ad

i_r_

b_co

mm

_ver

bal_

tota

l_p1

.CD

V

adi_

r_rr

b_c_

tota

l_p1

.CD

V

adi_

r_ev

iden

ce_o

nset

_p1.

CD

V

ados

_mod

ule_

p1.C

DV

di

agno

sis_

ados

_p1.

CD

V

ados

_css

_p1.

CD

V

ados

_soc

ial_

affe

ct_p

1.C

DV

ad

os_r

estr

icte

d_re

petit

ive_

p1.C

DV

ad

os_c

omm

unic

atio

n_so

cial

_p1.

CD

V

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ss

c_di

agno

sis_

verb

al_i

q_ty

pe_p

1.C

DV

ss

c_di

agno

sis_

nonv

erba

l_iq

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

full_

scal

e_iq

_p1.

CD

V

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

vma_

p1.C

DV

ss

c_di

agno

sis_

nvm

a_p1

.CD

V

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

V

srs_

pare

nt_t

_sco

re_p

1.C

DV

sr

s_pa

rent

_raw

_tot

al_p

1.C

DV

sr

s_te

ache

r_t_

scor

e_p1

.CD

V

srs_

teac

her_

raw

_tot

al_p

1.C

DV

rb

s_r_

over

all_

scor

e_p1

.CD

V

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

cb

cl_6

_18_

exte

rnal

izin

g_t_

scor

e_p1

.CD

V

abc_

tota

l_sc

ore_

p1.C

DV

no

n_fe

brile

_sei

zure

s_p1

.CD

V

febr

ile_s

eizu

res_

p1.C

DV

age_at_ados_p1.CDV family_type_p1.CDV

sex_p1.CDV ethnicity_p1.CDV cpea_dx_p1.CDV

adi_r_cpea_dx_p1.CDV adi_r_soc_a_total_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDV adi_r_b_comm_verbal_total_p1.CDV

adi_r_rrb_c_total_p1.CDV adi_r_evidence_onset_p1.CDV

ados_module_p1.CDV diagnosis_ados_p1.CDV

ados_css_p1.CDV ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV

#CompletePairs/N

Figure 6: Blockplots of the fractions of missing (left) and complete (right) pairs of values.

11

Page 12: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

but in the large-p problem this is not a viable approach because complete cases may wellnot exist when the number of variables reaches into the hundreds or thousands.

An issue with calculating correlations from maximal sets of complete pairs of values isthat this set may vary from correlation to correlation because it is formed from the overlapof non-missing values in both variables. Thus, associated with each correlation r(x, y) are

• the number n(x, y) (≤ n) of complete pairs from which r(x, y) is calculated, and

• the number m(x, y) = n− n(x, y) of incomplete pairs where at least one of the two, xor y, is missing.

Just like the correlations r(x, y), the values n(x, y) and m(x, y) form n × n tables, hencecan be easily visualized with blockplots in their fractional forms n(x, y)/n and m(x, y)/n.An example of each is shown in Figure 6, again for the same 38 variable group of Figure 1.Apparently four variables have a major missing value problem.

Depending on whether the number of complete or incomplete pairs dominates, one orthe other plot is more sensible in that it uses less ink. Finally, we note that in this caseof a blockplot the diagonal is not occupied by a constant but contains instead the fractionof non-missing (n(x, x)/n) and missing (m(x, x)/n) values, respectively, for each individualvariable X. The two tables have inverse relationships between the diagonal and off-diagonalelements: n(x, x) ≥ n(x, y) and m(x, x) ≤ m(x, y). That is, in the n(x, y)-table the diagonaldominates its row and column, whereas in the m(x, y)-table the diagonal is dominated byits row and column.

3.5 Marginal and Bivariate Plots: Histograms/Barcharts,Scatterplots, and Scatterplot Matrices

The correlation of a pair of variables is a simple summary measure of association betweentwo variables, hence one often wonders about the detailed nature of the association. The fulldetails can be learned from a scatterplot of the two variables. Often the association is con-strained by the marginal distribution, hence we also show histograms and barcharts. Figure 7shows three examples of triples consisting of a pairwise scatterplot and two marginal his-tograms (for quantitative variables) and barcharts (for categorical variables). From Figure 7we can draw a few conclusions and recommendations:

• A most basic use of the plots is to note the type of the variables: In Figure 7, bothvariables on the left (..nonverbal iq.. and ..verbal iq..) and the y-variable in thecenter (..vma..) are quantitative, the x-variable in the center (ados module..) isapparently ordinal with four levels, and both variables on the right (nonverbal iq type

and verbal iq type) are binary. Quantitative variables can have strong marginal features:It might be of interest to observe that the x-variable on the left is slightly bimodal, with amajor mode around x = 90 and a minor mode around x = 30.8 The y-variable in the center

8 The bimodality of the IQ distribution is a measurement artifact: For cognitively highly impairedprobands a different and more appropriate IQ test is administered. In theory this alternative test should be

12

Page 13: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

●●

●●

● ●●

●●

● ●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ● ●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ● ●

●●

●●

●●

●●

●●●

● ●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

● ●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

●●

● ● ●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

0 50 100 150

050

100

150

ssc_diagnosis_verbal_iq_p1.CDV

ssc_

diag

nosi

s_no

nver

bal_

iq_p

1.C

DV

Frame #1: n = 1887Corr = 0.824 (pval = 0)

ssc_diagnosis_verbal_iq_p1.CDV

Fre

quen

cy

0 50 100 150

050

100

150

200

250

ssc_diagnosis_nonverbal_iq_p1.CDV

Fre

quen

cy

0 50 100 150

050

100

200

300

●●●

●●

●●

● ●

●●

● ●

●●

●●●●

● ●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●●

●●

●● ●●

● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ● ●●

●●

●●

●●

● ● ●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●● ●

●●

●●

●● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●010

020

030

0

ados_module_p1.CDV

ssc_

diag

nosi

s_vm

a_p1

.CD

V

1 2 3 4

●●

Frame #2: n = 1880Corr = 0.707 (pval = 0)

ados_module_p1.CDV

Fre

quen

cy

020

040

060

080

010

00

1 2 3 4

ssc_diagnosis_vma_p1.CDV

Fre

quen

cy

0 100 200 300

050

100

200

300

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

● ●●

●●

●●● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

● ●

●● ●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ● ●

●●

●●

● ●

● ●

●●

●●

●●

●●●

●●

● ●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

● ●

●●●

● ●

●●

●●

● ●

● ●

●●

●●

● ●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●●

●●●

● ●●●

●●●

●●

ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

1 2

12

Frame #3: n = 1887Corr = 0.76 (pval = 0)

ssc_diagnosis_verbal_iq_type_p1.CDV

Fre

quen

cy

050

010

0015

001 2

ssc_diagnosis_nonverbal_iq_type_p1.CDV

Fre

quen

cy

050

010

0015

00

1 2

Figure 7: Scatterplots and histograms/barplots for three variable pairs.

13

Page 14: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

scatterplot is partially censored on the upper side at about y = 210, as can be seen both inthe scatterplot and in the (lower) histogram.

• Categorical variables, when scored numerically, can be gainfully displayed in scatterplots.It is useful to jitter them to avoid being misled by overplotting. In Figure 7, jittering isapplied to the x-variable in the center scatterplot and to both binary variables in the righthand scatterplot.

• To enhance the perception of the association, the scatterplots can be decorated withsmooths for continuous variables and with traces of group means when the x-variableis categorical with fewer than, say, 8 groups (default, can be changed). In the left and cen-ter scatterplots of Figure 7, the associations of the y-variables with the x-variables are seento be somewhat non-linear, but compared to the linear component of the association, thenon-linearities are relatively modest.9

The AN shows scatterplots and histograms/barcharts in a window separate from theblockplot window, one triple of plots at a time. To overcome the one-at-a time limitation,the AN also offers scatterplot matrices (sometimes called “sploms”) of arbitrary numbers ofvariables. An example, involving four variables (different from those in Figure 7), is shownin Figure 8. For readers not familiar with scatterplot matrices, note that each variable pairis shown twice, in plots located symmetrically off the diagonal, and with reverse roles as x-and y-variables. Each diagonal cell shows a variable label that indicates (1) the commonx-axis in the column of the cell and (2) the common y-axis in the row of the cell. For thereader familier with scatterplot matrices, note that we show the vertical order of the variablesascending from bottom to top, the reason being consistency with the convention we use inthe blockplots.

As for particulars of the scatterplot matrix shown in Figure 8, the visually most strikingfeatures concern marginal distributions, not associations: The first variable is capped at themaximal value +90, and the fourth variable is binary. Otherwise the associations look simplymonotone and seem well-summarized by correlations.

3.6 Variations on Blockplots

Blockplots are not the most common visualizations of correlation tables. As a google searchof “correlation plot” reveals, the most frequent visual rendering of correlation tables is interms of “heatmaps” where square cells are always filled and numeric values are coded on agray or color scale. An example is shown in the left frame of Figure 9; for comparison, theright frame shows the corresponding blockplot. Here are a few observations about the twotypes of plots:

scaled to cohere with the test administered to the majority, but in practice it creates a minor mode in thelow end of the IQ distribution, more so for verbal IQ than nonverbal IQ.

9 The non-linearity on the left could be due to the marginal distributions. The non-linearity in the centeris expected by the expert: verbal mental age (vma) on the y-axis should be considerably higher on average inADOS modules 3 and especially 4 because these modules or levels are formed from a simple test of languagecompetence.

14

Page 15: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

20 40 60 80 100 120

02

46

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●●

●●

●●

● ●

40 50 60 70 80 90

●●

●●

●●

● ●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

● ●

ados2algorithmp1.OCUV

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● srsteacher

tscore

p1.CDV

4050

6070

8090

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

4050

6070

8090

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●● ●●

●●

●●

●●

●●

●●

●●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

● ●

●●

●●

● ●

●●●

● ●

● ●

●●

●●

● ●●

●●

●●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

● ●

●●

●●

● ●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ● ●

● ●●

●●

●●

●● ●

●●

●●

●● ●

● ●

●●

● ●

●●

●●

●●

●●● ●●

● ●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

● ●

●●

● ●●

● ●●

●●

●●

●●

●● ●

● ●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●● ●●●

● ●●

●●

●●

● ●●●

●●

● ●

●●

● ●●

●●●

● ●●

● ●

●●

●●

●● ●

●●

●● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ● ●

●● ●

● ●

●●

●●

● ●

● ●●

● ●

●● ●

● ●

●●● ●

●●

●●

●●

●●

srsparent

tscore

p1.CDV

●●

●●

● ●

● ●●

● ●

● ●

●●

●●

●●●

● ●

●●

● ●

● ●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

●●

●●

● ●●

●●

● ●●

●●

●●

● ● ● ●● ●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

● ●

● ●

●●

● ●

●●

●●

● ●●

● ●

●●

●●

●●●

●●

●●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●● ●● ●●

● ●● ●

●●

● ●

●●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●●

●●●

●● ●

●●

●●●

●●

● ●●

● ● ●●●

●●

●●

●●

● ●

●● ●●

●●

●●

●● ●

●●

●●

●●●

●●

●● ●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●●

●●

vinelandii

compositestandard

scorep1.CDV

40 50 60 70 80 90

●●

●●

●●

●●

● ●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

● ●

● ●

● ●

● ●

● ●

●●

●● ●

●●

●●

●●

● ●

●●

●● ●

●●

● ●●

●●

● ●

●●

●●

●●

● ●

●●

●● ●

●●

● ●●

●●

●●

●●

●●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●●

●●

●●

● ●●

●●

● ●●

●●

●●●

●●

● ●●●

●●

● ●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

●● ●

●● ●

●●

● ●

● ●

●●●

● ●

●●

● ●

●●

●●

●●●

●●

● ●

● ●

●●●

●●●

●●

●●

● ●

●●●

●●

●●

●●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

● ●●

● ●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●●

● ●

●●

●●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●●

●●●

● ●

●●

●●

●● ●

● ●

●●

0 2 4 6

2040

6080

100

120

●● ●

● ●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●● ●

●●

● ●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

● ●

●●

● ●

●●●

●●●

● ●●●

●●

●●

●●

●●

● ●●

●●

● ●●

● ●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●●

●●

●●

●●

Figure 8: Scatterplot matrix of four variables. (Note the convention for the vertical order ofthe variables: bottom to top, for consistency with the blockplots.)

15

Page 16: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

• Color and gray scale is generally a weaker visual cue than size. This argument favorsblockplots as long as the blocks are not too small, that is, as long as the view is notzoomed out too much. The superiority of blockplots over heatmaps is also noted byWickham, Hofmann and Cook (2006, Figure 2).

• In heatmaps color fuses adjacent cells when they are close in value. This may or maynot be a problem for the trained eye, but there is a loss of identity of the rows andcolumns in heatmaps.

• Heatmaps do not permit mark-up with background color because they fill the square orrectangular cells completely. This problem can be overcome by shrinking the heatmapcells somewhat to allow some surrounding space to be freed up that can be filled withbackground color for mark-up, as shown in the center frame of Figure 9. This methodof rendering, however, seems to further decrease the crispness of heatmaps.

• Heatmaps perform nicely when the view is heavily zoomed out, in which case theindividual blocks are so small that size is no longer visually functional as a cue. In thiscase color coding works well and gives an accurate impression of global structure. Wesolve this problem for blockplots by showing only 10,000 or so of the largest correlationswhen heavily zoomed out. Thinning the table in this manner works well even whenthe visible table is so large that each cell is strictly speaking below the pixel resolutionof a raster screen.

Because none of the two types of plots — blockplots or heatmaps — may be uniformlysuperior at all scales, the AN provides both, and with one keystroke one can toggle betweenthe two rendering methods. Varying block size allows for the mixed variant shown in thecenter of Figure 9.

Visualization of correlation tables has a small literature in statistics. An early referencethat addresses large correlation tables is Hills (1969), who applies half-normal plots to tellstatistically significant from insignificant correlations and clusters variables visually in two-dimensional projections. Closer to the present work are articles by Murdoch and Chow(1996) and Friendly (2002). Both propose relatively complex renderings of correlations withellipses or augmented circles that may not scale up to the sizes of tables we have in mindbut may be useful for conveying richer information for tables that are smaller, yet too largefor numeric table display. Blockplot coding, which uses squares, has the advantage thatthese shapes can completely fill their cells to represent extremal correlations as these aregeometrically similar to the shapes of the containing cells (at least if the the default aspectratio of the blockplot is maintained), whereas all other shapes leave residual space even whenmaximally expanded.

What we prefer to call descriptively “blockplots”, possibly contracted to “blots”, haspreviously been named “fluctuation diagrams” (Hofmann 2000). Under this term one can finda static implementation in the R-package extracat on the CRAN site authored by Pilhoeferand Unwind (2013). Static software for heatmaps is readily available, for example, in theR-function heatmap(). Heatmaps are often applied to raw data tables, but they can beequally applied to correlation tables. Many variations of glyph coding can be found in the

16

Page 17: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

classic book by Bertin (1983).An interesting aspect of blockplots is that there exists science regarding the perception

of area size. A general theory holds that most continuous stimuli (“continua” such as length,area, volume, weight, brightness, loudness, ...) result in perceptions according to “Stevens’power law” (Stevens (1957), Stevens and Galanter (1957) and Stevens (1975)). That is, aquantitative stimulus x translates to a quantitative perception p(x) through a law of theform p(x) = c xβ. As discussed by Cleveland (1985, p. 243) with reference to Stevens(1975), for area perception the power is about β = 0.7, meaning that an actual area ratioof 2:1 is on average perceived as a ratio of (2:1)0.7 ≈ 1.62. This law can be leveragedto determine the transformation that should be used to map correlations to squares ina blockplot. In R the symbol size is parametrized in terms of a linear expansion factorcalled cex (’character expansion’). Our goal is to use block sizes such that their perceivedratios faithfully reflect the ratios of respective correlations. This results in the conditioncor ∼ p(cex2) = (cex2)0.7 = cex1.4, hence cex ∼ cor1/1.4 ≈ cor0.7. This is indeed the defaultpower transformation in the AN, although users can change it (see Appendix B). If mostcorrelations are very small, a power closer to zero will expand the range of small values,resulting in enhanced discrimination at the low end at the cost of attenuated discriminationat the high end.

diag

nosi

s_ad

os_p

1.C

DV

ados

_css

_p1.

CD

V

ados

_soc

ial_

affe

ct_p

1.C

DV

ados

_res

tric

ted_

repe

titiv

e_p1

.CD

V

ados

_com

mun

icat

ion_

soci

al_p

1.C

DV

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ssc_

diag

nosi

s_ve

rbal

_iq_

type

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_p

1.C

DV

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_p

1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_vm

a_p1

.CD

V

ssc_

diag

nosi

s_nv

ma_

p1.C

DV

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

V

srs_

pare

nt_t

_sco

re_p

1.C

DV

srs_

pare

nt_r

aw_t

otal

_p1.

CD

V

srs_

teac

her_

t_sc

ore_

p1.C

DV

srs_

teac

her_

raw

_tot

al_p

1.C

DV

rbs_

r_ov

eral

l_sc

ore_

p1.C

DV

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

cbcl

_6_1

8_ex

tern

aliz

ing_

t_sc

ore_

p1.C

DV

abc_

tota

l_sc

ore_

p1.C

DV

non_

febr

ile_s

eizu

res_

p1.C

DV

febr

ile_s

eizu

res_

p1.C

DV

bckg

d_hx

_hig

hest

_edu

_mot

her_

p1.O

CU

V

bckg

d_hx

_hig

hest

_edu

_fat

her_

p1.O

CU

V

bckg

d_hx

_ann

ual_

hous

ehol

d_p1

.OC

UV

bckg

d_hx

_par

ent_

rela

tion_

stat

us_p

1.O

CU

V

ssc_

dx_b

est_

estim

ate_

dx_l

ist_

p1.O

CU

V

ssc_

dx_o

vera

llcer

tain

ty_p

1.O

CU

V

gend

er_s

ib1_

p1.O

CU

V

nbr_

still

birt

h_m

isca

rria

ge_p

1.O

CU

V

prob

and_

birt

h_or

der_

p1.O

CU

V

fam

ily_s

truc

ture

_p1.

OC

UV

adi_

r_q0

9_si

ngle

_wor

ds_p

1.O

CU

V

wor

d_de

lay_

p1.O

CU

V

diagnosis_ados_p1.CDV

ados_css_p1.CDV

ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV

ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV

ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV

ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV

ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV

ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV

srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV

srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV

rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV

cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV

cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV

non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV

bckgd_hx_highest_edu_mother_p1.OCUV

bckgd_hx_highest_edu_father_p1.OCUV

bckgd_hx_annual_household_p1.OCUV

bckgd_hx_parent_relation_status_p1.OCUV

ssc_dx_best_estimate_dx_list_p1.OCUV

ssc_dx_overallcertainty_p1.OCUV

gender_sib1_p1.OCUV

nbr_stillbirth_miscarriage_p1.OCUV

proband_birth_order_p1.OCUV

family_structure_p1.OCUV

adi_r_q09_single_words_p1.OCUV

word_delay_p1.OCUV

diag

nosi

s_ad

os_p

1.C

DV

ados

_css

_p1.

CD

V

ados

_soc

ial_

affe

ct_p

1.C

DV

ados

_res

tric

ted_

repe

titiv

e_p1

.CD

V

ados

_com

mun

icat

ion_

soci

al_p

1.C

DV

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ssc_

diag

nosi

s_ve

rbal

_iq_

type

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_p

1.C

DV

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_p

1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_vm

a_p1

.CD

V

ssc_

diag

nosi

s_nv

ma_

p1.C

DV

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

V

srs_

pare

nt_t

_sco

re_p

1.C

DV

srs_

pare

nt_r

aw_t

otal

_p1.

CD

V

srs_

teac

her_

t_sc

ore_

p1.C

DV

srs_

teac

her_

raw

_tot

al_p

1.C

DV

rbs_

r_ov

eral

l_sc

ore_

p1.C

DV

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

cbcl

_6_1

8_ex

tern

aliz

ing_

t_sc

ore_

p1.C

DV

abc_

tota

l_sc

ore_

p1.C

DV

non_

febr

ile_s

eizu

res_

p1.C

DV

febr

ile_s

eizu

res_

p1.C

DV

bckg

d_hx

_hig

hest

_edu

_mot

her_

p1.O

CU

V

bckg

d_hx

_hig

hest

_edu

_fat

her_

p1.O

CU

V

bckg

d_hx

_ann

ual_

hous

ehol

d_p1

.OC

UV

bckg

d_hx

_par

ent_

rela

tion_

stat

us_p

1.O

CU

V

ssc_

dx_b

est_

estim

ate_

dx_l

ist_

p1.O

CU

V

ssc_

dx_o

vera

llcer

tain

ty_p

1.O

CU

V

gend

er_s

ib1_

p1.O

CU

V

nbr_

still

birt

h_m

isca

rria

ge_p

1.O

CU

V

prob

and_

birt

h_or

der_

p1.O

CU

V

fam

ily_s

truc

ture

_p1.

OC

UV

adi_

r_q0

9_si

ngle

_wor

ds_p

1.O

CU

V

wor

d_de

lay_

p1.O

CU

V

diagnosis_ados_p1.CDV

ados_css_p1.CDV

ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV

ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV

ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV

ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV

ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV

ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV

srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV

srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV

rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV

cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV

cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV

non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV

bckgd_hx_highest_edu_mother_p1.OCUV

bckgd_hx_highest_edu_father_p1.OCUV

bckgd_hx_annual_household_p1.OCUV

bckgd_hx_parent_relation_status_p1.OCUV

ssc_dx_best_estimate_dx_list_p1.OCUV

ssc_dx_overallcertainty_p1.OCUV

gender_sib1_p1.OCUV

nbr_stillbirth_miscarriage_p1.OCUV

proband_birth_order_p1.OCUV

family_structure_p1.OCUV

adi_r_q09_single_words_p1.OCUV

word_delay_p1.OCUV di

agno

sis_

ados

_p1.

CD

V

ados

_css

_p1.

CD

V

ados

_soc

ial_

affe

ct_p

1.C

DV

ados

_res

tric

ted_

repe

titiv

e_p1

.CD

V

ados

_com

mun

icat

ion_

soci

al_p

1.C

DV

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ssc_

diag

nosi

s_ve

rbal

_iq_

type

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_p

1.C

DV

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_p

1.C

DV

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ssc_

diag

nosi

s_vm

a_p1

.CD

V

ssc_

diag

nosi

s_nv

ma_

p1.C

DV

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

V

srs_

pare

nt_t

_sco

re_p

1.C

DV

srs_

pare

nt_r

aw_t

otal

_p1.

CD

V

srs_

teac

her_

t_sc

ore_

p1.C

DV

srs_

teac

her_

raw

_tot

al_p

1.C

DV

rbs_

r_ov

eral

l_sc

ore_

p1.C

DV

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

cbcl

_6_1

8_ex

tern

aliz

ing_

t_sc

ore_

p1.C

DV

abc_

tota

l_sc

ore_

p1.C

DV

non_

febr

ile_s

eizu

res_

p1.C

DV

febr

ile_s

eizu

res_

p1.C

DV

bckg

d_hx

_hig

hest

_edu

_mot

her_

p1.O

CU

V

bckg

d_hx

_hig

hest

_edu

_fat

her_

p1.O

CU

V

bckg

d_hx

_ann

ual_

hous

ehol

d_p1

.OC

UV

bckg

d_hx

_par

ent_

rela

tion_

stat

us_p

1.O

CU

V

ssc_

dx_b

est_

estim

ate_

dx_l

ist_

p1.O

CU

V

ssc_

dx_o

vera

llcer

tain

ty_p

1.O

CU

V

gend

er_s

ib1_

p1.O

CU

V

nbr_

still

birt

h_m

isca

rria

ge_p

1.O

CU

V

prob

and_

birt

h_or

der_

p1.O

CU

V

fam

ily_s

truc

ture

_p1.

OC

UV

adi_

r_q0

9_si

ngle

_wor

ds_p

1.O

CU

V

wor

d_de

lay_

p1.O

CU

V

diagnosis_ados_p1.CDV

ados_css_p1.CDV

ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV

ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV

ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV

ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV

ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV

ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV

srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV

srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV

rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV

cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV

cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV

non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV

bckgd_hx_highest_edu_mother_p1.OCUV

bckgd_hx_highest_edu_father_p1.OCUV

bckgd_hx_annual_household_p1.OCUV

bckgd_hx_parent_relation_status_p1.OCUV

ssc_dx_best_estimate_dx_list_p1.OCUV

ssc_dx_overallcertainty_p1.OCUV

gender_sib1_p1.OCUV

nbr_stillbirth_miscarriage_p1.OCUV

proband_birth_order_p1.OCUV

family_structure_p1.OCUV

adi_r_q09_single_words_p1.OCUV

word_delay_p1.OCUV

Figure 9: A heatmap (left) compared with a corresponding blockplot (right), as well as a“shrunk heatmap” in the center.

17

Page 18: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

4 Operation of the Association Navigator

The purpose of the AN is to generate the displays described above in rapid order and evenwith realtime motion. Numerous realtime operations are under mouse and keyboard control,while a few text-based operations are under dialog and menu control. Further parameterscan be controlled from the R language (see Appendix B), but this will not be necessary formost users. This section describes the operations of the AN, the purposes they serve, aswell as a minimal set of R-related instructions that concern one-time setup, regular startingup, and saving of state. The software will be available as a R-package, but the instructionsbelow do not reflect this and get the reader going by sourcing the software from the firstauthor’s site.

4.1 Starting Up the AN

In order to simply see some AN running, the reader may paste the following code into anR interpreter:

source("http://stat.wharton.upenn.edu/~buja/association-navigator.R")

p <- 200

mymatrix <- matrix(rnorm(20000),ncol=p)

colnames(mymatrix) <- paste("V", 1:p, "_", c(rep("A",p/2),rep("B",p/2)), sep="")

a.n <- a.nav.create(mymatrix)

a.nav.run(a.n)

This code will download and source the software, generate an artificial data matrix of normalrandom numbers, generate an instance of an AN from it, and start up by creating a windowshowing a blockplot of correlations as they arise from pure random association among 100variables given a sample size of 200, divided into two block of 100 variables each, suffixed“A” and “B”, respectively. The reader may left-drag the mouse in the plot to see a firstrealtime response.

To prevent confusion in the operation of an AN, users should note the following funda-mental points:

• Important: While the AN is running, the R interpreter (R Gui) is blocked by theexecution of the AN’s event loop! All interactions must be directed at the masterwindow of the AN, which usually shows a blockplot.

• Quitting the AN and returning to the R interpreter is done by typing the capitalletter ‘Q’ into the AN master window. The master window will remain as a passiveR plot window. It will no longer respond to user input, but the R interpreter (R Gui)will be responsive again. (A live AN can also be stopped violently by typing interruptcharacters ctrl-C into the R interpreter or by killing the AN master window, but aneducated R user wouldn’t be this crude.)

18

Page 19: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

• Help: On typing the letter ‘h’ into a live AN, a help window will appear with tersedocumentation of all AN interactions. The window is meant to give reminders topreviously initiated AN users, not introductions to beginners. — The help window isactually a menu such that selecting a line documenting a keystroke will emulate theeffects of the keystroke. Because the help window is a menu, it must be closed in orderto regain the AN’s attention. (This behavior will be changed in a future version.)

• Notion of ‘state’: An AN instance has internal state. As a consequence, whenevera user stops a live AN and restarts it, it will resume in the exact state in which it wasstopped.

• Saving ‘state’: From the previous point follows that state of an AN is saved acrossR sessions if the core image has been saved (save.image()) before quitting the R ses-sions.

4.2 Moving Around: Crosshair Placement, Panning and Zooming

When an AN is run for the first time, it shows an overview of the complete correlationtable, which may comprise hundreds of variables. Most likely the variables will be organizedin variable groups that are characterized by shared suffixes of variable names and visuallyform a series of highlight squares along the ascending diagonal. The first order of businessis to zoom in and pan up and down the ascending diagonal to gain an overview of thesesub-tables. Here are the steps:

• Crosshair: Place it by left-clicking anywhere in the plotting area. All subsequentzooming is done with regard to the location of the crosshair; it is also the referencepoint for some panning operations. Repeat left-clicking a few times for practice. Thelast location of the crosshair will be the target for zooming, described next.

• Zooming: Hit the following for a single step of zooming, or keep depressed for “con-tinuous” zooming.

– ‘i’ for zooming in (alternate: ‘=’).

– ‘I’ for accelerated zooming in (alternate: ‘+’).

– ‘o’ for zooming out (alternate: ‘-’).

– ‘O’ for accelerated zooming out (alternate: ‘_’).

Accelerated zooming changes the visible range by a factor 2, whereas regular zoomingis adjusted such that 12 steps change the visible range by a factor of 2. Thus theaccelerated zooms are usually done discretely with single keystrokes, and the regularzooms in “continuous” mode with depressed keys. For practice, zoom in and out a fewtimes with your choice of key alternates.

19

Page 20: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

• Panning (shifting, translating) is most frequently done by dragging the mouse, butkeystrokes are sometimes useful for vertical, horizontal, and diagonal searching.

– Left-depress the mouse and drag; the plot will follow. When heavily zoomed outfrom a large table, the response may be slow. The response to mouse draggingwill be the swifter the more zoomed in the view is.

– ‘←’, ‘→’, ‘↑’, ‘↓’ for translation in the obvious directions by one block/variableper keystroke.

– ‘d’/‘D’ for diagonal moves down/up the ascending 45 degree diagonal.

– ‘ ’, the space bar for accelerated panning by doing the last single-step keyboardmove in jumps of five blocks/variables instead of one.

– ‘.’ to pan so the crosshair location becomes the center of the view.

– ‘[’, ‘]’, ‘{’, ‘}’ to pan so the crosshair location becomes, respectively, the bottomleft, the bottom right, the top left, or the top right of the view.

Yet another method of panning will be described below under “Text Search for Vari-able Names.” Combined pan/zoom based on focus rectangles is described in the nextsubsection.

4.3 Graphical Parameters

Graphical parameters that determine the aesthetics of a plot are rarely gotten right byautomatic algorithms. The problem of aesthetics is particularly difficult when zooming inand out over several orders of magnitude is the order of the day. The AN therefore makes noteven an attempt to guess pleasing and much less optimal values for such graphical parametersas font size of variable labels and margin size in blockplots. Instead, the user gets to choosethem by trial and error as follows:

• Block size in the blockplot: hit or depress

– ‘b’ to decrease,

– ‘B’ to increase.

After starting up a new AN, adjusting the block size is usually the second operationafter zooming in.

• Crosshair size: hit or depress

– ‘c’ to decrease,

– ‘C’ to increase.

Exploding the crosshair by depressing ‘C’ is an effective method for reading the variablenames of a given block in the margins.

• Font size of the variable labels: hit or depress

20

Page 21: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

– ‘f’ to decrease,

– ‘F’ to increase.

Important: When the font size is large in relation to the zoom, the variable labelsget “thinned out” to avoid gross overplotting (only every second, third ... label mightbe shown). This allows viewers to at least identify the variable group from the suffix.

• Margin size for the variable labels: hit or depress

– ‘m’ to decrease,

– ‘M’ to increase.

Margin size needs adjusting according to the prevalent label length and font size. Adilemma occurs when, for example, the x-variable labels are much shorter than they-variable labels. For this situation we want the following:

• Differential margin size for the variable labels: hit or depress

– ‘n’ to decrease the left/y margin and increase the bottom/x margin,

– ‘N’ to increase the left/y margin and decrease the bottom/x margin.

4.4 Correlations, P-values, Missing and Complete Pairs

By default the blockplot of a AN represents correlations, but the user can choose them torepresent p-values or fraction of missing (incomplete) pairs or fraction of complete pairs asfollows: Hit

• ‘ctrl-O’ for observed correlations,

• ‘ctrl-P’ for p-values of the correlations (Section 3.3),

• ‘ctrl-M’ for fraction of missing/incomplete pairs (Section 3.4),

• ‘ctrl-N’ for fraction of complete pairs (Section 3.4),.

As discussed in Section 3.3, p-values can be thresholded to obtain Bonferroni-style protectionagainst multiplicity. The thresholds are confined to a ladder of “round” values. Stepping upand down the ladder is achieved by repeatedly hitting

• ‘>’ to lower the threshold and obtain greater protection,

• ‘<’ to raise the threshold and lose protection.

Recall Figure 5 for two examples of p-value blockplots that differ in the threshold only. —Thresholding also applies to correlation blockplots, in which case ‘>’ raises the threshold onthe magnitude of the correlations that are shown, and ‘<’ lowers it.

Sometimes it is useful to compare magnitudes of the blocks without the distraction ofcolor, hence it may be convenient to hit

• ‘ctrl-A’ to toggle between showing all blocks in blue (ignoring signs) and showing thenegative correlations (and their p-values) in red.

21

Page 22: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

4.5 Highlighting (1): Strips

Highlight strips are horizontal or vertical bands that run across the whole width or height ofthe blockplot. They help users search the associations of a given variable with all other vari-ables. Cross-wise highlight strips are also often placed to maintain the connection betweena given block and the labels of the associated variable pair. By default the color of highlightstrips is ”lightgoldenrod1” in R. Their appearance is shown in Figure 2. Highlight stripscan co-exist in any number and combination, horizontally and vertically. The mechanismsfor creating and removing them are as follows:

• Right-click the mouse on

– a block in the blockplot to place a horizontal and a vertical highlight strip throughthe block;

– an x-variable label on the horizontal axis to place a vertical highlight strip throughthis variable;

– a y-variable label on the vertical axis to place a horizontal highlight strip throughthis variable.

• Hit ‘ctrl-C’ to clear the strips and start from scratch.

Instead of clicking one can right-depress and drag the mouse across the blockplot with theeffect that horizontal and vertical strips are placed across all blocks touched by the dragmotion.

Vertical highlight strips lend themselves to convenient searching of associations betweena fixed variable on the horizontal axis and all variables on the vertical axis. To this end it isuseful to pan vertically with ‘↑’, ‘↓’, and the space bar as accelerator (Section 4.2).

4.6 Highlighting (2): Rectangles

A highlight rectangle is a rectangular area in the blockplot selected by the user for high-lighting. Highlight rectangles are meant to help the user focus on the associations betweencontiguous groups of variables on the horizontal and the vertical axis. By default the colorof highlight rectangles is ”lightcyan1” in R. Their appearance is that of the center square inFigure 4. In the case of this figure, the highlight rectangle coincides with the highlight squarefor the variable group defined by the suffix “p1.CDV”. Unlike highlight squares, which markpredefined variable groups, highlight rectangles can be placed (and removed from) anywhereby the user. The mechanisms to this end are as follows:

• Define a highlight rectangle in arbitrary position by placing two opposite corners:

– Place the crosshair in the location of the desired first corner; thenhit ‘1’ to place the first corner of a new rectangle.

– Place the crosshair in the location of the desired second corner; thenhit ‘2’ to place the second corner.

22

Page 23: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

Action ‘1’ creates a new highlight rectangle consisting of just one block. Action ‘2’never creates a new block but only sets/resets the second corner of the most recentrectangle.

• Define a highlight rectangle in terms of two variable groups:

– Place the crosshair such that the x-coordinate is in the desired horizontal variablegroup and the y-coordinate in the desired vertical variable group; then

– hit ‘3’ to create the highlight rectangle.

As a special case, this allows a highlight square to become a highlight rectangle byletting the x- and y-variable groups be the same, as in Figure 4.

• Pan and zoom to snap the view and the highlight rectangle to each other:

– Place the crosshair in the highlight rectangle to be snapped; then

– either hit ‘4’ to snap, preserving the aspect ratio,

– or hit ‘5’ to snap, distorting the aspect ratio, unless the rectangle is a square.

If the crosshair is not placed in a highlight rectangle, the most recent one will be used.Note that the squares in a blockplot always remain squares, even if the aspect ratio ofthe plot has been distorted. Changing the aspect ratio has the consequence that thesquares can no longer fill their cells because they have become rectangles.

• Any number of highlight rectangles can co-exist. Remove them selectively as follows:

– Place the crosshair anywhere in a highlight rectangle to be removed; then

– hit ‘0’ to remove it.

4.7 Reference variables

A recurrent issue when using the AN is that some variables are often of persistent interest.In autism phenotype data, for example, a recurrent theme is to check up on age, gender andsite association (potential confounders) while examining associations within and betweenvarious “autism instruments” such as ADOS, ADI, RBS,... To spare users the distraction ofhopping back and forth across the multi-hundred square table, the AN implements a notionof “reference variables”, that is, variables that never disappear from view. The AN keepsthem tucked in the left and the bottom of the blockplot. The manner in which referencevariables present themselves is shown in Figure 10. The mechanism for selecting referencevariables is by first selecting them with highlight strips (Section 4.5), and then hitting

• ‘R’ to turn the strip variables into reference variables,

• ‘r’ to toggle on and off the display of the selected reference variables.

23

Page 24: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

age_

at_a

dos_

p1.C

DV

se

x_p1

.CD

V

ethn

icity

_p1.

CD

V

ados

_mod

ule_

p1.C

DV

age_

at_a

dos_

p1.C

DV

fa

mily

_typ

e_p1

.CD

V

sex_

p1.C

DV

et

hnic

ity_p

1.C

DV

cp

ea_d

x_p1

.CD

V

adi_

r_cp

ea_d

x_p1

.CD

V

adi_

r_so

c_a_

tota

l_p1

.CD

V

adi_

r_co

mm

_b_n

on_v

erba

l_to

tal_

p1.C

DV

ad

i_r_

b_co

mm

_ver

bal_

tota

l_p1

.CD

V

adi_

r_rr

b_c_

tota

l_p1

.CD

V

adi_

r_ev

iden

ce_o

nset

_p1.

CD

V

ados

_mod

ule_

p1.C

DV

di

agno

sis_

ados

_p1.

CD

V

ados

_css

_p1.

CD

V

ados

_soc

ial_

affe

ct_p

1.C

DV

ad

os_r

estr

icte

d_re

petit

ive_

p1.C

DV

ad

os_c

omm

unic

atio

n_so

cial

_p1.

CD

V

ssc_

diag

nosi

s_ve

rbal

_iq_

p1.C

DV

ss

c_di

agno

sis_

verb

al_i

q_ty

pe_p

1.C

DV

ss

c_di

agno

sis_

nonv

erba

l_iq

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

full_

scal

e_iq

_p1.

CD

V

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_t

ype_

p1.C

DV

ss

c_di

agno

sis_

vma_

p1.C

DV

ss

c_di

agno

sis_

nvm

a_p1

.CD

V

vine

land

_ii_

com

posi

te_s

tand

ard_

scor

e_p1

.CD

V

srs_

pare

nt_t

_sco

re_p

1.C

DV

sr

s_pa

rent

_raw

_tot

al_p

1.C

DV

sr

s_te

ache

r_t_

scor

e_p1

.CD

V

srs_

teac

her_

raw

_tot

al_p

1.C

DV

rb

s_r_

over

all_

scor

e_p1

.CD

V

cbcl

_2_5

_int

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_2_5

_ext

erna

lizin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_in

tern

aliz

ing_

t_sc

ore_

p1.C

DV

cb

cl_6

_18_

exte

rnal

izin

g_t_

scor

e_p1

.CD

V

abc_

tota

l_sc

ore_

p1.C

DV

no

n_fe

brile

_sei

zure

s_p1

.CD

V

febr

ile_s

eizu

res_

p1.C

DV

family.ID sz.sorted_sites.FAM

srs_adult_total.mo_cuPARENT age_at_ados_p1.CDV

family_type_p1.CDV sex_p1.CDV

ethnicity_p1.CDV cpea_dx_p1.CDV

adi_r_cpea_dx_p1.CDV adi_r_soc_a_total_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDV adi_r_b_comm_verbal_total_p1.CDV

adi_r_rrb_c_total_p1.CDV adi_r_evidence_onset_p1.CDV

ados_module_p1.CDV diagnosis_ados_p1.CDV

ados_css_p1.CDV ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV bckgd_hx_highest_edu_mother_p1.OCUV

Correlations(Compl.Pairs)

Figure 10: Reference variables shown in the left and bottom bands. Whenever the userzooms and pans the blockplot, these variables stay in place and show their associations withthe variables from the rest of the blockplot.

24

Page 25: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

The disentangling of the two actions allows users to keep marking up strips without changingthe earlier selected reference variables.

In Figure 10, the y-reference variables are “sz.sorted sites.FAM” and “family.ID”, andtheir associations with the x-variables are shown in the horizontal band at the bottom. Sim-ilarly, the x-reference variables are “age a ados p1.CDV”, “sex p1.CDV”, “ethnicity p1.CDV”and “ados module p1.CDV”, and their associations with the y-variables are shown in the ver-tical band on the left. In the bottom left corner, the intersection of the reference bands, areshown the associations between x- and y reference variables.

4.8 Searching Variables

Other recurrent issues with analyzing large numbers of variables is simply finding variables.For example,

• find a variable whose name one remembers partly, but not exactly; or

• find a set of variables whose names share a meaningful syllable.

In the context of autism, for example, it might be of interest to find all variables relatedto anxiety across all instruments; it would then be sensible to search for all variables thatcontain the phoneme “anx” in their name. This type of problem can be solved in theAN with a blend of text search and menu selection. We address here the problem of locatingone variable and panning to it. To this end hit...

• ‘H’ to locate a variable on the x-axis;

• ‘V’ to locate a variable on the y-axis;

• ‘@’ to locate a variable on both the x- and the y-axis.

In each case a dialog box pops up where a search string or regular expression can be entered.On hitting ‘<Return>’ or ‘OK’, a menu appears with the list of variables that contains thesearch string or matches the regular expression (according to R’s grep() function). The useris then asked to select one of the offered variables, upon which the AN pans to the variable(depending on ‘H’, ‘V’ or ‘@’) on the x- or the y-axis or both, marks it with a vertical orhorizontal highlight strip or both, and places the crosshair on it. See Figure 11.

Search can be bypassed by not entering a search string at all. The menu shows then thecomplete list of all variables with scrolling.

4.9 Lenses: Scatterplots and Barplots/Histograms

We think of barplots, histograms and scatterplots as lenses into the blocks, each of whichrepresents a pair (x, y) of variables. Taking the pair “under the lens” means looking at theassociation (and the marginal distribution) in greater detail; see Section 3.5 above. Themechanics are as follows: Hit

25

Page 26: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

rbs_

r_vi

_res

tric

ted_

beha

vior

_p1.

OC

UV

ab

c_nb

r_m

issi

ng_p

1.O

CU

V

abc_

i_irr

itabi

lity_

p1.O

CU

V

abc_

iv_h

yper

activ

ity_p

1.O

CU

V

abc_

ii_le

thar

gy_p

1.O

CU

V

abc_

iii_s

tere

otyp

y_p1

.OC

UV

ab

c_v_

inap

prop

riate

_spe

ech_

p1.O

CU

V

scq_

life_

nbr_

mis

sing

_p1.

OC

UV

sc

q_lif

e_ite

m_1

_p1.

OC

UV

sc

q_lif

e_to

tal_

p1.O

CU

V

cbcl

_2_5

_anx

ious

_dep

ress

ed_p

1.O

CU

V

cbcl

_2_5

_som

atic

_com

plai

nts_

p1.O

CU

V

cbcl

_2_5

_with

draw

n_p1

.OC

UV

cb

cl_2

_5_s

leep

_pro

blem

s_p1

.OC

UV

cb

cl_2

_5_a

ttent

ion_

prob

lem

s_p1

.OC

UV

cb

cl_2

_5_a

ggre

ssiv

e_be

havi

or_p

1.O

CU

V

cbcl

_2_5

_tot

al_p

robl

ems_

p1.O

CU

V

cbcl

_2_5

_affe

ctiv

e_pr

oble

ms_

p1.O

CU

V

cbcl

_2_5

_anx

iety

_pro

blem

s_p1

.OC

UV

cb

cl_2

_5_p

erva

sive

_dev

elop

men

tal_

p1.O

CU

V

cbcl

_2_5

_add

_adh

d_p1

.OC

UV

cbcl_2_5_add_adhd_p1.OCUV cbcl_2_5_oppositional_defiant_p1.OCUV

cbcl_6_18_activities_p1.OCUV cbcl_6_18_social_p1.OCUV

cbcl_6_18_school_p1.OCUV cbcl_6_18_total_competence_p1.OCUV

cbcl_6_18_anxious_depressed_p1.OCUV cbcl_6_18_withdrawn_p1.OCUV

cbcl_6_18_somatic_complaints_p1.OCUV cbcl_6_18_social_problems_p1.OCUV

cbcl_6_18_thought_problems_p1.OCUV cbcl_6_18_attention_problems_p1.OCUV

cbcl_6_18_rule_breaking_p1.OCUV cbcl_6_18_aggressive_behavior_p1.OCUV

cbcl_6_18_total_problems_p1.OCUV cbcl_6_18_affective_problems_p1.OCUV cbcl_6_18_anxiety_problems_p1.OCUV

cbcl_6_18_somatic_prob_p1.OCUV cbcl_6_18_add_adhd_p1.OCUV

cbcl_6_18_oppositional_defiant_p1.OCUV cbcl_6_18_conduct_problems_p1.OCUV

cbcl_2_5_emotionally_reactive_p1.OCUV

Correlations(Compl.Pairs)

Figure 11: Text search with ‘H’ for horizontal variables containing “anx”, followed by se-lection of “cbcl 2 5 anxious depressed p1.OCUV”. The view pans horizontally to the selectedvariable, marks it with a vertical hilight strip, and places the crosshair on it.

26

Page 27: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

• ‘x’ to see in a separate window (Figure 7) a scatterplot and barplots/histograms of thetwo variables marked by the crosshair cursor.

• ‘y’ to switch the x-y roles of the variables.

• ‘l’ to toggle showing a “line”, that is, a smooth if x is quantitative, and a trace ofy-means of the x-groups if x is categorical.

Important: The lens window is passive and does not accept interactive input. One mustexpose the blockplot master window to continue with AN interactions.

These lenses have a simple history mechanism in that the consecutive x-y variable namesare collected in a list that can be traversed and edited: Hit

• ‘PgUp’ to take one step back in the history,

• ‘PgDn’ to take one step forward in the history,

• ‘Home’ to jump to the beginning of the history,

• ‘End’ to jump to the end of the history (the present),

• ‘Delete’ to delete the current lens from the history.

Finally, there is a separate lens mechanism with its own window that shows all pairwisescatterplots of the variables currently in highlight strips. An example is shown in Figure 8.As to the mechanics, hit

• ‘z’ to create the scatterplot matrix with independently scaled axes;

• ‘Z’ to create the scatterplot matrix with identically scaled axes.

The latter option is sometimes useful when all variables live on the same scale but havesomewhat different ranges.

4.10 Color-Brushing in Scatterplots

Often one would like to focus on groups of cases in the scatterplots of the lens window. Thiscan be achieved with color brushing as follows:

• Hit ‘s’ to see the current lens scatterplot in the main window, replacing the blockplot.

• Hit ‘r’ to fix one corner of a brush at the current mouse location.

• Left-depress and drag the mouse: the rectangular brushing area should open up andchange shape. Whenever the brush moves over a scatterplot point, it will change color.

• Right-depress and drag the mouse: the rectangular brushing area will translate alongwith the mouse. Again, moving over scatterplot points will change their color.

• The brushing color can be changed by cycling through a series of colors, hitting ‘S’.The color gray does not paint; it is useful for counting the points under the brush astheir number is shown in the bottom left corner.

27

Page 28: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

Figure 12: Screenshot of the adjustment menu. As shown, it enables adjustment of the “srs”variables for “age at ados p1.CDV” and “sex p1.CDV”.

28

Page 29: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

• Hit ‘s’ to return to the blockplot in the main window.

Thus, hitting ‘S’ toggles between blockplot and scatterplot in the main window. After eachbrushing operation, the lens scatterplot will follow suit and color its points to match thosein the main window.

4.11 Linear Adjustment

Another recurrent task in large tables is what we may call “adjustment”. The phrase “ad-justing for x” has many synonyms: “accounting for x”, “controlling for x”, “correcting for x”,“allowing for x”, and “holding x fixed” or “conditioning on x”. Technically most correct isthe last expression: We are often interested in the conditional association between variablesy and z given (holding fixed) a variable x, as measured for example by the conditional corre-lation r(y, z|x). In the context of the autism phenotype, one may be interested in adjustingfor age and/or gender. In practice, particulary in large-p problems, there is rarely sufficientdata to truly estimate conditional distributions,10 hence one makes the simplifying assump-tion that all associations are linear with constant conditional variances (homoscedasticity).11

In that case, adjustment of y for x amounts to a linear regression and forming residuals, thatis, “residualizing” or “partialling out” is done by subtracting the equation fitted with linearregression: y•x = y− (b0 + b1x). As a consequence, r(y•x, x) = 0, that is, by forming y•x oneremoves from y the linear association with x. This type of linear adjustment generalizes tomultiple x variables by residualizing with regard to a multiple linear regression.

In the AN implementation of linear adjustment, one has to select a set of “independent”x-variables, called “adjustors”, and a set of “dependent” y-variables, called the “adjustees”.Often the set of adjustors is small, possibly just one variable such as age, whereas the set ofadjustees can be large, for example, all items and summary scales of an autism phenotypeinstrument such as the SRS (“Social Responsiveness Scale”). The selection mechanismsare the same for both adjustors and adjustees: text search or regular expression matching,followed by menu selection, similar to Section 4.8, but here the menu selection allows multiplechoices. The mechanics are as follows: Hit

• ‘A’ to call up a large menu that forms the interface for all adjustment operations.

An example is shown in Figure 12. Initially, the list of adjustors and adjustees will be empty,so both need to be populated with text searches that require a dialog initiated by selectingthe lines “Find ADJUSTORS...” and “Find ADJUSTEES...” in sequence. Figure 12 shows thestate after having matched the regular expression “age at|sex p1.CDV” for adjustors andsearched the string “CDV” for adjustees.

Finally, after selection of adjustors and adjustees is completed, the user may select thetop line of the menu to actually “Do Adjustment”. Each raw adjustee will then be replaced

10Natural exceptions do exist: If we analyze females and males separate, for example, we study gender-conditional associations.

11Both assumptions may be wrong, but some form of adjustment, even if flawed, is often more informativethan remaining with raw variables.

29

Page 30: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

by its residuals obtained from the regression onto the adjustors. (To undo adjustment, selectthe second line from the Adjustment dialog, “Undo Adjustment”.)

To assist the visual examination of adjustment results, one may want to select the thirdline from the top of the menu in order to highlight the adjustors among the x-variables andthe adjustees among the y-variables (“Mark with Highlight Strips...”). Turning themfurther into reference variables (Section 4.7) by hitting “R”, we obtain Figure 13. As itshould be, the correlations between the two adjustors on the x-axis and the many adjusteeson the y-axis vanish. The correlations of the adjustees with other variables many now beof renewed interest because they are free of age and gender “effects”, which would invite asearch of the correlations in the horizontal band of the adjustees.

A word of caution: Adjustment of a y-variable is done using only cases for which thereare no missing values among the adjustors and obviously the adjustee is not missing either.Thus the underlying set of cases may have been inadvertently decreased. It is therefore goodadvice to check the missing-pairs patterns with either ‘ctrl-M’ or ‘ctrl-N’ (Section 4.4) orby looking at scatterplots (Section 4.9).

Having done adjustment of variables, one often wonders how much of it was done and towhich variable. To answer this question, select the fourth line from the adjustment dialog(“Sort Adjustees...”): The result is a sorted list of the adjustees according to the R2 valuesfrom the regression of the adjustees/y-variables onto the adjustors/x-variables. See Figure 14for an example.

4.12 The Future of the AN

The functionality described here reflects the 2015 implementation of the AN. Changes tothe are planned, the major one being a redesign to give the lens windows interactive respon-siveness as well. Currently all interaction is funneled throught the blockplot window, evenif the actions affect the lens window.

Other obvious functionality is still missing, above all sorting of variables, manual andalgorithmic, and a limited set of sorting operations may be added in a future version of theAN. If readers of this document and users of the AN have further suggestions, the authorswould appreciate hearing.

30

Page 31: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

age_

at_a

dos_

p1.C

DV

se

x_p1

.CD

V

adi_

r_b_

com

m_v

erba

l_to

tal_

p1.C

DV

ad

i_r_

rrb_

c_to

tal_

p1.C

DV

ad

i_r_

evid

ence

_ons

et_p

1.C

DV

ad

os_m

odul

e_p1

.CD

V

diag

nosi

s_ad

os_p

1.C

DV

ad

os_c

ss_p

1.C

DV

ad

os_s

ocia

l_af

fect

_p1.

CD

V

ados

_res

tric

ted_

repe

titiv

e_p1

.CD

V

ados

_com

mun

icat

ion_

soci

al_p

1.C

DV

ss

c_di

agno

sis_

verb

al_i

q_p1

.CD

V

ssc_

diag

nosi

s_ve

rbal

_iq_

type

_p1.

CD

V

ssc_

diag

nosi

s_no

nver

bal_

iq_p

1.C

DV

ss

c_di

agno

sis_

nonv

erba

l_iq

_typ

e_p1

.CD

V

ssc_

diag

nosi

s_fu

ll_sc

ale_

iq_p

1.C

DV

ss

c_di

agno

sis_

full_

scal

e_iq

_typ

e_p1

.CD

V

ssc_

diag

nosi

s_vm

a_p1

.CD

V

ssc_

diag

nosi

s_nv

ma_

p1.C

DV

vi

nela

nd_i

i_co

mpo

site

_sta

ndar

d_sc

ore_

p1.C

DV

sr

s_pa

rent

_t_s

core

_p1.

CD

V

srs_

pare

nt_r

aw_t

otal

_p1.

CD

V

srs_

teac

her_

t_sc

ore_

p1.C

DV

sr

s_te

ache

r_ra

w_t

otal

_p1.

CD

V

rbs_

r_ov

eral

l_sc

ore_

p1.C

DV

cb

cl_2

_5_i

nter

naliz

ing_

t_sc

ore_

p1.C

DV

cb

cl_2

_5_e

xter

naliz

ing_

t_sc

ore_

p1.C

DV

cb

cl_6

_18_

inte

rnal

izin

g_t_

scor

e_p1

.CD

V

cbcl

_6_1

8_ex

tern

aliz

ing_

t_sc

ore_

p1.C

DV

ab

c_to

tal_

scor

e_p1

.CD

V

non_

febr

ile_s

eizu

res_

p1.C

DV

fe

brile

_sei

zure

s_p1

.CD

V

bckg

d_hx

_hig

hest

_edu

_mot

her_

p1.O

CU

V

bckg

d_hx

_hig

hest

_edu

_fat

her_

p1.O

CU

V

bckg

d_hx

_ann

ual_

hous

ehol

d_p1

.OC

UV

bc

kgd_

hx_p

aren

t_re

latio

n_st

atus

_p1.

OC

UV

ss

c_dx

_bes

t_es

timat

e_dx

_lis

t_p1

.OC

UV

ss

c_dx

_ove

rallc

erta

inty

_p1.

OC

UV

ge

nder

_sib

1_p1

.OC

UV

nb

r_st

illbi

rth_

mis

carr

iage

_p1.

OC

UV

pr

oban

d_bi

rth_

orde

r_p1

.OC

UV

fa

mily

_str

uctu

re_p

1.O

CU

V

adi_

r_q0

9_si

ngle

_wor

ds_p

1.O

CU

V

wor

d_de

lay_

p1.O

CU

V

adi_

r_q1

0_fir

st_p

hras

es_p

1.O

CU

V

phra

se_d

elay

_p1.

OC

UV

ad

i_r_

q30_

over

all_

lang

uage

_p1.

OC

UV

ad

i_r_

q86_

abno

rmal

ity_e

vide

nt_p

1.O

CU

V

adi_

r_q8

7_ab

norm

ality

_man

ifest

_p1.

OC

UV

ad

os1_

algo

rithm

_p1.

OC

UV

ad

os2_

algo

rithm

_p1.

OC

UV

a1

_non

_ech

oed_

p1.O

CU

V

ados

_com

mun

icat

ion_

p1.O

CU

V

ados

_rec

ipro

cal_

soci

al_p

1.O

CU

V

vabs

_ii_

com

mun

icat

ion_

p1.O

CU

V

vabs

_ii_

dls_

stan

dard

_p1.

OC

UV

va

bs_i

i_so

c_st

anda

rd_p

1.O

CU

V

vabs

_ii_

mot

or_s

kills

_p1.

OC

UV

sr

s_pa

rent

_nbr

_mis

sing

_ite

ms_

p1.O

CU

V

srs_

pare

nt_a

war

enes

s_p1

.OC

UV

sr

s_pa

rent

_cog

nitio

n_p1

.OC

UV

sr

s_pa

rent

_com

mun

icat

ion_

p1.O

CU

V

srs_

pare

nt_m

anne

rism

s_p1

.OC

UV

sr

s_pa

rent

_mot

ivat

ion_

p1.O

CU

V

srs_

teac

her_

nbr_

mis

sing

_ite

ms_

p1.O

CU

V

srs_

teac

her_

awar

enes

s_p1

.OC

UV

sr

s_te

ache

r_co

gniti

on_p

1.O

CU

V

srs_

teac

her_

com

mun

icat

ion_

p1.O

CU

V

srs_

teac

her_

man

neris

ms_

p1.O

CU

V age_at_ados_p1.CDV

family_type_p1.CDV sex_p1.CDV

ethnicity_p1.CDV cpea_dx_p1.CDV

adi_r_cpea_dx_p1.CDV adi_r_soc_a_total_p1.CDV

adi_r_comm_b_non_verbal_total_p1.CDV adi_r_b_comm_verbal_total_p1.CDV

adi_r_rrb_c_total_p1.CDV adi_r_evidence_onset_p1.CDV

ados_module_p1.CDV diagnosis_ados_p1.CDV

ados_css_p1.CDV ados_social_affect_p1.CDV

ados_restricted_repetitive_p1.CDV ados_communication_social_p1.CDV

ssc_diagnosis_verbal_iq_p1.CDV ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_full_scale_iq_p1.CDV ssc_diagnosis_full_scale_iq_type_p1.CDV

ssc_diagnosis_vma_p1.CDV ssc_diagnosis_nvma_p1.CDV

vineland_ii_composite_standard_score_p1.CDV srs_parent_t_score_p1.CDV

srs_parent_raw_total_p1.CDV srs_teacher_t_score_p1.CDV

srs_teacher_raw_total_p1.CDV rbs_r_overall_score_p1.CDV

cbcl_2_5_internalizing_t_score_p1.CDV cbcl_2_5_externalizing_t_score_p1.CDV

cbcl_6_18_internalizing_t_score_p1.CDV cbcl_6_18_externalizing_t_score_p1.CDV

abc_total_score_p1.CDV non_febrile_seizures_p1.CDV

febrile_seizures_p1.CDV ssc_dx_best_estimate_dx_list_p1.OCUV

ssc_dx_overallcertainty_p1.OCUV gender_sib1_p1.OCUV

nbr_stillbirth_miscarriage_p1.OCUV proband_birth_order_p1.OCUV

family_structure_p1.OCUV adi_r_q09_single_words_p1.OCUV

word_delay_p1.OCUV adi_r_q10_first_phrases_p1.OCUV

phrase_delay_p1.OCUV adi_r_q30_overall_language_p1.OCUV

adi_r_q86_abnormality_evident_p1.OCUV adi_r_q87_abnormality_manifest_p1.OCUV

ados1_algorithm_p1.OCUV ados2_algorithm_p1.OCUV a1_non_echoed_p1.OCUV

ados_communication_p1.OCUV ados_reciprocal_social_p1.OCUV vabs_ii_communication_p1.OCUV

vabs_ii_dls_standard_p1.OCUV vabs_ii_soc_standard_p1.OCUV

vabs_ii_motor_skills_p1.OCUV srs_parent_nbr_missing_items_p1.OCUV

srs_parent_awareness_p1.OCUV srs_parent_cognition_p1.OCUV

srs_parent_communication_p1.OCUV srs_parent_mannerisms_p1.OCUV

srs_parent_motivation_p1.OCUV srs_teacher_nbr_missing_items_p1.OCUV

srs_teacher_awareness_p1.OCUV srs_teacher_cognition_p1.OCUV

Correlations(Compl.Pairs)

Figure 13: Results of adjustment of the “CDV” variables for “age at ados p1.CDV” and“sex p1.CDV”: the former are reference variables on the y-axis, the latter on the x-axis.As it should be, the correlations between adjustors and adjustees vanish.

31

Page 32: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

Figure 14: List of adjustees/y-variables sorted according to the R2 values from the regressionsonto the adjustors/x-variables.

32

Page 33: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

A Appendix: The Versatility of Correlation Analysis

We return to the apparent limitations of correlations as measures of association which wasleft as a loose end in the Introduction. We address the objections that (1) correlationsare measures of linear association only, (2) correlations reflect bivariate association only, and(3) correlations apply to quantitative variables only. Towards this end we make the followingobservations and recommendations:

(1) While it is true that correlation is strictly speaking a measure of linear associationamong quantitative variables, it is also a fact that correlation is useful as a measure ofmonotone association in general, even when it is non-linear. As long as the associationis roughly monotone, correlation will be positive when the association is increasing andnegative when it is decreasing. Admittedly, correlation is not an optimal measure ofnon-linear monotone association, but it is still a useful one, in particular in the large-pproblem. Lastly, if gross non-linearity is discovered, it is always possible to replacea variable X with a non-linear transform f(X) (often log(X)) so its association withother variables becomes more linear.12

(2) The objection that correlations only reflect bivariate association is factually correctbut practically not very relevant. In practical data analysis it is too contrived to en-tertain the possibility that, for example, there exists association among three variablesbut there exists no monotone association among each pair of variables.13 In generalone follows the principle that lower-order association is more likely than higher-orderassociation, hence pairwise association is more likely than true interaction among threevariables. Therefore data analysts look first for groups of variables that are linked bypairwise association, and thereafter they may examine whether these variables alsoexhibit higher-order association. Note, however, that even multivariate methods suchas principal components analysis (PCA) do not detect true higher-order interactionbecause they, too, rely on correlations only. Finally, we are not asserting that sim-ple correlation analysis should be the end of data analysis, but it should certainly benear the beginning in the large-p problems envisioned here, namely, in the analysis ofrelatively noisy data as they arise in many social science and medical contexts.14

12 Linearity of association is not a simple concept. For one thing, it is asymmetric: if Y is linearlyassociated with X, it does not follow that X is linearly associated with Y . The reason is that the definitionof linear association, E[Y |X] = β0 + β1X, is not symmetric in X and Y . Linearity of association in bothdirections holds only for certain “nice” distributions such as bivariate Gaussians. A counter-exampls is asfollows: Let X be uniformly distributed on an interval and Y = β0 + β1X + ε with independent Gaussian ε,then Y is linearly associated with X by construction, yet X is not linearly associated with Y .

13 An example would be three variables jointly uniformly distributed on the surface of a 2-sphere in 3-space.14 In other large-p problems the variables may be so highly structured that they become intrinsically low-

dimensional, as for example in the analysis of libraries of registered images where each variable correspondsto a pixel location and its values consist of intensities at that location across the images. The problem here isnot to locate groups of variables with association but to describe the manifold formed by the images in veryhigh-dimensional pixel space. A sensible approach in this case would be non-linear dimension reduction.

33

Page 34: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

(3) The final objection we consider is that correlations do not apply to categorical variables.This objection can be refuted with very practical advice on how to make categoricaldata quantitative and how to interpret the meaning of the resulting correlations. Wediscuss several cases in turn:

– If a categorical variable X is ordinal (its categories have a natural order), it iscommon practice to simply number the categories in order and use the resultinginteger variable as a quantitative variable. The resulting correlations will be ableto reflect monotone association with other variables that may be expressed bysaying “the higher categories of X tend to be associated with higher/lower val-ues/categories of other variables.” — An obvious objection is that the equi-spacedintegers may not be a good quantification of the categories. If this is a seriousconcern worth some effort, one may want to look into optimal scoring procedures(see, for example, De Leeuw and Rijckevorsel (1980) and Gifi (1990)). The ideabehind these methods is to estimate new scores for the categorical variables bymaking them as linearly associated as possible through optimization of the fit ofa joint PCA.

– If a categorical variable X is binary, it is common practice to numerically codeits two categories with the values 0 and 1, thereby creating a so-called “dummyvariable.” This practice is pervasive in the Analysis of Variance (ANOVA), butits usefulness is lesser known in multivariate analysis which is our concern. Theinterpretation of correlations with dummy variables is highly interesting as itsolves two seemingly different association problems:

∗ First order association between a binary variable X and a quantitative vari-able Y means that there exists a difference between the two means of Y inthe two groups denoted by X. As it turns out, the correlation of a dummyvariable X with a quantitative variable Y is mathematically equivalent toforming the t-statistic for a two-sample comparison of the two means of Yin the two categories of X (t ∝ r/(1 − r2)1/2). Even more, the statisticaltest for a zero correlation is virtually identical to the t-test for equality ofthe two means. Thus two-sample mean-comparisons can be subsumed undercorrelation analysis.

∗ Association between two binary variables means that their 2×2 table showsdependence. This situation is usually addressed with Fisher’s exact test ofindependence. It turns out, however, that Fisher’s exact test is equivalent totesting the correlation between the dummy variables, the only discrepancybeing that the normal approximation used to calculate the p-value of a cor-relation is just that, an approximation, although an adequate one in mostcases.

– If a categorical variable X is truly nominal with more than two values, that is,neither binary nor ordinal, we may again follow the lead of ANOVA and replaceX with a collection of dummy variables, one per category. For example, if in a

34

Page 35: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

medical context data are collected in multiple sites, it will be of interest to seewhether substantive variables in some sites are systematically different from othersites. It is then useful to introduce dummy variables for the sites and examine theircorrelations with the substantive variables. A significant correlation indicates asignificant mean difference at that site compared to the other sites.

This discussion shows that categorical variables can be fruitfully included in correlationanalysis, either with numerical coding of ordinal variables, or with dummy coding ofbinary and nominal variables.

This concludes our discussion of the versatility of correlation analysis.

B Appendix: Creating and ProgrammingAN Instances

To create a new instance of an Association Navigator for a given dataset, use the followingR statement:

a.n <- a.nav.create(datamatrix)

where ‘datamatrix’ is a numeric matrix, not a dataframe. The new AN instance ‘a.n’ canbe run with the following R statement:

a.nav.run(a.n)

These steps are completely general and may be useful for arbitrary numeric data matriceswith up to about 2,000 variables.

Table 1 shows a template for forming potentially useful instances of ANs that displaylarge numbers of SSC phenotype variables. As written the statement would produce anAN in the order of 3,000 variables.

AN’s are implemented not as lists but as “environments,” a relatively little known datastructure among most R users. Environments have some interesting properties. One canlook inside an AN with the R idiom

with(a.n, objects())

in order to list the AN-internal variables inside the AN instance ‘a.n’. Assignments and anyother kind of programming of the internal state variables can be achieved the same way. Forexample, if one desires a change of color of highlight strips to “mistyrose”, one can achievethis with the following:

with(a.n, { strips.col <- "mistyrose"; a.nav.blockplot() })

The call to a.nav.blockplot() redisplays the blockplot with the new paramater setting.Changing the blockplot glyph from square to diamond is achieved with

with(a.n, { blot.pch <- 18; a.nav.blockplot() })

35

Page 36: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

and reversing the color convention from “blue = positive” to “red = positive” in the style ofheatmaps is done with

with(a.n, { blot.col.pos <- 2; blot.col.neg <- 4; a.nav.blockplot() })

Note, however, that this affects only blockplots, not heatmaps, the latter requiring compu-tation of a color scale, not just a binary color decision. Still, there is plenty of opportunityfor playfulness by experimenting with display parameters. A more sophisticated exampleconcerns changing the power transformation that maps correlations to glyph sizes:

with(a.n, { blot.pow <- .7; a.nav.cors.trans(); a.nav.blockplot() })

In addition to redisplay with a.nav.blockplot(), this also requires recomputation of thedisplay table with a.nav.cors.trans().

R environments represent one of the two data types (the other being “connections”)that disobeys the functional programming paradigm that is otherwise fundamental to R. Asa consequence, assignment of an AN does not allocate a new copy but passes a referenceinstead. In particular, the R statement

b.n <- a.n

creates a variable ‘b.n’ that will be a reference to the same environment as the variable ‘a.n’.Hence the two statements

a.nav.run(a.n)

a.nav.run(b.n)

will run off the same AN instance. They have identical effects in the sense that interactiveoperations affect the same instance.

36

Page 37: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

References

[1] J. Bertin. Semiology of Graphics. Madison, WI: The University of Wisconsin Press,1983.

[2] W. S. Cleveland. The Elements of Graphing Data. Pacific Grove, CA: Wadsworth &Brooks/Cole, 1985.

[3] J. De Leeuw and J. van Rijckevorsel. HOMALS and PRINCALS - some generaliza-tions of principal components analysis. In E. Diday et al., editor, Data Analysis andInformatics II, pages 231–242. Amsterdam: Elsevier Science Publisher, North Holland,1980.

[4] M. Friendly. Corrgrams: Exploratory displays of correlation matrices. The AmericanStatistician, 56(4):316–324, 2002.

[5] A. Gifi. Nonlinear multivariate analysis. New York: John Wiley & Sons, 1990.

[6] M. Hills. On looking at large correlation matrices. Biometrika, 56(2):249–253, 1969.

[7] H. Hofmann. Exploring categorical data: Interactive mosaic plots. Metrika, 51(1):11–26,2000.

[8] D. J. Murdoch and E. D. Chow. A graphical display of large correlation matrices. TheAmerican Statistician, 50(2):178–180, 1996.

[9] A. Pilhoefer and A. Unwin. New approaches in visualization of categorical data: Rpackage extracat. Journal of Statistical Software, 53(7):1–25, 2013.

[10] S. S. Stevens. On the psychophysical law. The Psychological Review, 64(3):153–181,1957.

[11] S. S. Stevens. Psychophysics. New York: John Wiley & Sons, 1975.

[12] S. S. Stevens and E. H. Galanter. Ratio scales and category scales for a dozen perceptualcontinua. Journal of Experimental Psychology, 54(6):377–411, 1957.

[13] H. Wickham, H. Hofmann, and D. Cook. Exploring cluster analysis. http: // www.

had. co. nz/ model-vis/ clusters. pdf , 2006.

37

Page 38: A Visualization Tool for Mining Large Correlation Tables: The …stat.wharton.upenn.edu/~buja/PAPERS/Buja-et-al... · 2016. 8. 22. · A Visualization Tool for Mining Large Correlation

a.n <- a.nav.create(cbind(

"family.ID"=as.numeric(v.families),

v.sites, v.srs.bg, v.individual,

v.family, v.parent.race, v.parent.common,

v.proband.cdv, v.proband.ocuv, v.sibling.s1, v.sibling.s2,

v.ados.common,

v.ados.1, v.ados.1.raw, v.ados.2, v.ados.2.raw,

v.ados.3, v.ados.3.raw, v.ados.4, v.ados.4.raw,

v.adi.r.diagnostic, v.adi.r.pca, v.adi.r,

v.adi.r.dum, v.adi.r.loss,

v.ssc.diagnosis,

v.vineland.ii.p1, v.vineland.ii.s1,

v.cbcl.2.5.p1, v.cbcl.2.5.s1,

v.cbcl.6.18.p1, v.cbcl.6.18.s1,

v.abc, v.abc.raw, v.rbs.r, v.rbs.r.raw,

v.srs.parent.p1, v.srs.parent.recode.p1,

v.srs.teacher.p1, v.srs.teacher.recode.p1,

v.srs.parent.s1, v.srs.parent.recode.s1,

v.srs.teacher.s1, v.srs.teacher.recode.s1,

v.srs.adult.fa, v.srs.adult.recode.fa,

v.srs.adult.mo, v.srs.adult.recode.mo,

v.bapq.fa, v.bapq.recode.fa, v.bapq.mo, v.bapq.recode.mo,

v.fhi.interviewer.fa, v.fhi.interviewer.mo,

v.scq.current.p1, v.scq.life.p1,

v.scq.current.s1, v.scq.life.s1,

v.ctopp.nr, v.purdue.pegboard, v.dcdq, v.ppvt,

v.das.ii.early.years, v.das.ii.school.age,

v.ctrf.2.5, v.trf.6.18,

v.ssc.med.hx.v2.autoimmune.disorders, v.ssc.med.hx.v2.birth.defects,

v.ssc.med.hx.v2.chronic.illnesses, v.ssc.med.hx.v2.diet.medication.sleep,

v.ssc.med.hx.v2.genetic.disorders, v.ssc.med.hx.v2.labor.delivery.birth.feeding,

v.ssc.med.hx.v2.language.disorders,

v.ssc.med.hx.v2.medical.history.child.1, v.ssc.med.hx.v2.medical.history.child.2,

v.ssc.med.hx.v2.medical.history.child.3,

v.ssc.med.hx.v2.medications.drugs.mother,

v.ssc.med.hx.v2.neurological.conditions,

v.ssc.med.hx.v2.other.developmental.disorders, v.ssc.med.hx.v2.pdd,

v.ssc.med.hx.v2.pregnancy.history, v.ssc.med.hx.v2.pregnancy.illness.vaccinations,

v.ssc.psuh.fa, vv.ssc.psuh.mo,

v.temperature.form.raw

), remove=T )

Table 1: Template for joining large numbers of SSC tables and creating an AN for them.Readers should make a selection from this template as the full collection creates a data matrixwith about 3,000 variables.

38