November 2015 The Data Lab Tufts University Tufts Technology Services (TTS)
A computational tool for depth-based Statistical analysis Eynat Rafalin, Tufts University Computer...
-
Upload
cory-craig -
Category
Documents
-
view
212 -
download
0
Transcript of A computational tool for depth-based Statistical analysis Eynat Rafalin, Tufts University Computer...
A computational tool fordepth-based Statistical analysis
Eynat Rafalin, Tufts UniversityComputer Science
Department
The tool Easy to use, efficient and
expandable interface, for statistical research, based on the notion of data depth.
For scientists with no computer science background.
Our goal Present the tool to the community
Code\software available on request Run on real data Get feedback
Is such a tool needed? Additions\improvements?
General C++ based software (no additional tools\
software needed) Simple interface. Should allow to
enter data files, sort the data points and filter unwanted data
perform calculations present the results in an easy to understand
graphical interface Save and output data for future use
Fast Portable code
General descriptionData filter
Contours display and selection
Statistical modules
output
txt, excel files
Geomview
Data filter Graphical user interface developed in C+
+ Used to crop\manipulate a data set
before it is fed into the statistical modules
Fast and light Convenient and easy to use user
interface Portable code (UNIX, Solaris, Linux, Win)
Data filter
Statistical modules
Depth contours (2D) Half-space (location) depth contours
optimal O(n2) time Supports two approaches for defining contours Including Tukey median and the bagplot Including contours’ parameters (size, etc..)
Convex hull peeling depth contours Simplicial depth contours Tukey median computation (O(nlog3n)) Locating a new point in a set of depth
contours (O(log n) query time)
Approaches for defining depth contours P. Rousseeuw et al.
The k-th depth contour is the boundary of the set of points in the plane with depth k
R. Liu et al. (based on order statistics) The sample p-th central hull is the
convex hull containing the most central fraction p sample points.
Half-space (location) depth contours module
Depth contours for a sample set with 8 data points
Depth contours for a data set describing diabetic patients
Statistical modules – cntd.Plots DD (Depth vs. Depth) plots
O(n2) time Shrinkage plots Fan plots
DD (Depth vs. Depth) plots module
Two 2D data sets of 50 points each, created from normal distribution, centered at (0,0), with different covariance matrices (1 and 4 id).
Depth
acc
ord
ing t
o s
et
A
Depth according to set B
Fan plots
50 data points, created from a random distribution, with covariance matrix 4 times identity.The fans are created for data sets containing the 1/6, 2/6, ..central regions. For each region the area of the CH of 2, 4, 6,…% of the points is computed.
Rela
tive a
rea (
CH
of
p%
/CH
)
Percentile of points
Graphical contour selection tool
Plots depth contours and selects data ranges.
Actions Import\export Select points Depth slider Filter
Future work Run the tool on existing data sets Distribute preliminary versions and get
users feedback Data filter
Group by row\column Filter by row\column Interactions between rows\columns (addition,
substitution, logical operations) Statistical modules
Implement additional modules Improve running times
Contributors Prof. Diane Souvaine Prof. Alva Couch Eynat Rafalin Michael Burr Joe Handelman James Hayes
Ori Taka Alok Lal Janet Luan Kim Miller Tim Mitchell Nikolai Shvertner