Sam Madden [email protected] With a cast of many….

48
Sam Madden [email protected] With a cast of many…. Data Hub: A Collaborative Data Analytics and Visualization Platform

Transcript of Sam Madden [email protected] With a cast of many….

Page 1: Sam Madden madden@csail.mit.edu With a cast of many….

Sam [email protected]

With a cast of many….

Data Hub: A Collaborative Data Analytics and Visualization

Platform

Page 2: Sam Madden madden@csail.mit.edu With a cast of many….

BIG

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Data

Page 3: Sam Madden madden@csail.mit.edu With a cast of many….

Example: Medical Costs

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

MGH Cancer Center

“Super-Database”

Question: What are the factors driving costs for lung cancer patients?

Some results:No correlation of cost with

• Stage of presentation• Survival

Strong correlation of cost with oncologist!

Largest cancer database in the world (173,301 patients)Based on national tumor registryCross linked with death registryIncludes billing, reports, labs, imagery, genome SNPs

- Dr. James Michaelson, PhD, MGH, Harvard Medical School

Page 4: Sam Madden madden@csail.mit.edu With a cast of many….

Super Duper Indexes

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Beyond scalable platforms

Challenge: Making Data Accessible

Main Memory DBsColumn Oriented DBsMap Reduce

What does the data look like?

How do I correlate it with other data sets?

How do I present it to users/execs?

Where are these anomalies and outliers coming from?

Page 5: Sam Madden madden@csail.mit.edu With a cast of many….

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Introducing Datahub

Challenge: Making Data Accessible

+ =

Octocat, the Github mascot

DB Technology

Page 6: Sam Madden madden@csail.mit.edu With a cast of many….

Introducing Datahub

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Data Commons

Selective Sharing and Access Control

Easy to Find, Combine, Clean Data Sets

Secure, Hosted Data Storage (“Database Service”)

Ability to Browse, Visualize, and Query Data in situ

Page 7: Sam Madden madden@csail.mit.edu With a cast of many….

Lots of other places to find data!

For example:

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Datahub: “five-star” integrated, browse-able, & query-able repository of linked data

Aka … Just a bunch of zip files

★ make your stuff available on the Web under an open license★★ make it available as structured data ★★★ use non-proprietary formats (e.g., CSV instead of Excel)★★★★ use URIs to denote things, so that people can point at your stuff★★★★★ link your data to other data to provide context

Versus open, linked data (Tim Berners Lee Taxonomy)

Page 8: Sam Madden madden@csail.mit.edu With a cast of many….

Datahub Interface

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Anant Bhardwaj

Page 9: Sam Madden madden@csail.mit.edu With a cast of many….

Datahub Interface

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 10: Sam Madden madden@csail.mit.edu With a cast of many….

Datahub Interface

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 11: Sam Madden madden@csail.mit.edu With a cast of many….

“Wrangling” Features

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Wrangler: Interactive Visual Specification of Data Transformation ScriptsSean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer

Page 12: Sam Madden madden@csail.mit.edu With a cast of many….

Post-Wrangling

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 13: Sam Madden madden@csail.mit.edu With a cast of many….

More Datahub Interface

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Versions

BrowsingandVisualization

Page 14: Sam Madden madden@csail.mit.edu With a cast of many….

MIT Living Lab

• Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

A Dogfood Eating Exercise

Page 15: Sam Madden madden@csail.mit.edu With a cast of many….

MIT Living Lab

• Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

MIT Data HubOrganizational Data Personal Data

Public Data

MIT data: ID card swipes, network packets, expense reports, medical data, payroll, parking garages, buses and cars, course catalogs, registrar, benefits, on-campus events/seminars, Infrastructure: energy, HVAC, maintenance, etc. Academic/Research: publications, presentations, research data…

Personal Data: location/GPS, calendar, video/pictures, exercise/physio data, application usage, meetings…

Relevant Linked Data: local transit / transport data, crime data, nearby restaurants, events etc.

Page 16: Sam Madden madden@csail.mit.edu With a cast of many….

What Will Data Hub Enable at MIT?

• Campus “Quantification”– is going to class correlated with better grades?– which dining facilities are most popular amongst different groups?

• Transportation planning: – bus utilization and on demand routing – parking lot utilization– carpool finding, etc

• Health + Medical: – campus wide public health, e.g., flu tracking,– observing who is missing class, depressed – Health signals: exercise and eating habits; partners; – outpatient care

• Research:– expert finding; – data sharing between groups

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 17: Sam Madden madden@csail.mit.edu With a cast of many….

Challenges: It’s Not All Fuzzy Stuff

Platform Challenges:How to efficiently store thousands or millions of

databases?

How to anonymize data, control access, etc?How to keep data private and allowing querying over it?

Challenges in Improving Interaction with Databases:Data Cleaning and IntegrationInteractive Data PresentationUnderstanding Why Results are the Way They AreHow to Leverage Experts in an Organization

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Monomi

MapDScorpion

We also don’t want our research to be like this guy

Page 18: Sam Madden madden@csail.mit.edu With a cast of many….

Confidential data leaks 2012: hackers extracted 6.5 million hashed passwords

from the DB of LinkedIn

Application DB ServerSQL

User 1

User 2

User 3

Private Data Problem

System administrator

Threat: passive DB server attacks

Hackers

Sensitive content

Datahub

Page 19: Sam Madden madden@csail.mit.edu With a cast of many….

How to protect data confidentiality?

DB Server

Client

Sensitive content

Sensitive content

Encrypt data server may not be able to process queries!

Compute on encrypted data! Without giving server encryption key!

[request]

[result]

General approach has been proposed several times…

Page 20: Sam Madden madden@csail.mit.edu With a cast of many….

1. Process SQL queries on encrypted data Hide DB from sys. admins., outsource DB to the cloud

2. Modest overhead

Monomi / CryptDB

3. No changes to DBMS (e.g., Postgres, MySQL) and no changes to applications

Application DB ServerSQL

User 1

User 2

User 3

Threat 1: passive DB server attacks

Sensitive content

w/ Raluca Popa, Stephen Tu, Hari Balakrishnan, Frans Kaashoek, Nickolai Zeldovich

Page 21: Sam Madden madden@csail.mit.edu With a cast of many….

col1/rank col2/name

table1/emp

SELECT * FROM emp WHERE salary = 100

x934bc1x5a8c34

x5a8c34

x84a21c

SELECT * FROM table1 WHERE col3 = x5a8c34

Proxy

?x5a8c34x5a8c34

?x5a8c34x5a8c34

x4be219

x95c623

x2ea887

x17cea7

col3/salary

Application

60

100

800

100

Randomized encryption

Deterministic encryption

SQL Queries on Encrypted Data Example

Page 22: Sam Madden madden@csail.mit.edu With a cast of many….

col1/rank col2/name

table1 (emp)

x934bc1x5a8c34

x5a8c34

x84a21cx638e54

x638e54x922eb4

x1eab81

SELECT * FROM table1

WHERE col3 ≥ x638e54Proxy

x638e54x922eb4x638e54

col3/salary

Application

60

100

800

100

Deterministic encryption

SELECT * FROM emp

WHERE salary ≥ 100

OPE (order)encryption

Page 23: Sam Madden madden@csail.mit.edu With a cast of many….

Monomi: Protecting Data in Datahub

• Extensions to CryptDB to efficiently support OLAP queries

• Show how to run all of TPC-H, rather than just 4 of 22 queries– Key insight: split queries, run as much as possible

on untrusted DBMS, compute remainder on trusted client

Page 24: Sam Madden madden@csail.mit.edu With a cast of many….

Monomi vs PlaintextTPC-H SF10, Postgres

Takeaway: median overhead 1.24x,

See Stephen Explain How it Really Works Right after this Talk!

Mo

no

mi R

untim

e vs

Pla

inte

xt

Page 25: Sam Madden madden@csail.mit.edu With a cast of many….

Many Open Problems

Understanding performance more broadly

How to reason about security of non-randomized schemes?

Auditing, information flow, etc.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 26: Sam Madden madden@csail.mit.edu With a cast of many….

DataHub Research Challenges

Platform Challenges:How to efficiently store thousands or millions of

databases?

How to anonymize data, control access, etc?How to keep data private and allowing querying over it?

Challenges in Improving Interaction with Databases:Data Cleaning and IntegrationInteractive Data PresentationUnderstanding Why Results are the Way They AreHow to Leverage Experts in an Organization

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Monomi

MapDScorpion

Page 27: Sam Madden madden@csail.mit.edu With a cast of many….

Interactive Large-Scale Visualization

using a GPU Database

Page 28: Sam Madden madden@csail.mit.edu With a cast of many….

The Need for Interactive Analytics

• DataHub needs to support browsing massive data sets

• Browsing is best supported through visualization

ad-hoc analytics, with millisecond response times

Page 29: Sam Madden madden@csail.mit.edu With a cast of many….

MapD: GPU Accelerated SQL Database

• Key insight: GPUs have enough memory that a cluster of them can store substantial amounts of data

• Not an accelerator, but a full blown query processor!

• Massive parallelism enables interactive browsing interfaces– 4x GPUs can provide > 1 TB/sec of bandwidth– 12 Tflops compute– Order of magnitude speedups over CPUs,

when data is on GPU

• “Shared nothing” arrangement

Page 31: Sam Madden madden@csail.mit.edu With a cast of many….
Page 32: Sam Madden madden@csail.mit.edu With a cast of many….
Page 33: Sam Madden madden@csail.mit.edu With a cast of many….
Page 34: Sam Madden madden@csail.mit.edu With a cast of many….
Page 35: Sam Madden madden@csail.mit.edu With a cast of many….

Next Steps

• Scale out to many nodes, automate layout algorithms

• Add various advanced analytics (e.g., machine learning algorithms)

• Generalize visualization beyond maps

Page 36: Sam Madden madden@csail.mit.edu With a cast of many….

DataHub Research Challenges

Platform Challenges:How to efficiently store thousands or millions of

databases?

How to anonymize data, control access, etc?How to keep data private and allowing querying over it?

Challenges in Improving Interaction with Databases:Data Cleaning and IntegrationInteractive Data PresentationUnderstanding Why Results are the Way They AreHow to Leverage Experts in an Organization

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Monomi

MapDScorpion

Page 37: Sam Madden madden@csail.mit.edu With a cast of many….

Visual Provenance: Scorpion

• Visualization of data is most common form of big data analysis

• Common problem: outliers• Would be nice to have a tool that identifies why

outliers exist Eugene Wu

Page 38: Sam Madden madden@csail.mit.edu With a cast of many….

Definition of WhyGiven an outlier group, find a predicate over the inputs that makes the output no longer an outlier.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

i = Input Data

Italy France Spain US0

0.51

1.52

2.53

3.54

4.55

Output Visualization

p

Outlier Group

p = predicate

Page 39: Sam Madden madden@csail.mit.edu With a cast of many….

Definition of WhyGiven an outlier group, find a predicate over the inputs that makes the output no longer an outlier.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

i = Input Data

Italy France Spain US0

0.51

1.52

2.53

3.54

4.55

Output Visualization

p

p = predicate

Page 40: Sam Madden madden@csail.mit.edu With a cast of many….

Definition of WhyGiven an outlier group, find a predicate over the inputs that makes the output no longer an outlier.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

i = Input Data

Italy France Spain US0

0.51

1.52

2.53

3.54

4.55

Output Visualization

p

Removing the predicate makes US no longer an outlier

What are common properties of those records?

{Bill Gates, Steve Ballmer}p: Company = MSFT

Page 41: Sam Madden madden@csail.mit.edu With a cast of many….

Why is this hard?

Exponential search space over records, attributes

In general, each candidate predicate requires re-running aggregation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

A B C D E F G

Page 42: Sam Madden madden@csail.mit.edu With a cast of many….

Why is this hard?

Exponential search space over records, attributes

In general, each candidate predicate requires re-running aggregation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

A B C D E F G

AVG(rows) = 2.7

A B C D E F G

Page 43: Sam Madden madden@csail.mit.edu With a cast of many….

Why is this hard?

Exponential search space over records, attributes

In general, each candidate predicate requires re-running aggregation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

A B C D E F G

AVG(rows) = 2.9

A B C D E F G

Page 44: Sam Madden madden@csail.mit.edu With a cast of many….

Why is this hard?

Exponential search space over records, attributes

In general, each candidate predicate requires re-running aggregation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

A B C D E F G

AVG(rows) = 2.2

A B C D E F G

Page 45: Sam Madden madden@csail.mit.edu With a cast of many….

Why is this hard?

Exponential search space over records, attributes

In general, each candidate predicate requires re-running aggregation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

A B C D E F G

AVG(rows) = 3.3

A B C D E F G

Page 46: Sam Madden madden@csail.mit.edu With a cast of many….

Why is this hard?

Exponential search space over records, attributes

In general, each candidate predicate requires re-running aggregation

Desire for simple, understandable predicates and a general purpose visualization framework

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

A B C D E F G

AVG(rows) = 3.1

A B C D E F G

See Eugene Explain How it Really Works this Afternoon!

Page 47: Sam Madden madden@csail.mit.edu With a cast of many….

Next Steps

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• A general purpose visualization language for expressing visualizations with provenance support

References to underlying data set

Page 48: Sam Madden madden@csail.mit.edu With a cast of many….

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Big Data is a cry for help from non DB people

Lots of exciting work on scalable systems

DB community should be doing a much better job of helping users use data

We risk losing mindshare

Datahub aims to make data easy to find, visualize, and query, securely and efficiently

Many fascinating, hard problems!(Monomi, MapD, Scorpion)

Conclusion