Post on 10-May-2015
description
04/11/2023 1
Bill Howe, PhDDirector of Research,
Scalable Data AnalyticsUniversity of Washington
eScience Institute
eScience and Data Science at the UW eScience Institute
Bill Howe, UW
2
“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
Jim Gray
04/11/2023 4
The University of Washington eScience Institute
• Rationale– The exponential increase in physical and virtual sensing tech is
transitioning all fields of science and engineering from data-poor to data-rich
– Techniques and technologies include• Sensors and sensor networks, data management, data mining,
machine learning, visualization, cluster/cloud computing • Mission
– Help position the University of Washington and partners at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them.
• Strategy– Bootstrap a cadre of Research Scientists– Add faculty in key fields– Build out a “consultancy” of students and non-research staff
Bill Howe, UW
04/11/2023 5Bill Howe, UW
# o
f b
yte
s
# of data sources
telescopes
spectra
LSST (~100PB; images, spectra)
PanSTARRS (~40PB; images, trajectories)
OOI (~50TB/year; sims, RSN)IOOS (~50TB/year; sims, satellite, gliders,
AUVs, vessels, more)CMOP (~10TB/year; sims, stations, gliders,
AUVs, vessels, more)
SDSS (~400TB; images, spectra, catalogs)
n-body sims
models
AUVs
stations
cruises, CTDsflow cytometry
gliders
ADCPsatellites
Astronomy
Ocean Sciences
3 V’s of Big Data
Volume
Variety
Velocity
PDB
GenBank
UniProt
Pfam
Spreadsheets, NotebooksLocal, Lost
High throughput experimental methodsIndustrial scaleCommons based productionPublicly data setsCherry picked resultsPreserved
CATH, SCOP(Protein Structure Classification)
ChemSpider
Long Tail of Research Data
[src: Carol Goble]
Types of Data Stored
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
5.20%
11.70%
15.20%
20.50%
22.80%
48.00%
61.20%
71.60%Quantitative/tabular/statistical
Text (literature, transcriptions, field notes)
Images (photographs, maps)
Video recordings
Audio recordings
Multimedia digital objects
Geo-tagged objects/ spatial data
Other
Lewis et al 2008
src: Conversations with Research Leaders (2008)
Where do you store your data?
src: Conversations with Research Leaders (2008)
src: Faculty Technology Survey (2011)
Other
External (non-UW) data center
Department-managed server
My computer
0% 20% 40% 60% 80% 100%
5%6%
12%27%
41%66%
87%
Lewis et al 2011
How much data do you work with?
Wright 2013
The Long Tail is worthy of investment
Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6)
log(NSERC grants) ($) log(NSERC + CIHR rants) ($)
The Long Tail is worthy of investment
Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6)
“In sum, greater productivity is not strongly related to greater funding.”
“…the best article of one rich researcher received, on average, 14% fewer citations than the best article from any random pair of researchers, each of whom received only half as much funding!”
“if maximizing the total impact of the entire pool of grantees is the goal, then the “few big” strategy would be less effective than the “many small” strategy.”
04/11/2023 12Bill Howe, UW
“Ensuring a successful future for the biological sciences will require restraint in the growth of large centers and -omics-like projects, so as to provide more financial support for the critical work of innovative small laboratories striving to understand the wonderful complexity of living systems.”
Alberts B (2012) The End of “Small Science”? Science 337: 1583
The long tail is worthy of investment
04/11/2023 13
04/11/2023 14Bill Howe, UW
100x more impact
How much more productive is a great programmer than a mediocre programmer?
04/11/2023 16
Problem
How much time do you spend “handling data” as opposed to “doing science”?
Mode answer: “90%”
Bill Howe, UW
1704/11/2023 Bill Howe, UW
Simple Example
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
COGAnnotation_coastal_sample.txt
SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
04/11/2023 18Bill Howe, UW
“The future is already here; it’s just not very evenly distributed.”
-- William Gibson
04/11/2023 19
Three Avenues of Attack
• Technological• Educational• Organizational
Bill Howe, UW
SQLSHARE: QUERY-AS-A-SERVICETechnology for the Long Tail
Alex Szalay Jim Gray
How can we deliver 1000 little SDSSs?
1) Upload data “as is”Cloud-hosted; no need to install or design a database; no pre-defined schema
2) Analyze data with SQLRight in your browser, writing queries on top of queries on top of queries ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results Queries on queries on queries…
04/11/2023 23
Browse English descriptions
04/11/2023 24
Save the results, share them with others
Edit a Query
VizDeck
Python Client
“Flagship” SQLShare App (Python) on EC2
SQLShare REST API
Excel Addin
R Client
Spreadsheet Crawler
ASP.NET
OAuth2
WCF
04/11/2023 26
METADATAAside
Bill Howe, UW
04/11/2023 27
What about metadata?
• Claim: Comprehensive metadata standards* represent a shared consensus about the world
• At the frontier of research, this shared consensus does not exist, by definition
• Any consensus that does emerge will change frequently, by definition
• Data found “in the wild” will typically not conform to any standard, by definition
So are we stuffed?
* ontology / schema / controlled vocabulary / etc.
Bill Howe, UW
04/11/2023 28Bill Howe, UW
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43
Maslow’s Needs Hierarchy
04/11/2023 29
A “Needs Hierarchy” of Science Data Management
storage
sharing
Bill Howe, UW
query
curation
analytics
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43
04/11/2023 30
A “Needs Hierarchy” of Science Data Management
storage
sharing
Bill Howe, UW
curation
query
analytics
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43
04/11/2023 31
Views
• A view is just a query with a name• We can use the view just like a real
table
Bill Howe, UW
Why can we do this?
Because we know that every query returns a relation:
We say that the language is “algebraically closed”
04/11/2023 32
Scientific data management reduces to sharing views• Integrate data from multiple sources?
– joins and unions with views
• Standardize on units, apply naming conventions?– rename columns, apply functions with views
• Attach metadata?– add new tables with descriptive names, add new columns
with views
• Data cleaning, quality control?– hide bad values with views
• Maintain provenance?– inspect view dependencies
• Propagate updates? – view maintenance
• Protect sensitive data?– expose subsets with views (assuming views carry
permissions) Bill Howe, UW
Pay-as-you-go metadata
Deeply nested hierarchies of viewsProvenance
Controlled SharingImplicit re-execution
Bring the computation to the data
• Rich query services, not data cemetaries
• Avoid “Transloading:” Pointless data movement from one environment to another– Compute Vis– Server Client– Cloud Cloud
• “Share the soup” and curate incrementally as a side effect of usage
• Pay-as-you-go metadata
04/11/2023 35
USE CASES
Bill Howe, UW
Laser
Microscope Objective
Pine Hole Lens
Nozzle d1
d2
FSC (Forward scatter)
Orange fluo
Red fluo
SeaFlowFrancois Ribalet
Jarred Swalwell
Ginger Armbrust
SeaFlowd1
/ F
SC
d2 / FSC
RE
D f
luor
esce
nce
FSC
Picoplankton
Nanoplankton
IS
Ultraplankton
Prochlorococcus
Continuous observations of various phytoplankton groups from 1-20 mm in size
Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton Based on ORANGE fluo: Synechococcus, Cryptophytes Based on FSC: Coccolithophores
Francois Ribalet
Jarred Swalwell
Ginger Armbrust
SeaFlowFrancois Ribalet
Jarred Swalwell
Ginger Armbrust
04/11/2023 39
Script-oriented computing was killing them• Scripts (typically in R) must be pre-shared with all
collaborators• When the data changes, everybody has to re-run all the
scripts• When the scripts change, everybody has to re-run all the
scripts. • Implicit assumption that all data fits in main memory• Terrible provenance, terrible reproducibility • Pipeline of scripts dependent on intricate file formats and
file naming schemes
Bill Howe, UW
Howe, et al., CISE 2012
04/11/2023 41
Workflow Systems
• Scripts (typically in R) must be pre-shared with all collaborators
• When the data changes, everybody has to re-run all the scripts
• When the scripts change, everybody has to re-run all the scripts.
• Implicit assumption that data fits in main memory
• No provenance• Pipeline of scripts dependent on intricate file
formats and/or file naming schemes
Bill Howe, UW
Steven Roberts
SQL as a lab notebook:http://bit.ly/16Xj2JP
Popular service for Bioinformatics Workflows
Halperin, Howe, et al. SSDBM 2013
04/11/2023 44
Robin Kodner
Bill Howe, UW
“I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.”
Data management and statistics for biologists
04/11/2023 45Bill Howe, UW
“An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces.
Previously, we were using huge directory trees and plain text files.
Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”-- Andrew D White
Andrew White, UW Chemistry
Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries (rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of queries written by non-programmers
04/11/2023 47
DATA SCIENCE
Bill Howe, UW
Education
04/11/2023 48Bill Howe, UW
04/11/2023 49
Drew Conway’s Data Science Venn Diagram
Bill Howe, UW
04/11/2023 50Bill Howe, UW
“I worry that the Data Scientist role is like the mythical “webmaster” of the 90s: master of all trades.”
-- Aaron Kimball, CTO Wibidata
04/11/2023 51Bill Howe, UW
But what are the abstractions of data science?
tools abstr.
“Data Jujitsu”“Data Wrangling”“Data Munging”
Translation: “We have no idea what this is all about”
Claim: Relational Algebra is the universal formalism for data modeling and manipulation, and every data scientist should know it
04/11/2023 52
UW Data Science Education Efforts
Bill Howe, UW
Students Non-StudentsCS/Informatics Non-Major professionals researchersundergrads grads undergrads grads
UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) Coursera Course: Intro to Data Science
Previous courses:Scientific Data Management, Graduate CS, Summer 2006, Portland State UniversityScientific Data Management, Graduate CS, Spring 2010, University of Washington
04/11/2023 53Bill Howe, UW
Bill Howe
“Very very interesting tht there is high correlation in the way the election results were being announced and the way the graph is shaped.”
“I was quite amazined that I was able to obtain this analysis with just 2 days of lecturing and practice!”
“Inspired by all my new learning, I thought about doing a little sentiment analysis myself for our national elections!”
“Darth Grader”
… I was frankly amazed that you were so fast in responding to my query, in spite of the class being so huge. I’d like to thank you so much for teaching this course. This has been one of the most useful courses I’ve ever taken and you’re an awesome instructor. With such great quality education freely available, I wonder why I joined a Master’s program haha. Thanks again and take care. I’ll try my best to finish another competition.
“With such great quality education freely available, I wonder why I
joined a Master’s program haha.”
04/11/2023 57
INTELLECTUAL INFRASTRUCTURE
Institution
Bill Howe, UW
Seek work in “Pasteur’s Quadrant”
Considerations of use
Que
st fo
r fun
dam
enta
l und
erst
andi
ng
Pasteur
Edison
Bohr
04/11/2023 59Bill Howe, UW
Multiple modes of interaction, multiple time scales
1-2 years and up1-2 weeks and down 1-2 quarters
Incubation(projects)
Communication(events)
Collaboration(partnerships)
2018
2008
Some local observations:• Big data work exposes common ground• Every job is becoming “data scientist”• More π-shaped people!• Democratization to the long tail is key• Industry and research aren’t too different
Incubator• Seed grants to students and postdocs• Rotating staff from science and industry• An evolving portfolio of reusable tools• Produce digital capital and human capital
Data Science Incubator
2013
04/11/2023 61Bill Howe, UW
On NoSQL
04/11/2023 62Bill Howe, UW
http://sqlshare.escience.washington.edu
billhowe@cs.washington.edu
http://escience.washington.edu
04/11/2023 63Bill Howe, UW
04/11/2023 64
WHY SQL?
Bill Howe, UW
5/18/10 Garret Cole, eScience Institute
What’s the point?• Conventional wisdom says “Science data isn’t relational”
– This is nonsense
• Conventional wisdom says “Scientists won’t write SQL”– This is nonsense
• So why aren’t databases being used more often?– They’re a PITA
• We implicate difficulty in– installation, configuration– schema design, data loading– performance tuning– app-building (NoGUI?)
We ask instead, “What kind of platform can support ad hoc scientific Q&A with SQL?”
04/11/2023 66Bill Howe, UW
God made the integers; all else is the work of man.
(Leopold Kronecker, 19th Century Mathematician)
slide src: Mike Franklin
Codd made relations;all else is the work of man.
(Raghu Ramakrishnan, DB text book author)
04/11/2023 Bill Howe, UW 67
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
Data Curation
Data Management
Data Analytics
Cyberinfrastructure
Big
Dat
a
Open D
ata
Database & Systems
Researchers
Stats, ML, and Viz
Library Science
Researchers
DataONE
Hadoop
GraphLab
Vertica
Greenplum
Oracle/MS/IBM
Dataverse
R/SPSS/MATLAB/Stata
Dryad
ICPSR
Geodata.gov
Tableau
Weka
GenBank
Intellectual Infrastructure
Spark
Pig
HIVE
Shark
Dremel
04/11/2023 69
SQLShare as a CS Research Platform• Automatic “Starter” Queries
– (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle)
• VizDeck: Automatic Mashups and Visualization – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon)
• Info Extraction from Spreadsheets– (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani)
• Scalable Analytics-as-a-Service– (Dan Suciu, Magda Balazinska, Bill Howe)
• Optimizing Iterative Queries for Machine Learning– (Dan Suciu, Magda Balazinska, Bill Howe)
• Case Studies in Metagenomics, Chemistry, more
SSDBM 2011SIGMOD 2011 (demo)
SSDBM 2011CHI 2012SIGMOD 2012 (demo)
Bill Howe, UW
VLDB 2010Datalog2.0 2012CIDR 2013
Data engineering 2012CiSE 2012
Two Failure Modes Serving the Long Tail
over-abstraction
“uber-system”
“neither fish nor fowl”
tries to address so many requirements, it addresses none
too reactive, ad hoc, one-off
Addresses exactly 1 application
no leverage; doesn’t scale
A stripped-down version of Jim Gray’s “20 questions” methodology
Experimental Engagement Algorithm for the Long Tail
1. Get the data2. Load the data “as is” – no schema design3. Get ~20 questions (in English)4. Translate the questions into SQL (when possible)5. Provide these “starter queries” to the researchers
Q: Can researchers questions be expressed in SQL?Q: Are a few examples sufficient for novices to self-train with SQL?Q: Can we scale this process up?Q: If so, will the use of SQL reduce their data handling overhead?