Hazy: Making Statistical Applications Easier to Build and Maintain
-
Upload
vielka-tucker -
Category
Documents
-
view
35 -
download
1
description
Transcript of Hazy: Making Statistical Applications Easier to Build and Maintain
Hazy: Making Statistical Applications Easier to Build and Maintain
Christopher RéBigLearn
Collaborators listed throughout
Big data is the future.
Big data is great for vendors and consulting $$$, but is ‘Big’ the heart of the problem?
How big is `big’?Is it a GB?
That fits on your phone… but LeCun & Farabet has same
spirit.
Is it a TB?TB is big on a Hadoop…but is main memory in many
apps
Is it a PB?Do you have that problem?
NB: I love Hadoop, Y! and Cloudera.
“Let’s end Peta-philia” – John Doyle
Maybe something other than size is the common thread?
Big Shifts Underlying Big Data
(1)More signal (data) beats a more complex model nine times out of ten. -- This is why we acquire big data.
(2)Move computation to the data – not data to to the computation.
My bet: The key is the ability to quickly sling signal together – where data live.
Hazy: Making Statistical Applications Easier to Build and Maintain
6
Two Trends that Drive Hazy1. Data in unprecedented number of formats
Hazy = statistical + data management techniques
2. Arms race for deeper understanding of data
Statistical tools attack both 1. and 2.
Hazy’s Goal: Understand common patterns in deploying & maintaining statistical tools on data.
7
Hazy’s Thesis
The next breakthrough in data analysis may not be a new data analysis algorithm…
…but may be in the ability to rapidly combine, deploy, and maintain existing algorithms.
8
Outline
Three Application Areas for Hazy
Victor/Bismarck (in Parallel RDBMS)
HogWild! (Shared Memory)
Other Hazy Projects
9
Data constantly generated on the Web, Twitter, Blogs, and Facebook
Extract and Classify sentiment about products, ad campaigns, and customer facing entities.
Often deployed statistical tools: Extraction (CRFs) & Classification (SVM).
DARPA Machine Reading: “List members of the Brazilian Olympic Team in this corpus with years of membership”
Simplified View of MR Data Flow
Entity Linking Slot Filling
Infrastructure (Felix)
ClueWeb Freebase News Corpus
Mention and EntityFeatures
Pairs of EntityFeatures
Goals: High quality. Low Effort. Easy to test.
500M Docs or 15TB
Two Key Features our system, Felix1. Felix allows many best-in-breed algorithms to
work jointly together. - Extraction (CRF), Classification (SVM)
2. Simple SQL-like rule-language (Markov Logic)– Felix allows us to leverage Wikipedia, Freebase,
ClueWeb in one high-level language (TBs of data)
Key: Marry simple robust statistical tools with flexible rules – and be able to scale to TBs.
TAC-KBP F1=39. SotA Systems F1=20
Feng Niu, Ce Zhang, Josh Slauson:www.cs.wisc.edu/hazy/felix
12
Rough Demo!
A physicist interpolates sensor readings and uses regression to more deeply understand their data
14
Digital Optical Module (DOM)
15
Workflow of IceCube
In Ice: Detection occurs.
At Pole: Algorithm says “Interesting!”
In Madison: Lots of data analysis.
Via satellite: Interesting DOM readings
16
A Key Step: Detecting Track
Mathematical structure used to track neutrinos similar to labeling text.
Mark’s code to the South Pole
in 2012.
Mark Wellons, Ben Recht, Francis Halzen, and Gary Hill
Rules + Simple Regression
NB: Many data analysts are ex-physicists
(Experts: Both are Convex Programs)
17
OCR and Speech
Getting text is challenging! (statistical model errors)
A social scientist wants to extract the frequency of synonyms of English words in 18th century texts.
Output of speech and OCR models similar to text-labeling models (transducers)
OCR & Speech
Social Scientists @ UW
Want to unify content management systems with RDBMS.
18
Application Takeaways
Statistical processing on large data enables a wide variety of new applications.
Algorithm (IGMs) that solves bold models close to data in (1) an RDBMS & (2) Shared-Memory.
Goal: Develop the ability to rapidly combine, deploy, and maintain existing algorithms.
19
Outline
Two Application Areas for Hazy
Victor/Bismarck (in Parallel RDBMS)
HogWild! (Shared Memory)
Other Hazy Projects
20
Classify publications by subject area
21
Example: Linear Models
Label papers as DB Papers or Non-DB Papers
2. Classify via plane
1 2
3
45
DB Papers
Non-DB Papers
1. Map each papers to Rdx
Question is: How do we pick a good plane (x)?
22
Example: Linear Models
Instead: change f to a smooth function of distance to plane.e.g., the squared distance (least squares), hinge loss (svm), log loss (logistic regression). Different f = different model.
Input: Labeled papers. Each point labeled as DB (+) or not (-)
+ -
-
-+
x says DB Papers
x says Non-DB
xIdea: Let’s score each plane
Idea: minimize misclassification.
yi is a paper vector and its label
Framework: Inverse Problems
Examples: 1. Paper Classification: yi is the paper with its label
2. Neutrino Tracking: yi is a DOM (sensor) reading
3. CRFs: yi is (document, labeling)
4. Netflix: yi is (user,movie,rating), Claim: General data analysis technique that is no more
difficult to compute than a SQL AVG.
x the model
yia data item
f scores the error
P Enforces prior
a
Experts: f and P are convex
24
Background: Gradient Methods
Gradient Methods: Iterative. 1. Start at current x, 2. Take gradient at x, 3. Move in opposite direction
F(x)
25
Incremental Gradient Methods
Select a single data item to approximate gradient
Gradient Methods: Iterative. 1. Start at current x, 2. Approximate gradient at x by selecting j in {1…N}, 3. Move in opposite direction
26
Incremental Gradient Methods (iGMs)
Why use iGMs? iGMs converge to an optimal for many problems, but the real reason is:
iGMs are fast.
Technical Connection: iGM processing isomorphic to computing a SQL AVG
(x is the accumulator, G is an expression on a single tuple)
Solve statistical models in an RDBMS for free
27
Most RDBMS 3 Functions in a UDA1. Initialize (State)2. Transition(State, data)3. Terminate(State)
iGMs ≈ User Defined Aggregates
State in R2 : (# terms, running total)Transition( (n, T), d ) = (n, T) + (1,d) Terminate( (n,T) ) = T/n
AVG
State in Rd : the model xTransition( x, yj ) = x - G(x,yj) Terminate( x ) = x
Gradient
28
Some subtle differences…
UDAs typically commutative and algebraic
IGMs are formally neither, but are morally both
Morally Algebraic: One can average models, so morally algebraic [Zinkevich et al 10].
Morally Commutative: Different orders give different exact result, but any order converges to same result.
Key: IGMs work (in parallel) off the shelf in a UDA. (This is how we do Logistic Regression on the Web)
MADLib & Oracle versions coming. Terminology from Gray et al
29
Hogwild!
[Niu, Recht, Ré, & Wright NIPS 11] Code on Web.
Other Hazy work• Dealing with data in different formats
– Staccato: Arun Kumar. Query OCR documents [PODS 10 & VLDB 12]– Hazy goes to the South Pole: Mark Wellons Trigger Software for IceCube
• Machine Learning Algorithms– Hogwild run SGD on convex problems with no locking [NIPS 11]– Jellyfish faster than Hogwild! on Matrix Completion [Opt Online 11]
• Maintain Supervised Techniques on Evolving Corpora– iClassify: M. Levent Koc. Maintain classifiers on evolving examples [VLDB 11]– CRFLex: Aaron Feng. Maintain CRFs on Evolving Corpora [ICDE 12]
• Populate Knowledge Bases from Text– Tuffy: SQL + weights to combine multiple models [VLDB 11]– Felix: Running Markov Logic on Millions of Documents [Preprint ARXIV 11]– Feng Niu, Ce Zhang, Josh Slauson,and Adel Ardalan
• Infrastructure with Mike Cafarella (Michigan)– Manimal: Optimizing MapReduce with Relational Operations [VLDB 11]
30
Student who did real work in red.
31
Conclusion
Many statistical data analyses are no more complicated to compute than a SQL AVG.
Hogwild! approach to multicore parallelism has no locking, but good speedup
Code, data, and papers:www.cs.wisc.edu/hazy