Generative Models for Crowdsourced Data
description
Transcript of Generative Models for Crowdsourced Data
![Page 1: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/1.jpg)
Generative Models for Crowdsourced Data
![Page 2: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/2.jpg)
Outline
• What is Crowdsourcing?• Modeling the labeling process• Example with real data• Extensions• Future Directions
![Page 3: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/3.jpg)
What is Crowdsourcing?
• Human based computation.• Outsourcing certain steps of a computation to
humans.• ``Artificial artificial intelligence.’’• Data science:– Making an immediate decision.– Creating a labeled data set for learning.
![Page 4: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/4.jpg)
Immediate Decision Workflow
![Page 5: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/5.jpg)
Labeled Data Set Workflow
![Page 6: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/6.jpg)
An Example HIT
![Page 7: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/7.jpg)
An Example HIT
![Page 8: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/8.jpg)
Funny enough …
• Not everybody agrees on the gender of a Twitter profile.
• Difficult Instances• Worker Ability / Motivation• Worker Bias• Adversarial Behaviour
![Page 9: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/9.jpg)
Difficult Instance
![Page 10: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/10.jpg)
Difficult Instance
![Page 11: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/11.jpg)
Difficult Instance
![Page 12: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/12.jpg)
Worker Ability
![Page 13: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/13.jpg)
Worker Ability
![Page 14: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/14.jpg)
Worker Ability
![Page 15: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/15.jpg)
Worker Motivation
![Page 16: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/16.jpg)
Worker Motivation
![Page 17: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/17.jpg)
Worker Bias
![Page 18: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/18.jpg)
Worker Bias
![Page 19: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/19.jpg)
Worker Bias
![Page 20: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/20.jpg)
Disagreements
• When some workers say “male” and some workers say “female”, what to do?
![Page 21: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/21.jpg)
Majority Rules Heuristic
• Assign label l to item x if a majority of workers agree.
• Otherwise item x remains unlabeled.
![Page 22: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/22.jpg)
Majority Rules Heuristic
• Assign label l to item x if a majority of workers agree.
• Otherwise item x remains unlabeled.• Ignores prior worker data.
![Page 23: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/23.jpg)
Majority Rules Heuristic
• Assign label l to item x if a majority of workers agree.
• Otherwise item x remains unlabeled.• Ignores prior worker data.• Introduce bias in labeled data.
![Page 24: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/24.jpg)
Train on all labels
• For labeled data set workflow.• Add all item-label pairs to the data set.• Equivalent to cost vector of:– P (l | { lw }) = 1/nw S 1{l = lw}
![Page 25: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/25.jpg)
Train on all labels
• For labeled data set workflow.• Add all item-label pairs to the data set.• Equivalent to cost vector of:– P (l | { lw }) = 1/nw S 1{l = lw}
• Ignores prior worker data.
![Page 26: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/26.jpg)
Train on all labels
• For labeled data set workflow.• Add all item-label pairs to the data set.• Equivalent to cost vector of:– P (l | { lw }) = 1/nw S 1{l = lw}
• Ignores prior worker data.• Models the crowd, not the “ground truth.”
![Page 27: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/27.jpg)
What is ground truth
• Different theoretical approaches.– PAC learning with noisy labels.– Fully-adversarial active learning.
• Bayesians have been very active.– “Easy” to posit a functional form and quickly
develop inference algorithms.– Issue of model correctness is ultimately empirical.
![Page 28: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/28.jpg)
Bayesian Literature
• (2009) Whitehill et. al. GLAD framework.– (1979) Dawid and Skene. Maximum Likelihood
Estimation of Observer Error-Rates Using the EM Algorithm.
• (2010) Welinder et. al. The Multidimensional Wisdom of Crowds.
• (2010) Raykar et. al. Learning from Crowds.
![Page 29: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/29.jpg)
Bayesian Approach
• Define ground truth via a generative model which describes how “ground truth” is related to the observed output of crowdsource workers.
• Fit to observed data.• Extract posterior over ground truth.• Make decision or train classifier.
![Page 30: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/30.jpg)
Generative Model
![Page 31: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/31.jpg)
Example: Binary Classification
• Each worker has a matrix.
α = ( -1 α01 )
( α10 -1 )
• Each item has a scalar difficulty β > 0.• P (lw = j | z = i) = e-βαij / (Σk e-βαik)
• αij ~ N (μij, 1) ; μij ~ N (0, 1)• log β ~ N (ρ, 1) ; ρ ~ N (0, 1)
![Page 32: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/32.jpg)
Other Problems
• Multiclass classification:– Same as binary with larger confusion matrix.
• Ordinal classification: (“Hot or not”)– Confusion matrix has special form.– O (L) parameters instead of O (L2).
• Multilabel classification:– Reduce to multiclass on power set.– Assume low-rank confusion matrix.
![Page 33: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/33.jpg)
EM
![Page 34: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/34.jpg)
EM
• Initially all workers are assumed moderately accurate and without bias.– Implies initial estimate of ground truth distribution
favors consensus.– Disagreeing with the majority is a likely error.
![Page 35: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/35.jpg)
EM
• Initially all workers are assumed moderately accurate.
• Workers consistently in the minority have their confusion probabilities increase.
![Page 36: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/36.jpg)
EM
• Initially all workers are assumed moderately accurate.
• Workers consistently in the minority have their confusion probabilities increase.
• Workers with higher confusion probabilities contribute less to the distribution of ground truth.
![Page 37: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/37.jpg)
“Different” workers are marginalized
![Page 38: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/38.jpg)
“Different” workers are marginalized
• Workers that are consistently in the minority will not contribute strongly to the posterior distribution over ground truth.– Even if they are actually more accurate.
• Can correct when an accurate worker(s) is paired with some inaccurate workers.
• Good for breaking ties.• Raykar et. al.
![Page 39: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/39.jpg)
Example with real data
![Page 40: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/40.jpg)
Online EM
• Given a set of worker-label pairs for a single item:
• (Inference) Using current α, find most likely β* and distribution q* over ground truth.
• (Training) Do SGD update of α with respect to EM auxiliary function evaluated at β* and q*.
![Page 41: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/41.jpg)
Online EM
• Given a set of worker-label pairs for a single item:
• (Inference) Using current α, find most likely β* and distribution q* over ground truth.
• (Training) Do SGD update of α with respect to EM auxiliary function evaluated at β* and q*.
![Page 42: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/42.jpg)
Things to do with q*
• Take an immediate cost-sensitive decision– d* = argmind Ez~q*[f (z, d)]
• Train a (importance-weighted) classifier– cost vector cd = Ez~q*[f (z, d)]– e.g. 0/1 loss: cd = (1 - q*d)– e.g. binary 0/1 loss: |c1 – c0| = |1 – 2 q*1|– No need to decide what the true label is!
• Raykar et. al.: why not jointly estimate classifier and worker confusion?
![Page 43: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/43.jpg)
Raykar et. al. insight
• Cost vector is constructed by estimating worker confusion matrices.
• Subsequently, classifier is trained; it will sometimes disagree with workers.
• Would be nice to use that disagreement to inform the worker confusion matrices.
• Circular dependency suggests joint estimation.
![Page 44: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/44.jpg)
Generative Model
![Page 45: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/45.jpg)
Generative Model
![Page 46: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/46.jpg)
Online Joint Estimation
![Page 47: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/47.jpg)
Online Joint Estimation
• Initially the classifier will output an uninformative prior and therefore will be trained to follow consensus of workers.
• Eventually workers which disagree with the classifier will have their confusion probabilities increase.
• Workers consistently in the minority can contribute strongly to the posterior if they tend to agree with the classifier.
![Page 48: Generative Models for Crowdsourced Data](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681602f550346895dcf44f0/html5/thumbnails/48.jpg)
Additional Resources
• Software– http://code.google.com/p/nincompoop
• Blog– http://machinedlearnings.com/