Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers

Get Another Label? Improving Data Quality and Machine

Learning Using Multiple, Noisy Labelers

Panos IpeirotisNew York University

Joint work with Jing Wang, Foster Provost, and Victor Sheng

Outsourcing machine learning preprocessing

Traditionally, modeling teams have invested substantial internal resources in data formulation, information extraction, cleaning, and other preprocessing

Now, we can outsource preprocessing tasks, such as labeling, feature extraction, verifying information extraction, etc.– using Mechanical Turk, oDesk, etc.– quality may be lower than expert labeling (much?) – but low costs can allow massive scale

Example: Build an “Adult Web Site” Classifier

Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X (porn)

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr MTurk: 2500 websites/hr, cost: $12/hr

Noisy labels can be problematic

Many tasks rely on high-quality labels for objects:– webpage classification for safe advertising– learning predictive models– searching for relevant information – finding duplicate database records – image recognition/labeling– song categorization– sentiment analysis

Noisy labels can lead to degraded task performance

1 20 40 60 80 100

Number of examples ("Mushroom" data set)

Quality and Classification Performance

Labeling quality increases classification quality increases

P = 50%

P = 60%

P = 80%

P = 100%

P: Single-labeler quality (probability of assigning correctly a binary label)

Solutions

Get better labelers– Often beyond our

control or too expensive

Get more labelers per item– Our focus

Majority Voting and Label Quality

1 3 5 7 9 11 13

Number of labelers

Ask multiple labelers, keep majority label as “true” label Quality is probability of being correct

P is probabilityof individual labelerbeing correct

Tradeoffs for Modeling

Get more examples Improve classification Get more labels Improve label quality Improve classification

1 20 40 60 80 100

Number of examples (Mushroom)

P = 0.5

P = 0.6

P = 0.8

P = 1.0

Basic Labeling Strategies

Single Labeling– Get as many data points as possible– One label each

Round-robin Repeated Labeling– Repeatedly label data points, giving next label to the

one with the fewest so far

Round Robin vs. Single: Low Noise

p= 0.8, labeling quality#examples =50

FRR (50 examples)

Single

With low noise/few examples, more (single labeled) examples better

Round Robin vs. Single: High Noise

p= 0.6, labeling quality#examples =100

FRR(100 examples)

High noise/many examples, repeated labeling better

Selective Repeated-Labeling

We have seen so far: – With enough examples and noisy labels, getting multiple

labels is better than single-labeling– When we consider extra cost for getting the unlabeled part

the benefit is magnified Can we do better than the basic strategies? Key observation: we have additional information to

guide selection of data for repeated labeling the current multiset of labels

Example: {+,-,+,-,-,+} vs. {+,+,+,+,+,+}

Natural Candidate: Entropy

Entropy is a natural measure of label uncertainty:

E({+,+,+,+,+,+})=0 E({+,-, +,-, -,+ })=1

Strategy: Get more labels for high-entropy label multisets

||||log

||||)( 22 S

negativeSpositiveS |:||:|

What Not to Do: Use Entropy

0 1000 2000 3000 4000

Number of labels (mushroom, p=0.6)

ENTROPY

Entropy: Improves at first, hurts in long run

Why not Entropy

In the presence of noise, entropy will be high even with many labels

Entropy is scale invariant {3+, 2-} has same entropy as {600+ , 400-}, i.e., 0.97

Estimating Label Uncertainty (LU)

Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs} Use Bayesian estimate of Bernoulli: Beta prior + update Uncertainty = tail of beta distribution

0.50.0 1.0

Beta probability density function

Label Uncertainty

p=0.7 5 labels (3+, 2-) Entropy ~ 0.97 CDFb=0.34

Label Uncertainty

Label Uncertainty vs. Round Robin

0.600.650.700.75

0.800.850.900.951.00

0 1000 2000 3000 4000

Number of labels (mushroom, p=0.6)

similar results across a dozen data sets

Remember: GRR already better than single labeling

Why LU is a hack?

More sophisticated label uncertainty

Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs} Estimated Beta distribution = quality of labelers for the

example

Estimate the prior Pr(+) by iterating

More sophisticated LU improves labeling quality under class imbalance and fixes some pesky LU learning curve glitches

Both techniquesperform essentially optimally with balanced classes

Another strategy:Model Uncertainty (MU)

Learning models of the data provides an alternative source of information about label certainty– (a random forest for the results to come)

Model uncertainty: get more labels for instances that cause model uncertainty

Intuition?– for modeling: why improve training data

quality if model already is certain there?– for data quality, low-certainty “regions”

may be due to incorrect labeling of corresponding instances

Models

Examples

Self-healing process [Brodley et al, 99]

+ + ++

- - - -- - - -- -

- - - -

- - - -- - - -- - - -

- - - -

Yet another strategy:Label & Model Uncertainty (LMU)

Label and model uncertainty (LMU): avoid examples where either strategy is certain

MULULMU SSS

Label Quality

0.60.650.7

0.750.8

0.850.9

0 400 800 1200 1600 2000Number of labels (waveform, p=0.6)

UNF MULU LMU

Label Uncertainty

Uniform, round robin

Label + Model Uncertainty

Model Uncertainty alone also improves quality

Model Quality

Label & Model Uncertainty

Across 12 domains, LMU is always better than GRR. LMU is statistically significantlybetter than LU and MU.

0 1000 2000 3000 4000Number of labels (sick, p=0.6)

GRR MULU LMU

Adult content classification

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr MTurk: 2500 websites/hr, cost: $12/hr

Bad news: Spammers!

Worker ATAMRO447HWJQ

labeled X (porn) sites as G (general audience)

Redundant votes, infer quality

Look at our spammer friend ATAMRO447HWJQ together with other 9 workers

Using redundancy, we can compute error rates for each worker

1. Initialize“correct” label for each object (e.g., use majority vote)2. Estimate error rates for workers (using “correct”

labels)3. Estimate “correct” labels (using error rates, weight

worker votes according to quality)4. Go to Step 2 and iterate until convergence

Repeated Labeling, EM, and Confusion Matrices (Dawid & Skene, 1979)

Iterative process to estimate worker error rates

Our friend ATAMRO447HWJQ marked almost all sites as G.

Seems like a spammer…

Error rates for ATAMRO447HWJQP[G → G]=99.947% P[G → X]=0.053%P[X → G]=99.153% P[X → X]=0.847%

Challenge: From Confusion Matrixes to Quality Scores

The algorithm generates “confusion matrixes” for workers

How to check if a worker is a spammer

using the confusion matrix?(hint: error rate not enough)

Error rates for ATAMRO447HWJQ P[X → X]=0.847% P[X → G]=99.153%

P[G → X]=0.053% P[G → G]=99.947%

Challenge 1: Spammers are lazy and smart!

Confusion matrix for spammer P[X → X]=0% P[X → G]=100%

P[G → X]=0% P[G → G]=100%

Confusion matrix for good worker

P[X → X]=80% P[X → G]=20%

P[G → X]=20%P[G → G]=80% Spammers figure out how to fly under the radar…

In reality, we have 85% G sites and 15% X sites

Errors of spammer = 0% * 85% + 100% * 15% = 15% Error rate of good worker = 85% * 20% + 85% * 20% = 20%

False negatives: Spam workers pass as legitimate

Challenge 2: Humans are biased!

Error rates for CEO of AdSafe

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P →

X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R →

X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0% In reality, we have 85% G sites, 5% P sites, 5% R

sites, 5% X sites

Errors of spammer (all in G) = 0% * 85% + 100% * 15% = 15%

Error rate of biased worker = 80% * 85% + 100% * 5% = 73%

False positives: Legitimate workers appear to be spammers

Solution: Reverse errors first, compute error rate afterwards

When biased worker says G, it is 100% G When biased worker says P, it is 100% G When biased worker says R, it is 50% P, 50% R When biased worker says X, it is 100% X

Small ambiguity for “R-rated” votes but other than that, fine!

Error Rates for biased workerP[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P →

X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R →

X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

When spammer says G, it is 25% G, 25% P, 25% R, 25% X When spammer says P, it is 25% G, 25% P, 25% R, 25% X When spammer says R, it is 25% G, 25% P, 25% R, 25% X When spammer says X, it is 25% G, 25% P, 25% R, 25% X[note: assume equal priors]

The results are highly ambiguous. No information provided!

Error Rates for spammer: ATAMRO447HWJQP[G → G]=100.0% P[G → P]=0.0% P[G → R]=0.0% P[G → X]=0.0%P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0%P[R → G]=100.0% P[R → P]=0.0% P[R → R]=0.0% P[R → X]=0.0%P[X → G]=100.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=0.0%

Solution: Reverse errors first, compute error rate afterwards

Cost of “soft” label <p1, p2, …., pL > with cost cij for labeling as j an object of class i

“Soft” labels with probability mass in a single class are good “Soft” labels with probability mass spread across classes are bad

Cost (spammer) = Replace pi with Priori

Computing Quality Scores

Quality = 1 – Cost(worker) / Cost(spammer)

Experimental Results

500 web pages in G, P, R, X 100 workers per page (just to evaluate effect of more labels) 339 workers Lots of noise! 95% accuracy with majority vote (only!) Dropped all workers with quality score 50% or below

Error rate: 1% of labels dropped Quality scores: 30% of labels dropped, accuracy 99.8%

Note massive amount of redundancy and very conservative spam rejection

Too much theory?

Open source implementation available at:http://code.google.com/p/get-another-label/

Input: – Labels from Mechanical Turk– Cost of incorrect labelings (e.g., XG costlier than GX)

Output: – Corrected labels– Worker error rates– Ranking of workers according to their quality

Beta version, more improvements to come! Suggestions and collaborations welcomed!

Workers reacting to quality scores

Score-based feedback leads to strange interactions:

The angry, has-been-burnt-too-many-times worker: “F*** YOU! I am doing everything correctly and you know

it! Stop trying to reject me with your stupid ‘scores’!”

The overachiever: “What am I doing wrong?? My score is 92% and I want to

have 100%”

An unexpected connection at theNAS “Frontiers of Science” conf.

Your spammers behave like my mice!

Your spammers want to engage their brain only for motor skills and not for cognitive skills

Yeah, makes sense…

And here is how I train my mice to behave…

I should try this the moment that I get back to my room

Implicit Feedback with Frustration

Punish bad answers with frustration (e.g, delays between tasks)– “Loading image, please wait…”– “Image did not load, press here to reload”– “Network error. Return the HIT and accept again”

Reward good answers by giving next task immediately, with no problems

→Make this probabilistic to keep feedback implicit

First result

Spammer workers quickly abandon Good workers keep labeling

Bad: Spammer bots unaffected How to frustrate a bot?

– Give it a CAPTHCA

Second result (more impressive)

Remember, Don was training the mice…

15% of the spammers start submitting good work! Putting cognitive effort is more beneficial …

Key trick: Learn to test workers on-the-fly– Model performance using Dirichlet compound

multinomial distribution (details offline)

Thanks!

Q & A?

Why does Model Uncertainty (MU) work?

MU score distributions for correctly labeled (blue) and incorrectly labeled (purple) cases

++ ++ +

+ + ++

- - - -- - - -- -

- - - -

- - - -- - - -- - - -

- - - -

Why does Model Uncertainty (MU) work?

Models

ExamplesSelf-healing process

++ ++ +

+ + ++

- - - -- - - -- -

- - - -

- - - -- - - -- - - -

- - - -

Self-healing MU

“active learning” MU

What if different labelers have different qualities?

(Sometimes) quality of multiple noisy labelers is better than quality of best labeler in set

here, 3 labelers:p-d, p, p+d

Soft Labeling vs. Majority Voting

MV: majority voting ME: soft labeling

10 410 810 1210 1610Number of examples (bmg, p=0.6)

Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers

Documents

Transcript of Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.

The Noisy Stable

Active Learning from Imperfect Labelers @ NIPS読み会・関西

LITTLE DAVID LABELERS - ipack.com

Noisy neighbors

PKG Mach Bro Wrappers Labelers

Who Said What: Modeling Individual Labelers …Who Said What: Modeling Individual Labelers Improves Classiﬁcation Melody Y. Guan Stanford University 450 Serra Mall Stanford, California

THE NOISY FOOD.docx

Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.

Noisy Studio

Gitterman Noisy Pendulum

China Noisy

Noisy Numbers

Noisy Room

Noisy Portfolios

Noisy Or Gate

How Label Verification Reduces Pharmaceutical Recall Costs · chain to benefit from the information embedded within the barcode should a recall occur. Labelers must verify that the

The Noisy Archives

Rebateable Labelers 04/01/2020 - medicaidprovider.mt.gov · 59365 cooper surgical inc. 59385 biodelivery sciences international, inc 59417 shire us, inc. rebateable labelers –april