Active Content-Based Crowdsourcing Task Selection

Active Content-Based Crowdsourcing Task

Selection

Piyush Bansal, Carsten Eickhoff, Thomas HofmannETH Zurich

1

Outline● Past work

○ Exploiting Document content for vote aggregation

● Ongoing extensions○ Crowdsourcing in extreme budget constraints.○ Information theoretic approaches○ Experiments and results○ Conclusion

2

State of the Art● Crowdsourced relevance assessment cheap and effective

● Quality control via redundancy yields strong performance

● Untapped source of information: document content

● Key idea: Locality of relevance

Davtyan et al. 2015: Exploiting Document Content for Efficient Aggregation of Crowdsourcing Votes

3

●

Clustering Hypothesis for relevance assessment

4

Methods ● (informal) Problem statement: Given a set of relevance assessments,

how accurately can we infer the relevance of unjudged Web pages?

○ Solution ideas: ■ Assign same relevance assessment label as nearest neighbor.■ Borrow relevance assessments from <n> nearest neighbors and

then assign the majority label.■ Smooth expected relevance across similarity space (KDE, GPs)

○ Baseline:■ Majority Voting for label aggregation, and coin toss for unjudged

Web pages.

5

Davtyan et. al. - Results

6

Motivation for our workConsider the task of search relevance assessment

● Extremely budget-constrained scenario

● Can only ask humans to rate a few Web pages per query

● In previous figure: Number of votes < 1

7

A Generic Model of Crowdsourcing

8


9


Difallah et al. 2013: Pick-a-crowd,

Nushi et al. 2015: Crowd Access Path Optimization10


11


12

Kazai et al. 2011: Worker types and personality traits in crowdsourcing relevance labels

Davtyan et al. 2015: Exploiting Document Content for Efficient Aggregation of Crowdsourcing Votes


13

Preliminaries● RequestVote

○ Sample radom vote from crowd● AggregateVotes

○ Gaussian Processes (GP) for inferring relevance labels for unjudged documents.

○ Described by mean function (here: constant), ○ and covariance function (here: linear covariance).

14

PickDocument● What subset of documents to select for labeling?

○ Typical Active learning problem○ Focus on optimal data acquisition○ Baseline: Random sampling

● Select points that the classifier is most uncertain about○ uncertainty based sampling.

15

Solution ● Variance-based sampling:

○ Proxy for “uncertainty”, as entropy is a measure of uncertainty○ Variance-based sampling as approximation to max entropy sampling. ○ In Gaussian processes, the posterior variance does not depend on

the actual observed values of random variables.

16

Solution● Selecting points that maximise variance is NP complete2

● However, this criterion is “submodular"○ Submodularity (informally): In mathematics, a submodular set function (also

known as a submodular function) is a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an input set decreases as the size of the input set increases.

○ However, due to Nemhauser (1978), an approximate solution (1 - 1/e) OPT to this is achieved via a greedy algorithm.

2 Krause et al. 2008: Near-optimal sensor placements in Gaussian processes

3 Nemhauser et al. 1978: An analysis of approximations for maximizing submodular set functions17

Algorithm: Variance based sampling

18

Mutual Information based sampling● Variance-based sampling is only concerned with reducing

uncertainty at sampled points. ● We care about system-wide uncertainty. ● Maximise Mutual Information b/w selected documents and

rest of space.

● Equivalent to maximally minimising the entropy between selected documents, and the rest of space (D\A).

19

Algorithm: MI based sampling

20

Experiments

● TREC Crowdsourcing Track 2011 data

● 30 (28) topics

● ~100 documents (ClueWeb’09) to be judged per topic

● ~15 historic votes per query-document pair

● Project documents in 100D doc2vec space

21

Results - on TREC2011 CrowdSourcing Dataset.

22

Qualitative Analysis

23

Conclusions● Active Learning for Crowdsourcing Vote Sampling

● Two information-theoretic criteria○ Variance○ Mutual information

● Saves up to 25% budget at constant quality● Can be computed efficiently (greedy)● Does not depend on sampled observations

● In the future: application to other modalities (images, videos)

24

Thank you!

Questions?

25

Active Content-Based Crowdsourcing Task Selection

Science

Transcript of Active Content-Based Crowdsourcing Task Selection