Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using...

29
1 1 Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others) Stanford University Big Data Analytics CrowdSourcing

Transcript of Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using...

Page 1: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

1

1

Using CrowdSourcingfor Data Analytics

Hector Garcia-Molina(work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others)

Stanford University

• Big Data Analytics

• CrowdSourcing

Page 2: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

2

CrowdSourcing

3

Real World Examples

4

Image MatchingTranslation

Image MatchingTranslation

Categorizing ImagesCategorizing Images

Search RelevanceSearch Relevance

Data GatheringData Gathering

Page 3: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

3

Many Crowdsourcing Marketplaces!

Many Research Projects!

6

Page 4: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

4

7

analytics

data

results

humans

Example tasks:• get missing data• verify results• analyze data

8

analytics

data

results

humans

Example tasks:• get missing data• verify results• analyze data

Key Point:• use humans judiciously

Page 5: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

5

Today will illustrate with

• Entity Resolution

• (may cover another topic briefly)

9

10

Traditional Entity Resolution

cleansing

analysis

System 1 System n...

what matches what??

Page 6: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

6

Why is ER Challenging?

• Huge data sets

• No unique identifiers

• Missing data

• Lots of uncertainty

• Many ways to skin the cat

11

Simple ER Example

12

Page 7: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

7

Simple ER Example

13

a

c

b

d

sim=0.9

sim=0.8

Simple ER Example

14

a

c

b

d

sim=0.9

sim=0.8

Page 8: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

8

15

ER: Exact vs Approximate

products

camerasresolvedcameras

CDs

books

...

resolvedCDs

resolvedbooks

...

ER

ER

ER

Simple ER Algorithm

• Compute pairwise similarities

• Apply threshold

• Perform transitive closure

16

Page 9: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

9

Simple ER Algorithm

• Compute pairwise similarities

• Apply threshold

• Perform transitive closure

17

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

Simple ER Algorithm

• Compute pairwise similarities

• Apply threshold

• Perform transitive closure

18

0.9

0.950.9

0.7

0.9

0.87

0.7

threshold = 0.7

Page 10: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

10

Simple ER Algorithm

• Compute pairwise similarities

• Apply threshold

• Perform transitive closure

19

0.9

0.950.9

0.7

0.9

0.87

0.7

Crowd ER

20

Page 11: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

11

Same as this?

21

Crowd ER

• First Cut: For every pair of records,ask workers if they match (i.e., get similarity)

22

Page 12: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

12

Crowd ER

• First Cut: For every pair of records,ask workers if they match (i.e., get similarity)

• Too expensive!

23

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

Crowd ER

• Second Cut: Compute similarities;workers verify "critical" pairs

24

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

critical??

Page 13: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

13

Crowd ER

• Second Cut: Compute similarities;workers verify "critical" pairs

25

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

critical??

Crowd ER

• Second Cut: Compute similarities;workers verify "critical" pairs

26

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

critical??

Page 14: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

14

27

pairwise analysis

records

clusters

generate questions

Key Point:• use humans judiciously

global analysis

crowd

new evidence

Key Issue: Semantics of Crowd Answer

28

Page 15: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

15

Key Issue: Semantics of Crowd Answer

29

?

EDC

B A

Also issue: Similarities as Probabilities

30

sim(a,b) → prob(a,b)

Page 16: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

16

Strategy

31

current state

a

b c

0.9

0.5

0.2

ER result use any given ER algorithm

Strategy

32

current state

a

b c

0.9

0.5

0.2 Q(a,b)

Q(b,c)

Q(a,c)

consider ALL possible questions(three in this example)

Page 17: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

17

Strategy

33

current state

a

b c

0.9

0.5

0.2 Q(a,b)

Q(b,c)

Q(a,c)

new state

new state

new state

new state

new state

new state

ER result

ER result

ER result

ER result

ER result

ER result

Y

Y

Y

N

N

N

considerpossibleoutcomes

Strategy

34

current state

a

b c

0.9

0.5

0.2

Q(b,c)

new state ER resultY

example

a

b c

0.9

1.0

0.2a

b c

Page 18: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

18

Strategy

35

current state

a

b c

0.9

0.5

0.2 Q(a,b)

Q(b,c)

Q(a,c)

new state

new state

new state

new state

new state

new state

ER result

ER result

ER result

ER result

ER result

ER result

score?

score?

score?

score?

score?

score?

Y

Y

Y

N

N

N

Two Remaining Issues

• How do we score an ER result?

36

• Efficiency?

ER result gold standard

F score

Page 19: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

19

Gold Standard?

37

a

b c

0.9

0.5

0.2a

b c

1.0

0.6

0.2

sim to prob

Gold Standard?

38

a

b c

0.9

0.5

0.2a

b c

1.0

0.6

0.2

a

b c

a

b c

a

b c

a

b c

0.12

0.48

0.08

0.32

sim to prob

possible worlds

Page 20: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

20

Gold Standard?

39

a

b c

0.9

0.5

0.2a

b c

1.0

0.6

0.2

a

b c

a

b c

a

b c

a

b c

0.12

0.48

0.08

0.32

a

b c

0.68

a

b c

0.32sim to prob

possible worlds possible clustering(via ER algorithm)

Strategy

40

current state

a

b c

0.9

0.5

0.2 Q(a,b)

Q(b,c)

Q(a,c)

new state

new state

new state

new state

new state

new state

ER result

ER result

ER result

ER result

ER result

ER result

scorevs GS?Y

Y

Y

N

N

N

scorevs GS?

scorevs GS?

scorevs GS?

scorevs GS?

scorevs GS?

Page 21: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

21

Evaluating Efficiently

• See: Steven E. Whang, Peter Lofgren, and H. Garcia-Molina. Question Selection for Crowd Entity Resolution. To appear in Proc. 39th Int'l Conf. on Very Large Data Bases (PVLDB), Trento, Italy, 2013.

41

Sample Result

42

Page 22: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

22

43

analytics

data

results

humans

Example tasks:• get missing data• verify results• analyze data

Key Point:• use humans judiciously

Summary

44

analytics

bigdata

Now for something completely different!

DBMS

Page 23: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

23

45

analytics

bigdata

Now for something completely different!

DBMS

humans

46

data

DeCo: Declarative CrowdSourcing

DBMS

humans

Enduser

what is best price forNikon DSLR cameras?

Page 24: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

24

47

data

DeCo: Declarative CrowdSourcing

DBMS

humans

Enduser

what is best price forNikon DSLR cameras?

model type brand

D7100 DSLR Nikon

7D DSLR Canon

P5000 comp Nikon

• • • • • • • • •

48

data

DeCo: Declarative CrowdSourcing

DBMS

humans

Enduser

what is best price forNikon DSLR cameras?

model type brand

D7100 DSLR Nikon

7D DSLR Canon

P5000 comp Nikon

• • • • • • • • •

what is best price forNikon D7100 camera?

Crowd

Page 25: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

25

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

Example with a bit more detail:

Userview

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

50

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

Example with a bit more detail:

Page 26: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

26

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

51

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

fetch rule

fetch ruleBytes

Chez Panisse

fetch rule

Example with a bit more detail:

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

52

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

fetch rule

fetch ruleBytes

Chez Panisse

fetch rulefetch rule

French

Example with a bit more detail:

Page 27: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

27

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

53

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

resolutionrule

resolutionrule

Bytes

Chez Panisse

Example with a bit more detail:

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

54

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

1. Fetch2. Resolve3. Join

Example with a bit more detail:

Page 28: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

28

Fetch[n]Fetch[ln]Fetch

[ln,c]Scan

D1(n,l)ScanA(n)

55

Join

Join

AtLeast [8]SELECT n,l,cFROM countryWHERE l = ‘Spanish’ATLEAST 8

Resolve[m3]Resolve[d.e]

Fetch[nl]

ScanD2(n,c)

Resolve[m3]

Fetch[nl,c]

Many Query Processing Challenges

Filter [l=‘Spanish’]

Fetch[nl,c]

Deco Prototype V1.0

56

Page 29: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina

29

Conclusion

• Crowdsourcing is important formanaging data!

• Still many challenges ahead!

57

58