Social Media, Data Integration, and Human Computation

33
AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

description

Social Media, Data Integration, and Human Computation. AnHai Doan University of Wisconsin @ WalmartLabs. @ Walmart Labs. A Journey Starting in 2001 . Worked in data integration combine multiple data sources into one e.g , aggregation/comparison shopping sites, Google Scholar - PowerPoint PPT Presentation

Transcript of Social Media, Data Integration, and Human Computation

Page 1: Social Media, Data Integration, and  Human Computation

AnHai DoanUniversity of Wisconsin@WalmartLabs

Social Media, Data Integration, and Human Computation

@WalmartLabs

Page 2: Social Media, Data Integration, and  Human Computation

2

A Journey Starting in 2001 ... Worked in data integration

– combine multiple data sources into one– e.g, aggregation/comparison shopping sites, Google Scholar

– use schema matching, information extraction, entity disambiguation

Ph.D. thesis focused on schema matching

Find houses with 2 bedroomsunder 400Krealestate.com

fsbo.com

homes.com

Page 3: Social Media, Data Integration, and  Human Computation

3

Schema Matching

address price31 Bagley Ct ... 250K12 Hope St ... 375K

location sold-at14 Main St ... 249,00025 West St ... 324,000

address = location price = sold-at

Developed automatic solution using machine learning Realized that automatic solutions are not good enough

– only 65-85% accuracy– need human intervention

Proposed a crowdsourcing approach

Page 4: Social Media, Data Integration, and  Human Computation

4

Crowdsourced Schema Matching

Can crowdsource other DI tasks too Difficult to publish

– Building data integration systems via mass collaboration, WebDB-03– Subsequent reviews: great work, I don’t believe it, neutral

Yes, Yes, No

Build a large-scale DI system on the Web Show that crowdsourcing is practical

address price31 Bagley Ct ... 250K12 Hope St ... 375K

location sold-at14 Main St ... 249,000

25 West St ... 324,000

address = location

Page 5: Social Media, Data Integration, and  Human Computation

Researcher HomepagesConference PagesGroup PagesDBworld mailing listDBLP

Started DBLife Project in 2005

Web pages* *

*

** * ***

SIGMOD-07

**

** give-talk

HV Jagadish Superpages

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

HV Jagadish

SIGMOD-07

**

File system RDBMS Hadoop

Page 6: Social Media, Data Integration, and  Human Computation

6

Example Superpage

Page 7: Social Media, Data Integration, and  Human Computation

7

Example Crowdsourcing

Picture is removed if enough users vote “no”.

Page 8: Social Media, Data Integration, and  Human Computation

8

Project Status in 2009 Data integration

– overall methodology: VLDB-07a, VLDB-07b, CIDR-09– DI operators: VLDB-07c– optimization: VLDB-07c, SIGMOD-08, ICDE-08a, SIGMOD-09a– provenance/others: ICDE-07a, ICDE-07b, VLDB-08a

Crowdsourcing / human computation– schema matching: ICDE-08b– best-effort information extraction: SIGMOD-08– human feedback into the DI pipeline: SIGMOD-09b– how lay users can query the database: SIGMOD-09c

System development– hard to build/maintain systems in academiaWanted to know what’s going on in industry

Wanted to take DBLife to the next levelJoined Kosmix in 2010 to do “DBLife on steroids”

Page 9: Social Media, Data Integration, and  Human Computation

9

Kosmix Founded by Anand Rajaraman & Venky Harinarayan

– formerly of Junglee, sold to Amazon for 250M 55M in funding, 30+ engineers Integrated Web data sources into a giant taxonomy

IMDBMusicbrainzTripadvisorWikipedia…

all

people

actors

Angelia Jolie Mel Gibson

placesInformation extractionEntity disambiguationEntity merging ...

File system RDBMS Hadoop

topic pages

Page 10: Social Media, Data Integration, and  Human Computation

10

Raised many interesting challenges - e.g., incremental updates, recycling human edits

Very good in certain topics (e.g., health)But hard to compete with Google and WikipediaSwitched to social media in early 2010

Page 11: Social Media, Data Integration, and  Human Computation

Social Media Exploding

11

Every two days now we create as much information as we did from the dawn of civilization up until  2003.

-- Eric Schmidt 

• 100 million tweets per day• 1 billion Facebook shares per day• 1.5 million Foursquare checkins per day• 40,000 Flickr photos per second

Page 12: Social Media, Data Integration, and  Human Computation

Switching Made Much Business Sense Lot of social media data Lot of people using it, spending a lot of time on it

– lot of links now come from social media, not search engines– Google is worried (hence Buzz, Google+, Google++)

New level playing field Have a secret weapon: the giant taxonomy Next hot Internet wave

– SoLoMo = social + local + mobile

But can we build interesting applications? What is social media good for?

12

Page 13: Social Media, Data Integration, and  Human Computation

95% of tweets is still junk– I feel good today

Help teenagers track Justin Bieber– the background noise of Twitter

Charlie Sheen, celebrity fighting, Weiner losing his job Foster customer relationships

– follow your dentist Spread news Manage disasters Promote e-commerce Help organize events,

movements– revolutions

From Frivolous to Serious

13

Page 14: Social Media, Data Integration, and  Human Computation

Lot of Companies / Actions in This Space Build platforms for social media

– how to tweet more effectively Understand social media

– social analytics / route relevant information to users Use social media to make predictions Use social media to affect real-world changes

Mostly operate at the keyword level– how many times the keyword “Obama” has been mentioned today?

Kosmix: the leader in performing semantic analysis– how many times the entity President Obama has been mentioned

today?– “Obama”, “Barack”, “Barry”, “BO”, “the Pres”, “the Messiah”, ...

Page 15: Social Media, Data Integration, and  Human Computation

Kosmix Solution

IMDBMusicbrainzWikipedia… Information extraction

Entity disambiguationEntity merging Schema matchingEvent detectionEvent monitoring ...

Social Genome Applications

Highly scalable real-time infrastructureFile system RDBMS Hadoop Muppet

Slates Stream servers

Crowd sourcinginternal analysts, users, Mechanical Turks, others

Page 16: Social Media, Data Integration, and  Human Computation

Social Genomeall

people

actors

Angelia Jolie Mel Gibson

places Twitter users

@melgibson @dsmith …

FB users

mel-gibson davesmith …

events

celebritiessports politics …

Gibson car crash Egyptian uprising

the-same-as tweet-about

@dsmith: Mel crashed. Maserati is gone.

@far213: Tahrir is packed!Tahrir

CairoEgypt

related-tolocated-in

capital-of

Page 17: Social Media, Data Integration, and  Human Computation
Page 18: Social Media, Data Integration, and  Human Computation

Building Social Genome: Three Sample Challengesall

people

actors

Angelia Jolie Mel Gibson

places Twitter users

@melgibson @dsmith …

FB users

mel-gibson davesmith …

events

celebritiessports politics …

Gibson car crash Egyptian uprising

the-same-as tweet-about

@dsmith: Mel crashed. Maserati is gone.

@far213: Tahrir is packed!Tahrir

CairoEgypt

related-tolocated-in

capital-of

Page 19: Social Media, Data Integration, and  Human Computation

Extraction and Disambiguation:Traditional Methods Ill Suited for Social Media

all

people

actors directors

Angelia Jolie Mel Gibson

places

Long-term, Web context: actor, movie, Oscar, Hollywood

Short-term, social context: crash, car, Maserati

@dsmith: mel crashed. maserati is gone.

Mel was arrested again. What a dramatic fall sincehis Oscar-winning day.

Mel Brooks

events

celebritiessports politics …

Gibson car crash Egyptian uprising

Extractionuse rule-based / NLP / machine learning techniques

Extractionuse dictionaries use rules

Disambiguation

Disambiguation

Page 20: Social Media, Data Integration, and  Human Computation

20

Must Maintain a Highly Dynamic Social Genome

all

people

actors directors

Angelia Jolie Mel Gibson

places

Long-term, Web context: actor, movie, Oscar, Hollywood

Short-term, social context: crash, car, Maserati

Mel Brooks

events

celebritiessports politics …

Gibson car crash Egyptian uprising

Latency less than 2 seconds

Page 21: Social Media, Data Integration, and  Human Computation

The Giant Traditional Taxonomy is the Secret Weapon

Without it, dictionary-based extraction is not possible Provide a framework to

– “understand” social media, find related concepts, “hang” social contexts Very hard to develop, takes years

– like learning a new foreign language Partly explains why it was hard for others to catch up Must integrate traditional data well, then bootstrap

all

people

actors

Angelia Jolie Mel Gibson

places

Tahrir

CairoEgypt

located-in

capital-of

Page 22: Social Media, Data Integration, and  Human Computation

Event Detection: Current Solutions

• Focus on Twitter + Foursquare• Lot of current work in academia / industry• Limitations of most of the current solutions

– exploit just one kind of heuristics • e.g., find popular, strongly correlated words (Egypt, revolt)

– does not exploit crowdsourcing– does not scale

• not designed explicitly for parallelism

events

celebritiessports politics …

Gibson car crash Egyptian uprising

Twitter4squareFacebookMyspaceFlickr…

Event detection

Page 23: Social Media, Data Integration, and  Human Computation

Event Dection: Kosmix Solution

TwitterFoursquare

Detector 2

Detector n

Detector 1

Candidate events

Candidate events

Candidate events

Eventevaluatorand ranker

Rankedevents Population 2

Population 3

Population 1

...

Hadoop Muppet

Slates Stream servers

Page 24: Social Media, Data Integration, and  Human Computation

Event Monitoring: Current Solutions

• Manually write rules to match tweets to events– e.g., tweet contains certain keywords / userids positive– conceptually simple, relatively easy to implement– often achieve high initial precision

• Limitations– expensive, don’t scale– manually writing good rules can be hard– rules often become invalid/inadequate over time

• e.g., Baltimore shooting John Hopkins shooting24

Baltimore shooting

@dsmith: Baltimore shooting on TV5!

Egyptian uprising

@far213: Tahrir is packed!

Page 25: Social Media, Data Integration, and  Human Computation

Event Monitoring: Kosmix Solution

25

Event Twitter firehoseBaltimoreshooting

Initial profile{Baltimore, shoot}

Learning algorithm

Tweets“Baltimore shooting on TV5!”“Baltimore shooting. John Hopkins shut down.” ...

New profile{Baltimore, shoot, John Hopkins}

Page 26: Social Media, Data Integration, and  Human Computation
Page 27: Social Media, Data Integration, and  Human Computation

Social Analytics with The NYTimes

Tweets Annotators Tweets& Dimensions SocialCubes Stats

e.g. Location, Sentiment, Entity extraction, etc.

Barack Obama

Medicare

Hillary Clinton

Topics

Arizon

a

California

PositiveNegative

NeutralSentiment

Location

How many people in Arizona

feel positive of the new

Medicare plan?

New Yo

rk

How many feel negative of Barack Obama across the

US?

How many are tweeting about Barack Obama in New York, by

the minute for last 60 mins, by hour for last 24 hours, and by day for

last 10 days?

Barack Obama, President Obama, the Pres, Barry, BO, ...

Page 28: Social Media, Data Integration, and  Human Computation

Social Monitoring with an Unknown Agency

Twitter firehose

Justin BieberCharlie Sheen

Egyptian uprising

Jordan unrest

China unrestNorth

Tibet

West

Southeast

Count tweetsrelated to Wael Ghonim

146 in past 5 mins 3267 in past 12 hours

Bought by Walmart in May 2011

Page 29: Social Media, Data Integration, and  Human Computation

The Walmart Acquisition Deal reported to be

250-300M Kosmix became

@WalmartLabs– based in San Bruno– local office in India– plan new offices in

China and Brazil 100 persons today,

actively hiring

29

Page 30: Social Media, Data Integration, and  Human Computation

Why? 400+ B in revenue, only 5-10B online vs. 34B of Amazon Major problems if won’t catch up within 5-10 years

– see Borders

@WalmartLabs can help in many ways– Provides a core of technical people, attract more– Improve traditional e-commerce

– SEO, SEM, search on walmart.com– build a vast product taxonomy

– Helps build the e-commerce of the future– social, local, and mobile– a good way to catch up and leapfrog Amazon

30

Page 31: Social Media, Data Integration, and  Human Computation

Improve Traditional E-Commerce

31

Product data from thousands of vendors

In-house data

Web data

all products

cars

US cars

Ford Chevrolet

booksInformation extractionEntity disambiguationEntity merging ...

File system RDBMS Hadoop

searchads

Page 32: Social Media, Data Integration, and  Human Computation

Help Build the E-Commerce of Future: Social, Local, and Mobile

O2O (Online 2 Offline) emerging as a major trend– increasingly tighter integration of online and offline parts– e.g., Groupon, Living Social

Social, local, and mobile commerce examples– gift recommendation:

– “I love salt!”– “Your friend has just tweeted about the movie SALT. Would you

like to buy something related for her birthday?”– personalized “Groupon” with vendors:

– “You seem to be interested in gourmet coffee. If 50 persons sign up to buy the new DeLonghi coffee maker, you can get that for a 50% discount.”

– stocking a local store– a Siri-like shopping assistant

32

Page 33: Social Media, Data Integration, and  Human Computation

Wrapping Up Social media has become a major frontier on Web Integrating social data is fundamentally much harder

than integrating “traditional” data– lack of context– dynamic environment, new concepts appear quickly– quality issues, lots of spam– quick spread of information, user activities– fast data– solution will change over time, need human in the loop to monitor

Must integrate “traditional” data well, then bootstrap– giant taxonomy critical

Crowdsourcing becomes indispensible– but raises interesting challenges