Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC

33
Big Data/DIG Domain-Specific Insight Graphs Pedro Szekely University of Southern California www.isi.edu/~szekely

Transcript of Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC

Big Data/DIG Domain-Specific Insight Graphs

Pedro SzekelyUniversity of Southern California

www.isi.edu/~szekely

Connecting The Dots Using the Web To Solve Hard Problems

Hard ProblemsState of the ArtOur Solution

Impact

Hard Problems

HealthcareResearch investmentHuman trafficking…

Human Trafficking

Illegal drugsArms trafficking

Human trafficking

Illegal Industries

$32 billion profit per year

14 Average Age of Entry To Prostitution in the US

$150,000 PIMP’s Profit Per Child Per Year

$45,000,000 Advertising Budget On the Web

Human Trafficking on the Web

Thousands of Web sites

Millions of pages

Hard ProblemsState of the ArtOur Solution

Impact

Google Finds “DOTS”

Recipe“Dot”

Nutrition“Dot”

Google finds dots

User finds connections

System Objectives

1.  find all the dots

2.  find all the connections

Hard ProblemsState of the ArtOur Solution

Impact

1.  Downloads all relevant pages2.  Extracts & cleans the data3.  Discovers connections4.  Builds unified database5.  Creates query & analysis portal

1.  Go to Web site2.  Download page3.  Follow links4.  Wait, then repeat

24/7

Web Crawling Software

2,000 Pages/Hour -- 50,000,000 pages Total

Data Extraction

“YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish, Armenian and Filipino mixed princess :) ❤ Kim ❤ 7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses”

name: Kim

eye-color: green

hair-color: black

phone: 707-727-7477

rate: $60/15min

$80/30min

$120/60min

Crowd-SourcED Annotations

“YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish, Armenian and Filipino mixed princess :)

Green O eye color O hair color black O eye color O hair color

2 cents/sentence

Automatic Construction of Extractors

5,000 annotations

Machine

Learning

Ready-to-use

Extraction

Software

$100, 1 day

Technology: Conditional Random Fields

Data Cleaning

AD Weight1  1302  4803  133lbs4  BBW5  52 kg6  110 pounds

AD Weight (Kg)1  592  3  604  5  526  50

Using Extracted Data to Connect the Dots

Mary Lucy

222-0000 777-0000

Police Database

Bad Guy: 777-0000

Technology: Karma Information Integration Toolkit

Using Text Similarity to Connect the Dots

E M I L Y SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S

L A Y L A SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S L I L A SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S

Technology: MinHash/LSH

Using Image Similarity to Connect the Dots

20 Million Images Technology: Deep Learning

Create Unified Database

50 Million Ads

Technologies: Karma, Hadoop, Hive, Elastic-Search

20 Computers, 2 Hours 4 Billion Records

Hard ProblemsState of the ArtOur Solution

Impact

Deployed to Law

Enforcement and NGOs

Organizations

University of Southern CaliforniaColumbia UniversityInferLinkNASA JPLNext Century

Researchers

Pedro Szekely (PI), Shih-Fu ChangTao ChenKevin KnightCraig KnoblockDaniel MarcuChris MattmannSteve MintonPrem NatarajanAndrew PhilpotMike Tamayo

Engineers

Brian AmanatullahRachel ArtissDavid FlyntDipsy Kapoor,

Students

Jason SlepickaAmandeep SinghChengye YinSubessware Karunamoorthy