Information excellence 2012feb_komli_srinivasan s h_making data repitions work

31
Confidential Making Data Repetitions Work for You Srinivasan H Sengamedu Komli Labs [email protected] Information Excellence Summit, February 25, 2012 Bangalore http://Informationexcellence.wordpress.com

Transcript of Information excellence 2012feb_komli_srinivasan s h_making data repitions work

Confidential

Making Data Repetitions Work for You

Srinivasan H SengameduKomli Labs

[email protected]

Information Excellence Summit,February 25, 2012 Bangalorehttp://Informationexcellence.wordpress.com

Confidential

Srinivasan H SengameduBio: Srinivasan H Sengamedu (SHS) is the Head of Komli Labs where he works on real-time bidding, user modeling, and other areas related to computational advertising.

He was Director of Audience and Search Sciences at Yahoo Labs, Bangalore earlier where he worked on information extraction, machine learned ranking, pornography detection in images, comment spam detection, etc. Most of the technologies are powering various Yahoo products.

He got his PhD from Indian Institute of Science, Bangalore and has held visiting positions at UCSD and NUS.

He has published over 100 papers and has more than 30 approved or filed patent applications. He's generally excited about creating and productizing advanced technologies.

Confidential

Lot of information on the web

Web pagesImagesVideosBlogsMails

Confidential

Not all information is new!

Web pages about the same product, business, etc.Near-duplicate imagesSimilar comments, tweets, etc.

Confidential

Leveraging redundant information

Classical use– Compression

• Lossless compression (LZW)• Perceptually lossless compression (JPEG, MP3)

– Co-occurrence• Pointwise Mutual Information

Redundancy ≈ ConfidenceLeveraging redundancy requires care.

Confidential

Pointwise Mutual Information

Confidential

Any other uses of redundancy?

Commercial Spam Detection– Min-closed sequences

Information Extraction– Strong Similarity

Near-duplicate images– Image Signatures

Face recognition– Consistency Learning

Confidential

Identifying Near-Duplicate Subsequences

Confidential

Two Spam Comments

Happy to see she is progressing well. Happy 2011 to everyone....My friend Vanessa, a 25 yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO of a MNC. It's amazing, she said she just posted her profile on a millionaire d'ating s'ite called ----------Celeb Mingle.C○M-------- - and received his chat invitations a few days later. Then, everything went so well that I can't believe it's true!Every love story will unfold on it's own. Also happy to see that most Americans reject the blame-the-conservatives crap that some (not all) liberals from all social strata were trying to promote for political gain.

Texas and Israel forever. Happy 2011 everyone....This has got to be a better year!...My friend Vanessa, a 25 yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO of a MNC. It's amazing, she said she just posted her profile on a millionaire d'ating site called ----------Rich'Friends.Org----- - and received his chat invitations a few days later. I can't believe it's true! Every love story will unfold on it's own. you can start your own wealthy love story for real at there too !many famous and wealthy people had a profile there ,why not me ? Taking out the world's trash. Oooraaaah!-----

Confidential

Sequence-based Spam Detection

Motivation: Commercial spammers repeat variations of the spam content and embed it in good content. These usually avoid detection by spam filters.

Technical Challenge: mine frequent subsequences efficiently. The general problem is NP-Hard. The algorithms in the literature do not scale to web-scale data. The spam patterns change every few hours.Basic Ideas A new sequence mining algorithm that scales to internet scale and is faster than

those in the literature even for other public data sets like Gazelle A new framework for spam detection using frequent subsequences Experimental studies to measure the efficacy of the subsequence mining

approach in detecting spam. We also study the life cycle of a typical spam pattern and use it to tune our mining parameters

ResultsExperiments on News comment data show Coverage >70% Editorial Savings of a factor of ~30.

Confidential

mcPrism

The main ingredients in the algorithm: A modified DFS on the lexicographically ordered sequence

tree. The tree is pruned whenever we encounter a prefix-l-closed node.

The set of prefix-l-closed nodes is pruned by inclusion check Prime Block Encodings for fast computation of joins. We

enhance the encoding scheme to handle gap and closure constraints.

On-the-fly closure checking. We use the bidirectional closure checking and the backscan pruning schemes in BIDE. This is done using an enhancement of the Block encoding scheme

This enhancement also solves an open problem: how to use block encodings to speed up closed sequence mining.

Confidential

Commercial Spam Detection – Results

Subsequence: happy 2011 friend yrs lady announced wedding it' amazing posted received chat invitations days believe it' true love story unfold it' own

Match 1: Happy 2011 to everyone....My friend Vanessa, a 25 yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO of a MNC. It's amazing, she said she just posted her profile on a millionaire d'ating s'ite called ----------Celeb Mingle.C○M--------- and received his chat invitations a few days later. Then, everything went so well that I can't believe it's true! Every love story will unfold on it's own...=====Happy to see she is progressing well. Also happy to see that most Americans reject the blame-the-conservatives crap that some (not all) liberals from all social strata were trying to promote for political gain.

Match 2: Happy 2011 everyone....This has got to be a better year!...My friend Vanessa, a 25 yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO of a MNC. It's amazing, she said she just posted her profile on a millionaire d'ating site called ----------Rich'Friends.Org----- - and received his chat invitations a few days later. Then, everything went so well that I can't believe it's true! Every love story will unfold on it's own. you can start your own wealthy love story for real at there too !many famous and wealthy people had a profile there ,why not me ?Texas and Israel forever. Taking out the world's trash. Oooraaaah!-----

Total Matches: 35; Only 15 marked spam by existing classifiers/editors

Confidential

The sequences are discriminative

Confidential

Identifying Near-Duplicate Strings

Confidential

Content Matching Approach

Key idea: Leverage redundant content across template-based sites for automatic information extraction.

Name Address

Chinese Mirrch 120 Lexington Ave, New York, NY 10016

Tiffin Wallah 127 E 28th St New York, NY 10079

Seed Database

Web page

Confidential

Baseline Similarity Measure

Use q-grams to handle spelling errors

Weak Similarity = Cosine-similarity between IDF-weighted q-grams.

String 3-grams

chinese mirch

{ chi, hin, ine, nes, ese, se# , e#m, #mi, mir, irc, rch}

chinese mirrch

{ chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch}

• Weight of a q-gram (attribute specific)= Sum of the IDFs of the words it appears in.

Confidential

Strong Similarity

Address (Seed) Address (Site) WS120 Lexington AvenueNew York, NY 10016

120 Lexington Ave (between 28th and 29th St) New York, NY 10016

0.53

312 W 34th StreetNew York, NY 10001

312 W 34th St (between 8th and 9th Ave) New York, NY 10001

0.49

Strong similarity is defined between two sets of strings.1. Calculate the matching pattern between weakly similar

pairs in the two sets.2. Pick matching patterns with sufficient “support”3. Use only parts of a string selected by the matching pattern

in the final similarity calculation.

1. Variations are systematic and site-dependent.2. Cannot be handled by term weighting.

Confidential

Support & Strong Similarity

Address (Seed) Address (Site) Matching Pattern

Matching Segments

120 Lexington AvenueNew York, NY 10016

120 Lexington Ave (between 28th and 29th St) New York, NY 10016

103 103 120 Lexington New York, NY 10016

312 W 34th StreetNew York, NY 10001

312 W 34th St (between 8th and 9th Ave)New York, NY 10001

103 103 312 W 34th New York, NY 10001

Matching Pattern: 103 103Support(103 103) = |{“120 Lexington New York, NY 10016”, “312 W 34th New York, NY 10001”}| = 2 (100% support)

Address’ (Seed) Address’ (Site) SS

120 LexingtonNew York, NY 10016

120 LexingtonNew York, NY 10016

1

312 W 34thNew York, NY 10001

312 W 34thNew York, NY 10001

1

Confidential

Need for Support of a Matching Pattern

Address (Seed) Address (Site)

120 Lexington AvenueNew York, NY 10016

1075 Fifth Ave New York, NY 10128

312 W 34th StreetNew York, NY 10001

1167 Madison AveNew York, NY 10128

Matching Pattern: 010 010Support(010 010): |{“New York, NY”}| = 1 (50% support)Hence Strong Similarity = Weak Similarity

Address (Seed) Address (Site) Matching Pattern

MatchingSegments

120 Lexington AvenueNew York, NY 10016

1075 Fifth Ave New York, NY 10128

010 010 New York, NY

312 W 34th StreetNew York, NY 10001

1167 Madison AveNew York, NY 10128

010 010 New York, NY

Confidential

Strong Similarity Scores

SS boosts the similarity scores of TPs over a wide-range of WS scores without boosting that of FPs.SS is not always 1 – even for true positives.SS scores are very high for most true positives.

String 1 String 2 WS SS

980 n michigan ave 14th floorchicago il

980 n michigan avechicago il 60611

0.57 1

1100 e north ave westchicago il 60185

300 w north ave westchicago il 60185

0.74 0.74

Confidential

Identifying Near-Duplicate Images

Confidential

Near-Duplicates on the Web

Confidential

Approach

Feature– DCT/FMT transform– Choose low-frequency coefficients

Signature– Median-based quantization– Signature size depends on number of coefficients

Performance– Large Signature Near dup detection– Small size Image Similarity

Confidential

FMT Detections

Confidential

Signature-based Image Retrieval

Confidential

Large-scale Face Recognition

Confidential

Face Recognition

Face recognition was an important open problem in computer vision.Availability of text and image/video data has provided new directions in web-scale face recognition.If an image occurs in a news article, the named entities in the article can be associated with the faces in the images. This provides weak labels.With large amount of data, such weak signals can be boosted.

Confidential

Conclusions

There is an information explosion but the information has lots of near-duplicates.Spotting near-duplicates has lots of advantages but is a challenge.Large datasets present an equally large opportunity (“Unreasonable effectiveness of data …”).

Confidential

References

Ravi Kant, Srinivasan H. Sengamedu, Krishnan S. Kumar: Comment spam detection by sequence mining, WSDM 2012.Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, Ashwin Tengli: Exploiting Content Redundancy for Web Information Extraction, PVLDB, 2010.Srinivasan H. Sengamedu, Neela Sawant: Finding near-duplicate images on the web using fingerprints, ACM Multimedia 2008.Ming Zhao, Jay Yagnik, Hartwig Adam, David Bau, Large Scale Learning and Recognition of Faces in Web Videos, FG 2008.

Confidential

Questions/Comments?

[email protected]

Confidential

Making Data Repetitions Work for You