Content Analytics for Better Search

download Content Analytics for Better Search

If you can't read please download the document

Transcript of Content Analytics for Better Search

Content Analytics

for

Better Search

Otis Gospodneti Sematext International

Agenda

Intro: Otis & Sematext

Basic Search

Taming Search Results

Key Phrases

Beyond Search

About Otis Gospodneti

Member: Apache Lucene/Solr/Nutch/Mahout

Author: Lucene in Action 1 & 2

Entrepreneur: Simpy, Lucene Consulting, Sematext Int'l since 2007

About Sematext

Consulting, development, support:

Big Data (Hadoop, HBase, Voldemort...)

Search (Lucene, Solr, Elastic Search...)

Web Crawling (Nutch)

Machine Learning (Mahout)

Basic Search

Taming Search Results

Related searches (high query volume)

Search results clustering (fuzzy)

Named Entity Recognition (NER)

Faceted search (structured input)

10 days of data (5K/min)

Example: Related Searches

Example: Results Clustering

Example: Named Entities





Sorry, no screenshot, but I know sites use this!
Really, I do!

:)

Example: Faceted Search

Content Analysis: Key Phrases

Related searches

Search results clustering

Named Entity Recognition (NER)

Faceted search

Key PhrasesCollocations

Statistically Improbable Phrases (SIPs)

10 days of data (5K/min)

Example: Key Phrases & Search

Example: Key Phrases & Search

Definitions: Collocations

Collocations are phrases whose words are seen together more than you would expect given an estimate of how frequent each individual word is in the given text vs. how often they are seen together in the same text.

Source: http://sematext.com/demo/kpe/

See: http://en.wikipedia.org/wiki/Collocation

Definitions: SIPs

Statistically Improbably Phrases are phrases that appear in a text more often than you would expect given how often they appear in another text. In this demo we extract SIPs by comparing texts from two different time periods.

Source: http://sematext.com/demo/kpe/

See: http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases

Language Models

Hybrid Key Phrases

Beyond Search

Content analysis

Trend spotting / Buzz monitoring

Social media

Customer reviews / Brand

Book Content Analysis

SIPs at Amazon

Amazon SIPs are the most distinctive phrases in the text of books in the Search Inside! program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.

News Content Analysis








Source: http://sematext.com/demo/kpe/

SIPs & News Topic Trending

The text for the new (or you can think of it as "current") period goes from now to up to 7 days back. The text for the old (or "past") period is for the 7 days before that.

now new text (now - 7 days) text (now - 14 days)

Customer Experience

Mindshare Technologies (MT) is a Voice of the Customer company who helps companies make operational improvements based on customer feedback. MT's client list includes many of the world's largest restaurant chains, hotels, car rental agencies, and telecommunications companies. Much of the feedback we collect is from surveys that contain open-ended questions where customers can leave comments. MT has used the Key Phrase Extractor to unlock the value contained in these comments. We are able to identify common problems experienced by customers and are even able to detect emerging topics that are starting to catch fire. Mindshare's clients are able to leverage this information and make operational changes that improve customer experiences.

Lessons

GIGO

Language-awareness (POS)

Filtering (England v)

sematext.com

blog.sematext.com

@sematext

@otisg

[email protected]

Contact



Copyright 2010 Sematext Int'l. All rights reserved.