Real time semantic search engine for social tv streams

48
Textalytics: Meaning-as-a-Service Real Time Semantic Search for Social TV streams César de Pablo Sánchez Daedalus 8/11 2013 Big Data Spain (Madrid)

description

Social TV, the use of social networks to comment on TV programs is a growing phenomena. TV channels and brands are turning into social networks to look for real time insights about their programs. Understanding the global conversation about a program is useful to acquire insights for broadcasters and brands. For broadcasters, acquiring insights while a program is aired enable them to produce new content formats that include social conversation. For brands, it helps to prevent reputation crisis and increase the reach of their marketing efforts. For viewers, which increasingly use second screen devices, should benefit from tools that help to understand opinions around main content and connect with peers during TV programs or live events. We present a system that combines natural language processing (Textalytics API) and a scalable semi-structured database/search engine (senseiDB) to provide semantic and faceted search, real time analytics and support visualizations for this kind of applications. In the first part, we will present some of the useful NLP methods that we can use to tame unstructured big data like Twitter or Facebook comments. We will include description for tasks like text categorization, sentiment analysis, named entity recognition. We would also see how this data could be related to external data like Linked Data points. While the description would be general, examples would be illustrated using Textalytics API. Then we would present how this data could be ingested and made available for search in real time using a semi-structured database like SenseiDB. We would present key features of SenseiDB including high performance real time indexing and simultaneous querying, distribution and support for full-text and faceted search. We would also discuss how facets may be overused to provide real time analytics and enable semantic search. Finally we will discuss advantages, problems and current limitations of SenseiDB. Takeaway Points. - Analyzing and searching text in social streams - Integrating text analytics services (Textalytics) and a semi-structured database (SenseiDB) - Key features of SenseiDB

Transcript of Real time semantic search engine for social tv streams

Page 1: Real time semantic search engine for social tv streams

Textalytics: Meaning-as-a-Service

Real Time Semantic Search for Social TV

streams

César de Pablo SánchezDaedalus

8/11 2013

Big Data Spain (Madrid)

Page 2: Real time semantic search engine for social tv streams
Page 3: Real time semantic search engine for social tv streams
Page 4: Real time semantic search engine for social tv streams

The plot

1. What's Social TV?

2. Monitoring Social TV conversations. A preliminary architecture

3. Understanding the buzz. Textalytics

4. Organizing the mess. SenseiDB

5. Lessons learned

Page 5: Real time semantic search engine for social tv streams

Social TV

SecondScreen

Transmedia

Page 6: Real time semantic search engine for social tv streams

Not just TV

Sports

Elections

Alerts

Page 7: Real time semantic search engine for social tv streams

Big Data?

Volume

Velocity

Variety

Page 8: Real time semantic search engine for social tv streams

Users?

Viewers

Channels

Brands

Page 9: Real time semantic search engine for social tv streams

Viewers?

Participate

Vote

Influence

Confirm beliefs

Keep updated

Belong to group

Page 10: Real time semantic search engine for social tv streams

Viewers?

Participate

Influence

Confirm beliefs

Keep updated

Page 11: Real time semantic search engine for social tv streams

Channels?

Understand

React

Measure

Page 12: Real time semantic search engine for social tv streams

Channels?

Understand

React

Page 13: Real time semantic search engine for social tv streams

Brands?

Select programs

Reputation

Find public

Page 14: Real time semantic search engine for social tv streams

Reputation

Find public

Brands?

Example from Bluefin Labs

Page 15: Real time semantic search engine for social tv streams

Monitoring Social TV conversations.

The architecture

Page 16: Real time semantic search engine for social tv streams

trackergateway

HTTP Stream

pipeline

Pull

EPG

Page 17: Real time semantic search engine for social tv streams

Understanding the buzzTextalytics API

Page 18: Real time semantic search engine for social tv streams

Core API

Topics Extraction

Text Classification

SentimentAnalysis

Languageidentification

Lemmatization,POS and Parsing

Speeech Recognition andSpeaker Diarization

Semantic LinkedData ViewerSemantic LinkedData Viewer

Spell, Grammar and Style

User Demographics

Page 19: Real time semantic search engine for social tv streams

Language identification

● Given a text identify a language list - or just one● 62 languages● Using language ngrams signatures

● Social TV● Filter – TV hashtags often implies language● Sometimes hashtags are multilingual – but not

relevant for users

Page 20: Real time semantic search engine for social tv streams
Page 21: Real time semantic search engine for social tv streams

Text Classification

● Theme labels – IPTC ● Relevance ● Multiple labels● Tailored for short text (tweets)

● Define your own models and categories

● Social TV – filter on topic content

Page 22: Real time semantic search engine for social tv streams

Sentiment analysis

● Document level classification ● Positive/Negative/Neutral ● Subjective/Objective ● Tailored for short texts● Handles twitter jargon – RT, @, hashtags, emoticons,

spelling errors, disfluence● Other features

● Entity level sentiment ● Segment level sentiment

Page 23: Real time semantic search engine for social tv streams

Topics Extraction➔ Personas:

Ben Bernanke, Mariano Rajoy…

➔ Empresas, Organizaciones:BBVA, Bankia, Goldman Sachs, Coca-Cola, Reserva Federal…

➔ Entidades económicas:Ibex35, Dax Xetra…

➔ Ubicaciones:Londres, EE.UU., París…

➔ Conceptos:prima de riesgo, presidente del Gobierno, intervención parlamentaria, índice bursátil, situación económica…

➔ Referencias de tiempo:hoy, ayer, sobre las 11 de la mañana…

➔ Cantidades económicas:104 dólares, 1 euro…

● 12 main types● Ontology with > 200 types● Instances – BBVA● Classes – bank● fictional/historic

● SocialTV:● populate custom dictionaries –

programs, celebrities, fictional characters

● relationship

Page 24: Real time semantic search engine for social tv streams

Entity Linking

● Linking entities to their 'real' representation● Linking to several LOD sources

Page 25: Real time semantic search engine for social tv streams
Page 26: Real time semantic search engine for social tv streams

API

● NLP and Semantics API ● Multilingual: EN, ES (FR,IT,PT,CA)● REST Service : JSON and XML● Combine best of all worlds

● Deep language analysis● Comprehensive resources: linguistics and Dbs● Ontology● Rule Based Method● Statistics and Machine Learning Methods

Page 27: Real time semantic search engine for social tv streams

● High level semantic API – close to bussines scenarios

● Core API – building blocks

Topics

Sentiment

Classif.

Linked Data

POS

Configuración yRecursos

Lingüísticos

Configuración yRecursos

Lingüísticos

Configuración yRecursos

Lingüísticos

API Análisis Medios

API Publicación Semántica

API

Page 28: Real time semantic search engine for social tv streams

Organizing the mess.SenseiDB

Page 29: Real time semantic search engine for social tv streams

SenseiDB

● Open source, distributed, realtime, semi-structured database

● From LinkedIn sna: powering Linkedin home and LinkedIn signals● Integrates other open source technologies:

– Zoie – lucene based search engine– Bobo - faceted search– Apache Kafka – pub-sub system

● http://www.senseidb.com/

Page 30: Real time semantic search engine for social tv streams

SenseiDB features

● 'Hybrid' Information Retrieval – Database ● Full text search ● Structured and faceted search ● Fast real time updates with low latency and high troughput

– pull model● Single table/collection● BQL – a SQL like language ● Eventual consistency ● Distributed – sharding and partitioning ● Hadoop integration

Page 31: Real time semantic search engine for social tv streams

Faceted search

● Amazon.com?● Identify relevant

attributes to use as filters

● Predefined facets● Define a table schema ● Define fields as facets

– facet schema● Efficient - in memory

Page 32: Real time semantic search engine for social tv streams

Faceted search in depth

● Field types● Basic: string, int, short, long, float, double, char● Complex: date and text (analyzed, termvectors)

● Facet types● Simple : 1 row – 1 value ● Hierarchical – Path c>b>a● Range – define ranges ● Multi : 1 row – n values ● Histogram – define bins and their size● TimeRange – for real time data● Custom

Page 33: Real time semantic search engine for social tv streams

Real time indexing

● Data events – add and delete ● Data streams – succession of data events● Gateways

● Read data events from data streams● File● JDBC ● JMS ● Kafka ● Custom: Twitter

Page 34: Real time semantic search engine for social tv streams

BQL – search, filter and facets

● Search – common boolean and phrase operators

● Filters – where contitions● Facets support basic analytics task defined on

facets● Relevance

● Default – recency ● Ad-hoc - may be defined in query

Page 35: Real time semantic search engine for social tv streams

BQL Query Example on Tweets

SELECT *

WHERE hashtags in (“TopChef”)

BROWSE BY

hashtags, user_screen_name, urls

Page 36: Real time semantic search engine for social tv streams

Tweet Query example

Page 37: Real time semantic search engine for social tv streams

Query examples

SELECT *

WHERE QUERY IS "relaxing cup of coffee”

Page 38: Real time semantic search engine for social tv streams

Query examples

SELECT *

WHERE QUERY IS "relaxing cup of coffee”

BROWSE by entities, sentiment

Page 39: Real time semantic search engine for social tv streams

Query examples

SELECT *

WHERE QUERY IS "relaxing cup of coffee”

AND time IN LAST 2 hours

BROWSE by entities, sentiment

Page 40: Real time semantic search engine for social tv streams

Using facets for semantic search

● Define a facet for:● entities/concept → tweets about Chicote – include

all variants + user + hashtags ● for each entity types → Navigate by type – Popular

people ● classification/sentiment/emotions → Positive

tweets about Chicote ● users or hashtags → popular users / popular

mentions / correlated hashtags

Page 41: Real time semantic search engine for social tv streams

Architecture

Page 42: Real time semantic search engine for social tv streams

Scalability

● Zookeper to keep replicas● Low indexing latency (no batch commit)● Low search latency – even with indexing bursts ● Horizontally scalable – shards ● Shards may be replicated N times● Elastic – nodes can be added to accomodate

growth

Page 43: Real time semantic search engine for social tv streams

Other features

● Batch indexing via Hadoop – ETL ● Simple analytics by batch indexing● Customized relevance models● MapReduce functions over facets

● Sum, avg, min, max ● DistinctCount

● Activity values – volatile values – likes

Page 44: Real time semantic search engine for social tv streams

Lessons learned

Page 45: Real time semantic search engine for social tv streams

Conclusions

● SenseiDB is fast at searching/indexing – no variance

● A couple nodes enough to handle Spanish SocialTV volume

● Love query language and time operators - BQL● Support real time exploration

Page 46: Real time semantic search engine for social tv streams

Limitations

● SenseiDB● Documentation is still scarce● Single table model – flat users and reputation● Tricks to store complex facets● Manageability

● Social TV Tracker● Group and disambiguate entity mentions across

tweets ● Relevance is tricky – ad hoc

Page 47: Real time semantic search engine for social tv streams

Comparison

● Solr ● NearRT updates

– Soft commits● Simple facets● Popular – great

tools

● Storm, S4 ?

● ElasticSearch● Batch/realtime

commits ● On line facets ● Aggregation after

facets● Much better plugin

system

Page 48: Real time semantic search engine for social tv streams

Thanks and QA

@zdepablo #bigdata #socialtv #2ndscreen

#nlp @textalytics