Real time semantic search engine for social tv streams

Textalytics: Meaning-as-a-Service

Real Time Semantic Search for Social TV

streams

César de Pablo SánchezDaedalus

8/11 2013

Big Data Spain (Madrid)

The plot

1. What's Social TV?

2. Monitoring Social TV conversations. A preliminary architecture

3. Understanding the buzz. Textalytics

4. Organizing the mess. SenseiDB

5. Lessons learned

Social TV

SecondScreen

Transmedia

Not just TV

Sports

Elections

Alerts

Big Data?

Volume

Velocity

Variety

Users?

Viewers

Channels

Brands

Viewers?

Participate

Vote

Influence

Confirm beliefs

Keep updated

Belong to group

Viewers?

Participate

Influence

Confirm beliefs

Keep updated

Channels?

Understand

React

Measure

Channels?

Understand

React

Brands?

Select programs

Reputation

Find public

Reputation

Find public

Brands?

Example from Bluefin Labs

Monitoring Social TV conversations.

The architecture

trackergateway

HTTP Stream

pipeline

Pull

EPG

Understanding the buzzTextalytics API

Core API

Topics Extraction

Text Classification

SentimentAnalysis

Languageidentification

Lemmatization,POS and Parsing

Speeech Recognition andSpeaker Diarization

Semantic LinkedData ViewerSemantic LinkedData Viewer

Spell, Grammar and Style

User Demographics

Language identification

● Given a text identify a language list - or just one● 62 languages● Using language ngrams signatures

● Social TV● Filter – TV hashtags often implies language● Sometimes hashtags are multilingual – but not

relevant for users

Text Classification

● Theme labels – IPTC ● Relevance ● Multiple labels● Tailored for short text (tweets)

● Define your own models and categories

● Social TV – filter on topic content

Sentiment analysis

● Document level classification ● Positive/Negative/Neutral ● Subjective/Objective ● Tailored for short texts● Handles twitter jargon – RT, @, hashtags, emoticons,

spelling errors, disfluence● Other features

● Entity level sentiment ● Segment level sentiment

Topics Extraction➔ Personas:

Ben Bernanke, Mariano Rajoy…

➔ Empresas, Organizaciones:BBVA, Bankia, Goldman Sachs, Coca-Cola, Reserva Federal…

➔ Entidades económicas:Ibex35, Dax Xetra…

➔ Ubicaciones:Londres, EE.UU., París…

➔ Conceptos:prima de riesgo, presidente del Gobierno, intervención parlamentaria, índice bursátil, situación económica…

➔ Referencias de tiempo:hoy, ayer, sobre las 11 de la mañana…

➔ Cantidades económicas:104 dólares, 1 euro…

● 12 main types● Ontology with > 200 types● Instances – BBVA● Classes – bank● fictional/historic

● SocialTV:● populate custom dictionaries –

programs, celebrities, fictional characters

● relationship

Entity Linking

● Linking entities to their 'real' representation● Linking to several LOD sources

API

● NLP and Semantics API ● Multilingual: EN, ES (FR,IT,PT,CA)● REST Service : JSON and XML● Combine best of all worlds

● Deep language analysis● Comprehensive resources: linguistics and Dbs● Ontology● Rule Based Method● Statistics and Machine Learning Methods

● High level semantic API – close to bussines scenarios

● Core API – building blocks

Topics

Sentiment

Classif.

Linked Data

POS

Configuración yRecursos

Lingüísticos


Lingüísticos


Lingüísticos

API Análisis Medios

API Publicación Semántica

…

API

Organizing the mess.SenseiDB

SenseiDB

● Open source, distributed, realtime, semi-structured database

● From LinkedIn sna: powering Linkedin home and LinkedIn signals● Integrates other open source technologies:

– Zoie – lucene based search engine– Bobo - faceted search– Apache Kafka – pub-sub system

● http://www.senseidb.com/

SenseiDB features

● 'Hybrid' Information Retrieval – Database ● Full text search ● Structured and faceted search ● Fast real time updates with low latency and high troughput

– pull model● Single table/collection● BQL – a SQL like language ● Eventual consistency ● Distributed – sharding and partitioning ● Hadoop integration

Faceted search

● Amazon.com?● Identify relevant

attributes to use as filters

● Predefined facets● Define a table schema ● Define fields as facets

– facet schema● Efficient - in memory

Faceted search in depth

● Field types● Basic: string, int, short, long, float, double, char● Complex: date and text (analyzed, termvectors)

● Facet types● Simple : 1 row – 1 value ● Hierarchical – Path c>b>a● Range – define ranges ● Multi : 1 row – n values ● Histogram – define bins and their size● TimeRange – for real time data● Custom

Real time indexing

● Data events – add and delete ● Data streams – succession of data events● Gateways

● Read data events from data streams● File● JDBC ● JMS ● Kafka ● Custom: Twitter

BQL – search, filter and facets

● Search – common boolean and phrase operators

● Filters – where contitions● Facets support basic analytics task defined on

facets● Relevance

● Default – recency ● Ad-hoc - may be defined in query

BQL Query Example on Tweets

SELECT *

WHERE hashtags in (“TopChef”)

BROWSE BY

hashtags, user_screen_name, urls

Tweet Query example

Query examples

SELECT *

WHERE QUERY IS "relaxing cup of coffee”

Query examples

SELECT *


BROWSE by entities, sentiment

Query examples

SELECT *


AND time IN LAST 2 hours

BROWSE by entities, sentiment

Using facets for semantic search

● Define a facet for:● entities/concept → tweets about Chicote – include

all variants + user + hashtags ● for each entity types → Navigate by type – Popular

people ● classification/sentiment/emotions → Positive

tweets about Chicote ● users or hashtags → popular users / popular

mentions / correlated hashtags

Architecture

Scalability

● Zookeper to keep replicas● Low indexing latency (no batch commit)● Low search latency – even with indexing bursts ● Horizontally scalable – shards ● Shards may be replicated N times● Elastic – nodes can be added to accomodate

growth

Other features

● Batch indexing via Hadoop – ETL ● Simple analytics by batch indexing● Customized relevance models● MapReduce functions over facets

● Sum, avg, min, max ● DistinctCount

● Activity values – volatile values – likes

Lessons learned

Conclusions

● SenseiDB is fast at searching/indexing – no variance

● A couple nodes enough to handle Spanish SocialTV volume

● Love query language and time operators - BQL● Support real time exploration

Limitations

● SenseiDB● Documentation is still scarce● Single table model – flat users and reputation● Tricks to store complex facets● Manageability

● Social TV Tracker● Group and disambiguate entity mentions across

tweets ● Relevance is tricky – ad hoc

Comparison

● Solr ● NearRT updates

– Soft commits● Simple facets● Popular – great

tools

● Storm, S4 ?

● ElasticSearch● Batch/realtime

commits ● On line facets ● Aggregation after

facets● Much better plugin

system

Thanks and QA

@zdepablo #bigdata #socialtv #2ndscreen

#nlp @textalytics

Real time semantic search engine for social tv streams

Technology

Transcript of Real time semantic search engine for social tv streams