Real time semantic search engine for social tv streams
-
Upload
daedalus-sa -
Category
Technology
-
view
3.416 -
download
0
description
Transcript of Real time semantic search engine for social tv streams
Textalytics: Meaning-as-a-Service
Real Time Semantic Search for Social TV
streams
César de Pablo SánchezDaedalus
8/11 2013
Big Data Spain (Madrid)
The plot
1. What's Social TV?
2. Monitoring Social TV conversations. A preliminary architecture
3. Understanding the buzz. Textalytics
4. Organizing the mess. SenseiDB
5. Lessons learned
Social TV
SecondScreen
Transmedia
Not just TV
Sports
Elections
Alerts
Big Data?
Volume
Velocity
Variety
Users?
Viewers
Channels
Brands
Viewers?
Participate
Vote
Influence
Confirm beliefs
Keep updated
Belong to group
Viewers?
Participate
Influence
Confirm beliefs
Keep updated
Channels?
Understand
React
Measure
Channels?
Understand
React
Brands?
Select programs
Reputation
Find public
Reputation
Find public
Brands?
Example from Bluefin Labs
Monitoring Social TV conversations.
The architecture
trackergateway
HTTP Stream
pipeline
Pull
EPG
Understanding the buzzTextalytics API
Core API
Topics Extraction
Text Classification
SentimentAnalysis
Languageidentification
Lemmatization,POS and Parsing
Speeech Recognition andSpeaker Diarization
Semantic LinkedData ViewerSemantic LinkedData Viewer
Spell, Grammar and Style
User Demographics
Language identification
● Given a text identify a language list - or just one● 62 languages● Using language ngrams signatures
● Social TV● Filter – TV hashtags often implies language● Sometimes hashtags are multilingual – but not
relevant for users
Text Classification
● Theme labels – IPTC ● Relevance ● Multiple labels● Tailored for short text (tweets)
● Define your own models and categories
● Social TV – filter on topic content
Sentiment analysis
● Document level classification ● Positive/Negative/Neutral ● Subjective/Objective ● Tailored for short texts● Handles twitter jargon – RT, @, hashtags, emoticons,
spelling errors, disfluence● Other features
● Entity level sentiment ● Segment level sentiment
Topics Extraction➔ Personas:
Ben Bernanke, Mariano Rajoy…
➔ Empresas, Organizaciones:BBVA, Bankia, Goldman Sachs, Coca-Cola, Reserva Federal…
➔ Entidades económicas:Ibex35, Dax Xetra…
➔ Ubicaciones:Londres, EE.UU., París…
➔ Conceptos:prima de riesgo, presidente del Gobierno, intervención parlamentaria, índice bursátil, situación económica…
➔ Referencias de tiempo:hoy, ayer, sobre las 11 de la mañana…
➔ Cantidades económicas:104 dólares, 1 euro…
● 12 main types● Ontology with > 200 types● Instances – BBVA● Classes – bank● fictional/historic
● SocialTV:● populate custom dictionaries –
programs, celebrities, fictional characters
● relationship
Entity Linking
● Linking entities to their 'real' representation● Linking to several LOD sources
API
● NLP and Semantics API ● Multilingual: EN, ES (FR,IT,PT,CA)● REST Service : JSON and XML● Combine best of all worlds
● Deep language analysis● Comprehensive resources: linguistics and Dbs● Ontology● Rule Based Method● Statistics and Machine Learning Methods
● High level semantic API – close to bussines scenarios
● Core API – building blocks
Topics
Sentiment
Classif.
Linked Data
POS
Configuración yRecursos
Lingüísticos
Configuración yRecursos
Lingüísticos
Configuración yRecursos
Lingüísticos
API Análisis Medios
API Publicación Semántica
…
API
Organizing the mess.SenseiDB
SenseiDB
● Open source, distributed, realtime, semi-structured database
● From LinkedIn sna: powering Linkedin home and LinkedIn signals● Integrates other open source technologies:
– Zoie – lucene based search engine– Bobo - faceted search– Apache Kafka – pub-sub system
● http://www.senseidb.com/
SenseiDB features
● 'Hybrid' Information Retrieval – Database ● Full text search ● Structured and faceted search ● Fast real time updates with low latency and high troughput
– pull model● Single table/collection● BQL – a SQL like language ● Eventual consistency ● Distributed – sharding and partitioning ● Hadoop integration
Faceted search
● Amazon.com?● Identify relevant
attributes to use as filters
● Predefined facets● Define a table schema ● Define fields as facets
– facet schema● Efficient - in memory
Faceted search in depth
● Field types● Basic: string, int, short, long, float, double, char● Complex: date and text (analyzed, termvectors)
● Facet types● Simple : 1 row – 1 value ● Hierarchical – Path c>b>a● Range – define ranges ● Multi : 1 row – n values ● Histogram – define bins and their size● TimeRange – for real time data● Custom
Real time indexing
● Data events – add and delete ● Data streams – succession of data events● Gateways
● Read data events from data streams● File● JDBC ● JMS ● Kafka ● Custom: Twitter
BQL – search, filter and facets
● Search – common boolean and phrase operators
● Filters – where contitions● Facets support basic analytics task defined on
facets● Relevance
● Default – recency ● Ad-hoc - may be defined in query
BQL Query Example on Tweets
SELECT *
WHERE hashtags in (“TopChef”)
BROWSE BY
hashtags, user_screen_name, urls
Tweet Query example
Query examples
SELECT *
WHERE QUERY IS "relaxing cup of coffee”
Query examples
SELECT *
WHERE QUERY IS "relaxing cup of coffee”
BROWSE by entities, sentiment
Query examples
SELECT *
WHERE QUERY IS "relaxing cup of coffee”
AND time IN LAST 2 hours
BROWSE by entities, sentiment
Using facets for semantic search
● Define a facet for:● entities/concept → tweets about Chicote – include
all variants + user + hashtags ● for each entity types → Navigate by type – Popular
people ● classification/sentiment/emotions → Positive
tweets about Chicote ● users or hashtags → popular users / popular
mentions / correlated hashtags
Architecture
Scalability
● Zookeper to keep replicas● Low indexing latency (no batch commit)● Low search latency – even with indexing bursts ● Horizontally scalable – shards ● Shards may be replicated N times● Elastic – nodes can be added to accomodate
growth
Other features
● Batch indexing via Hadoop – ETL ● Simple analytics by batch indexing● Customized relevance models● MapReduce functions over facets
● Sum, avg, min, max ● DistinctCount
● Activity values – volatile values – likes
Lessons learned
Conclusions
● SenseiDB is fast at searching/indexing – no variance
● A couple nodes enough to handle Spanish SocialTV volume
● Love query language and time operators - BQL● Support real time exploration
Limitations
● SenseiDB● Documentation is still scarce● Single table model – flat users and reputation● Tricks to store complex facets● Manageability
● Social TV Tracker● Group and disambiguate entity mentions across
tweets ● Relevance is tricky – ad hoc
Comparison
● Solr ● NearRT updates
– Soft commits● Simple facets● Popular – great
tools
● Storm, S4 ?
● ElasticSearch● Batch/realtime
commits ● On line facets ● Aggregation after
facets● Much better plugin
system
Thanks and QA
@zdepablo #bigdata #socialtv #2ndscreen
#nlp @textalytics