Post on 23-Feb-2016
description
Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute)
Achim Klein (University of Hohenheim)
Core Information Processing Technologies
Technical Presentation & Demos
Luxembourg, November 2011
Technical WPs
Luxembourg, Nov 2011
Architecture, Integration & Scaling Strategy
Man
agem
ent
WP
10
WP2 & WP7
Dis
sem
inat
ion
& E
xplo
itatio
nW
P9
WP3 WP4 WP6
OntologyInfrastructure
InformationExtraction
Sentiment Analysis
Decision SupportInfrastructure
Domain-independent GUI(Open Source)
Information Integration
Data, Information & Knowledge Base
WP5
WP1 & WP8
UC#1Market
Surveillance
UC#2 Reputational
Risk management
UC#3 Online Retail
Brokerage
DataAcquisition
DataAcquisition
We are here
2FIRST Y1 Review Meeting
Data acquisition pipeline (Dacq)
FIRST Y1 Review Meeting
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
RSS reader
RSS reader
RSS reader
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
.
.
.
.
.
.
Loadbalancing
One readerper site
processingpipelines
Luxembourg, Nov 20113
Luxembourg, Nov 2011
Data acquisition pipeline (Dacq)
FIRST Y1 Review Meeting 4
Demo video(3:20)
Data acquisition pipeline
FIRST Y1 Review Meeting
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
RSS reader
RSS reader
.
.
.
.
.
.
RSS reader
Luxembourg, Nov 20115
Boilerplate removal
Demo video(1:30)
Data acquisition pipeline
FIRST Y1 Review Meeting
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
RSS reader
RSS reader
.
.
.
.
.
.
RSS reader
Luxembourg, Nov 20117
Language detection
Motivation: language-specific text analysis components
Relatively simple problemSolutions based on word or character sequences
(language models)Side effects: removes “garbage” and can be used to
identify code pageOur implementation based on frequencies of
character sequences
FIRST Y1 Review Meeting
Demo video(0:45)
Luxembourg, Nov 20118
Data acquisition pipeline
FIRST Y1 Review Meeting
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
RSS reader
RSS reader
.
.
.
.
.
.
RSS reader
Luxembourg, Nov 20119
Near-duplicate detection
Why is this a difficult problem?We are dealing with millions of documents –
cannot afford to compare every document with every document
We are also looking for near-duplicates, not only exact matches
Overlooked boilerplate “produces” false near-duplicates
FIRST Y1 Review Meeting Luxembourg, Nov 201110
Demo video(1:00)
Near-duplicate detection
Existing approaches like SimHash, shingling and sketching, SpotSigs…Apart from SpotSigs, they require “clean”
documents Hard to interpret similarity value (how many
characters, words, sentences?)
Developing a novel solution to remove boilerplate and detect duplicates [with clear interpretation] in the same framework
Luxembourg, Nov 2011FIRST Y1 Review Meeting 11
Technical WPs
Luxembourg, Nov 2011
Architecture, Integration & Scaling Strategy
Man
agem
ent
WP
10
WP2 & WP7
Dis
sem
inat
ion
& E
xplo
itatio
nW
P9
WP3 WP4 WP6
OntologyInfrastructure
InformationExtraction
Sentiment Analysis
Decision SupportInfrastructure
Domain-independent GUI(Open Source)
Information Integration
Data, Information & Knowledge Base
WP5
WP1 & WP8
UC#1Market
Surveillance
UC#2 Reputational
Risk management
UC#3 Online Retail
Brokerage
DataAcquisition
OntologyInfrastructure
InformationExtraction
12FIRST Y1 Review Meeting
We are here
FIRST ontology
SentimentObject FinancialInstrument Index Stock_Index Stock Company Country
Luxembourg, Nov 2011FIRST Y1 Review Meeting 13
Seedindices
Constituents(stocks)
Companies
Countries
FIRST ontology
:NASDAQ_100 a :Stock_Index ;rdfs:label "NASDAQ-100" .
:MICROSOFT a :Stock ;rdfs:label "MICROSOFT CORP COM USD0.00000625" ;:memberOf :NASDAQ_100 .
:MICROSOFT_CORP a :Company ;rdfs:label "Microsoft Corp." ;:issues :MICROSOFT .
:USA a :Country ;rdfs:label "USA" .
:MICROSOFT_CORP :locatedIn :USA .
:MICROSOFT_CORP:hasGazetteer :MICROSOFT_CORP_Gazetteer .
:MICROSOFT_CORP_Gazetteer:hasTerm "Microsoft Corp" ;:hasTerm "Microsoft Corporation" ;:hasStopWord "CORP" ;:hasStopWord "CORPORATION" ;a :Gazetteer .
Luxembourg, Nov 2011FIRST Y1 Review Meeting 14
Microsoft Corporation is engaged in developing, licensing and supporting a range of software products and services. Microsoft also designs and sells hardware, and delivers online advertising to the customers.
Microsoft Corp
correlationDefinitionInfluencesIndicator
feat
ureH
asC
orre
latio
nDef
initi
on
indicatorHas
CorrelationDefinitioncorr
elat
ionD
efin
ition
Influ
ence
sFea
ture
objectHasCorrelationDefinition
correlationDefinitionInfluencesObject
FIRST ontology
Sentiment Object
Company
Financial Instrument
MacroIndicator
MicroIndicator
Indicator
Technical
FundamentalFeature
CorrelationDefinition
Volatility
Price
Reputation
OrientationPhrase
Annotation pipeline
FIRST Y1 Review Meeting
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
Boilerplate remover
Language detector
Duplicate detector
Natural language preproc.
Semantic annotator
ZeroMQ emitter
RSS reader
RSS reader
.
.
.
.
.
.
RSS reader
Ontology-basedsemantic
annotation
Luxembourg, Nov 201116
Demo video(3:00)
Technical WPs
Luxembourg, Nov 2011
Architecture, Integration & Scaling Strategy
Man
agem
ent
WP
10
WP2 & WP7
Dis
sem
inat
ion
& E
xplo
itatio
nW
P9
WP3 WP4 WP6
OntologyInfrastructure
InformationExtraction
Sentiment Analysis
Decision SupportInfrastructure
Domain-independent GUI(Open Source)
Information Integration
Data, Information & Knowledge Base
WP5
WP1 & WP8
UC#1Market
Surveillance
UC#2 Reputational
Risk management
UC#3 Online Retail
Brokerage
DataAcquisition
Sentiment Analysis
We are here
17FIRST Y1 Review Meeting
Sentiment Analysis
Object: Sentiment in financial web texts
Problem: Classification of sentiment orientation with respect to expected future …price change of financial instrumentsvolatility change of financial instrumentsreputation change of companies
Approach: Knowledge-based sentiment classificationStarting at the sentence-levelSpecific to features of objects
(e.g., reputation of a company)
Example
Ambiguity: ”The low clarity of messages implies that quite often people would be likely to disagree on the classification” [Das and Chen 2007].
Identification and differentiation of objects (and features)Relationships of indicators (e.g., earnings) and objects
Short term: uptrend Support for the SPX remains at 848 and then 789, with resistance at 912 and then 935. Short term momentum was overbought during the rally early in the week and is now displaying a positive divergence at friday's lows. Should the market fail to hold this pivot (SPX 840) in the days and weeks ahead the uptrend is likely over.Long term: bear marketThe Cycle wave bear market of October 2007 continues. Thus far, equity markets worldwide have declined on average about 50%. The opportunity still remains for the US and World economies to avoid a devastating Supercycle bear market like that of 1929-1932.
http://caldaroew.spaces.live.com/Blog/cns!D2CB8C5EBA2ADE86!27847.entry
Manual Sentiment Annotation
Luxembourg, Nov 2011FIRST Y1 Review Meeting 20
Topic
FIRST Knowledge-based Sentiment Analysis Approach
1. Identify
2. Extract
3. Classify sentiment orientation {positive, negative} for all sentence-level sentiments
4. Aggregate
All sentence-level sentiments
Scoring Document-level sentiment score
All sentiment scores for a given day
Averaging Sentiment Index [-1,1]
All sentiments in one document
Rules,Ontology
Support for the SPX remains at 848 and then 789, with resistance at 912 …
Support for the SPX remains at 848 and then 789, with resistance at 912 …
Rules,Ontology
All sentiment objects and features
Sentence-levelSentiment Classification
a) directly Example: „I expect the S&P 500 to rise“
positive sentiment
Addressed by rules
b) indirectly, via an indicator Example: „I think U.S. interest rates will rise“
negative sentiment
Addressed by ontology
http://business.financialpost.com/2011/10/04/economic-uncertainty-could-fan-volatility/Oct 4, 2011 – 3:24 PM ET
The fourth quarter began on Monday with the broad S&P 500 on the precipice of a bear market and investors lacking confidence in either European or U.S. policymakers being able to stem the disquiet surrounding the debt crisis.Wall Street typically defines a bear market as a drop of 20 percent or more from a recent high.Volatility is at its most persistently elevated since the financial crisis of 2008, as measured by the popular VIX , or CBOE Volatility Index. Barring a knock-out U.S. earnings period in the next month, it could remain high, and investors should brace for wild swings and more down days.
Example Text: S&P 500
Sentiment Sentences on Price Change of S&P 500
Luxembourg, Nov 2011FIRST Y1 Review Meeting 24
Negative sentiment about the future price change of the S&P 500
Sentiment Sentences on Volatility Change of S&P 500
Luxembourg, Nov 2011FIRST Y1 Review Meeting 25
Positive sentiment about the future
volatility change of the S&P 500
Document-level Sentiment with respect to multiple objects/features
Initial Experiment Results
Accuracy of knowledge-based sentiment classification vs. standard machine learning methodsSmall manually classified corpusResult: 7% more accurate
Portfolio selection experimentUse sentiment to select Dow Jones stocksResult: Excess returns seem possible
More information in paper: „Extracting Investor Sentiment from Weblog Texts: A Knowledge-based Approach“, published in IEEE CEC 2011 conference proceedings
Luxembourg, Nov 2011FIRST Y1 Review Meeting 27
Main Y1 Achievements
Data acquisition software runningSentiment analysis for web texts
Sentence-level, and specific to features of objects Initial experiment results are promising
Ontology available (~4000 instances) Sentence-level annotated corpus available
(900 documents and growing) Delivered as D3.1 and D4.1 Book chapter on data acquisition in preparation Paper on sentiment extraction
(best paper award at CEC 2011 conference)
Luxembourg, Nov 2011FIRST Y1 Review Meeting 28
Next Steps
Improve ontology and gazetteers
Use corpus to improve sentiment classification
Increase throughput of sentiment extraction
Luxembourg, Nov 2011FIRST Y1 Review Meeting
Thank you
30