Core Information Processing Technologies Technical Presentation & Demos

Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute)

Achim Klein (University of Hohenheim)

Core Information Processing Technologies

Technical Presentation & Demos

Luxembourg, November 2011

Technical WPs

Luxembourg, Nov 2011

Architecture, Integration & Scaling Strategy

WP2 & WP7

itatio

WP3 WP4 WP6

OntologyInfrastructure

InformationExtraction

Sentiment Analysis

Decision SupportInfrastructure

Domain-independent GUI(Open Source)

Information Integration

Data, Information & Knowledge Base

WP1 & WP8

UC#1Market

Surveillance

UC#2 Reputational

Risk management

UC#3 Online Retail

Brokerage

DataAcquisition

We are here

2FIRST Y1 Review Meeting

Data acquisition pipeline (Dacq)

FIRST Y1 Review Meeting

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Boilerplate remover

RSS reader

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Loadbalancing

One readerper site

processingpipelines

Data acquisition pipeline (Dacq)

FIRST Y1 Review Meeting 4

Demo video(3:20)

Data acquisition pipeline

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

RSS reader

Boilerplate removal

Demo video(1:30)

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

RSS reader

Language detection

Motivation: language-specific text analysis components

Relatively simple problemSolutions based on word or character sequences

(language models)Side effects: removes “garbage” and can be used to

identify code pageOur implementation based on frequencies of

character sequences

Demo video(0:45)

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

RSS reader

Near-duplicate detection

Why is this a difficult problem?We are dealing with millions of documents –

cannot afford to compare every document with every document

We are also looking for near-duplicates, not only exact matches

Overlooked boilerplate “produces” false near-duplicates

FIRST Y1 Review Meeting Luxembourg, Nov 201110

Demo video(1:00)

Near-duplicate detection

Existing approaches like SimHash, shingling and sketching, SpotSigs…Apart from SpotSigs, they require “clean”

documents Hard to interpret similarity value (how many

characters, words, sentences?)

Developing a novel solution to remove boilerplate and detect duplicates [with clear interpretation] in the same framework

Luxembourg, Nov 2011FIRST Y1 Review Meeting 11

Technical WPs

WP2 & WP7

itatio

WP3 WP4 WP6

Sentiment Analysis

WP1 & WP8

UC#1Market

Surveillance

UC#2 Reputational

Risk management

UC#3 Online Retail

Brokerage

DataAcquisition

We are here

FIRST ontology

SentimentObject FinancialInstrument Index Stock_Index Stock Company Country

Seedindices

Constituents(stocks)

Companies

Countries

FIRST ontology

:NASDAQ_100 a :Stock_Index ;rdfs:label "NASDAQ-100" .

:MICROSOFT a :Stock ;rdfs:label "MICROSOFT CORP COM USD0.00000625" ;:memberOf :NASDAQ_100 .

:MICROSOFT_CORP a :Company ;rdfs:label "Microsoft Corp." ;:issues :MICROSOFT .

:USA a :Country ;rdfs:label "USA" .

:MICROSOFT_CORP :locatedIn :USA .

:MICROSOFT_CORP:hasGazetteer :MICROSOFT_CORP_Gazetteer .

:MICROSOFT_CORP_Gazetteer:hasTerm "Microsoft Corp" ;:hasTerm "Microsoft Corporation" ;:hasStopWord "CORP" ;:hasStopWord "CORPORATION" ;a :Gazetteer .

Microsoft Corporation is engaged in developing, licensing and supporting a range of software products and services. Microsoft also designs and sells hardware, and delivers online advertising to the customers.

Microsoft Corp

correlationDefinitionInfluencesIndicator

indicatorHas

CorrelationDefinitioncorr

objectHasCorrelationDefinition

correlationDefinitionInfluencesObject

FIRST ontology

Sentiment Object

Company

Financial Instrument

MacroIndicator

MicroIndicator

Indicator

Technical

FundamentalFeature

CorrelationDefinition

Volatility

Reputation

OrientationPhrase

Annotation pipeline

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Semantic annotator

ZeroMQ emitter

RSS reader

Ontology-basedsemantic

annotation

Demo video(3:00)

Technical WPs

WP2 & WP7

itatio

WP3 WP4 WP6

Sentiment Analysis

WP1 & WP8

UC#1Market

Surveillance

UC#2 Reputational

Risk management

UC#3 Online Retail

Brokerage

DataAcquisition

Sentiment Analysis

We are here

Sentiment Analysis

Object: Sentiment in financial web texts

Problem: Classification of sentiment orientation with respect to expected future …price change of financial instrumentsvolatility change of financial instrumentsreputation change of companies

Approach: Knowledge-based sentiment classificationStarting at the sentence-levelSpecific to features of objects

(e.g., reputation of a company)

Example

Ambiguity: ”The low clarity of messages implies that quite often people would be likely to disagree on the classification” [Das and Chen 2007].

Identification and differentiation of objects (and features)Relationships of indicators (e.g., earnings) and objects

Short term: uptrend Support for the SPX remains at 848 and then 789, with resistance at 912 and then 935. Short term momentum was overbought during the rally early in the week and is now displaying a positive divergence at friday's lows. Should the market fail to hold this pivot (SPX 840) in the days and weeks ahead the uptrend is likely over.Long term: bear marketThe Cycle wave bear market of October 2007 continues. Thus far, equity markets worldwide have declined on average about 50%. The opportunity still remains for the US and World economies to avoid a devastating Supercycle bear market like that of 1929-1932.

http://caldaroew.spaces.live.com/Blog/cns!D2CB8C5EBA2ADE86!27847.entry

Manual Sentiment Annotation

FIRST Knowledge-based Sentiment Analysis Approach

1. Identify

2. Extract

3. Classify sentiment orientation {positive, negative} for all sentence-level sentiments

4. Aggregate

All sentence-level sentiments

Scoring Document-level sentiment score

All sentiment scores for a given day

Averaging Sentiment Index [-1,1]

All sentiments in one document

Rules,Ontology

Support for the SPX remains at 848 and then 789, with resistance at 912 …

Rules,Ontology

All sentiment objects and features

Sentence-levelSentiment Classification

a) directly Example: „I expect the S&P 500 to rise“

positive sentiment

Addressed by rules

b) indirectly, via an indicator Example: „I think U.S. interest rates will rise“

negative sentiment

Addressed by ontology

http://business.financialpost.com/2011/10/04/economic-uncertainty-could-fan-volatility/Oct 4, 2011 – 3:24 PM ET

The fourth quarter began on Monday with the broad S&P 500 on the precipice of a bear market and investors lacking confidence in either European or U.S. policymakers being able to stem the disquiet surrounding the debt crisis.Wall Street typically defines a bear market as a drop of 20 percent or more from a recent high.Volatility is at its most persistently elevated since the financial crisis of 2008, as measured by the popular VIX , or CBOE Volatility Index. Barring a knock-out U.S. earnings period in the next month, it could remain high, and investors should brace for wild swings and more down days.

Example Text: S&P 500

Sentiment Sentences on Price Change of S&P 500

Negative sentiment about the future price change of the S&P 500

Sentiment Sentences on Volatility Change of S&P 500

Positive sentiment about the future

volatility change of the S&P 500

Document-level Sentiment with respect to multiple objects/features

Initial Experiment Results

Accuracy of knowledge-based sentiment classification vs. standard machine learning methodsSmall manually classified corpusResult: 7% more accurate

Portfolio selection experimentUse sentiment to select Dow Jones stocksResult: Excess returns seem possible

More information in paper: „Extracting Investor Sentiment from Weblog Texts: A Knowledge-based Approach“, published in IEEE CEC 2011 conference proceedings

Main Y1 Achievements

Data acquisition software runningSentiment analysis for web texts

Sentence-level, and specific to features of objects Initial experiment results are promising

Ontology available (~4000 instances) Sentence-level annotated corpus available

(900 documents and growing) Delivered as D3.1 and D4.1 Book chapter on data acquisition in preparation Paper on sentiment extraction

(best paper award at CEC 2011 conference)

Next Steps

Improve ontology and gazetteers

Use corpus to improve sentiment classification

Increase throughput of sentiment extraction

Luxembourg, Nov 2011FIRST Y1 Review Meeting

Thank you

Core Information Processing Technologies Technical Presentation & Demos

Documents

Transcript of Core Information Processing Technologies Technical Presentation & Demos

Quantum Information Processing: Algorithms, Technologies and

Coding technologies used in Meat Processing

Technologies Fueling Predictive Analytics Discussion & Demos

Coconut Processing Technologies - celkau.in

Mild processing technologies - KATANA

REINVENTED ASEPTIC FILL-FINISH PROCESSING TECHNOLOGIES

Emerging food processing technologies 2014

New Food Processing Technologies

Emerging Technologies for High Recovery Processing

Membrane processing technologies jan.2012

Alternative Food Processing Technologies

Oilseed Processing Technologies Adoption Survey

Pyrometallurgical Processing Technologies for Treating ...convencionminera.com/perumin31/encuentros/tecnologia/miercoles18/... · Pyrometallurgical Processing Technologies for Treating

Interactive Image Processing Demos for the Web

Notes on Processing - Infineon Technologies

Ore processing in fluidized bed technologies

MICROWAVE TECHNOLOGIES FOR COMPOSITE PROCESSING

Hanford Supplemental Waste Processing Technologies ...

Pyrometallurgical Processing Technologies for Treating High

Fruit Juice Processing Technologies