Data acquisition and FIRST datasets

Miha Grčar,Jožef Stefan Institute

Data acquisition and FIRST datasets

FIRST Y3 Review Meeting

Activity in Y3

Ontology evolution

Data acquisition software (DacqPipe)

FIRST dataset of news & blogs

Luxembourg, Nov 2013FIRST Y3 Review Meeting 2

Ontology evolution


Semantic & lexical resources,

IDMS API

Topic detection & tracking Active learning

FIRST ontology

Indices, stocks,companies, geo-entities,

actors…Sentiment vocabularyTopic taxonomies

(Nearly) Static partDynamic part

Ontology evolution

Semantic & lexical resources,

IDMS API

Topic detection & tracking Active learning*

Models for canyon flow visualization

(Nearly) Static partDynamic part

FIRST ontologyModels for sentiment

classification*

“Knowledge base”*Smailović, Grčar, Lavrač, Žnidaršič: Stream-based active learning for sentiment analysis in the financial domain (to appear)

Data acquisition pipeline (DacqPipe)

Resembles big data streaming architectures such as Twitter Storm Running continuously since April 2011 Several scientific contributions

Boilerplate remover & gold standard dataset Ontology & ontology-based information extractor

Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip Source code: https://github.com/project-first/dacqpipe


0MQchannel

Emit

OBIE

OBIE

HTML tokenizer

B'plate remover &

duplicate detector

Language detector Filter NLP pipe DB writer

HTML tokenizer

RSS reader

RSS reader

B'plate remover &

duplicate detector


Read & parse CleanSyntacticanalysis Store

DB

Semanticpreprocessing

http://first.ijs.si/software/DacqPipeJun2013.zip


https://github.com/project-first/dacqpipe


Data acquisition pipeline (DacqPipe)

Resembles big data streaming architectures such as Twitter Storm Running continuously since April 2011 Several scientific contributions

Boilerplate remover & gold standard dataset Ontology & ontology-based information extractor

Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip Source code: https://github.com/project-first/dacqpipe


0MQchannel

Emit

OBIE

OBIE

HTML tokenizer


HTML tokenizer

RSS reader

RSS reader


Read & parse CleanSyntacticanalysis Store

DB

Semanticpreprocessing





Boilerplate removal


Streaming setting


Hypothesis

Web pages at similar Web addresses share common boilerplate, while main content is unique


URL Tree


Stream

How many times did I see “About us” in this part of

the tree?

http://www.bbc.co.uk/sports/story2371.html

“About us”

Evaluation

Dataset569,583 time-stamped documents (stream)292,053 documents after URL normalizationOct 24 – Dec 19, 2011; 31 Web sitesPart of the FIRST dataset of news & blogs

Gold standard56,436 documents annotated with manually

designed regex tailored for specific Web sites


Evaluation


Reset

Gold standard datasethttp://first.ijs.si/urltreedataset


Conclusion: Final results of WP3

Data acquisition pipeline software (DacqPipe)Since April 2011https://github.com/project-first/dacqpipe

FIRST dataset of news & blogs219 Web sites; ~15 million unique documentshttp://first.ijs.si/FIRSTDataset

FIRST ontologySemantic + lexical partInformation extraction + sentiment analysishttp://first.ijs.si/FIRSTOntology/y3




http://first.ijs.si/FIRSTDataset

http://first.ijs.si/FIRSTOntology/y3

Achim Klein (UHOH), 20 November, Luxembourg

Technical Presentations and Demos- Sentiment Analysis -

Knowledge-basedSentiment Extraction

a) Direct sentiment Example: „I expect the S&P 500 to rise“

positive sentiment

Addressed by rules

b) Indirect sentiment, using indicators Example: „I think U.S. interest rates will rise“

negative sentiment

Addressed by ontology

UC Retail Brokerage/Market Surveillance:Economic Indicators

Debt to EquityDividend YieldEarnings to Price RatioNew ProductsProfit MarginSales…

Interest RateInflationM2 Change RateDurable Goods OrdersUnemploymentPrivate HousingNew Building Permits…

Advance/Decline RatioBear FlagBreak OutDouble BottomRSISupportResistance…

Example Insights:Unemployment Indicator

1/1/2013 4/1/2013 7/1/2013 10/1/20130

20

40

60

80

100

120

140

160

180

Unemployment Indicator Volume

Official US unemployment statistics release dates.

Record Greek unemployment numbers released.

UC Reputational Risk:Reputation Indicators (Y3)

Reputation Indicator

Social Responsibility

Positive Correlation

Negative Correlation

Human Resources



Business Behavior



Corporate Governance



Exposure on Critical Markets



CharityDonationEducation

CrimeBullyingSlave

ProfessionalTalentManpower

Lay offJob cutsWrongdoers

TransparentResponsibleCampaign

DebtForeclosurePrice-fixing

AccountableTier 1 ratioAML

BreachShady fundsLaw suit

SubsidyLiquidityCustomers

Subprime MortgageCDS spread

Positive and negative sample indicators per reputation topic

Total number of indicators: 1451

Reputation Sentiment Classification Performance

Precision Recall0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

67.7%

23.7%

71.2%

44.9%

Without Indicators With Indicators

Higher recall of (indirect) sentiments by means of indicators

67.7%71.2%

23.7%

44.9%

13.09

.2013

15.09

.2013

17.09

.2013

19.09

.2013

21.09

.2013

23.09

.2013

25.09

.2013

27.09

.2013

29.09

.2013

01.10

.2013

03.10

.2013

05.10

.2013

07.10

.2013

09.10

.2013

11.10

.2013

13.10

.2013

15.10

.2013

17.10

.2013

19.10

.2013

21.10

.2013

23.10

.2013

25.10

.2013

27.10

.2013

29.10

.2013

31.10

.2013

02.11

.2013

04.11

.2013

0

50

100

150

200

250

300

Reputational Insights: JPMorgan

Corporate Governance 11.10.2013“JPMorgan’s Dimon Posts First Loss on $7.2 Billion Legal Cost”[BLOOMBERG]

Volu

me

of C

orpo

rate

Gov

erna

nce

Corporate Governance 19.09.2013“Scandals cost JPMorgan $1 billion in fines”[REUTERS]

http://www.bloomberg.com/news/2013-10-11/jpmorgan-reports-380-million-third-quarter-loss-on-legal-costs.html

http://www.reuters.com/article/2013/09/19/us-jpmorgan-whale-idUSBRE98I0JL20130919

Fuzzy Sentiment Classification

4. Two separate document-level machine-learning fuzzy classifiers with 5 degrees of …(1) positive, (2) negative

2. Classify sentiment per object in each sentence

3. Generate machine-learning input:Sentiments and words of all sentences that refer to the same object

1. Extract sentiment objects„Apple‘s earnings are rising“

„Sales might decrease because of the financial crisis“

Enhanced Gold Standard Corpus (Y3):Retail Brokerage/Market Surveillance

Y2 Y30

200

400

600

800

1000

1200

409

1021

Corpus size

Precision Recall0%

10%

20%

30%

40%

50%

60%

70%

80%

62.6%54.2%

69.0%62.4%

Y2 Y3

Improved hybrid sentiment classifier performance

62.6%69.0%

54.2%62.4%

Main Results

Deep knowledge-based sentiment analysisSpecific to a feature of an object using rules

(e.g., reputation of a company)Economic and reputation indicators improve classifier

performance and provide valuable insights for usersGlass-box approach with drill-down capabilitiesBest paper award at IEEE

Fuzzy classifier with 5 degrees of positivity and negativity for better decision making

Fuzzy-level Gold Standard CorpusAnalyzed >3 million documentsOpen source available

git://github.com/project-first/semanticinformationextraction.git

Thank you

Luxembourg, November 20th, 2013

WP6 Technical Presentation & DemosMarko Bohanec, Miha Grčar,

Jan Muntermann, Michael Siering

WP6Status End of Y2

26 2719 20 21 22 23 24 25 28 29 30 31 32 33 34 35 36

• Mainly presenting basic stand-alone prototypes • Presentation of the first models• First visualisation components


WP6Achievements Y3

26 2719 20 21 22 23 24 25 28 29 30 31 32 33 34 35 36

T 6.2 /T 6.3

Machine Learning & Qualitative

Models

T 6.4 Visualisation Components

• Refinements of qualitative models based on domain experts’ feedback• Highly scalable implementations• FIRST pipeline integration• Delivery of D6.3 in M33

• Development of additional and revised visualisation components based on domain experts’ feedback

• Highly scalable implementations• Delivery of D6.4 in M34


Agenda

Qualitative and quantitative modelsReputational Risk Management Market Surveillance

VisualizationsRetail Brokerage

29FIRST Y3 Review Meeting

Reputational Risk Problem Formulation (1/2)

General Area:Production and distribution of investment products and services by banks and other financial institutions.

Specific Use Case:Assessment of reputational risk (RI) based on assessments of MPS counterparties.

Reputational Risk:Risk arising from negative perception on the part of customers, counterparties, shareholders, investors, debt-holders, market analysts, other relevant parties or regulators that can adversely affect a bank’s ability to maintain existing, or establish new, business relationships and continued access to sources of funding.


Reputational Risk Problem Formulation (2/2)

Goal: to develop• a multi-criteria model for the assessment of MPS reputational risk

(RIM)• that serves as the main component of corresponding DSS

Approach: expert modeling, qualitative multi-attribute modeling (method DEX)

TIME SERIES

Financial, trading data

Sentiment data

Evaluation

novelties


RIM: Main Components

PRODUCT

CUSTOMER B

asic

dat

a pr

oces

sing

Qualitative evaluation (DEXi model)

qRI1

Aggregation

relative CUSTOMER numbers

relative PRODUCT volumes

→ Customer → Product → Counterpart → Bank

RI

COUNTERPART

bank data


RIM: Basic Data Processing

PRODUCT VOLUMES

MISMATCHING

SENTIMENT

PERFORMANCE

COUN

TERP

ART

CUST

OM

ER

PRO

DUCT

SLP

SSP S

PP

BP

SRI

w

– f

ΔB

qS

qP

RP – qM

BAN

K da

ta

ΔM

P

CUSTOMER VOLUMES

V1

qRV1C VC

RV1C ÷

VP

TA

NP

TN

÷

÷

RVP

RNP


RIM: Qualitative Evaluation

Aim: qualitative assessment of Reputational Index for one customer and product

Model: qualitative hierarchical rule-based DEX model

qS

qP

qM

qRV1C

qPM qRI1

DEXi RIM_D63.dxi 10.7.13 Page 2 61 very-neg medium:high medium:high high62 * very-high >=high very-high63 >=low-neg very-high >=medium very-high64 >=high-neg >=high very-high very-high65 very-neg >=medium very-high very-high66 very-neg very-high >=medium-low very-high qP qM qPM 1 <=low in-line in-line2 in-line low:medium low3 <=medium low low4 medium <=low low5 >=medium in-line low6 in-line high medium7 low:medium medium medium8 >=high low medium9 <=low very-high high

10 low >=high high11 low:high high high12 high medium:high high13 >=high medium high14 >=medium very-high very-high15 very-high >=high very-high


RIM: Aggregation

Aim: gradually aggregate qRI1 into the overall Reputation Risk Index (RI):• hierarchical aggregation: Customer → Product → Counterpart → Bank• taking into account relative product volumes and relative customer

numbers

C/P → PRODUCT PRODUCT → COUNTERPART COUNTERPART → BANK

qRI1FIRST Y3 Review Meeting

RIM Reports: Topmost Level (Bank)


RIM Summary

Developed and implemented a decision support model component for the assessment of bank reputational risk Approach: expert modeling using a variety of modeling methods (qualitative, quantitative, hierarchical, relational)Novel aspects: taking into account sentiment assessments of counterparts advancing the present RI assessment modelBenefits for the users: obtaining a comprehensive RI as time series for different groups

(customers, products, counterparts, bank) ability to analyse and explain assessments at different levels by

drilling down through the RIM hierarchy


Agenda




Problem Formulation:Market Surveillance

Pump & Dump market manipulation: Manipulation of the share price by the dissemination of false positive information in order to take profit from an increased price level.


Pump & Dump Example (1/2)

„Shares can multiply dramatically in value

over short time periods.“

„Thursday's pick is a story straight out of

Hollywood!“

„SAPX - Wake Up, Put It On Your Screen NOW“

„Could this company be the next

blockbuster?“

Source: http://newsletter.hotstocked.com/newsletters/view/Could_this_company_be_the_next_blockbuster_-92301


Pump & Dump Example (2/2)

Shares Purchased

Seven Arts Entertainment, Inc. (SAPX)

Shares Sold

Pump & Dump campaign July, 24th – 28th 2011> 30 different recommendations

Source: Yahoo! Finance


How to address Pump & Dump Manipulations?

Qualitative Modeling

Quantitative Modeling

Based on expert knowledge Qualitative attributes Decision problem divided

into sub problems Goal: daily assessments

Based on machine learning algorithms

Quantitative attributes Goal: assessment of single

documents

Country Black List

IndustryBlack List

Company Black List

Age

Bankrupt

Trading Volume

Number of Trades

Market Capitalization

Market Segments

Sentiment

Content

Black List

History

Market

Trading

News

Company

Financial Instruments

Comp_FinInst Pump & Dump


Qualitative Multi-Attribute Model Development

Country Black List

Industry Black List

Company Black List

Age

Bankrupt

Trading Volume

Number of Trades


Market Segment

Sentiment

Content

Black List

History

Market

Trading

News

Company

Financial Instrument




From initial DEXi Model (M24) to Processing of Data Stream (M33)

Initial development of the model structure was distributed as DEXi-files.

Models can be applied within DEXi-environment only (M24). To address of the models capability to process large-scale data

streams, a JAVA-based prototype was implemented (M33).

Definition of Data Sources

Regulatory Authorities web

pages


Model Configuration and Evaluation (M24)

Number of P&D values

Percentage

v-high 482 0.66high 57588 78.75med 12342 16.88low 2498 3.42v-low 215 0.29

Sum 73125 100

Model Configuration• V-high: 3 configurations• High: 9• Medium: 7• Low: 5• V-low: 1

Evaluation based on predefined configuration:

Evaluation• 1700 OTC-traded companies• Dataset: 01.2012 to 06.2013

(370 trading days)• on average 157 alerts per day

for v-high and high


Reconfiguration of the Rules in Y3



Model Configuration and Evaluation (M33)

Number of P&D values

Percentage

v-high 982 0.8high 22215 18.8med 92049 78.0low 2779 2.4v-low 57 0.0

Sum 118082 100

Configuration:• V-high: 3 configurations• High: 7• Medium: 8• Low: 6• V-low: 1

Evaluation resultsbased on reconfigured model:

Evaluation:• 1700 OTC-traded companies• Dataset: 01.2012 to 09.2013

(435 trading days)• on average 53 alerts per day

for v-high and high

Research Impact

Alic, I.; Siering, M.; Bohanec, M. (2013)

Hot Stock or Not? A Qualitative Multi-Attribute Model to Detect Financial Market Manipulation

Proceedings of the 26th Bled eConference; Bled, Slovenia


How to address Pump & Dump Manipulations?

Qualitative Modeling

Quantitative Modeling

Based on expert knowledge Qualitative attributes Decision problem divided

into sub problems Goal: daily assessments

Based on machine learning algorithms

Quantitative attributes Goal: assessment of

documents

Country Black List

IndustryBlack List

Company Black List

Age

Bankrupt

Trading Volume

Number of Trades


Market Segments

Sentiment

Content

Black List

History

Market

Trading

News

Company

Financial Instruments



Research Objective: Development of a Pump & Dump Classifier

Learning phase:

Application phase:

New documents PredictionsClassifier

?not

suspicious

suspiciou

s?

Labeled documents

Trainingalgorithm Classifier

Support Vector Machine++ - -Evaluation of

Training Documents

Evaluation of Machine Learning

Algorithms

Integration in FIRST Pipeline


Evaluation of Training Documents

Event study: capital market reaction during / after pump and dump campaignsignificant abnormal returns during campaignprice decrease after campaign has ended

Siering, M. (2013) All Pump, No Dump? The Impact of Internet Deception on Stock Markets In: Proceedings of the 21st European Conference on Information Systems; Utrecht, Netherlands


Evaluation of different machine learning algorithms

Neural Network: reduced feature set SVM: parameter optimisation according to

Hsu et al. (2003)

Evaluation of Machine Learning Algorithms

Class suspicious Class non-suspiciousAccuracy

Precision

Recall F1 Precision

Recall F1

Decision Tree 95.10 94.65 95.60 95.12 95.56 94.60 95.08

Naïve Bayes 97.30 96.28 98.40 97.33 98.36 96.20 97.27

k-NN, k =1 78.10 80.09 74.80 77.35 76.36 81.40 78.80k-NN, k=2 73.60 68.67 86.80 76.68 82.07 60.40 69.59k-NN, k=3 75.30 77.32 71.60 74.35 73.56 79.00 76.18k-NN, k=4 74.20 71.68 80.00 75.61 77.38 68.40 72.61k-NN, k=5 75.10 77.34 71.00 74.03 73.20 79.20 76.08Neural Network

97.00 96.81 97.20 97.00 97.19 96.80 96.99

SVM 99.30 99.20 99.40 99.30 99.40 99.20 99.30

Hsu, C. W., Chang, C. C., & Lin, C. J. (2003). A practical guide to support vector classification. National Taiwan University, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf (accessed on 10/16/2011) .

Decision Tree

Naive Bayes

k-NN Neural Network

Support Vector

Machine

050

100150200250300350400450500

Computing requirements of different learners(10-Fold Cross Validation, in sec.)


Integration in FIRST Pipeline


Boiler-plate

removal

Datasources

HTML preproc.

Languagedetection

Near-duplicateremoval

Database

HTM

L pa

ges

Internal Data Sources

Cle

an te

xt

External Data (Web) Sources

Quant.Pump and

Dump Model

Integration of Quantitative and Qualitative Models: Multi Classifier Approach

Country Black List

Industry Black List

Company Black List

Age

Bankrupt

Trading Volume

Number of Trades


Market Segment

Sentiment

Content

Black List

History

Market

Trading

News

Company

Financial Instrument



Market Surveillance Summary

Developed and implemented multi-classifier decision support component for the assessment of information-based market manipulationApproach: expert modeling using a variety of modeling methods (qualitative, quantitative, hierarchical, relational) and machine learningNovel aspects: taking into account sentiment assessments of published documents multi-classifier component integrates qualitative and quantitative

modelBenefits for the users: obtaining the ability to monitor information-based market

manipulation in market segments with a large number of financial instruments.


Agenda




Visualisation Components

Video

Online Demo


http://first.ijs.si/Videos/y3wp6demo/

http://first.ijs.si/visapi/

Retail Brokerage Summary

Developed and implemented visualisation components providing the basis for data/document-driven DSS in the Retail Brokerage use case scenario.

Approach: Clustering of document topics, aggregation of document sentiments and publication statistics.Novel aspects: Visualisation components that condense social media activity Aggregation of media topics to explore social media contentsBenefits for the users: Explore activity, sentiment and topics of social media in a user-

friendly way


Thank you for your attention!

Questions?


Data acquisition and FIRST datasets

Documents

Transcript of Data acquisition and FIRST datasets