Data acquisition and FIRST datasets

60
Miha Grčar, Jožef Stefan Institute Data acquisition and FIRST datasets FIRST Y3 Review Meeting

description

Data acquisition and FIRST datasets. Miha Gr čar, Jožef Stefan Institute. Activity in Y3. Ontology evolution Data acquisition software (DacqPipe) FIRST dataset of news & blogs. Ontology evolution. Dynamic part. (Nearly) Static part. Semantic & lexical resources, IDMS API. - PowerPoint PPT Presentation

Transcript of Data acquisition and FIRST datasets

Page 1: Data acquisition and FIRST datasets

Miha Grčar,Jožef Stefan Institute

Data acquisition and FIRST datasets

FIRST Y3 Review Meeting

Page 2: Data acquisition and FIRST datasets

Activity in Y3

Ontology evolution

Data acquisition software (DacqPipe)

FIRST dataset of news & blogs

Luxembourg, Nov 2013FIRST Y3 Review Meeting 2

Page 3: Data acquisition and FIRST datasets

Ontology evolution

Luxembourg, Nov 2013FIRST Y3 Review Meeting 3

Semantic & lexical resources,

IDMS API

Topic detection & tracking Active learning

FIRST ontology

Indices, stocks,companies, geo-entities,

actors…Sentiment vocabularyTopic taxonomies

(Nearly) Static partDynamic part

Page 4: Data acquisition and FIRST datasets

Ontology evolution

Semantic & lexical resources,

IDMS API

Topic detection & tracking Active learning*

Models for canyon flow visualization

(Nearly) Static partDynamic part

FIRST ontologyModels for sentiment

classification*

“Knowledge base”*Smailović, Grčar, Lavrač, Žnidaršič: Stream-based active learning for sentiment analysis in the financial domain (to appear)

Page 5: Data acquisition and FIRST datasets

Data acquisition pipeline (DacqPipe)

Resembles big data streaming architectures such as Twitter Storm Running continuously since April 2011 Several scientific contributions

Boilerplate remover & gold standard dataset Ontology & ontology-based information extractor

Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip Source code: https://github.com/project-first/dacqpipe

Luxembourg, Nov 2013FIRST Y3 Review Meeting 5

0MQchannel

Emit

OBIE

OBIE

HTML tokenizer

B'plate remover &

duplicate detector

Language detector Filter NLP pipe DB writer

HTML tokenizer

RSS reader

RSS reader

B'plate remover &

duplicate detector

Language detector Filter NLP pipe DB writer

Read & parse CleanSyntacticanalysis Store

DB

Semanticpreprocessing

Page 6: Data acquisition and FIRST datasets

Data acquisition pipeline (DacqPipe)

Resembles big data streaming architectures such as Twitter Storm Running continuously since April 2011 Several scientific contributions

Boilerplate remover & gold standard dataset Ontology & ontology-based information extractor

Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip Source code: https://github.com/project-first/dacqpipe

Luxembourg, Nov 2013FIRST Y3 Review Meeting 6

0MQchannel

Emit

OBIE

OBIE

HTML tokenizer

Language detector Filter NLP pipe DB writer

HTML tokenizer

RSS reader

RSS reader

Language detector Filter NLP pipe DB writer

Read & parse CleanSyntacticanalysis Store

DB

Semanticpreprocessing

Page 7: Data acquisition and FIRST datasets

Boilerplate removal

Luxembourg, Nov 2013FIRST Y3 Review Meeting 7

Page 8: Data acquisition and FIRST datasets

Streaming setting

Luxembourg, Nov 2013FIRST Y3 Review Meeting 8

Page 9: Data acquisition and FIRST datasets

Hypothesis

Web pages at similar Web addresses share common boilerplate, while main content is unique

Luxembourg, Nov 2013FIRST Y3 Review Meeting 9

Page 10: Data acquisition and FIRST datasets

URL Tree

Luxembourg, Nov 2013FIRST Y3 Review Meeting 10

Stream

How many times did I see “About us” in this part of

the tree?

http://www.bbc.co.uk/sports/story2371.html

“About us”

Page 11: Data acquisition and FIRST datasets

Evaluation

Dataset569,583 time-stamped documents (stream)292,053 documents after URL normalizationOct 24 – Dec 19, 2011; 31 Web sitesPart of the FIRST dataset of news & blogs

Gold standard56,436 documents annotated with manually

designed regex tailored for specific Web sites

Luxembourg, Nov 2013FIRST Y3 Review Meeting 11

Page 12: Data acquisition and FIRST datasets

Evaluation

Luxembourg, Nov 2013FIRST Y3 Review Meeting 12

Reset

Page 13: Data acquisition and FIRST datasets

Gold standard datasethttp://first.ijs.si/urltreedataset

Luxembourg, Nov 2013FIRST Y3 Review Meeting 13

Page 14: Data acquisition and FIRST datasets

Conclusion: Final results of WP3

Data acquisition pipeline software (DacqPipe)Since April 2011https://github.com/project-first/dacqpipe

FIRST dataset of news & blogs219 Web sites; ~15 million unique documentshttp://first.ijs.si/FIRSTDataset

FIRST ontologySemantic + lexical partInformation extraction + sentiment analysishttp://first.ijs.si/FIRSTOntology/y3

Luxembourg, Nov 2013FIRST Y3 Review Meeting 14

Page 15: Data acquisition and FIRST datasets

Achim Klein (UHOH), 20 November, Luxembourg

Technical Presentations and Demos- Sentiment Analysis -

Page 16: Data acquisition and FIRST datasets

Knowledge-basedSentiment Extraction

a) Direct sentiment Example: „I expect the S&P 500 to rise“

positive sentiment

Addressed by rules

b) Indirect sentiment, using indicators Example: „I think U.S. interest rates will rise“

negative sentiment

Addressed by ontology

Page 17: Data acquisition and FIRST datasets

UC Retail Brokerage/Market Surveillance:Economic Indicators

Debt to EquityDividend YieldEarnings to Price RatioNew ProductsProfit MarginSales…

Interest RateInflationM2 Change RateDurable Goods OrdersUnemploymentPrivate HousingNew Building Permits…

Advance/Decline RatioBear FlagBreak OutDouble BottomRSISupportResistance…

Page 18: Data acquisition and FIRST datasets

Example Insights:Unemployment Indicator

1/1/2013 4/1/2013 7/1/2013 10/1/20130

20

40

60

80

100

120

140

160

180

Unemployment Indicator Volume

Official US unemployment statistics release dates.

Record Greek unemployment numbers released.

Page 19: Data acquisition and FIRST datasets

UC Reputational Risk:Reputation Indicators (Y3)

Reputation Indicator

Social Responsibility

Positive Correlation

Negative Correlation

Human Resources

Positive Correlation

Negative Correlation

Business Behavior

Positive Correlation

Negative Correlation

Corporate Governance

Positive Correlation

Negative Correlation

Exposure on Critical Markets

Positive Correlation

Negative Correlation

CharityDonationEducation

CrimeBullyingSlave

ProfessionalTalentManpower

Lay offJob cutsWrongdoers

TransparentResponsibleCampaign

DebtForeclosurePrice-fixing

AccountableTier 1 ratioAML

BreachShady fundsLaw suit

SubsidyLiquidityCustomers

Subprime MortgageCDS spread

Positive and negative sample indicators per reputation topic

Total number of indicators: 1451

Page 20: Data acquisition and FIRST datasets

Reputation Sentiment Classification Performance

Precision Recall0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

67.7%

23.7%

71.2%

44.9%

Without Indicators With Indicators

Higher recall of (indirect) sentiments by means of indicators

67.7%71.2%

23.7%

44.9%

Page 21: Data acquisition and FIRST datasets

13.09

.2013

15.09

.2013

17.09

.2013

19.09

.2013

21.09

.2013

23.09

.2013

25.09

.2013

27.09

.2013

29.09

.2013

01.10

.2013

03.10

.2013

05.10

.2013

07.10

.2013

09.10

.2013

11.10

.2013

13.10

.2013

15.10

.2013

17.10

.2013

19.10

.2013

21.10

.2013

23.10

.2013

25.10

.2013

27.10

.2013

29.10

.2013

31.10

.2013

02.11

.2013

04.11

.2013

0

50

100

150

200

250

300

Reputational Insights: JPMorgan

Corporate Governance 11.10.2013“JPMorgan’s Dimon Posts First Loss on $7.2 Billion Legal Cost”[BLOOMBERG]

Volu

me

of C

orpo

rate

Gov

erna

nce

Corporate Governance 19.09.2013“Scandals cost JPMorgan $1 billion in fines”[REUTERS]

Page 22: Data acquisition and FIRST datasets

Fuzzy Sentiment Classification

4. Two separate document-level machine-learning fuzzy classifiers with 5 degrees of …(1) positive, (2) negative

2. Classify sentiment per object in each sentence

3. Generate machine-learning input:Sentiments and words of all sentences that refer to the same object

1. Extract sentiment objects„Apple‘s earnings are rising“

„Sales might decrease because of the financial crisis“

Page 23: Data acquisition and FIRST datasets

Enhanced Gold Standard Corpus (Y3):Retail Brokerage/Market Surveillance

Y2 Y30

200

400

600

800

1000

1200

409

1021

Corpus size

Precision Recall0%

10%

20%

30%

40%

50%

60%

70%

80%

62.6%54.2%

69.0%62.4%

Y2 Y3

Improved hybrid sentiment classifier performance

62.6%69.0%

54.2%62.4%

Page 24: Data acquisition and FIRST datasets

Main Results

Deep knowledge-based sentiment analysisSpecific to a feature of an object using rules

(e.g., reputation of a company)Economic and reputation indicators improve classifier

performance and provide valuable insights for usersGlass-box approach with drill-down capabilitiesBest paper award at IEEE

Fuzzy classifier with 5 degrees of positivity and negativity for better decision making

Fuzzy-level Gold Standard CorpusAnalyzed >3 million documentsOpen source available

git://github.com/project-first/semanticinformationextraction.git

Page 25: Data acquisition and FIRST datasets

Thank you

Page 26: Data acquisition and FIRST datasets

Luxembourg, November 20th, 2013

WP6 Technical Presentation & DemosMarko Bohanec, Miha Grčar,

Jan Muntermann, Michael Siering

Page 27: Data acquisition and FIRST datasets

WP6Status End of Y2

26 2719 20 21 22 23 24 25 28 29 30 31 32 33 34 35 36

• Mainly presenting basic stand-alone prototypes • Presentation of the first models• First visualisation components

FIRST Y3 Review Meeting

Page 28: Data acquisition and FIRST datasets

WP6Achievements Y3

26 2719 20 21 22 23 24 25 28 29 30 31 32 33 34 35 36

T 6.2 /T 6.3

Machine Learning & Qualitative

Models

T 6.4 Visualisation Components

• Refinements of qualitative models based on domain experts’ feedback• Highly scalable implementations• FIRST pipeline integration• Delivery of D6.3 in M33

• Development of additional and revised visualisation components based on domain experts’ feedback

• Highly scalable implementations• Delivery of D6.4 in M34

FIRST Y3 Review Meeting

Page 29: Data acquisition and FIRST datasets

Agenda

Qualitative and quantitative modelsReputational Risk Management Market Surveillance

VisualizationsRetail Brokerage

29FIRST Y3 Review Meeting

Page 30: Data acquisition and FIRST datasets

Reputational Risk Problem Formulation (1/2)

General Area:Production and distribution of investment products and services by banks and other financial institutions.

Specific Use Case:Assessment of reputational risk (RI) based on assessments of MPS counterparties.

Reputational Risk:Risk arising from negative perception on the part of customers, counterparties, shareholders, investors, debt-holders, market analysts, other relevant parties or regulators that can adversely affect a bank’s ability to maintain existing, or establish new, business relationships and continued access to sources of funding.

FIRST Y3 Review Meeting

Page 31: Data acquisition and FIRST datasets

Reputational Risk Problem Formulation (2/2)

Goal: to develop• a multi-criteria model for the assessment of MPS reputational risk

(RIM)• that serves as the main component of corresponding DSS

Approach: expert modeling, qualitative multi-attribute modeling (method DEX)

TIME SERIES

Financial, trading data

Sentiment data

Evaluation

novelties

FIRST Y3 Review Meeting

Page 32: Data acquisition and FIRST datasets

RIM: Main Components

PRODUCT

CUSTOMER B

asic

dat

a pr

oces

sing

Qualitative evaluation (DEXi model)

qRI1

Aggregation

relative CUSTOMER numbers

relative PRODUCT volumes

→ Customer → Product → Counterpart → Bank

RI

COUNTERPART

bank data

FIRST Y3 Review Meeting

Page 33: Data acquisition and FIRST datasets

RIM: Basic Data Processing

PRODUCT VOLUMES

MISMATCHING

SENTIMENT

PERFORMANCE

COUN

TERP

ART

CUST

OM

ER

PRO

DUCT

SLP

SSP S

PP

BP

SRI

w

– f

ΔB

qS

qP

RP – qM

BAN

K da

ta

ΔM

P

CUSTOMER VOLUMES

V1

qRV1C VC

RV1C ÷

VP

TA

NP

TN

÷

÷

RVP

RNP

FIRST Y3 Review Meeting

Page 34: Data acquisition and FIRST datasets

RIM: Qualitative Evaluation

Aim: qualitative assessment of Reputational Index for one customer and product

Model: qualitative hierarchical rule-based DEX model

qS

qP

qM

qRV1C

qPM qRI1

DEXi RIM_D63.dxi 10.7.13 Page 2 61 very-neg medium:high medium:high high62 * very-high >=high very-high63 >=low-neg very-high >=medium very-high64 >=high-neg >=high very-high very-high65 very-neg >=medium very-high very-high66 very-neg very-high >=medium-low very-high qP qM qPM 1 <=low in-line in-line2 in-line low:medium low3 <=medium low low4 medium <=low low5 >=medium in-line low6 in-line high medium7 low:medium medium medium8 >=high low medium9 <=low very-high high

10 low >=high high11 low:high high high12 high medium:high high13 >=high medium high14 >=medium very-high very-high15 very-high >=high very-high

FIRST Y3 Review Meeting

Page 35: Data acquisition and FIRST datasets

RIM: Aggregation

Aim: gradually aggregate qRI1 into the overall Reputation Risk Index (RI):• hierarchical aggregation: Customer → Product → Counterpart → Bank• taking into account relative product volumes and relative customer

numbers

C/P → PRODUCT PRODUCT → COUNTERPART COUNTERPART → BANK

qRI1FIRST Y3 Review Meeting

Page 36: Data acquisition and FIRST datasets

RIM Reports: Topmost Level (Bank)

FIRST Y3 Review Meeting

Page 37: Data acquisition and FIRST datasets

RIM Summary

Developed and implemented a decision support model component for the assessment of bank reputational risk Approach: expert modeling using a variety of modeling methods (qualitative, quantitative, hierarchical, relational)Novel aspects: taking into account sentiment assessments of counterparts advancing the present RI assessment modelBenefits for the users: obtaining a comprehensive RI as time series for different groups

(customers, products, counterparts, bank) ability to analyse and explain assessments at different levels by

drilling down through the RIM hierarchy

FIRST Y3 Review Meeting

Page 38: Data acquisition and FIRST datasets

Agenda

Qualitative and quantitative modelsReputational Risk Management Market Surveillance

VisualizationsRetail Brokerage

FIRST Y3 Review Meeting

Page 39: Data acquisition and FIRST datasets

Problem Formulation:Market Surveillance

Pump & Dump market manipulation: Manipulation of the share price by the dissemination of false positive information in order to take profit from an increased price level.

FIRST Y3 Review Meeting

Page 40: Data acquisition and FIRST datasets

Pump & Dump Example (1/2)

„Shares can multiply dramatically in value

over short time periods.“

„Thursday's pick is a story straight out of

Hollywood!“

„SAPX - Wake Up, Put It On Your Screen NOW“

„Could this company be the next

blockbuster?“

Source: http://newsletter.hotstocked.com/newsletters/view/Could_this_company_be_the_next_blockbuster_-92301

FIRST Y3 Review Meeting

Page 41: Data acquisition and FIRST datasets

Pump & Dump Example (2/2)

Shares Purchased

Seven Arts Entertainment, Inc. (SAPX)

Shares Sold

Pump & Dump campaign July, 24th – 28th 2011> 30 different recommendations

Source: Yahoo! Finance

FIRST Y3 Review Meeting

Page 42: Data acquisition and FIRST datasets

How to address Pump & Dump Manipulations?

Qualitative Modeling

Quantitative Modeling

Based on expert knowledge Qualitative attributes Decision problem divided

into sub problems Goal: daily assessments

Based on machine learning algorithms

Quantitative attributes Goal: assessment of single

documents

Country Black List

IndustryBlack List

Company Black List

Age

Bankrupt

Trading Volume

Number of Trades

Market Capitalization

Market Segments

Sentiment

Content

Black List

History

Market

Trading

News

Company

Financial Instruments

Comp_FinInst Pump & Dump

FIRST Y3 Review Meeting

Page 43: Data acquisition and FIRST datasets

Qualitative Multi-Attribute Model Development

Country Black List

Industry Black List

Company Black List

Age

Bankrupt

Trading Volume

Number of Trades

Market Capitalization

Market Segment

Sentiment

Content

Black List

History

Market

Trading

News

Company

Financial Instrument

Comp_FinInst Pump & Dump

FIRST Y3 Review Meeting

Page 44: Data acquisition and FIRST datasets

FIRST Y3 Review Meeting

From initial DEXi Model (M24) to Processing of Data Stream (M33)

Initial development of the model structure was distributed as DEXi-files.

Models can be applied within DEXi-environment only (M24). To address of the models capability to process large-scale data

streams, a JAVA-based prototype was implemented (M33).

Page 45: Data acquisition and FIRST datasets

Definition of Data Sources

Regulatory Authorities web

pages

FIRST Y3 Review Meeting

Page 46: Data acquisition and FIRST datasets

Model Configuration and Evaluation (M24)

Number of P&D values

Percentage

v-high 482 0.66high 57588 78.75med 12342 16.88low 2498 3.42v-low 215 0.29

Sum 73125 100

Model Configuration• V-high: 3 configurations• High: 9• Medium: 7• Low: 5• V-low: 1

Evaluation based on predefined configuration:

Evaluation• 1700 OTC-traded companies• Dataset: 01.2012 to 06.2013

(370 trading days)• on average 157 alerts per day

for v-high and high

FIRST Y3 Review Meeting

Page 47: Data acquisition and FIRST datasets

Reconfiguration of the Rules in Y3

FIRST Y3 Review Meeting

Page 48: Data acquisition and FIRST datasets

FIRST Y3 Review Meeting

Model Configuration and Evaluation (M33)

Number of P&D values

Percentage

v-high 982 0.8high 22215 18.8med 92049 78.0low 2779 2.4v-low 57 0.0

Sum 118082 100

Configuration:• V-high: 3 configurations• High: 7• Medium: 8• Low: 6• V-low: 1

Evaluation resultsbased on reconfigured model:

Evaluation:• 1700 OTC-traded companies• Dataset: 01.2012 to 09.2013

(435 trading days)• on average 53 alerts per day

for v-high and high

Page 49: Data acquisition and FIRST datasets

Research Impact

Alic, I.; Siering, M.; Bohanec, M. (2013)

Hot Stock or Not? A Qualitative Multi-Attribute Model to Detect Financial Market Manipulation

Proceedings of the 26th Bled eConference; Bled, Slovenia

FIRST Y3 Review Meeting

Page 50: Data acquisition and FIRST datasets

How to address Pump & Dump Manipulations?

Qualitative Modeling

Quantitative Modeling

Based on expert knowledge Qualitative attributes Decision problem divided

into sub problems Goal: daily assessments

Based on machine learning algorithms

Quantitative attributes Goal: assessment of

documents

Country Black List

IndustryBlack List

Company Black List

Age

Bankrupt

Trading Volume

Number of Trades

Market Capitalization

Market Segments

Sentiment

Content

Black List

History

Market

Trading

News

Company

Financial Instruments

Comp_FinInst Pump & Dump

FIRST Y3 Review Meeting

Page 51: Data acquisition and FIRST datasets

Research Objective: Development of a Pump & Dump Classifier

Learning phase:

Application phase:

New documents PredictionsClassifier

?not

suspicious

suspiciou

s?

Labeled documents

Trainingalgorithm Classifier

Support Vector Machine++ - -Evaluation of

Training Documents

Evaluation of Machine Learning

Algorithms

Integration in FIRST Pipeline

FIRST Y3 Review Meeting

Page 52: Data acquisition and FIRST datasets

Evaluation of Training Documents

Event study: capital market reaction during / after pump and dump campaignsignificant abnormal returns during campaignprice decrease after campaign has ended

Siering, M. (2013) All Pump, No Dump? The Impact of Internet Deception on Stock Markets In: Proceedings of the 21st European Conference on Information Systems; Utrecht, Netherlands

FIRST Y3 Review Meeting

Page 53: Data acquisition and FIRST datasets

Evaluation of different machine learning algorithms

Neural Network: reduced feature set SVM: parameter optimisation according to

Hsu et al. (2003)

Evaluation of Machine Learning Algorithms

Class suspicious Class non-suspiciousAccuracy

Precision

Recall F1 Precision

Recall F1

Decision Tree 95.10 94.65 95.60 95.12 95.56 94.60 95.08

Naïve Bayes 97.30 96.28 98.40 97.33 98.36 96.20 97.27

k-NN, k =1 78.10 80.09 74.80 77.35 76.36 81.40 78.80k-NN, k=2 73.60 68.67 86.80 76.68 82.07 60.40 69.59k-NN, k=3 75.30 77.32 71.60 74.35 73.56 79.00 76.18k-NN, k=4 74.20 71.68 80.00 75.61 77.38 68.40 72.61k-NN, k=5 75.10 77.34 71.00 74.03 73.20 79.20 76.08Neural Network

97.00 96.81 97.20 97.00 97.19 96.80 96.99

SVM 99.30 99.20 99.40 99.30 99.40 99.20 99.30

Hsu, C. W., Chang, C. C., & Lin, C. J. (2003). A practical guide to support vector classification. National Taiwan University, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf (accessed on 10/16/2011) .

Decision Tree

Naive Bayes

k-NN Neural Network

Support Vector

Machine

050

100150200250300350400450500

Computing requirements of different learners(10-Fold Cross Validation, in sec.)

FIRST Y3 Review Meeting

Page 54: Data acquisition and FIRST datasets

Integration in FIRST Pipeline

FIRST Y3 Review Meeting

Boiler-plate

removal

Datasources

HTML preproc.

Languagedetection

Near-duplicateremoval

Database

HTM

L pa

ges

Internal Data Sources

Cle

an te

xt

External Data (Web) Sources

Quant.Pump and

Dump Model

Page 55: Data acquisition and FIRST datasets

Integration of Quantitative and Qualitative Models: Multi Classifier Approach

Country Black List

Industry Black List

Company Black List

Age

Bankrupt

Trading Volume

Number of Trades

Market Capitalization

Market Segment

Sentiment

Content

Black List

History

Market

Trading

News

Company

Financial Instrument

Comp_FinInst Pump & Dump

FIRST Y3 Review Meeting

Page 56: Data acquisition and FIRST datasets

Market Surveillance Summary

Developed and implemented multi-classifier decision support component for the assessment of information-based market manipulationApproach: expert modeling using a variety of modeling methods (qualitative, quantitative, hierarchical, relational) and machine learningNovel aspects: taking into account sentiment assessments of published documents multi-classifier component integrates qualitative and quantitative

modelBenefits for the users: obtaining the ability to monitor information-based market

manipulation in market segments with a large number of financial instruments.

FIRST Y3 Review Meeting

Page 57: Data acquisition and FIRST datasets

Agenda

Qualitative and quantitative modelsReputational Risk Management Market Surveillance

VisualizationsRetail Brokerage

FIRST Y3 Review Meeting

Page 58: Data acquisition and FIRST datasets

Visualisation Components

Video

Online Demo

FIRST Y3 Review Meeting

Page 59: Data acquisition and FIRST datasets

Retail Brokerage Summary

Developed and implemented visualisation components providing the basis for data/document-driven DSS in the Retail Brokerage use case scenario.

Approach: Clustering of document topics, aggregation of document sentiments and publication statistics.Novel aspects: Visualisation components that condense social media activity Aggregation of media topics to explore social media contentsBenefits for the users: Explore activity, sentiment and topics of social media in a user-

friendly way

FIRST Y3 Review Meeting

Page 60: Data acquisition and FIRST datasets

Thank you for your attention!

Questions?

FIRST Y3 Review Meeting