Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science...

19
Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös University Budapest benczur @ sztaki.mta.hu http://datamining.sztaki.hu 14 June 2013 Web and Social Media TÁMOP-4.2.2.C-11/1/KONV-2012- 0013

Transcript of Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science...

Page 1: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Web and Social Media

Web and Social Media research at SZTAKIZsolt Fekete, Andras Benczur

Insitute for Computer Science and Control Hungarian Academy of Sciencesand

Eötvös University [email protected]

http://datamining.sztaki.hu

14 June 2013

TÁMOP-4.2.2.C-11/1/KONV-2012-0013

Page 2: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Informatics Laboratory• Data Mining and Search Group

o Zsolt Fekete, head

• Data Warehouse and Business Intelligence groupo Csaba Sidlo, head

• Groups within the labo Lajos Ronyai, Theory of Computing groupo Daniel Marx, ERC Starting Grant

winner, Parameterized Complexityo Andras Kornai, Human Language

Technologies

Hardware• 50-node old dual

core Hadoop• 5-node new

Hadoop/HBASE• 260TB net Isilon

Big Data – „Momentum” groupAwarded by President of Hungarian Academy of

Sciences in 2012

Page 3: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

SZTAKI Text Mining Center• Funded by the President of the Hungarian Academy of

Sciences• Led by Prof. Laszlo Monostori, Research Laboratory on

Engineering & Management Intelligence o Informatics Laboratory (András Benczúr)o Laboratory of Parallel and Distributed Systems (Péter Kacsuk)o Internet Technology Department (István Tétényi)o Department of Distributed Systems (László Kovács)

• Topics:o trend monitoring; novelty recognition; concept-flow, concept-mapping;o analysis, monitoring and visualization of theme, professional relation, joint

authorship, citations, etc.o opinion extraction; semantic annotation; domain ontology development;o identification and resolution of names of persons and organization;o plagiarism detection

Page 4: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Connection to FuturICT.hu Work Plan

• Science of Scienceo SZTAKI Text Mining Centero Web classificationo Metadata extractiono SZTAKI Plagiarism Detection toolkit

• Fully Distributed Learning (and Networks)o Recommender systemso Distributed and streaming architectureso Network influence in recommender systems

Page 5: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Automatic metadata extraction• Articles in pdf form• Extracting

o Titleo Authorso Referenceso Etc

• Used techniqueso Computing features (text, visual info)o Machine Learning: SVM, CRF

Page 6: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

• Save resources, select quality and topic• Legal regulation (porn, illicit content)• Web scale data (Test: ClueWeb09 25TB –

0.5 Billion English language docs)

JulienPhilippeMasanes

RigauxInternet Memory Paris

Cross-Lingual Web Spam Classification. Garzó, Daróczy, Kiss, Siklósi, Benczúr. WebQuality 2013 (@WWW)The classification power of Web features. Erdelyi, Benczur, Daroczy, Garzo, Kiss, Siklosi Internet Mathematics, under revision

Crosslingual Web Classification

Page 7: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

• Expensive human labeling task language by language?

• How can models be “translated”?

Terms in the English model translated into Portuguese to classify in the target language.

Strongest positive and negative predictions are used for training a model in the target language.

Crosslingual Web Classification

Page 8: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

KopFIRE: Technology in the cloud• BonFIRE FP7

Future Internet Research and Experimentation testbed• KOPI: A plagiarism detection toolkit

o http://kopi.sztaki.hu/o Translation plagiarism (English and Hungarian)o Now serving Wikipediao Service puts very heavy load on search index

(sentence based checks, existing suboptimal code)o Index ported to several distributed key-value storeso New alpha version now fed with Web data

Page 9: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Search for events in time

Page 10: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Search for events in time

Page 11: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Search for events in time

Page 12: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

SZTAKI Full Text Search Technology

Page 13: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Trend analysis• Temporal data (eg. blogs)• Visualizing trends

o Wordso Groups of words

• Challengeso Big data techniqueso Temporal text indexing

Page 14: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Network Influence in Recommenders

Page 15: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Mobility Data Stream processing (Orange D4D)

Page 16: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Stream Processing Architecture Overview

Goal is to hide Storm details from user• Streaming infrastructure pluggable

(could combine with Stratosphere)• Persistence layer pluggable

Page 17: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Conclusions• SZTAKI covers a chain of research topics

o Web data acquisitiono cleansing and metadata extractiono search, temporal analyticso influence detectiono recommendation

• Science of Scienceo SZTAKI Text Mining Centero Multilingual classification for quality, genre, spamo Metadata extraction from pdf publications over the Webo SZTAKI Plagiarism Detection toolkit

• Fully Distributed Learning and Networkso Distributed and streaming architectureso Network influence in recommender systems

Page 18: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Recent publications• Pálovics,Benczúr. Temporal influence over the Last.fm social network.

IEEE ASONAM 2013• Garzó et al., Cross-Lingual Web Spam Classification. WebQuality 2013• Erdélyi et al., The classication power of Web features. Internet

Mathematics, under revision• L. Kocsis, A. György, A. N. Bán., BoostingTree: Parallel Selection of Weak

Learners in Boosting, with Application to Ranking. Machine Learning, to appear.

• Garzo et al., Real-time streaming mobility analytics. NetMob 2013• Göbölös-Szabó, Prytkova, Spaniol, Weikum. Cross-Lingual Data Quality

for Knowledge Base Acceleration across Wikipedia Editions. QDB 2012• Eom, Frahm, Benczur, Shepelyansky. Time evolution of Wikipedia

network ranking. Arxiv, 2013.• C. Sidló, A. Garzó, A. Molnár, A.A. Benczúr, Infrastructures and Bound

for Distributed Entity Resolution, in Proc. QDB in conj. VLDB 2011.

Page 19: Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Web and Social Media

Questions?Zsolt Fekete

Head,Data Mining and

Search

member of the“Big Data” lab

http://datamining.sztaki.hu/

[email protected] 14 June 2013