ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce...

7
ICCS 2008, Cracow June 23-25, 2008 1 Towards Large Scale Semantic Annotation Built Towards Large Scale Semantic Annotation Built on MapReduce Architecture on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Michal Laclavík, Martin Šeleng, Ladislav Hluchý Hluchý Institute of Informatics Institute of Informatics Slovak Academy of Sciences in Bratislava Slovak Academy of Sciences in Bratislava

Transcript of ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce...

Page 1: ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.

ICCS 2008, Cracow June 23-25, 2008 1

Towards Large Scale Semantic Annotation Built on Towards Large Scale Semantic Annotation Built on MapReduce ArchitectureMapReduce Architecture

Michal Laclavík, Martin Šeleng, Ladislav HluchýMichal Laclavík, Martin Šeleng, Ladislav Hluchý

Institute of InformaticsInstitute of InformaticsSlovak Academy of Sciences in BratislavaSlovak Academy of Sciences in Bratislava

Page 2: ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.

Motivation

• Semantic Annotation or Tagging– Deliver formal understanding of text

documents one of main focuses of semantic web

– Documents on Web or in enterprise to be understood by computer

– To understand content and context

ICCS 2008, Cracow June 23-25, 2008 2

Page 3: ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.

Semantic Annotation

• Similar to Information Extraction

• Finding meta data about entities, its properties and their relations

• Ontologies

• Manual tools

• (Semi) Automatic tools– Usually tested on a few hundreds documents

• Needs:– To deliver application on the web or in enterprises we need to annotate large

scale – Semantic Web can be exploited only if metadata understood by a computer

reach critical mass

• Examples:– Geographical locations, People, Organizations

ICCS 2008, Cracow June 23-25, 2008 3

Page 4: ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.

MapReduce

• Google approach for large scale information processing• Commodity PC’s• Application developer needs to implement only Map and

Reduce methods• Inputs and outputs are ordered key-value pairs• Fault tolerant, easy to use, scalable to

hundred thousands computers • Hadoop

– open sourceimplementation by Apache

– Yahoo! is using it on10 000 cores in production environment.

ICCS 2008, Cracow June 23-25, 2008 4

Page 5: ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.

Ontea: Pattern Based Annotation

• Information extraction and semantic annotation using patterns• Find objects and properties in text • Possibility to transform it to RDF/OWL

• Similar to C-PANKOW, KIM or GATE• Very simple solution good for languages

where advanced NLP is not present• Applicable in enterprise applications

ICCS 2008, Cracow June 23-25, 2008 5

# Text Instance Patterns – regular expressions 1 Apple, Inc. Company: Apple Company: ([A-Za-z0-9]+)[, ]+(Inc|Ltd) 2 Mountain View, CA 94043

Settlement: Mountain View Settlement: ([A-Z][a-z]+[ ]*[A-Za-z]*)[

]+[A-Z]{2}[ ]*[0-9]{5} 3 [email protected] Email: [email protected] Email:

[-_.a-z0-9]+@[-_.a-zA-Z0-9]+\.[a-z]{2,8} 4 Mr. Michal Laclavik Person: Michal Laclavik Person:

(Mr.|Mrs.|Dr.) ([A-Z][a-z]+ [A-Z][a-z]+)

ontea.core.ResultRegExp

+ ResultRegExp()+ getFoundText()

ontea.core.PatternSet

+ PatternSet()+ getPatternSet()

ontea.core.PatternRegExp

+ PatternRegExp()+ PatternRegExp()+ PatternRegExp()+ annotate()+ getName()+ getPattern()+ getType()+ getThreshold()

ontea.core.ResultOnto

+ ResultOnto()+ ResultOnto()+ getURI ()+ getLocalName()+ toString()

«interface»ontea.core.Pattern

+ annotate()

«interface»ontea.transform.ResultTransformer

+ transform()+ transform()

ontea.core.Result

+ Result()+ getIndividual()+ setIndividual()+ getPattern()+ getType()+ getRelevance()+ setRelevance()+ equals()+ hashCode()

ontea.transform.LuceneRelevance

+ transform()+ transform()+ LuceneRelevance()

ontea.transform.SesameIndividual...

+ transform()+ transform()+ SesameIndividualSearchAndCreate()

+pattern

Page 6: ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.

Ontea in Hadoop

• Map function - Pattern.annotation()– Input lines of text– Output key-value pairs e.g.

• file_name => organization:Apple

• Organization:Apple=>address:Mountain View

• Map function – transformers– E.g. lemmatization transformer– input: Settlement:Bratislave,Settlement:Bratislava– Output: Settlement:Bratislava

• Reduce function– input key-value pairs (objects and properties)– Output as needed – objects and its relations to files with

properties (e.g. in RDF/OWL)

ICCS 2008, Cracow June 23-25, 2008 6

Page 7: ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.

Results & Conclusion

• It works, it is portable, it is faster• 12 times faster on 16 cores • http://ontea.sourceforge.net/

ICCS 2008, Cracow June 23-25, 2008 7

Description Enron corpus (88MB) Personal email (770MB)

Time on single machine 2min, 5sec 3hours, 37mins, 4sec Time on 8 nodes hadoop cluster 1min, 6sec 18mins, 4sec

Performance increased 1.9 times 12 times

Launched map tasks 45 187

Launched reduce tasks 1 1

Data-local map tasks 44 186

Map input records 2,205,910 10,656,904

Map output records 23,571 37,571

Map input bytes 88,171,505 770,924,437

Map output bytes 1,257,795 1,959,363

Combine input records 23,571 37,571

Combine output records 10,214 3,511

Reduce input groups 7,445 861

Reduce input records 10,214 3,511

Reduce output records 7,445 861