ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce...
-
Upload
cordelia-doyle -
Category
Documents
-
view
212 -
download
0
Transcript of ICCS 2008, CracowJune 23-25, 20081 Towards Large Scale Semantic Annotation Built on MapReduce...
ICCS 2008, Cracow June 23-25, 2008 1
Towards Large Scale Semantic Annotation Built on Towards Large Scale Semantic Annotation Built on MapReduce ArchitectureMapReduce Architecture
Michal Laclavík, Martin Šeleng, Ladislav HluchýMichal Laclavík, Martin Šeleng, Ladislav Hluchý
Institute of InformaticsInstitute of InformaticsSlovak Academy of Sciences in BratislavaSlovak Academy of Sciences in Bratislava
Motivation
• Semantic Annotation or Tagging– Deliver formal understanding of text
documents one of main focuses of semantic web
– Documents on Web or in enterprise to be understood by computer
– To understand content and context
ICCS 2008, Cracow June 23-25, 2008 2
Semantic Annotation
• Similar to Information Extraction
• Finding meta data about entities, its properties and their relations
• Ontologies
• Manual tools
• (Semi) Automatic tools– Usually tested on a few hundreds documents
• Needs:– To deliver application on the web or in enterprises we need to annotate large
scale – Semantic Web can be exploited only if metadata understood by a computer
reach critical mass
• Examples:– Geographical locations, People, Organizations
ICCS 2008, Cracow June 23-25, 2008 3
MapReduce
• Google approach for large scale information processing• Commodity PC’s• Application developer needs to implement only Map and
Reduce methods• Inputs and outputs are ordered key-value pairs• Fault tolerant, easy to use, scalable to
hundred thousands computers • Hadoop
– open sourceimplementation by Apache
– Yahoo! is using it on10 000 cores in production environment.
ICCS 2008, Cracow June 23-25, 2008 4
Ontea: Pattern Based Annotation
• Information extraction and semantic annotation using patterns• Find objects and properties in text • Possibility to transform it to RDF/OWL
• Similar to C-PANKOW, KIM or GATE• Very simple solution good for languages
where advanced NLP is not present• Applicable in enterprise applications
ICCS 2008, Cracow June 23-25, 2008 5
# Text Instance Patterns – regular expressions 1 Apple, Inc. Company: Apple Company: ([A-Za-z0-9]+)[, ]+(Inc|Ltd) 2 Mountain View, CA 94043
Settlement: Mountain View Settlement: ([A-Z][a-z]+[ ]*[A-Za-z]*)[
]+[A-Z]{2}[ ]*[0-9]{5} 3 [email protected] Email: [email protected] Email:
[-_.a-z0-9]+@[-_.a-zA-Z0-9]+\.[a-z]{2,8} 4 Mr. Michal Laclavik Person: Michal Laclavik Person:
(Mr.|Mrs.|Dr.) ([A-Z][a-z]+ [A-Z][a-z]+)
ontea.core.ResultRegExp
+ ResultRegExp()+ getFoundText()
ontea.core.PatternSet
+ PatternSet()+ getPatternSet()
ontea.core.PatternRegExp
+ PatternRegExp()+ PatternRegExp()+ PatternRegExp()+ annotate()+ getName()+ getPattern()+ getType()+ getThreshold()
ontea.core.ResultOnto
+ ResultOnto()+ ResultOnto()+ getURI ()+ getLocalName()+ toString()
«interface»ontea.core.Pattern
+ annotate()
«interface»ontea.transform.ResultTransformer
+ transform()+ transform()
ontea.core.Result
+ Result()+ getIndividual()+ setIndividual()+ getPattern()+ getType()+ getRelevance()+ setRelevance()+ equals()+ hashCode()
ontea.transform.LuceneRelevance
+ transform()+ transform()+ LuceneRelevance()
ontea.transform.SesameIndividual...
+ transform()+ transform()+ SesameIndividualSearchAndCreate()
+pattern
Ontea in Hadoop
• Map function - Pattern.annotation()– Input lines of text– Output key-value pairs e.g.
• file_name => organization:Apple
• Organization:Apple=>address:Mountain View
• Map function – transformers– E.g. lemmatization transformer– input: Settlement:Bratislave,Settlement:Bratislava– Output: Settlement:Bratislava
• Reduce function– input key-value pairs (objects and properties)– Output as needed – objects and its relations to files with
properties (e.g. in RDF/OWL)
ICCS 2008, Cracow June 23-25, 2008 6
Results & Conclusion
• It works, it is portable, it is faster• 12 times faster on 16 cores • http://ontea.sourceforge.net/
ICCS 2008, Cracow June 23-25, 2008 7
Description Enron corpus (88MB) Personal email (770MB)
Time on single machine 2min, 5sec 3hours, 37mins, 4sec Time on 8 nodes hadoop cluster 1min, 6sec 18mins, 4sec
Performance increased 1.9 times 12 times
Launched map tasks 45 187
Launched reduce tasks 1 1
Data-local map tasks 44 186
Map input records 2,205,910 10,656,904
Map output records 23,571 37,571
Map input bytes 88,171,505 770,924,437
Map output bytes 1,257,795 1,959,363
Combine input records 23,571 37,571
Combine output records 10,214 3,511
Reduce input groups 7,445 861
Reduce input records 10,214 3,511
Reduce output records 7,445 861