Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction,...
-
Upload
adam-lyons -
Category
Documents
-
view
221 -
download
1
Transcript of Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction,...
Information processingInformation processing
Michal Laclavík, Ladislav Hluchý
(Email research, information extraction, information retrieval, contextual recommendation)
Bratislava, 10th November 2011 2
Primary Research Team & CapabilitiesPrimary Research Team & Capabilities
Dept. of Parallel and Distributed ComputingResearch and Development Areas:
– Large-scale HPCN, Grid and MapReduce applications– Intelligent and Knowledge oriented Technologies
Experience from IST:– 3 project in FP5: ANFAS, CrosGRID, Pellucid– 6 project in FP6: EGEE II, K-Wf Grid, DEGREE
(coordinator), EGEE, int.eu.grid, MEDIGRID– 4 projects in FP7: Commius, Admire, Secricom, EGEE III
Several National Projects (SPVV, VEGA, APVT)IKT Group Focus:
– Information Processing (Large Scale)– Graph Processing – Information Extraction and Retrieval– Semantic Web– Knowledge oriented Technologies– Parallel and Distributed Information Processing
Solutions:– SGDB: Simple Graph Database– gSemSearch: Graph based Semantic Search– Ontea: Pattern-based Semantic Annotation– ACoMA: KM tool in Email– EMBET: Recommendation System– Experts on MapReduce and IR (Nutch, Solr, Lucene)
Director & leader of PDC: Dr. Dipl. Ing. Ladislav Hluchý
URL: http://ikt.ui.sav.sk
Approach and SolutionsApproach and Solutions
Large scale Text and Graph data processingLarge scale Text and Graph data processing
Core Technology• Web crawling
– Nutch + plugins
• Full text indexing and search– lucene, Sorl
• Information Extraction– Ontea, GATE
• All above large scale– Hadoop, S4
• Graph processing and Querying– Simple Graph Database (SGDB)
– gSemSearch
– Neo4j
– Blueprints
Bratislava, 10th November 2011 4
Underlined are the technologies developed by IISAS
Ontea: Information Extraction ToolOntea: Information Extraction Tool
Regex patternsGazetteersResuls
Key-value pairs Structured into trees graphs
Transformers, ConfigurationAutomatic loading of extractors
Visual Annotation Tool Integration with external tools
GATE, Stemers, Hadoop …Multilingual tests
English, Slovak, Spanish, Italian
Bratislava, 10th November 2011 5
http://ontea.sf.net
• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and
ordered by spread activation on social network graph
• Faceted search, navigation
Email Search PrototypeEmail Search Prototype
Bratislava, 10th November 2011 6
gSemSearch: Graph based Semantic SearchgSemSearch: Graph based Semantic Search
• Graph/Network of interacting (interconnected) entities• Discovering relation in the Graph (network) using spread of activation algorithm• Showing relations of concrete type, e.g. telephone numbers related to a person• Navigation over related entities• Full-text search of the entities• User interface for search• User interaction with data (merging,
deleting entities) with immediate impact on discovered relations
• Tested on Email Enron Corpus– Email Social Network Search– http://ikt.ui.sav.sk/esns/
Bratislava, 10th November 2011 7
SGDB: Simple Graph DatabaseSGDB: Simple Graph Database
• Storage for graphs• Optimized for graph traversing and spread of activation• Faster then Neo4j for graph traversing operations• Supports Blueprints API• https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3
• Graph Database Benchmarks– Graph Traversal Benchmark for Graph Databases
– http://ups.savba.sk/~marek/gbench.html
– Blueprints API - possibility to test compliant Graph databases
Bratislava, 10th November 2011 8
Future Direction: Relations Discovery in Large Graph DataFuture Direction: Relations Discovery in Large Graph Data
• Motivation– Graph/Network data are everywhere: social networks, web, LinkedData,
transactions, communication (email, phone). – Also text can be converted to graph. – Interconnecting graph data and searching for relations is crucial.
• Approach– Forming semantic trees and graphs from text, web, communication, databases
and LinkedData– User interaction with graph data in order to achieve integration and data
cleansing– Users will do it, if user effort have immediate impact on search results
Bratislava, 10th November 2011 9
Ontea: Pattern based information extraction and Ontea: Pattern based information extraction and semantic annotationsemantic annotation
Text processing
Ontea: Information Extraction (Features)Ontea: Information Extraction (Features)
Regex patternsVisual Annotation Tool Integration with external tools
GATE, Stemers, Hadoop …Gazetteers IE System configurationAutomatic loading of extractorsPatternsMultilingual tests
Spanish Slovak English Italian
Bratislava, 10th November 2011 11
Information Extraction ModelInformation Extraction Model
Address and product patternsAddress and product patterns
ExtractionExtraction
ProcessingProcessing
3 words macro3 words macro
ZIP macroZIP macro
Street number macroStreet number macro
Street name macroStreet name macro
City name macroCity name macro
Country macroCountry macro
Address patternsAddress patterns
Bratislava, 10th November 2011 12
SegmentationSegmentation
• Sentences • Paragraphs• Objects (Address, Product ..)
Bratislava, 10th November 2011 13
GazetteerCan extract information, which
cannot be properly extracted by regular expression patterns (like given names, product names, etc.)
Gazetteer extraction approach is combined with regular expressions based extrac-tion. For example personal full names can be extracted with higher precision.
Gazetteer is easy to update, because it is configured by simple text files.
Information Extraction: Gazetteers configurationInformation Extraction: Gazetteers configuration
Bratislava, 10th November 2011 14
Gazetteer listssimple text files with keywords
Gazetteer configurationsimple text file with<list file>:<IE result type>
Information extractor rules
Information Extraction: Rules configurationInformation Extraction: Rules configuration
IE System configuration– IE dynamically loads and run its
components (XMLRegexExtractor, Gazetteer, RuleTransformer) according to setting in IE rules file
– IE Components are executing consecutively and operate on a set of information extraction results
Bratislava, 10th November 2011 15
Information extractor rules file
IE result setModified
IE result setIE component
Regex basedIE component
GazetteerIE component
Result set transformerIE component
Semantic AnnotationSemantic Annotation
Bratislava, 10th November 2011 16
TheThe conceptconcept InformationExtractor - IEInformationExtractor - IE produces produces a set of extraction resultsa set of extraction results
SemanticAnnotator - SASemanticAnnotator - SA consumes consumes the IE result set and builds a trees the IE result set and builds a trees convertible to Ontology instances or convertible to Ontology instances or objects according to XML schema e.g. objects according to XML schema e.g. Core ComponentsCore Components
SA first builds an intermediate tree of IE SA first builds an intermediate tree of IE results on which it operatesresults on which it operates
The tree is upon its creation not compliant The tree is upon its creation not compliant to Core Components specification and to Core Components specification and needs to be transformedneeds to be transformed
Therefore we have Therefore we have tree transformerstree transformers which transform the IE result tree to a treeswhich transform the IE result tree to a trees
Semantic AnnotationSemantic Annotation
• Tree transformers– Input is a tree of IE results and output is the modified tree of IE results
– Tree transformers are executing consecutively and operate on a tree of information extraction results
– Tree transformers, which delete, create,rename, move, switch and order nodesare configured in the SA rules file
Bratislava, 10th November 2011 17
Treetransformer
Social NetworksSocial Networks
Social nework reconstruction:probabilistic inference using spreading
activationrelies on the output of the information
extractor (IE) in the form of complex objects
Bratislava, 10th November 2011 18
Preliminary results on a set of Preliminary results on a set of 50 Spanish emails (phone/name):50 Spanish emails (phone/name):Precision 60% Precision 60% (due to lower recall in IE)(due to lower recall in IE)Precision 85% Precision 85% (achievable with better IE)(achievable with better IE)self-healing self-healing (with new incoming emails)(with new incoming emails)
Social NetworksSocial Networks
Bratislava, 10th November 2011 19
Results as XML or HTML: Results as XML or HTML: (via XSL Transformations)(via XSL Transformations)
Future:Future:
DataSource for Search DataSource for Search for Partner modulefor Partner module
Improve the recall of Improve the recall of Information ExtractorInformation Extractor
Exploit multi-pass algorithm and named entity recognition: things Exploit multi-pass algorithm and named entity recognition: things learned in the first pass will be used in the next, e.g. possible names learned in the first pass will be used in the next, e.g. possible names with initials, etc.with initials, etc.
Build an enhanced statistical reasoning procedure on top of the Build an enhanced statistical reasoning procedure on top of the present Social Network Extractor/Correlatorpresent Social Network Extractor/Correlator
Email ResearchEmail Research
Acoma
Bratislava, 10th November 2011 21
Acoma ArchitectureAcoma Architecture
• Connected to email protocols on desktop or server• No need to change working practices
– Emails are received and send as before
• Received email is processed by Acoma and enriched with useful information
• Extensible with OSGi modulesS
erverD
esktop Mail Client
Browser
Mail Server
POP3IMAP
Acoma
Se
rve
rD
es
kto
p Mail Client
Browser
Mail ServerSMTP
Acoma
Information Processing and Extraction
Mail Server
Modified
Co
nn
ector to
Em
ail Infrastru
cture
System Connectors
Hint Recomendation
Mo
du
le 1
Mo
du
le 2
Mo
du
le n
Mail Client
Browser
Bratislava, 10th November 2011 22
System ConnectorsSystem Connectors
• Connection of Acoma to existing systems– Document Archives– Internet or Intranet Systems– Databases
• Access or import of data • Key-value pair transformation
Meta-Connector
Web Connector
SpreadSheet Connector
Database Connector
Internet
Key-value
TransformedKey-value
Bratislava, 10th November 2011 23
Acoma architecture : Message Post ProcessingAcoma architecture : Message Post Processing
• Useful hints with links are included in enriched email
• Links lead to internal or external systems (Internet, Intranet)
Bratislava, 10th November 2011 24
Business objects in EmailsBusiness objects in Emails
• Study on 6 organizations show:– Objects can be identified by patterns and gazeteers– It is possible to define set of common objects
• Objects identified:– Organization:
• org:Name, org:RegNo, org:TaxNo– Person:
• person:Name, person:Function– Contact:
• contact:Phone, contact:Email, contact:Webpage– Address:
• address:ZIP, address:Street, address:Settlement– Product:
• product:Name, product:Module, product:Component, product:BOID– Document:
• doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest– Inventory:
• inventory:ResID, inventory:ResType– Other business object
• ID: BOID
Social Networks and Graph DataSocial Networks and Graph Data
• Relations among objects• Support for search
Bratislava, 10th November 2011 25
• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and
ordered by spread activation on social network graph
• Faceted search, navigation
Email Search PrototypeEmail Search Prototype
Bratislava, 10th November 2011 26
Context based Recommendation, Knowledge SharingContext based Recommendation, Knowledge Sharing
EMBET, Acoma
28
Objective: Recommend and provide user information or knowledge in context
EMBET: proactive information and knowledge provisionEMBET: proactive information and knowledge provision
• Collaboration among users• Knowledge sharing• Active knowledge provision• Reuse of knowledge: notes and other
resources
http://ups.savba.sk/kwfgrid/uaa/http://ups.savba.sk/kwfgrid/uaa/Bratislava, 10th November 2011
29
EMBET: AchievementsEMBET: Achievements
• Software with following functionality
– User Problem description– Displaying Knowledge– Adding Knowledge – Knowledge Reuse– Permanent Notes
Storage– Voting on Notes
• EMBET architecture: Core, GUI
• Context detection
• Context Matching to display information & knowledge
• Plain text analysis using Advanced Semantic Annotation Algorithms – OnTeA
• Theory of different context matching algorithms
Bratislava, 10th November 2011
Bratislava, 10th November 2011 30
Acoma: Hint RecommendationAcoma: Hint Recommendation
Information Retrieval and Information ExtractionInformation Retrieval and Information Extraction
lectures
IR LecturesIR Lectures
• Introduction to Information Retrieval• Text Operations, Text Analysis, stemming• Crawling, link processing• IR Models, Indexing techniques• IR Software libraries and systems• Ranking by Graph Algorithms (PageRank, HITS, …) and Searching• Information Extraction• Regular Expressions• Large Scale Data Processing on MapReduce Architecture• Multimedia Information Retrieval• Evaluation Techniques, Precision, Recall• Google• Semantics and IR, Semantic Web Standards
32Bratislava, 10th November 2011
Lectures conditionsLectures conditions
• Every students gets project focused on – Crawling– Indexing– Ranking– Information Extraction– Large Scale information Processing
• They have to consult project 3 times during semester
• Availability of data from day one• Lectures are available at:
– http://vi.ikt.ui.sav.sk/Témy_prednášok
33
Spracovanie odkazov
Indexovač
Usporiadanie
Vyhľadávač
Bázadokumentov
Odkazy
Index dokumentov
Sťahovač
Textové operácie
Otázka
Užívateľ
Zoznam dokumentov
Internet
Bratislava, 10th November 2011