© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Knowledge Base...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of © Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Knowledge Base...
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Knowledge Base for RTD Competencies in
IST
– Results from a European SSA Project –
Brigitte Jörg
German Research Center for Artificial Intelligence
Language Technology Lab, Saarbrücken, Germany
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Introduction of Speaker
Brigitte Jörg M.A. Information ScienceInformation Systems, Business Administration
Project Manager, ResearcherDFKI GmbH, Language Technology Lab, Saarbrücken, Germany
CERIF TG Leader, Board MembereuroCRIS
Contact: brigitte.joerg @ dfki.dehttp://www.dfki.de/~brigitte/
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Presentation Outline
Introduction of the Project
Information Repository
Data Collection / Data Integration / Data Cleaning
Analytic Tools
Evaluation and Results
Conclusion / Beyond the Project
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Project Information
Funding Organization: European Commission Funding Program: Sixth Framework Programme
(FP6: IST (3rd Call)) Project Type: Specific Support Action (SSA) Duration: 32 Months (April 2005 – November 2007) Project Co-ordination: DFKI GmbH Technical Co-ordination: Jozef Stefan Institute (IJS) Technology Partners: DFKI, IJS, Ontotext, STFC Project Consortium: 15 partners from EU MS, NMS
and ACC
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Project Consortium
Deutsches Forschungszentrum für Künstliche Intelligenz, Germany
Institute Jozef Stefan, Slovenia Ontotext Lab, Sirma AI EAD, Bulgaria RTD Talos, Cyprus Institute of Information Theory and Automation, Czech Republic Archimedes Foundation, Estonia Comp. and Autom. Research Inst., Hung. Academy of Sc., Hungary Institute of Mathematics and Computer Science, University of
Latvia Lithuanian Innovation Centre, Lithuania Projects in Motion, Malta Technical University of Silesia, Poland National Institute for R&D in Informatics, Romania Slovak University of Technology, Poland TUBITAK, Turkey The Science and Technology Facilities Council, UK
(formerly CCLRC, UK)
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Technology Partners
DFKI
Co-ordinator
“LT World” PortalInformation Extraction
Semantic Web
DFKI
Co-ordinator
“LT World” PortalInformation Extraction
Semantic Web
Jozef Stefan Institute Technical Co-ordinator
“Project Intelligence”Data Mining
Social Network Analysis
Jozef Stefan Institute Technical Co-ordinator
“Project Intelligence”Data Mining
Social Network Analysis
Ontotext
“KIM Semantic Annotation Platform”
Ontotext
“KIM Semantic Annotation Platform”
euroCRIS
“CERIF” StandardAccess to Data
euroCRIS
“CERIF” StandardAccess to Data
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Project Objectives
Set up and populate an information portal on IST research
Provide information about RTD actors and their expertise
Provide innovative and automated services
To promote RTD competencies in specific fields
To support partner search for IST proposals and commercial projects
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Repository Features
Information Repository Entities (based on the CERIF 2004 Standard*) are
Organisations Persons Projects Publications
Data Collection - Import (based on CERIF XML) from
National CRISs (Current Research Information Systems) National Collections (no system behind) Web Crawlings Community Support
* CERIF: Common European Research Information Format http://www.euroCRIS.org/
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Repository Challenges
Data Integration from Heterogeneous Sources CERIF-based databases
(MSSQL Server; MS Access; EPSRC database) MSWord documents; MSExcel documents Raw Text files; HTML files; XML files Data crawled from the Web; from CERIF-based CRISs; from
public CRISs
Data Integration into ONE single dataset to enable Analysis at European Level
Overall Data Cleaning with Supervised Machine Learning Methods
(Active Learning)
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
European Research Dataset (entries)
Europan Research: 55078 Orgs, 30489 Proj, 58164 Exp, 165795 Pubs
Bulgaria: 794 Orgs, 73 Proj, 10940 Exp, 19023 Pubs Cyprus: 29 Orgs Czech Republic: 183 Orgs, 163 Proj, 164 Exp Estonia: 75 Orgs, 1256 Proj, 6726 Exp., 51376 Pubs Hungary: 2665 Orgs, 1297 Proj, 2425 Exp Latvia: 106 Orgs, 830 Proj, 701 Exp Lithuania: 102 Orgs, Malta: 58 Orgs, 27 Proj, 898 Exp, 180 Pubs Poland: 1451 Orgs, 2179 Proj, 7392 Exp, 16086 Pubs Romania: 169 Orgs, 68 Proj, 87 Exp Serbia: 60 Orgs, 2278 Exp, 79130 Pubs Slovenia: 1723 Orgs, 3748 Proj, 11655 Exp Slovakia: 56 Orgs, 432 Proj, 683 Exp. Turkey: 285 Orgs EPRI-start: 286 Orgs, 275 Exp Cordis FP5+FP6: 48988 Orgs, 20436 Proj, 13941 Exp
Community: 61 Orgs, 41 Proj, 435 Exp
January 2008
January 2008
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Collection Method Analysis
From National CRISs / Collections: complete & comprehensive often bi-lingual quick, easily (exported) transformed into CERIF XML mostly technical contact/expertise available
Crawled from public CRISs / CERIF-based CRISs: complete as publicly available needs data transformation / re-structuring efforts into CERIF XML technical expertise not related to domain knowledge depends on static website structures
Crawled from the Web (Google Scholar Publication Data): not usable for quality analysis
Community Contributions: a lot of interest entries incomplete, only basic personal data, not many relations
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Repository Analysis before Data Integration
Analysis of Obvious Errors: Duplicate records inherent in single datasets
Even more duplicate records after merging of datasets
Most obvious duplicates for organisations and persons
no significant number of duplicate projects publications have been ignored
Duplicate records are a known problem !!
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Heuristic Analysis of Random Samples in National Datasets / Cordis Datasets
most obvious duplicates found inside Cordis FP5 and FP6 datasets and across Cordis FP5 and FP6 datasets Largest Sets !!
not so many duplicates found in national datasets a lot of duplicate person records across all datasets no duplicate records found in project datasets only some duplicate records across project datasts publications have not been examined
Decision with Respect to the IST World Scope not touching project records ignore publication records let the community resolve person records (IST World
Community) concentrate on cleaning organisation records
Repository Analysis after Data Integration
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Problems with Organisation Records
Most entries had slightly different names caused by additional special characters or character modifications
Capitalization, Lowercase Letters Blanks, extra Spaces Hyphens Quotes Coma in Different Places Article in Name Full stop in Name Incomplete Names English Translation Word Order Language Specific Characters (Jorg instead of Jörg) Special Characters (wrong encoding &, ?, )
Mixture of Organisation Names and Department Names Differences in Addresses
Data Cleaning Application
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Active Learning Application
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Evalution of Automated Matching Results in the CORDIS FP6 dataset
Human evaluation of 1000 organisation record pairs: 30 Matches correct 934 Non-Matches correct 1 Match incorrect 35 Non-Matches incorrect
integration approach worked well can be used for large scale integration tasks
Result: semi-automated identification of 4000 duplicates with high accuracy and reasonable recall
97% precision 46% recall
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Analytic Tools
publicly available at: http://www.ist-world.org/
Advanced Tools
Competence Diagram Collaboration Diagram
Experimental Tools
Collaobration Trends Competence Trends Consortia Prediction Semantic Search
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
How to Analyze or Generate a Diagram
(1) definition of a query in the IST World Portal
(2) get a list of result records matching the query
(3) generate diagrams based on results
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Competence Diagram
Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6
Thematic Areas (Blue Clouds):SEMANTICHEALTHLEGALCHANGINGROADMAPSOFTWARE
Projects (Red Dots)Linked with Full Record in Repository
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Competence Diagram
Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6
Goals (List of Keywords):DEMENTIAPEOPLEMEDICALSTANDARDS…
Configuration of Result Space:40% of result list30 topics
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Competence Diagram
Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6
Goals
Configuration of Result Space:40% of result list30 topics
Themes
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Collaboration Diagram
Query: IST SSA projects within FP6Aim: investigate the collaboration of SSA partners in FP6
Number of joint partners
Configuration of Result Space:20% of result list
Project
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Evaluation of Analytic Tools
… very powerful
… itself are a powerful dissemination means
… strongly depend on the data behind
More evaluation details and results can be found in the CRIS 2008 Proceedings at http://www.eurocris.org
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Overall Conclusion
Data Collection: Data should be updated at their origin to avoid repetition
of data cleaning with updates independent of their collection method updates have to happen in the processes needs backwards-communication with data providers
CRISs support systematic data collections and updates A Lingua Franca for communication and interchange between
systems is needed for large-scale integration large-scale analyses across single sets
CERIF was crucial for IST World Crawlings/CRISs do not easily distinguish between topics (IST
only) Web Crawlings (GoogleScholar) considerably lacked quality
Automated Data Integration: Semi-automatically learned models can be re-used with new
data
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Overall Conclusion
Evaluation of large Datasets: very difficult needs expert knowledge
Analytics and Tools: depend heavily on quality data are very powerful for investigation of large datasets are much appreciated by the community (many registered
users)
Common Interest: Very High! even from outside the project:
Hungary, Serbia, Croatia, Russia, … epriStart project
Needs professional Authority:legalization; not available within the scope of a project
Project Results
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels
Beyond the Project
IST World is public http://www.ist-world.org/
Registration is Registration is freefree
Create your own Profile,
Competence Map, Collaboration Map
Currently FP7 Data are being prepared
Continuation is planned …