LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform
description
Transcript of LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform
![Page 1: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/1.jpg)
LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform
Alexandre Donizeti AlvesHoracio Hideki Yanasse
Nei Yoshihiro Soma
October 24, 2011
11th Workshop on Domain-Specific Modeling
![Page 2: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/2.jpg)
Introduction
Lattes Platform is an information system implanted
by CNPq (National Council for Scientific and
Technological Development) to manage
information on science, technology and innovation
related to researchers and institutions in Brazil
This platform is undoubtedly the major source of
information available on Brazilian researchers
![Page 3: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/3.jpg)
Introduction: Lattes Platform
http://lattes.cnpq.br
![Page 4: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/4.jpg)
Introduction
The Lattes CV system, a curricular information
system, is the main component of the platform
Currently, the Lattes CV system stores around
2,000,000 curricula of researchers, lectures,
students and professionals from diverse areas of
knowledge
![Page 5: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/5.jpg)
Introduction: Lattes CV system
http://buscatextual.cnpq.br/buscatextual
Jorge Almeida Guimaraes
![Page 6: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/6.jpg)
Introduction: Lattes curriculum (English)
![Page 7: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/7.jpg)
Introduction: Lattes curriculum (English)
![Page 8: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/8.jpg)
Introduction: Lattes curriculum (Portuguese)
![Page 9: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/9.jpg)
Introduction
In the last years, many works were developed
using data extracted from Lattes Platform of
researchers of different areas of knowledge
A common problem presented in these works is
that the curricula and the information extracted
had to be obtained manually
![Page 10: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/10.jpg)
Introduction
Therefore, this system has a very high
quality information extraction potential
![Page 11: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/11.jpg)
LattesMiner
LattesMinerLattesMiner is an internal multilingual DSL for automatic
information extraction from Lattes curricula
It is composed by a set of classes written in Java that
allows developers to implement their own applications
with a high-level abstraction and expression power
![Page 12: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/12.jpg)
LattesMiner
Data Discovery is used to find the (ID) number of the researchers.
Usually, only the name of the researcher is available.
Data Acquisition is responsible for downloading the Lattes curricula
of the researchers from Lattes CV system on the Web.
Data Extraction is the main component of LattesMiner. It is
responsible for extracting data from the HTML files. The technique
of information extraction based on regular expressions was used.
The extracted data can be stored in XML files or in any database
using the Data Structure component.
The Data Visualization component is responsible for the identification
and visualization of the academic social networks. These networks are
identified by checking the relationships between researchers.
The Data Analysis component is responsible for the analysis of the
data extracted and also for the analysis of the relationships identified.
![Page 13: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/13.jpg)
LattesMiner
LattesMiner
Biodata
Board
BiodataIE
BoardIE
BoardDaoBiodataDao
lattes.miner
lattes.miner.ielattes.miner.en
lattes.miner.dao
Perfil Banca
lattes.miner.brThe LattesMiner class is composed by instances of classes Biodata
and Board, in addition to many others not presented here.
![Page 14: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/14.jpg)
LattesMiner
LattesMiner was created through a fluent interface, that
provides a compact and yet easy-read representation of
the domain problem
Fluent interfaces are implemented using the method
chaining
LattesMiner makes use of static factory methods and
imports
![Page 15: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/15.jpg)
Case Study
http://plsql1.cnpq.br/divulg/RESULTADO_PQ_102003.curso
For the following examples researchers of the Computer Science area
with CNPq Research Productivity Scholarship were considered.
The list contains all the names of the researchers.
However, their corresponding (ID) number are not provided.
![Page 16: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/16.jpg)
Listing 1
import java.util.*;import lattes.util.Util;import static lattes.miner.LattesMiner.*;
public class Listing1{ public static void main(String[] args) {
}}
List<String> list = new ArrayList<String>();
for (String name : Util.getList("names.txt"))
list.add( );
Util.setList(list, "ids.txt");
search(name)
Java application code
![Page 17: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/17.jpg)
Listing 2
dir("cvs");
for (String id : Util.getList("ids.txt"))
download(id). save();
Code fragment used to download the lattes curricula of
the researchers.
![Page 18: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/18.jpg)
Listing 3
props("mysql");for (String id : Util.getList("ids.txt")) {
}
load(id). biodata(). address();
publications( )JOURNAL . save();
This listing shows as to extracted data from Lattes curricula
of the researchers.
![Page 19: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/19.jpg)
Listing 4
for (String id : Util.getList("ids.txt")){
}
// Portuguese
// English
for (Banca b : ){
}
for (Board b : ){
}
carregar(id).bancas() . getBancas()
load(id). boards() . getBoards()
if ( )
System.out.println( );
if ( )
System.out.println( );
b.ano() == 2010
b.aluno()
b.year() == 2010
b.student()
Code fragment to illustrate how the LattesMiner is used to extract information in different languages.
![Page 20: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/20.jpg)
Results
The SUCUPIRA is a system for identification and visualization of academic social networks.
Here is shows the geographical distribution of the five researchers
that have published more articles in scientific journals.
![Page 21: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/21.jpg)
Results
This is a graph of contacts of the five researchers that have published more in scientific journals.
The graph depicts an academic social network of the five researchers.
Nodes are presented with
the name of researcher
The color of the edges represent the number of relationships among researchers.
![Page 22: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/22.jpg)
Conclusions
Currently, the Lattes curricula are available in HTML
format
LattesMiner however does not depend on the data
format because it allows users to program their
own applications with a high-level abstraction
If the data format is eventually modified, the DSL
interface remains the same
![Page 23: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/23.jpg)
Conclusions
An advantage of LattesMiner is that it searches by
the name of the researcher
LattesMiner is multilingual
Another advantage is that the data extracted can
are stored in a structural format (XML or
database), allowing these data to be easily used by
others applications
![Page 24: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/24.jpg)
Future work
The future step that is already being implemented
in the LattesMiner DSL is a statistical analysis of
the data
![Page 25: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform](https://reader036.fdocuments.net/reader036/viewer/2022062310/568160c8550346895dcff749/html5/thumbnails/25.jpg)
ACNOWLEDGMENTS