Methods for accessing permanent information and their...
Transcript of Methods for accessing permanent information and their...
Gianmaria SilvelloInformation Management Systems Research Group
Department of Information Engineering University of Padua
Workshop: Postdoctoral Research in Informatics
08 July 2015, Centro congressi “A. Luciani”, Padova
[email protected] http://www.dei.unipd.it/~silvello/
Methods for accessing permanent information
and their evaluation
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Main Aspects of Interest
2
Data Model
Quality Parameter
Measure
User
Measurement
ims:expressAssessment
ims:isMeasuredBy
ims:isAssignedTo
ims:isEvaluatedBy
ims:evaluates
Run
Concept
rdfs:subClassOf
Statistic
DescriptiveStatistic
ims:isAssociatedTo
ims:isAssignedTo
rdfs:subClassOf
ims:describes
ims:isComposedBy
rdfs:subClassOf
ResourceNamespaceIdentifiableResource
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOfTrack
Evaluation Activity
ims:submittedTo
ims:consistsOf
ims:isPartOf
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOf
Data Accessand Sharing Evaluation
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Structure, Access, Query, Evaluation
3
Structured
Semi-Structured
Unstructured
Data
Relational database
Search/AccessParadigm
Exact-Match
Best-Match
Hybrid
Query
XPath/XQuery
SPARQL
Natural Language Keywords
Evaluation
EfficiencyTime
Space
Effectiveness
Precision
Recall
Accuracy
…
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
The use of semi-structured data (XML)
4
- The use of XML is wide-spread in many sectors of everyday life
- Cultural heritage data: libraries, archives and museums
- Health data: protein sequences, pharmaceutical research
- Geographical data
- Linguistics data: Treebank, part-of-speech tagging and annotation
- Heterogeneous scientific datasets
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
NESTOR: Set-Based Approach to Access Hierarchical Data
5
v2
v4 v5 v6
a1
a2 a3
a4 a5 a6
A1
A2
A3A4 A5
a
b
cd
e
f
g
ab
cd
e
f
g
A1A2 A3
A4
A5A6
A6
Tree
Nested Sets Model Inverse Nested Sets ModelM. Agosti, N. Ferro, and G. Silvello (2011). Handling Hierarchically Structured Resources Addressing Interoperability Issues in Digital Libraries. Studies in Computational Intelligence vol. 375, pp. 17–49. Springer
<a1>text<a2>
<a4> text </a4><a5> text </a5><a6> text </a6>
</a2><a3>
text </a3>
</a1>
XML
N. Ferro and G. Silvello (2013). NESTOR: A Formal Model for Digital Archives. Information Processing & Management (IP&M), 49(6):1206–1240, Elsevier
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Efficient Implementations of NESTOR
6
Set-Wise and Element-Wise Query Primitives
Direct Data Structure (DDS) Inverse Data Structure (IDS) Hybrid Data Structure (HDS)
NESTOR Model Nested Sets Model (NSM) Inverse Nested Sets Model (INSM)
Descendants Ancestors Children Parent
Data Structures
XPath Axes
Structure DDS IDS HDS
Descendants O(m) O(1) O(m)
Ancestors O(1) O(m) O(m)
Parent O(1) O(m) O(1)
Children O(1) O(1) O(1)
Content DDS IDS HDS
Descendants O(1) O(m+n) O(m+n)
Ancestors O(m+n) O(1) O(m+n)
Parent O(n) O(m+n) O(n)
Children O(n) O(n) O(1)
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello 7
Efficiency Evaluation: Space and Time
25 XPath-Based Queries over Wikipedia from INEX (à la TPC benchmark)
Execution times
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25100
101
102
103
104
105
106
107
Structure query templates
mse
c, lo
g sc
ale
INEX Evaluation Time
DDSIDSHDSXalanJaxenJXpathBaseX
Descendants Element-Wise Primitive - Collaborative Knowledge
Descendants Union Descendants Intersection Descendants
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello 8
Average Index Building Time (Wikipedia XML files) Average Occupied Memory (Wikipedia XML file)
Efficiency Evaluation: Space and Time
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
The Other Side of Evaluation: Effectiveness
9
SELECT name FROM hotel WHERE city=‘Padua’
I need a comfortable accommodation in Padua, Italy
- User-oriented evaluation
- From structured queries to information needs
- From set of results to ranked lists ordered by relevance
Sheraton Hotel
Methis Hotel
Best Western Premier
Toscanelli
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Effectiveness-Oriented Evaluation
10
- Evaluation is a demanding activity carried out in international evaluation campaigns to share the effort and compare the experiments
- We designed a visual analytics tool for easing the evaluation work and reduce the required effort to carry out performance, failure and what-if analyses
M. Angelini, N. Ferro, G. Santucci, and G. Silvello (2014). A Visual Tool for Information Retrieval PerformanceEvaluation and Failure Analysis, Journal of Visual Languages and Computing, 25(4):394–413, Elsevier.
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Effectiveness-Oriented Evaluation
11
- Most common effectiveness measures evaluate the user achieved utility
- We propose the Twist measure to evaluate the effort required to the user
N. Ferro, G. Silvello, H. Keskustalo, A. Pirkola and K. Järvelin (2015). The Twist Measure for IR Evaluation: Taking User’s Effort into Account, Journal of the Association for Information Science and Technology, John Wiley & Sons, Inc. (in print).
0 100 200 300 400 500 600 700 800 900 1000−200
0
200
400
600
800
1000
1200
Rank
CRP
0 100 200 300 400 500 600 700 800 900 100010
20
30
40
50
60
70
80
Rank
DCG
TREC 10, 2001, Web − pir1wt2, topic 504
(d) Low effort
Twist = 0.8755
nDCG = 0.8195
Excellent run - type B
0 100 200 300 400 500 600 700 800 900 1000−20000
−15000
−10000
−5000
0
5000
Rank
CRP
0 100 200 300 400 500 600 700 800 900 10000
200
400
600
800
1000
Rank
DCG
TREC 10, 2001, Web − hum01t, topic 544
(b) High effort
Twist = 0.3654
nDCG = 0.7693
Typical run - type B
(a) Huge effort
0 100 200 300 400 500 600 700 800 900 1000
0
2000
4000
6000
8000
Rank
CRP
0 100 200 300 400 500 600 700 800 900 100010
20
30
40
50
60
70
Rank
DCG
TREC 10, 2001, Web − jscbtawtl4, topic 539
Twist = 0.1899
nDCG = 0.5855
Full-scale like run
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Effort vs Gain: TREC 10, 2001, Web
Twist
nDC
G
jscbtawtl4, topic 539hum01t, topic 544jscbtawtl4, topic 544pir1wt2, topic 504
0 100 200 300 400 500 600 700 800 900 1000−2
−1
0
1
2x 104
Rank
CR
P
0 100 200 300 400 500 600 700 800 900 10000
200
400
600
800
1000
1200
Rank
DC
G
TREC 10, 2001, Web − jscbtawtl4, topic 544
Twist = 0.6435
nDCG = 0.8641
Typical run - type A
(c) Medium effort
(a) Huge effort (b) High effort (d) Low effort(c) Medium effort
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Experimental Evaluation and Reproducibility
12
- Experimental evaluation enables repeatability, reproducibility and generalization of the experiments
N. Ferro and G. Silvello (2015). Rank-Biased Precision Reloaded: Reproducibility and Generalization. Proc of the 37th European Conference on Information Retrieval (ECIR 2015), LNCS 9022, pp. 768–780. Springer
- Experiments and findings are connected to scientific papers describing them: actionable papers
repeatability reproducibility
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Actionable Papers
13
<a href=”http://direct.dei.unipd.it/017c333a-4b7c-4267-926d-f15fe3554efd”>
<img src=”http://direct.dei.unipd.it/017c333a-4b7c-4267-926d-f15fe3554efd/
177bcef2-00a0-4f59-b781-f285610f1c6f
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Data Citation
14
G. Silvello (2015). A Methodology for Citing Linked Open Data Subsets, D-Lib Magazine, 21(1/2).
P. Buneman and G. Silvello (2010). A Rule-Based Citation System for Structured and Evolving Datasets. Bulletin of the Technical Committee on Data Engineering, 3(3):33–41
<Iuphar> <name>IUPHAR-DB </name> <citation>Rule0</citation> [...] <gpcr>
<name>G protein-coupled receptors</name> <citation>Rule1</citation> [...]
<family> <id>29</id>
<name>Glucagon receptor family</name> <citation>Rule2</citation> <receptor> <id>247</id> <name>GHRH</name> [...] <agonists> <ligand> [...] </ligand> </agonists> [...] </receptor> [...]
</family> [...] </gpcr> <ionchannels> [...] </ionchannels></iuphar>
iuphar[name=$.d,url=$.u, version=$.v]
iuphar[]/gpcr[name=$.n]
iuphar[]/gpcr[]/family[name=$.f,id=$.i]/contributors[]/contributor[name=$?c]
{database=$d, version=$v, contributors=$c, db-family=$n, family=$f, idFamily=$i}
Rules:
The citation that gets generated (example):{ database=IUPHAR-DB: the IUPHAR database || url=http://www.iuphar-db.org/ || version=15 || dbFamily=G protein-coupled receptors || family=Glucagon receptor family || idFamily=29 || contributor={Laurence J. Miller;;Daniel J. Drucker;;[...];;Rebecca Hills;;}}
The rules are recursively processed by the system and then transformed into a conjunction of XPaths.
The interpretation of the XPaths generates the citation.
Instantiation of the variables:
The first rule interpreted by the system
The second rule interpreted by the system
The third rule interpreted by the system
John Doe and Marco Rossi, "SystemA performances at CLEF 2009", 08 July 2014, <http://example.org/CLEF2009-systemA>
Human-readable reference
cit-sysA-CLEF2009
John Doe
Marco Rossi
2014-07-08
SystemA performances at CLEF 2009
dc:creator
dc:creatordc:date
dc:title
ex:cit-sysA-CLEF2009 dc:creator "John Doe" <http://example.org/CLEF2009-systemA>
ex:cit-sysA-CLEF2009 dc:creator "Marco Rossi <http://example.org/CLEF2009-systemA>
ex:cit-sysA-CLEF2009 dc:date "2014-07-08" <http://example.org/CLEF2009-systemA>
ex:cit-sysA-CLEF2009 dc:title "SystemA ..." <http://example.org/CLEF2009-systemA>
Machine-readable referenceSubject Property Object Name
Copyright © 2015 Gianmaria Silvello
ex:systemA
ex:expA
ex:CLEF 2009
ex:measureA
ex:produce
ex:measure ex:submitted-to
precision
0.70
ex:name
ex:value
ex:n1
ex:n2
ex:n3
ex:n4
ex:n5
schema:is-related-to
schema:is-related-to
schema:is-related-to
schema:is-related-to
ex:n1 schema:is-related-to ex:n2 ex:cit-sysA-CLEF2009
ex:n1 schema:is-related-to ex:n3 ex:cit-sysA-CLEF2009
ex:n2 schema:is-related-to ex:n4 ex:cit-sysA-CLEF2009
ex:n2 schema:is-related-to ex:n5 ex:cit-sysA-CLEF2009
Subject Property Object Name
Machine-readable citation meta-graph
ex:systemA ex:produce ex:expA ex:n1
ex:expA ex:measure ex:measureA ex:n2
ex:expA ex:submitted-to ex:CLEF2009 ex:n3
ex:measureA ex:name "precision" ex:n4
ex:measureA ex:value "0.7" ex:n5
Subject Property Object Name
Original cited LOD subsetn1
n3
n2n5
n4
Copyright © 2015 Gianmaria Silvello
Hierarchical Data (XML) Citation Graph Data (RDF) Citation
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Future Directions
15
Data citation methods for evolving datasets
Effectiveness-oriented evaluation of keyword-based system over
structured data
Data modeling, sharing and enriching via the Linked (Open) Data paradigm
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Selected Publications
16
- N. Ferro, G. Silvello, H. Keskustalo, A. Pirkola and K. Järvelin (2015). The Twist Measure for IR Evaluation: Taking User’s Effort into Account, Journal of the Association for Information Science and Technology, in print.
- G. Silvello (2015). A Methodology for Citing Linked Open Data Subsets, D-Lib Maga- zine, 21(1/2). DOI: 10.1045/january2015-silvello
- M. Angelini, N. Ferro, G. Santucci, and G. Silvello (2014). A Visual Tool for Information Retrieval Performance Evaluation and Failure Analysis, Journal of Visual Languages and Computing, 25(4):394–413.
- E. Di Buccio, G. Di Nunzio, and G. Silvello (2014). A Linked Open Data Approach for Geolinguistics Applications, International Journal of Metadata, Semantics and Ontologies, 9(1):29–41
- N. Ferro and G. Silvello (2013). NESTOR: A Formal Model for Digital Archives. Information Processing & Management (IP&M), 49(6):1206–1240.
- E. Di Buccio, G. Di Nunzio and G. Silvello (2013). A Curated and Evolving Linguistic Linked Dataset. Semantic Web, 4(3):265–270, P. Hitzler and K. Janowicz eds.
- P. Buneman and G. Silvello (2010). A Rule-Based Citation System for Structured and Evolving Datasets. IEEE Bulletin of the Technical Committee on Data Engineering, 2010, 3(3):33–41.
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Selected Publications
17
- N. Ferro and G. Silvello (2015). Rank-Biased Precision Reloaded: Reproducibility and Generalization. In Proc. of the 37th European Conference on Information Retrieval (ECIR 2015), LNCS 9022, pp. 768–780. Springer.
- M. Angelini, N. Ferro, G. Santucci and G. Silvello (2015). Tutorial: Visual Analytics for Information Retrieval Evaluation (VAIRË 2015). In Proc. of the 37th European Conference on Information Retrieval (ECIR 2015), LNCS 9022, pp. 809–812. Springer.
- N. Ferro and G. Silvello (2014). CLEF 15th Birthday: What Can We Learn From Ad Hoc Retrieval? In Proc. of the Information Access Evaluation. Multilinguality, Multimodality, and Interaction - 5th International Conference of the Cross-Language Evaluation Forum (CLEF 2014), LNCS 8685, pp. 32–44. Springer.
- M. Angelini, N. Ferro, G. Santucci and G. Silvello (2014). A Visual Interactive Environment for Making Sense of Experimental Data. 36th European Conference on Information Retrieval (ECIR 2014), Lecture Notes in Computer Science 8416, pp. 767–770, Springer
- E. Di Buccio, G. M. Di Nunzio and G. Silvello (2013). A Geolinguistic Web Application Based on Linked Open Data. Proc. of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’13) pp. 1101-1102. ACM, New York, NY, USA
- N. Ferro and G. Silvello (2013). Formal Models for Digital Archives: NESTOR and the 5S. In: Proc. of the Research and Advanced Technology for Digital Libraries - International Conference on Theory and Practice of Digital Libraries (TPDL 2013), LNCS 8092, pp. 192–203. Springer
slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello
Selected Publications
18
- M. Angelini, N. Ferro, G. Granato, G. Santucci and G. Silvello (2012). Information retrieval failure analysis: Visual analytics as a support for interactive “what-if” investigation. In: IEEE Conference on Visual Analytics Science and Technology, VAST 2012, pp. 204-206. IEEE Computer Society, USA
- M. Angelini, N. Ferro, G. Santucci, and G. Silvello (2012). Visual Interactive Failure Analysis. In Proc. of the Fourth Information Interaction in Context Symposium (IIiX 2012). ACM Press, New York, USA
- M. Agosti, N. Ferro, and G. Silvello (2011). Handling Hierarchically Structured Re- sources Addressing Interoperability Issues in Digital Libraries. In Learning Structure and Schemas from Documents. Studies in Computational Intelligence vol. 375, pp. 17–49.
- N. Ferro and G. Silvello (2009). The NESTOR Framework: How to Handle Hierarchical Data Structures. In Proc. of the 13th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2009), LNCS 5741, pp. 215–226. Springer-Verlag.
- M. Agosti, N. Ferro and G. Silvello (2009). Access and Exchange of Hierarchically Structured Resources on the Web with the NESTOR Framework. In Proc. of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, 2009, pp. 659–662. IEEE Computer Society.