Non-textual ranking in Digital Libraries
-
Upload
philipp-mayr -
Category
Education
-
view
87 -
download
0
Transcript of Non-textual ranking in Digital Libraries
![Page 1: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/1.jpg)
1
Non-textual ranking in
digital libraries
Philipp Mayr
Hochschule Darmstadt
Jour fixe ISE am 18.11.2009
Hochschule Darmstadt
Fachbereich Media
Slides in cooperation with Peter Mutschke & Philipp Schaer
![Page 2: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/2.jpg)
2
Agenda
• Introduction
• Ranking in DL
• IRM project
• Non-textual ranking in IRM
• Results
• Conclusion & Demo
![Page 3: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/3.jpg)
3
Background I
Database perspective:
• Large and heterogeneous document sets for
subject specific questions
• Various relevant and accessible databases
for a topic
• Focus on bibliographic databases (metadata)
• journal articles
• monographs
![Page 4: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/4.jpg)
4
Background II
User perspective:
• Relevant & qualitative documents (relevance ranking)
• Comprehensive search: documents from other fields
• Flexible search systems: alternative search strategies
and techniques (e.g. Berrypicking)
• Value-added, e.g. direct access to fulltexts or metrics
like citation counts in Google Scholar
![Page 5: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/5.jpg)
5
Digital Libraries
• Indexing & Abstracting databases
• Library catalogues
• Full texts
• Links to online resources
• Data
• Digital objects
• …
![Page 6: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/6.jpg)
6
Ranking
Models:
• Exact match vs. best match (e.g. tf-idf)
• Sorting vs. ranking
Textual vs. non-textual ranking
![Page 7: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/7.jpg)
7
Ranking: non-textual
Link analysis (PageRank, HITS)
Relevance feedback (user feedback)
Popularity (documents accessed)
![Page 8: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/8.jpg)
8
Ranking in Digital Libraries
![Page 9: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/9.jpg)
9
Ranking in Digital Libraries
![Page 10: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/10.jpg)
10
Ranking in Digital Libraries
![Page 11: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/11.jpg)
11
Non-textual ranking in DL
• Link analysis (PageRank, HITS)
• Relevance feedback (user feedback)
• Popularity (documents accessed)
BUT we have:
• High quality metadata
• Controlled vocabularies
• Maintained (curated) collections
are problematic
in DL
![Page 12: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/12.jpg)
12
Project IRM
![Page 13: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/13.jpg)
13
Value Added Services for IR Systems
Major problem areas of scholarly IR systems (Krause 2007):
1. search term vagueness
2. information overload by large result sets
IRM services → structural attributes of the science system :
• (1) Search Term Recommender: more appropriate terms from
controlled vocabulary (co-word analysis)
• (2a) Bradfordizing: re-ranking by core journals (bibliometrics)
• (2b) Author Centrality: re-ranking by centrality in co-authorship
networks (network analysis)
![Page 14: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/14.jpg)
14
Search Term Recommender (Petras 2006)
Search Term Service: recommending strongly associated terms from controlled vocabulary
![Page 15: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/15.jpg)
15
Bradfordizing (White 1981, Mayr 2009)
Bradford Law of Scattering (Bradford 1948): idealized example for 450 articles
Nucleus/Core:150 papers in3 Journals
Zone 2:150 papers in9 Journals
Zone 3:150 papers in27 Journals
Ranking by Bradfordizing: sorting the core journal papers / core books on top
bradfordized list of journals in informetrics applied to monographs: publisher as sorting criterion
![Page 16: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/16.jpg)
16
Author Centrality (Mutschke 2001, 2004)
Ranking by Author Centrality: sorting central author papers on top
![Page 17: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/17.jpg)
17
Scenarios for combined ranking services
iterative use : simultanous use:
Result Set
Core Journal Papers
Central Author PapersRelevant
Papers
Result Set
Central Author PapersCore Journal Papers
![Page 18: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/18.jpg)
18
Combination Matrix
Combination Author Centrality Bradfordizing STR
0 - - -
1 1 - -
2 - 1 -
3 - - 1
4 1 2 -
5 1 1 -
6 2 1 -
7 2 - 1
8 - 2 1
9 2 3 1
10 3 2 1
11 2 2 1
(number = order; same number in a row = simultanous use)
![Page 19: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/19.jpg)
19
Main Research Issue:
Contribution to retrieval quality and usability
• Precision:
– Do central authors (core journals) provide more relevant hits?
– Do highly associated cowords have any positive effects?
• Value-adding effects:
– Do central authors (core journals) provide OTHER relevant hits?
– Do coword-relationships provide OTHER relevant search terms?
• Mashup effects:
– Do combinations of the services enhance the effects?
![Page 20: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/20.jpg)
20
Evaluation Design
• precision in existing evaluation data:
– Clef 2003-2007: 125 topics; 65,297 SOLIS documents
– KoMoHe 2007: 39 topics; 31,155 SOLIS documents
• plausibility tests:
– author centrality / journal coreness ↔ precision
– Bradfordizing ↔ author centrality
• precision tests with users (Online-Assessment-Tool)
• usability tests with users (acceptance)
![Page 21: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/21.jpg)
21
Prototype Architecture
2,235,769 documents from
![Page 22: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/22.jpg)
22
Motivation: non-textual approaches in DL
• Larger document sets for subject specific
searches need to be concentrated again
(compensation, structuring)
• Exploring alternative ranking approaches
which can provide insights in document spaces
and enhance retrieval
• Plausibility that the nucleus of a literature or
central authors provide an utility for users
searching large document spaces
![Page 23: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/23.jpg)
23
Bradfordizing
Basis: Bradford Law of Scattering
Approach: Usage of the document distributions
(scattering) in scientific journal and monograph
publications.
Core journals on research topics -> bibliometric
approach
• Identification of „core journals“ and core
publishers
• ISSN and ISBN as identifiers
![Page 24: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/24.jpg)
24
Author Centrality
Basis: Graph theory, network analysis
Approach: Usage of the interaction
(communication) pattern -> coauthorship relations
in a research community
• Identification of „experts“
• Identification of networked, „central“ persons
• Different centrality measures
![Page 25: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/25.jpg)
25
Results
![Page 26: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/26.jpg)
26
Results: Bradfordizing
Bradford distributions appear in all subject domains and also
for queries in databases. It follows that Bradfordizing can be
used for re-sorting results, generally for topic specific queries
in bibliographic databases.
topic core core-j z2 z2-j z3 z3-j
1 45 3 46 11 52 42
2 85 6 88 20 87 63
3 72 9 66 19 72 55
4 99 6 97 15 92 61
5 66 2 71 14 65 50
73,4 5,2 73,6 15,8 74 54,2
Example:
Articles and journals
in the three zones
(core, z2, z3) for 5
topics
![Page 27: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/27.jpg)
27
CLEF topics 2006
1
10
100
1000
1 10 100 1000
top152
top155
top156
top158
top160
top163
top164
top165
top171
top175
top174
top173
top172
top170
top169
top168
top167
top166
top162
top161
top159
top157
top154
top153
top151
SOLIS database (German literature)
core z2 z3
journals
art
icle
s
![Page 28: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/28.jpg)
28
Results: Bradfordizing
Results from qualitative interviews with information
professionals
1. Spontaneous naming of core journal can be difficult
- no naming of core journals for 50% of the topics
- high plausibility of the bradfordized journals
2. Majority attest a positive relevance effect for core journals
Highest value-added can be expected for novice researchers,
students in a scientific field
Perhaps the zone 3 (periphery, long tail) is most valuable for
seniors in a field
![Page 29: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/29.jpg)
29
Results: Bradfordizing
CLEF
article
Verbesserung
core zu Z3 in
%
Verbesserung
core zu Z2 in
%
Verbesserung
Z2 zu Z3 in
%
Verbesserung
core zu base-
line in %
2003 86,56 (*) 34,57 (*) 38,63 (*) 32,65 (*)
2004 69,23 (*) 22,45 38,20 26,25 (*)
2005 78,03 (*) 29,05 (*) 37,95 (*) 29,52 (*)
2006 17,63 7,66 9,27 8,46
2007 28,18 (*) 8,31 18,35 11,77
55,93 (*) 20,41 (*) 28,48 (*) 21,73 (*)
KoMoHe
article
Verbesserung
core zu Z3 in
%
Verbesserung
core zu Z2 in
%
Verbesserung
Z2 zu Z3 in
%
Verbesserung
core zu base-
line in %
Test1 18,82 11,75 6,32 9,84
Test2 11,58 6,16 5,11 6,12
Test3 19,32 (*) 8,67 (*) 9,80 (*) 9,00 (*)
16,57 (*) 8,86 7,08 (*) 8,32 (*)
Vergleich der durchschnittl.
Precision zwischen den Zonen
ergibt:
• Core relevanter als Zone 2 und
Zone 3
• Zone 2 relevanter als Zone 3
• Meist signifikante
Verbesserungen
(t-Test, Wilcoxon)
• niedrigere Verbesserungen
bei KoMoHe
• kontinuierliche Verschiebung
der Relevanz
relevance related
distributions * Significant based on the Wilcoxon signed-rank test and the paired t-Test
![Page 30: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/30.jpg)
30
Results: Author centrality
![Page 31: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/31.jpg)
31
Heuristische Evaluation des Ex-Post-Ranking-Modells
Nutzer-evaluierte Anwendungen (Jugendinstitut 1997, ASI-
Tagung 2003)
Query Ergebnismenge
sortiert nach PY IDF ACL
Information. Mehrwert
ACL
Jugend – Gewalt 0.25 0.60 0.55 92 Rechtsextremismus – Ostdeutschland 0.35 0.45 0.60 122 Frau – Personalpolitik 0.35 0.60 0.65 100 Widerstand – Drittes Reich 0.40 0.65 0.95 138 Zwangsarbeit – II. Weltkrieg 0.55 0.65 0.70 92 Eliten – BRD 0.40 0.70 0.85 107 Armut – Stadt 0.30 0.35 0.55 157 Arbeiterbewegung – 19./20. Jahrh. 0.55 0.55 0.90 164 Wertewandel – Jugend 0.40 0.50 0.30 50 Terrorismus - Demokratie 0.20 0.35 0.60 129
Durchschnitt 0.38 0.54 0.67 115
PY = Erscheinungsjahr, IDF = Inverse Dokumenthäufigkeit , ACL = Autor-Closeness
Retrievaltest
Qualitative
Evaluationen
Rankings nach Autorenzentralität:
höhere Precision als traditionelle Rankings [?]
hoher informationeller Mehrwert (andere Dokumente) [?]
Ergebnisse
(Hypothesen)
![Page 32: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/32.jpg)
32
Evaluation of Author Centrality on CLEF Data
• moderate positive relationship between rate of networking and
precision
• precision of TF-IDF rankings (0.60) significantly higher than author
centrality based rankings (0.31) – BUT:
• very little overlap of documents on top of the ranking lists: 90% of
relevant hits provided by author centrality did not appear on top of
TF-IDF rankings
→ added precision of 28%
• author centrality seems to favor OTHER
relevant documents than traditional rankings
• value-adding effect:
other view to the information space
![Page 33: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/33.jpg)
33
Zentrale Akteure im Umfeld eines Autors (Expertensuche)
Ego = Ulrich Teichler
Schlagwort = Hochschule oder Studium
Ranking der Ko-Akteure nach
Zentralität
![Page 34: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/34.jpg)
34
Zentrale Autoren in Dokumentkollektion (Beispiel: Rechtsextremismus)
Schlagwort = Rechtsextremismus...oder Rechtsextremismus oder Antisemitismus oder
Rassismus oder Ausländerfeindlichkeit oder
Ethnozentrismus oder Faschismus oder Neofaschismus
ab 1996
2833
SOLIS/FORIS-
Nachweise
1851 vernetzte
Akteure
(65 Giant)
![Page 35: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/35.jpg)
35
Conclusion
• Methods are non-textual
• Scientometric/bibliometric approach
• Network analysis
• The methods can successfully be applied
(holds true in different domains, databases and
document types)
• A value-added can be demonstrated
(significant precision improvement)
• Users are intuitively and empirically satisfied
• High plausibility of the methods
![Page 36: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/36.jpg)
36
Demo
![Page 37: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/37.jpg)
37
Demo
![Page 38: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/38.jpg)
38
Demo
Link zum Prototyp
http://multiweb.gesis.org/GrailsSTR/testSTR/index
![Page 39: Non-textual ranking in Digital Libraries](https://reader034.fdocuments.net/reader034/viewer/2022052621/5584467cd8b42a6a6d8b45eb/html5/thumbnails/39.jpg)
39
Dr. Philipp Mayr
F14, Raum 39b
(06151) 16-9394
mailto:[email protected]
oder
mailto:[email protected]
http://www.gesis.org/index.php?id=2479