Relevance redefined

Deviance Nerd Feeler Declared Refine Even Candid Reefer Eleven Fleeced Reindeer Van Freelance Never Died Deliverance Need Ref Dereference And Live End Free Deliverance Deliverance Nerd Fee Deface Vender Reline Refaced Veneered Nil Refaced Relined Even Vile Acne Feeder Nerd Declare Define Nerve Cleared Define Never Fancied Reverend Lee Dance Fender Relieve Canned Refereed Vile Canned Refereed Evil Canned Deliverer Fee Canned Relieved Free Freelance Need Drive Irrelevance Feed End File And Decree Nerve Card Need Relief Even Revealed Nice Fender

Relevance redefinedLukas Koster

Library of the University of Amsterdam@lukask

IGeLU 2014 - Oxford

Main discovery tool feedback issues

Content

Not enough

Too many

Wrong types

No ‘full text’

Relevance

Not #1

Too many

Known item!?

WTF?

Main issues reported as feedback/survey results on discovery tools are about content and relevance.

Funny thing is that the use of facets for refining is somehow not very popular?

Usual responses to feedback issues

Change the front end!Tabs - Facets - Filters - - Font

Positions

More/less content!More of the same same same

Improve relevance ranking algorithms!Very shhhophisticated - Very shhhecret

Usual responses to feedback issues: Front end, content, relevance (ranking)Front end UI changes: it’s just about cosmetics and perception: more tabs with specific data sources, element positioning, etc.More or less content can have various effects. Either more or less relevant results. But usually still the same traditional content typesAlgorithm: only for a small part influenced by libraries, customers. Most of it in the software, which is confidential, not transparent, for competitive reasons

Before

Example: before.University of Amsterdam Primo: originally Google experience: one box (apart from advanced search), all sources, one blended results list.

After

Example: after. University of Amsterdam Primo now: three tabs: All, Local catalogue, Primo Central

Same old

Same old UX tricks

Same old Content types

Same old View on relevance

Basically these changes are not actual changes at all: it’s all cosmetic.UX/UI changes: perception, not actual improvement of relevance.Content: usually still the same resource types+search indexesRelevance from a system +result set perspective

iNTERLiNKED

R

SE

AR

CH

C

L

A

O

E

N

N

V

K

T

A

E

N

X

CO

NT

EN

T

E

But every aspect is dependent on all others: search, rank, content, context, relevance. No search without context. Search is executed in a specific limited content index. Ranking is performed on the results within this limited index. Relevance is completely defined by a user's context.

Relevance

Context

+

Content

Objectively, we can say that relevance is determined by context and content.

Relevance=Relative:Subjective:Contextual

Person Context

Role

Task

Goal

Need

Workflow

System Algorithm

Content

Index

Query

Collection

Configuration

Clash between personal context and system/collection. Personal context, defined by a person's specific needs in a specific role for a specific task/goal in a specific time, culminates in a specific Query, which consists of a limited number of words in character string format.System doesn’t know personal context, only has the indexed content, made up from specific collections, that is indexed in a certain way with specific system configurations, and the string based query to run through that structure.

Relevance

RecallThe fraction of relevant instances that are retrieved

retrieved relevant instances

total relevant instances

PrecisionThe fraction of retrieved instances that are relevant

retrieved relevant instances

total retrieved instances

Basic concepts used for determining relevance of result sets: Recall and Precision.This cannot be used to determine actual relevance of specific results! That is dependent on context and can only be determined by the user.

Total: 1000 items

Relevant: 300

Retrieved: 180

Retrieved relevant: 120

Retrieved unrelevant: 60

Unrelevant: 700

Recall:120/300=0.4

Precision:

120/180=0.66

Relevance

Recall and PrecisionExample: Searched index: 1000 itemsRelevant for query: 300Retrieved items: 180Retrieved relevant items: 120Retrieved unrelevant items: 60

Recall=120/300=⅖ (0.4)Precision=120/180=⅔ (0.66)

Relevance ranking is NOT Relevance

Relevance = Finding appropriate items

Recall, Precision

Relevance ranking = Determining most relevant within retrieved set

Term Frequency, Inverse Document Frequency, Proximity, Value Score

Retrieved set may not contain any relevant items at all, but can still be ordered according to relevance.

Relevance is NOT relevance ranking!Relevance is finding/retrieving appropriate items, using the words in the query, and if available: context information.Recall and Precision are used to measure the degree of relevance of a result set.

Relevance ranking is determining the most relevant items in a result set based on the query terms and the content of retrieved items, using a number of standard measures:TF, IDF, Proximity.Value score: a specific Primo algorithm that is looking at number of words, type of words, etc.

Also possible: local boosting. This method does not take into account any content relevance, but just uses brute force to promote items from specific (local) data sources.

Primo Central search and ranking enhancement - July 8, 2014

As part of our continuing efforts to enhance search and ranking performance in Primo, we changed the way Primo finds matches for keyword searches within indexed full text. As part of this approach Primo lowers the ranking of, or excludes, items of low relevance from the result set that were previously included. You may find as part of this change that the number of results for some searches is reduced, although result lists have become more meaningful.

Official Ex Libris announcement July 8, 2014.Combined with improvements to known item search/incomplete query terms in Primo 4.7.Something changed!? This announcement implies mixing up of getting relevant results and relevance ranking. Some results are actually excluded.

Only for full text resources.Only in Primo Central.Not clear if this is independent of software version/SP?

Unclear to libraries, customers what and how relevance/search/ranking are modified: an example of the not transparent nature of discovery tools' relavance algorithms.There were a number of complaints on the Primo mailing list about this.

The System Perspective

Objectivizing a subjective experience

Let's look at the traditional system perspective on relevance. It's trying to make a subjective process into an objective one.

Recall issues

Discovery tool index limits recall scope in advance

Relevance is calculated on:

available

selected

indexed

(scholarly) content

By vendors

By libraries

Everything

System

First let’s have a look at some recall issues in discovery tools.Recall is limited in advance, because only a limited set of items of certain content types are available for searching. A lot of relevant content is not considered at all.Decided by vendors, publishers and libraries.In Primo Central: by Ex Libris agreements with publishers, metadata vendors.In Primo Central: libraries decide what is enabled, what is subscribed, free for searchIn Primo Local: libraries decide which (part of) collections are indexed.

Recall issues

NOT indexed:

Not accessible

Not subscribed

Not enabled

Unusual resource types

Connections

Not digital

Not indexed, thus not searched:

Content not accessible to index vendors, librariesUnusual resource types: theatre performances, television interviews, research project information, historical eventsNot physical, tangible content.Connections: influenced by, collaborating with, temporal, genre

May not fit in bibliographical/proprietary format (MARC, DC, PNX)

Recall issues

Indexed, but NOT found:

By author name (string based)

By subject (string based, languages)

Related but unlinked items (chapter in book)

Content that IS indexed, but can't be found:

Author names: only strings, textual variations of name/pseudonyms, etc. that are indexed. Only items with explicit author search term are found.Subject: strings, individually indexed 'as is' from data sources, multiple languages. Only items with explicit specific subject search term are found.Related: a chapter may be indexed with a textual reference to the book it is a part of. The book (relevant for delivery) is not retrieved, neither a link to that item.

Author

Author name example.Charlotte Brontë pseudonym/pen name Currer Bell (male) used for Jane Eyre. (Left screenshot Wikipedia)In this case no links between both names, so the very relevant Charlotte Brontë stuff is not retrieved. (Right screenshot University of Amsterdam Primo)

Subject

Subject example.Topic/discipline “philosophy” (English) does not find stuff with Dutch “filosofie” (which also appears to be Czech).

Chapter

Connections example.Chapter written by UvA researcher, in local institutional repository, harvested in local Primo.Book in Aleph catalogue, harvested in local Primo.Book is not retrieved as item to present delivery options directly.

Precision issues

Discovery tool limits precision by ambiguous indexing

Next: some precision issues.Problems caused by using strings instead of identifiers/concepts

Precision issues

Indexed and/but erroneously found

By author name (string based)

By subject (string based, languages)

Query too broad

Indexed irrelevant items that are retrieved erroneously:

Author: common names result in items of all authors with that name.Subject: similar terms with different/ambiguous meanings give noise (voc)Broad query (few terms) gives too much noise

Author

Example of author names.

J. (Johanna, Jan, Joop, etc.) de Vries is a very common Dutch name.Results consist of all items by different authors.

Subject

Example of subjects.

Ambiguous/Multilingual topic VOC: physics (Volatile Organic Compounds), music (Vocals), history (Verenigde Oostindische Compagnie, Dutch East Indies Company).

Too broad

Example of too broad search terms.

Way too many results with a very common search term.

Recall and precision issues

Content of index

Quality of search index units

Lack of connections (isolated string items)

Algorithms for retrieving and ranking not transparent

Summary of Recall and Precision issues in discovery tools and relevance:

Content of index: resource types, connections, dataSearch index units (individual search index fields): strings, isolated items

Cause: system perspective with legacy data There is no way to determine if all relevant items have been retrieved.

Research cycle

http://commons.wikimedia.org/wiki/File:Research_cycle.png

Intermezzo: closer look at Context: workflow and use case. Example: research cycle (Cameron Neylon). Many different versions of this cycle. Important is: the nature of someone’s information need differs depending on the stage. Broad, focused in several dimensions

Context example - theatre research

Play

Author

Text

Productions

Use case: Theatre play researcher.A theatre Play is written by Author, is represented as text, but most importantly it is performed (or not) for an audience.


Play

Author

Text

Productions

Background

Connections

Influences

Period

For the Author there is biographical information, important things are background, connections with others (artists, funders, relatives, etv.), influences, the period in which they live and work.Libraries/discovery tools may have some biographical information.


Play

Author

Text

Productions

Background

Connections

Influences

Period

Versions

Translations

Editions

Text: there may be several versions, translations, editions etc. FRBR can be used to model this.This belongs to the traditional library domain.


Play

Author

Text

Productions

Background

Connections

Influences

Period

Versions

Translations

Editions

Performances

Reception

TheatresProducers

ActorsVisitor stats

Directors

Props

Posters

Recordings

PhotosCostumes

Productions and performances: a whole different world. People involved in a number of different roles. Different Productions, various actual performances, physical props, costumes, audio and video recordings, etc.Reception both of the play as such, as of the various productions, always related to the period.

What’s in a discovery tool? Could be anything, but in individual texts/items, not as separate retrievable items, and certainly not as connections/crossreferences/related information.Authors in authority files or individual biographical databases.Text/Editions treated separately as individual items.Productions: maybe, depending on types of indexed (local) databases.Reception: individual reviews possibly.

Relevance - New perspective

Instead of

SYSTEMCollections, Indexed content, Query

Context-Workflow-Goals- Environment of

USER

It is time to switch perspectives, from collection based System algorithms to context based User needs.

Is this technically possible, feasible?

Extend Content?

Know Context?

Important questions:Is it even possible, feasible to extend content without limits, to interpret personal context?Can commercial vendors and publishers benefit?

Relevance Redefined and Primo

What is already possible?

Content

Additional content types

Additional indexed fields

Third nodes (not merged)

External links (not searchable, link out only)

Context

Discipline (for ranking, not searching)

Algorithm improvements (for current items)

Let's look at this from the Primo perspective.

What is already possible in current version of Primo?

ContentOther resource types can be added, both in Primo Central, by ExLibris; and in Primo local, by individual libraries.Indexed fields: extend the PNX search section, extra entries (for authors, subjects for instance, needs normalization rules) + locally defined fieldsThird nodes, external data sources via API, like distributes federated search(EBSCO, Worldcat), but results unclear, can't be merged very well.

ContextUsers can enter their Discipline and Degree, but this is only used for ranking, not for retrieving.


What is missing?

Content

Internal links

Integrated Primo Central/Local

External links

External indexes

Normalised/multilingual authors/subjects

Context

Context

What is still missing/not possible in Primo?Internal links: chapter-book(s), article-journal(s), article-datasets, qualitative relationships etc.Primo Central-Primo Local: two separate indexes to be searched, no deduplication etc.External links, for instance to related content in external databases, not indexed in Primo: theater performances, research information, etc.External indexes: non-Primo data sources searchable (maybe with Third Nodes, but not merged)Normalised/multilingual indexes: there is no use of identifiers instead of string indexes


Options

Content

Universal record format: RDF!

Identifier based authorities: VIAF! MACS? (DBpedia?)

Global metadata index!

Transparent algorithms!

Context

What would Google do?

What options can we distinguish for future Primo development?Record format not proprietary, but universal: RDFRDF also requires identifiers + relations (triples).Existing authorities: VIAF, LCSH, MACS etc. (RDF/Linked data).Global metadata index: not silos for separate discovery layers, but open, global, unified format. Could be decentralized, distributed; managed by multiple partiesTransparent algorithms: to make it clear how relevance is computed.

New features announced by Ex Libris on earlier occasions: URIs in Primo PNX Links SectionKnowledge Graph type additional info (Wikipedia, …)Announced during conference: Primo/third generation discovery, with related information and serendipity, using identifiers, external sources, linked data.

Context: Google: next slide.

A word about Google vs Primo

Google knows

IP addresses

Account

Searches

Clicks

Location

Primo makes an educated guess

Discipline?

Query type

The difference between Google and library discovery.Google knows a lot about the user, and can target search results at user's history, location, email etc.Library discovery tools do not have that knowledge. They have to guess.

VIAF

Example of identifier based person authority files.

VIAF consolidates names for large number of authoritative sources.Also has Related names.

MACS

Multilingual ACcess to Subjects

Since 1997

Manual linking between strings

New future?

The European Library...

http://www.nb.admin.ch/nb_professionnel/projektarbeit/00729/00733/index.html?lang=en

Example of multilingual subjects.

MACS, since 1997, manually maintained, input from four national libraries. Used in The European Library.

Discussed at IFLA 2014 Linked Data for Libraries Satellite Meeting ParisThere are plans for extending and adjusting MACS for future, automated, linked data concepts.This would be a very important development.

The European Library uses MACS. Multilingual AND disambiguated

WikiPedia/DBpedia

SLUB Dresden local Primo addon SLUB-Semantics, using multilingual and disambiguated topics from Wikipedia/DBPedia

But, wait a minute...

RDF?

Identifiers?

Global index?

Transparency?

What are we talking about here? What would be the consequences of applying these suggestions?

Ŧ ᶙ©Ѥ

**** the system

Open independent transparent web based connected data infrastructure

Linked Open Data

Should libraries, vendors invest in data infrastructure instead of systems?

Discovery layers should be separated (decoupled) from proprietary systems, closed data stores and indexes. Main focus should be a global data infrastructure. Which can be accomplished with RDF/LOD. Tools, services built on top of global infrastructure.

This is exactly what Linked Open Data is all about.

Main issue here: would this be commercially beneficial for current discovery layer vendors?

And should libraries focus on data infrastructure instead of systems?

Ŧ ᶙ©Ѥ

**** the system

Open independent transparent web based connected data infrastructure

Linked Open Data

Should libraries, vendors invest in data infrastructure instead of systems?

No, if you look closely, it doesn't say what your mind thinks ;-)

NISO Open Discovery Initiative

“Transparency in discovery” 2014

(http://www.niso.org/workrooms/odi/)

“... facilitate increased transparency in the content

coverage of indexbased discovery services …

Full transparency will enable libraries to objectively

evaluate discovery services …”

NISO Open Discovery Initiative report 2014 objectives.Transparency in discovery, sounds promising.


In scope:

Quantity of content

Form of content

Do not favor or disfavor items from any given

content source or material type

Specific metadata fields indexed

Whether controlled vocabularies or ontologies are

included

NISO ODI topics declared “in scope”Most of these topics confirm suggestions made in this presentation.


Out of scope:

“Relevancy ranking” (may fall within the realm of

proprietary technologies used competitively to

differentiate commercial offerings)

APIs exposed by discovery service (initially,

reluctantly)

However: NISO ODI topics declared “out of scope”

Relevance rankingAPIs (system independent access to data, more or less)

These are exactly the things that are most important for transparency in discovery.


Nothing about:

Content linking/identifiers

Normalised/multilingual authority files

Relevancy ranking

System independent data infrastructure

NISO ODI ignores all issues that improve relevance in discovery.


Stakeholders/Working group members:

Content providers

Discovery service providers

Libraries

Who’s missing?

Most important stakeholders are missing from NISO ODI committees, the end users.

Relevance redefined

Context

User needs

User input

User feedback

Content

Open connected data infrastructure

Systems (Primo) Services

Algorithms Transparency

SOA - Service Oriented Architecture + Context

Conclusion/recommendation:Instead of closed systems with limited content, a transition to a new 3 component environment is required: - content (open global data infrastructure)- context (user needs, input, feedback)- services, systems that access the content and context layers in transparent waysSOA! Service Oriented Architecture + Context

How this can be achieved is still to be investigated. However, SOA is already widely implemented elsewhere.Linked Open Data is technically possible, we only need the will to cooperate.Context is the hardest part to realize. But it is not impossible.

Relevance redefined

Technology

Transcript of Relevance redefined