Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture...

52
Intelligent Tools for a Digital Library Debarshi Kumar Sanyal National Digital Library of India IIT Kharagpur Smart and Green Information and Communication Technology (SGICT), Short Term Course (STC), Jadavpur University, 17 May 2018

Transcript of Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture...

Page 1: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Intelligent Tools for a Digital Library

Debarshi Kumar Sanyal

National Digital Library of IndiaIIT Kharagpur

Smart and Green Information and Communication

Technology (SGICT), Short Term Course (STC),

Jadavpur University, 17 May 2018

Page 2: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Digital Library

• Repository with discovery service of digital resources

– Usually books, articles, digitized manuscripts, historical records, newspapers, photographs, paintings, maps, music, films, question papers, syllabi, presentations, audio-visual lectures, software, dataset

– Can contain only metadata or metadata + content

• Maintained by publishers, educational institutes, governments, non-government agencies, individuals, etc.

• Some are reservoirs of Big Scholarly Data

Page 3: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning
Page 4: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning
Page 5: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning
Page 6: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

24 x 7-ENABLED IMMERSIVE E-LEARNING FOR ALL LEARNERS AT ALL LEVELS

IN ALL AREAS

https://ndl.iitkgp.ac.in/

Page 7: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

National Digital Library of India (NDLI)

• Educational digital library

• Contains mainly metadata

• Contains > 17M resources

• Contains metadata of books, research papers, theses, audio-visual lectures, software, datasets, syllabi, question papers and model answers, etc.

• Content in different languages including English, Bengali, Hindi, Tamil, Telegu, Marathi, etc. (in > 100 languages)

• Keyword-based search and advanced faceted search

Page 8: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning
Page 9: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

21 Mar 2016

Page 10: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Some Research Areas

• Metadata engineering

– Automatic metadata extraction

– Author name disambiguation

• Search and retrieval

– Surrogates for access-restricted resources

– Semantic search

– Figure search

– Customized viewer

– Recommender systems

NDLI Research

Page 11: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Metadata Engineering

Page 12: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Metadata Acquisition

• NDLI is a huge metadata repository

• Currently, metadata is acquired in semi-automatic manner.

– Collected from publishers / libraries

– Manual and automatic post-processing done to adapt to NDLI schema

Page 13: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Automatic Metadata Extraction

• Can we extract metadata automatically for NDLI?

• Challenges

– Various resource types

– Multiple languages

– OCR needed for digitized resources

– Variable scan quality

– Unsatisfactory OCR quality for most Indic languages

– Hard to produce semantic metadata like pedagogical objective

Page 14: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Metadata from Scientific Papers

Developed by Sumana Dey, Staff @ NDLI

"figure": [

{"caption": "Fig. 2: The Framework of Cloud Computing", "page": "2", "path":

"img/Figure2-1.png"},

{"caption": "Fig. 3: Architecture of Mobile Cloud Computing", "page": "3", "path":

"img/Figure3-1.png"},

{"caption": "Fig. 1: Mobile Cloud Computing", "page": "0", "path": "img/Figure1-1.png"}],

"table": [

{"caption": "TABLE I: Challenges and Solutions of Mobile Cloud Computing", "page": "3",

"path": "img/TableI-1.png"}],

"metadata": {

"dc.title": "A Review on Mobile Cloud Computing: Issues, Challenges and Solutions",

"dc.title.alternative": [],

"dc.contributer.author": ["Mandeep Kaur Saggi", "Amandeep Singh Bhatia"],

"dc.contributer.editor": [],

"dc.language.iso": ["en"],

"dc.description.abstract": "Mobile Cloud Computing (MCC) is a combination of mobile

computing and cloud computing. It has become one of the Major Research issue in the

industry. Although there are so, many research studies in mobile computing and cloud

computing, convergence of these two areas grant further academic efforts towards flourishing

MCC. …

Page 15: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Metadata Extraction from Books

Page 16: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Author Name Ambiguity

• Same author might write under different names

• Same name might refer to multiple authors (a few surnames very common

in South-Asia)

• Other sources of noise are spelling mistakes, abbreviated names, etc.

=?

Page 17: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Author Name Disambiguation (AND)

• Author names in NDLI are currently treated as strings.

– Therefore, hyperlinking author names to respective works not feasible

– Author centric analytics not straightforward

– For research papers, citation tracking not feasible

• Can we disambiguate the author names in NDLI?

• Challenges

– A few surnames very common in South-Asia

– Authors from multiple cultural backgrounds have different name conventions

– Multilingual resources

– Metadata often too sparse to disambiguate.

Page 18: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

AND: Formal problem definition (1/2)

• We are given a set of M metadata records C = {c1, c2, …, cM} . Each metadata record ci contains at least author names and work title. It can also contain author affiliations, author email ids, venue of publication, abstract, keywords, references in the article, etc.

– Example of a metadata record

– Manning, Christopher, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. "The Stanford CoreNLP natural language processing toolkit." In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55-60. 2014.

18

Page 19: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

• Each author name is a reference to a real author. From the metadata records, we have to extract author references R = {r1, r2, …, rN} (𝑵 ≥ 𝑴)

• Our next goal is to map R into K disjoint clusters A = {a1, a2, …, aK} (𝑵 ≥ 𝑲) such each cluster ai contains all and only all references to the same real author. K may not be known a priori.

19

r1: S Chattopadhyay

r2: S Chattopadhyay

r3: S Chattopadhyay

r4: S Chattopadhyay

r5: P Smith

r6: P Smith

r7: P Smith

R

a1

a2

a3

a4

A

AND: Formal problem definition (2/2)

Page 20: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

• Approaches

– Author grouping

– Author assignment

• Machine learning methods

– Unsupervised

– Supervised

• Evidence explored

– Citation information

– Web information

– Implicit information

Author Name Disambiguation: Techniques

Page 21: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Author Name Disambiguation: Example

• Approaches

– Author grouping

– Author assignment

• Machine learning methods

– Unsupervised

– Supervised

• Evidence explored

– Citation information

– Web information

– Implicit information

Page 22: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Author Name Disambiguation on

• Create author blocks

– A block contains authors with same lastname and first initial (LN-FI)

• Within a block, create a similarity profile of a pair of papers using metadata like

– Author name, author email id, author affiliation, co-author names (LN-FI), journal name, year of publication, MeSH terms, etc.

• Train a random forest to

– Output 1 if similarity profile belongs to same real-world person

– Output 0 otherwise

• Use above trained classifier on test instances in each block

Treeratpituk, Pucktada, and C. Lee Giles. "Disambiguating authors in academic publications using random forests." Proc. ACM/IEEE-CS JCDL, ACM, 2009.

Page 23: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Search & Retrieval

Page 24: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Surrogates for Access-Restricted Scholarly Articles

• NDLI stores metadata for IEEE/ACM/Springer publications.

• Access to full text requires subscription to container library.

• Sometimes authors store a free version in a preprint server, sometimes an open-access conference paper closely resembles a journal version behind a paywall.

– Call a pair of very similar documents by the same author(s) surrogates

– Retrieve surrogates when original paper is access-restricted

Page 25: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Surrogator: Interface

Page 26: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Surrogator: Architecture

Page 27: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

SurrogatorSantosh, T Y S S, D K Sanyal, and P K

Bhowmick. “Surrogator: Enriching a Digital

Library with Surrogate Resources.” 5th ACM

IKDD CoDS and 23rd COMAD (Demo

track), Goa, India, January 11 - 13, 2018.

Santosh, T Y S S, D K Sanyal, P K Bhowmick,

and P P Das. “Surrogator: A Tool to Enrich

a Digital Library with Open Access

Surrogate Resources.” Proc. ACM/IEEE-CS

JCDL 2018 (Poster). Texas, USA, June 3 - 7,

2018.

Page 28: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Beyond Lexical Search

• Lexical or word matching-based search cannot read user intent.

• Semantic search aims to give what the user wants, rather than what the user said.

Page 29: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Semantic Web• Semantic Web is an extension of the traditional Web in which

information is given well-defined meaning.

• The Semantic Web will contain resources with relations among each other.

Guha, Ramanathan, Rob McCool, and Eric Miller. "Semantic search." Proc. WWW, ACM, 2003.

Page 30: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Semantic Search• Semantic Search attempts to augment and improve traditional

(keyword-based) search results by using data from the Semantic Web.

• Data represented as a directed labelled graph, wherein each node corresponds to a resource and each arc is labelled with a property type (also a resource).

Google

Knowledge

Graph

Page 31: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Semantic Search in the Web

Page 32: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Semantic Search in the Web

Page 33: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Semantic Model for NDLI Metadata

• Semantic model to represent authors, works and other elements (present in metadata) as interconnected entities.

– Handcrafted ontology / auto-generated ontology from NDLI metadata schema can be used.

“Satyajit

Ray”

“Sonar Kella”

“Our Films,

Their Films”

isAuthorOf

isAuthorOf

hasAuthor

hasAuthor

Page 34: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Lexical

search

(2015)

Semantic

search

(2017)

Page 35: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Need for Semantics in

• Concept queries

– “natural language interface”

– “ontology construction”

– “dynamic programming segmentation”

Xiong, Chenyan, Russell Power, and Jamie

Callan. "Explicit semantic ranking for

academic search via knowledge graph

embedding." Proc. WWW, 2017.

Page 36: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

• Create a Knowledge Graph (KG) G = (V, E)

– Collect concept entities (Entity set V)

– Select top ranked noun phrases (keyphrases) from article title, abstract, introduction, conclusion & citation contexts

– Connect concept entities to other objects via edges (E)

– Author edges: Link to author whose paper mentions the entity

– Context edges: Link to co-occurring concept entities

– Desc edges: Link to descriptions in Freebase

– Venue edges: Link to venues that published papers with the entity in its title

Knowledge Graph in

Page 37: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Deep learning

Machine reasoning

Neural architecture

Compositional attention network

Deep learning (also known as deep structured

learning or hierarchical learning) is part of a

broader family of machine learning methods

based on learning data representations, as

opposed to task-specific algorithms.

Page 38: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

KG Embedding in • Build KG from all documents in corpus.

• Find the embedding of each entity in KG.

– Embeddings are trained based on neighbours in the KG.

– Graph structure around an entity conveys semantics of the entity.

– Intuitively, entities with similar neighbours are usually related.

v

dVf : Embedding

(vector of dimension d)

Page 39: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Explicit Semantic Ranking in

Page 40: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Semantic Search in NDLI

• Can we build a semantic search engine over full text (wherever available) in NDLI?

• Challenges

– Parsing documents of diverse disciplines

– Constructing suitable representation of extracted information

– Building the right query interface

Page 41: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Q/A with Books & Articles• Search need not be keyword-based.

• One should be able to ask questions and get relevant answers.

• Can we have a Q/A interface to NDLI?

Page 42: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Figure Search

• “A figure is worth a thousand words”

• Many figure search engines available for biomed publications.

• Can we design a figure search engine (wherever full text available) for NDLI?

• Challenges

– Extracting figures from scanned documents is difficult.

– Annotating figures (especially for multilingual content) is not easy.

Sanyal, D K, S Chattopadhyay, and R Chatterjee. “Figure Retrieval from Biomedical Literature: An

Overview of Techniques, Tools and Challenges.” Machine Learning in Bio-Signal Analysis and

Diagnostic Imaging, Elsevier (In Press).

Page 43: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Customized Viewers• Various customized browsers possible over library contents.

Page 44: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Customized Viewer for Sodhganga Collection

• Sodhganga is a digital repository of Indian Electronic Theses and Dissertations

• NDLI indexes Sodhganga (> 36K records).

• Our tool Posterity helps visualize and analyse the Sodhgangametadata.

– It can show the academic descendants of a researcher

– It shows various indices (like number of direct students) to characterise a researcher’s mentorship.

Page 45: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Processed Sodhganga

advisorId researcherId advisor researcher department institution

51505 16148 datta, asis gunnery,

shobha

school of

life sciences

jawaharlal

nehru

university

Page 46: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Posterity

Developed by Sumana Dey, Staff @

NDLI

Page 47: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning
Page 48: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Recommender Systems

Page 49: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Recommender Systems: Approaches

• Recommend related resources

– Could be books, articles, videos, datasets, audio lectures, etc.

• Primarily 3 approaches

– Content-based filtering

– Uses item metadata and user’s preferences.

– Collaborative filtering

– Predicts what users will like based on their similarity to other users.

– Hybrid

– E.g., Apply above methods separately, then choose top-K from each.

Page 50: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Recommender System for NDLI

• Can we build a recommender system for NDLI?

• Challenges

– Diverse resource types

– Resources in multiple languages

– Diverse user types

– Only metadata available; not full text (barrier to content-based filtering)

– Sparse query logs (cold start problem in collaborative filtering)

Page 51: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning

Lots of Possibilities!

• Many more tools possible

– Cross-lingual and multi-lingual search facilities needed.

– Interface for differently-abled required.

– User experience tracking and enhancement is a plus.

– Data analytics over content could give astonishing insights into knowledge heritage.

Page 52: Intelligent Tools for a Digital Library · Deep learning Machine reasoning Neural architecture Compositional attention network Deep learning (also known as deep structured learning