Digital Libraries: What Should We Expect from Search Engines Dr. John M. Lervik, CEO FAST ECDL 2003.

44
Digital Libraries: What Should We Expect from Search Engines Dr. John M. Lervik, CEO FAST ECDL 2003

Transcript of Digital Libraries: What Should We Expect from Search Engines Dr. John M. Lervik, CEO FAST ECDL 2003.

Digital Libraries: What Should We Expect from Search Engines

Dr. John M. Lervik, CEO FAST ECDL 2003

Fast Search & Transfer (FAST)

San Francisco Tokyo

Chicago

Norway

Munich

Rome

LondonParis

BostonWashington

• Fast Search & Transfer (FAST)– Search technology company founded 1997 (background from NTNU)

– ~ 200 employees

– Profitable and well funded (> $90m cash)

– Oslo Stock Exchange: 'FAST.NO'

• Strong focus on publishing

– Publishers: Elsevier,

LexisNexis, etc.

– Digital Libraries: Norwegian Nat’l Library,

Univ Library Bielefeld, etc.

FAST is the creator of the real-time integrated search and filter solution behind the scenes at

some of the world's best known companies with the world's most demanding search problems

FAST is the creator of the real-time integrated search and filter solution behind the scenes at

some of the world's best known companies with the world's most demanding search problems

Outline

• What is a search engine?

• Third Generation Search Engines: Architecture

• Third Generation Search Engines: Relevance

• Example: Scirus

• Summary

Digital Library Challenges

• Digital libraries face an information management challenge– Huge and increasing amount of digital data– Data/content aggregation, data store (repository), information retrieval &

discovery, etc

• Increasing amount of digital data– Media types: Books, magazines, CDs, ...– Media formats: Text/numbers (incl metadata), audio files, images, video– Must support various access patterns, copyright, etc

• Need flexible and efficient interfaces between information and users

• Search engine as digital information management platform

Third Generation Search Engines: Architecture

Traditional View of Search Engines- 1st and 2nd Generation

Unstructured Data

Analyze Content

Analyze Query

Matching

Result Set

User Query

Search Engines vs Databases...

• 3-D scalability

– Data volume: 10 – 100 TB

– Number of users: > 1,000 QPS

– Latency: < 1 sec data input and query latency

• Data & content diversity

– Both structured and unstructured data

– Manage multiple data & content repositories

• Understand content and users

– Linguistic methodology to understand textual content

– Query analysis

• On-the-fly data analysis

– Reduce dependency on upfront data modeling

– Powerful content discovery and result navigationDatabases: Transaction processing, Structured data, Upfront data modeling

Search Engines: Aggregate & Retrieve data, Structured & Unstructured data, On-the-fly data analysis

Search Engines now also provide: • Data consistency• Fault tolerance/high availability• Distributed architectures• Low latency incremental indexing• ...

Search EngineHow It Works

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ERY &

RESU

LTPR

OC

ESSING

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

ContentPush

DO

CU

MEN

TPR

OC

ESSING

Pipeline

WebContent

Files,Documents

Databases

CustomApplications

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Pipeline

Multimedia

Open, modular, scalable architecture

Search EngineHow It Works

• Connect to content sources and get data– Web pages (e.g. XML, HTML, WML): Crawler– Files, documents (e.g. Word, Excel, pdf): File traverser– Database content (e.g. Oracle, DB2): Database connectors– Applications (e.g. Notes, Exchange, CMS/DMS): Application connectors

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ER

Y &

RE

SU

LT

PR

OC

ES

SIN

G

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

ContentPush

DO

CU

MEN

TPR

OC

ESSING

Pipeline

WebContent

Files,Documents

Databases

CustomApplications

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Multimedia

Search EngineHow It Works

• Analyze and index content to make it searchable– Convert and process content through pre-processing pipeline:

• Lemmatization, entity extraction, taxonomy classification, ontology• Custom logic (e.g. adding special tags)

– Write content to index files

WebContent

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ERY /R

ESULT

PRO

CESSIN

G

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

DO

CU

MEN

TPR

OC

ESSING

Pipeline

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Files,Documents

Databases

CustomApplications

ContentPush

Pipeline

Multimedia

Search EngineHow It Works

• Analyze query– Use query language or query API– Convert and process query through query pipeline:

• Linguistic processing• Custom logic (e.g. query term modification/addition)

WebContent

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ERY

PRO

CESSIN

G

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

ContentPush

DO

CU

MEN

TPR

OC

ESSING

Pipeline

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Files,Documents

Databases

CustomApplications

Multimedia

Search EngineHow It Works

• Match query to content index– Query- and content adaptive matching– Exploit all information and structure in the data

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ERY /R

ESULT

PRO

CESSIN

G

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

DO

CU

MEN

TPR

OC

ESSING

Pipeline

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

WebContent

ContentPush

Files,Documents

Databases

CustomApplications

Pipeline

Multimedia

CO

NN

ECTO

RS

Search EngineHow It Works

• Return results to user– Convert and process results through result pipeline:

• Resort, filter for security, organize for dynamic drilldown– Pass results on to application (generated or through API) – Push results to alert engine and then external environment (e.g. mail, queue)

WebContent

Pipeline

SEARCH

RESU

LTPR

OC

ESSING

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

ContentPush

DO

CU

MEN

TPR

OC

ESSING

Pipeline

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Files,Documents

Databases

CustomApplications

Multimedia

Search Engine FeaturesRelevant, Organized Information

• Linguistic Analysis– Auto-language detection– Natural language processing– Approximate matching (spelling)– Lemmatization (grammar)– Entity extraction, anti-phrasing– Multiple dictionaries, thesauri

• Taxonomy and Classification– Structured, unstructured data– Supervised, unsupervised categorization– Dynamic classification– Auto-taxonomy generation (terms, Web)– Taxonomy toolkit– Ontologies

• Tuning relevancy– Absolute and relative query boosting– Relative document boosting– Custom processing logic (pre-index, query)– Business Manager’s Control Panel

• Powerful Search Language– Exact matches, wildcards, multiple terms– “more like this” (query by example), “near”– Text, integer, Boolean expressions– Integer comparisons (>, , =, <, , )– Infinite level of parentheses

• Flexible Search and Sort– Range searching– Default sort, sort by field– Static, dynamic teasers, any field– Full inclusion, exclusion URI control– Robot aware

• Navigation and Drill-Down Tools– Structure, unstructured data– Dynamic drill-down (faceted browsing)– Results-based binning– Statistical analysis

Search Engines: An Ideal Information Management Platform

• Software system for overall information management– Universal data aggregation: public & proprietary

– Central content repository: source data & metadata, ...

– Efficient information access (through seach interface) – including push (alert)

– Powerful data mining and discovery

Third Generation Search Engines: Relevance

Relevance Model3rd Generation Search Engines

• Algorithmic– Mining

• Preparing unstructured information for intelligent information discovery• Information/entity extraction, structural document analysis, classification, NLP, ...

– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...

– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP

– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability

Not only query results; Also supporting query Result driven info discovery

• Rule Based– Rules: Override algorithmic search, preferred content– Automatic detection of “correct answers”

RelevanceMining: Content Refinement

• Algorithmic– Mining

• Preparing unstructured information for intelligent information discovery• Information/entity extraction and structural document analysis, classification,

NLP, ...– Matching

• Tuning of recall: Spell-check, query analysis, linguistic analysis ...– Ranking

• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP

– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability

Not only query results; Also supporting query Result driven info discovery

• Rule Based– Rules: Override algorithmic search, preferred content– Automatic detection of “correct answers”

RelevanceMining: Entity Extraction

• What is it?– Techniques to automatically detect, extract and normalize ”entities” from

documents’ text, e.g., names of people or companies, noun phrases, product names or codes, adresses, chemical compounds, acronyms, ...

• Why bother?– Make unstructured data more structured

• Enhance relevance (precision searching:”it” vs. ”IT”, boosting based on entities, improved vectorization, ...)

• Enables navigation, i.e., attributes to drill down along– Dictionary compilation

• Spellchecking, classification, query associations, etc.– Improve classification quality: entities are more specific and less ambiguous

• How is it done?– Realized as document processors that annotate/modify the indexed document

and/or otherwise persist the entities.– Guided by dictionaries, syntactical cues, grammars and/or document structure

Entities...

PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes.

The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything I'd love to do well here," the American said.

A first round loser last year, Williams is hoping to progress beyond the quarter-finals for the first time in her career.

... As Opposed to Words

11, 25, 3, 3, 6, 6, 65, a, a, advance, american, and, and, anything, aside, at, become, being, beyond, bianka, blustery, breezed, brushing, career, center, champion, court, d, do, finals, first, first, first, for, french, french, garros, german, her, here, here, hoping, i, i, i, in, in, into, is, lamade, last, loser, love, love, love, million, minutes, monday, more, of, on, open, open, open, paris, past, progress, quarter, raced, reuters, roland, round, round, s, said, second, second, seed, seeded, than, the, the, the, the, the, the, the, the, the, time, to, to, to, to, u, venus, well, williams, williams, wimbledon, year

(Wordlist of article above)

Automatically Extracted Entities

RelevanceMining: Structural Document Analysis

Investigations in E. coli

B. C. AbracadabraDepartment of Molecular MedicineUniversity of Wisconsin

S. MiheevAnalytical LaboratoryRussian Academy of ScienecesMoscow

Journal of Cancer Research

Issue 5, 2003 -12

Abstract

1. Introduction

9. References

2. Materials and Methods

In this study we investigate………

[1] Abracadabra, B.C., Discovery of E. coli for Genetic Research, Conf. Canc. 1997, 231-245

[2] Tomason T., Latest Developments in Cancer Research, Int J Med, 1999, 23, 12-16

[3] Zoralek Q.W., Geneteics. A Personal View. Int. Conf. Gen., 1993, 3-12

• Benefits– Sensitive to

• Content• Structure• Formatting

– Document classification• Improve precision in

information discovery

– Document segmentation• Segmentation used in

tunable relevance model• Enabling contextual

entity extraction

RelevanceMining: Structural Document Analysis

Investigations in E. coli

B. C. AbracadabraDepartment of Molecular MedicineUniversity of Wisconsin

S. MiheevAnalytical LaboratoryRussian Academy of ScienecesMoscow

Journal of Cancer Research

Issue 5, 2003 -12

Abstract

1. Introduction

9. References

2. Materials and Methods

In this study we investigate………

[1] Abracadabra, B.C., Discovery of E. coli for Genetic Research, Conf. Canc. 1997, 231-245

[2] Tomason T., Latest Developments in Cancer Research, Int J Med, 1999, 23, 12-16

[3] Zoralek Q.W., Geneteics. A Personal View. Int. Conf. Gen., 1993, 3-12

Journal Title

Article Title

Section Heading

“Readable” Text paragraph

Bibliography heading

Bibliography line

Author Section

TextBlockTypes

ComplexBlockTypes

DocType

RelevanceMatching: Recall optimization

• Algorithmic– Mining

• Preparing unstructured information for intelligent information discovery• Information/entity extraction, structural document analysis, classification, NLP...

– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...

– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP

– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability

Not only query results; Also supporting query Result driven info discovery

• Rule-Based– Rules: Override algorithmic search, preferred content– Automatic detection of “correct answers”

RelevanceMatching: Reasons for inadequate matching

• Ortographical analysis (incl. typos + other spelling variations)– Typos (in queries, documents)

– Official variants: E.g. German (Dutch) spelling reform

• Morphological analysis (incl. all forms of a given word)– Handled via linguistic normalization (lemmatization)

• Syntactic analysis (including handling natural language)– Entity/phrase extraction, anti-phrases, etc.

• Semantic analysis (understanding the intention behind the words) – Handled by a combination of general and specific thesauri and ontologies,

as well as automatic phrasing, and anti-phrasing, etc.

helwett packardhewletpackardhewlett packarthewlitt-packardhewlett pachardhewlett pacardhewlett packhardhewlett packerthewlet packerdhewelt packardhewlet pakardhewelet packardhawlet packardhwelett packardheweltt packardhewellet packard

hewlet packardhewlettpackardhewlitt packardhewllet packardhewlwtt packardhewett packardhewitt packardhewlette packardhewlett packerdhewelett packardhawlett packardhewlet-packardhewlett pakardhewllett packardhewlit packardhewlett parkard

Reasons for Inadequate Matching Ortographical Analysis

hewlatt packardhelwet packardhewleet packardhewlwt packardhewlwtt-packardhelett packardHewlittpackardhawlett-packardhewlett parckardhewlet packarthulett packardhewlert packardhewlet pacardhewletpackard comhewlet pachard

....

• And quite a few more variations...

• Taking this into account lead to a 500% increase in recall!

• This observation applies to all queries involving proper names, product names, technical terms, etc

RelevanceMatching: Linguistic Query Analysis

TokenizerPhrasing

Spellchecker Anti-Phrasing&

Stopwords Normalization

Baseform red.Synonyms Temp.

Morph/synExpansion

Customer’sQT

Lemmatization + Synonym: Reduction to base form, represented symbolically:

<lang> <concept> <lemma>012 319827 002

For lemmatization / synonymdictionaries added by customerwithout wanting to re-index:

Query Expansion

Language specificstop word lists

Character Normalization- configurable for other languages- can be switched off at query time

E.g. special thesaurus support- for narrower & broader terms- for special phonetic search- ….

TokenizerMake sure- special characters are treated correctly- on demand: no lower casing- etc..

NLQ

Natural Language Analysis

Adaptive Query EvaluationSelect: Content Ranking profiles

Adaptive Query Evaluation

RelevanceMatching: Adaptive Query Evaluation

General Queries

Hybrid Queries

Specific Queries

Content Format Reference

‘New York’

‘HP printer driver LP 6j’

‘C source code downloads’

RelevanceRanking: The CASQF Framework

• Algorithmic– Mining

• Preparing unstructured information for intelligent information discovery• Information/entity extraction and structural document analysis, classification, NLP...

– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...

– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP

– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability

Not only query results; Also supporting query Result driven info discovery

• Rule Based– Rules: Override algorithmic search, preferred content– Automatic detection of “correct answers”

Relevance Ranking: The CASQF Framework

• Completeness– How well does the query match superior contexts like the title or the url?– Example: query=”Mexico”, Is ”Mexico” or ”University of New Mexico” best?

• Authority– Is the document considered an authority for this query?– Examples: Web link cardinality, article references, product revenue, page

impressions, ...

• Statistics– How well does the contents of this document on overall match the query?– Examples: Proximity, context weights, tf-idf, degree of linguistic normalization,++

• Quality– What is the quality of the document? – Examples: Homepage?, Entry point to product group?, Press release?, ...

• Freshness– How fresh is the document compared to the time of the query?

Orthogonal attributes aggregated from search relevance primitives

• CASQF Attributes– Completeness– Authority– Statistics– Quality– Freshness

• Correlations– Complete &

Fresh?– Authority &

Frequent query– ….

Transfer functions Attribute weights Correlations

Relevance Ranking: Machine Learning Optimization

Auto-language detectionApproximate matchingLemmatization (grammar)Phrase detectionAnti-phrasing (stop words)Multiple dictionariesthesauriNatural language

(who, what, where, etc.)Proximity searchSpatial relevanceContextual tagging…

RelevancePrimitives

Relevance Navigation & Information Discovery

• Algorithmic– Mining

• Preparing unstructured information for intelligent information discovery• Information/entity extraction and structural document analysis, classification, NLP...

– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...

– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP

– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability

Not only query results; Also supporting query Result driven info discovery

• Rule Based– Rules: override algorithmic search, preferred content– Automatic detection of “correct answers”

Form of Result Sets

• Traditional: Results sets are typically lists of document identifiers

• 3rd generation: Result set depending on the query intentions– Traditional result set lists

– Dynamic clustering: Supervised and unsupervised

– Dynamic drill-down: Live analytics

– ...

2 ways to search: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”

2 ways to search: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”

Intelligent OrganizationIntelligent Organization

The search bar

The hierarchical tree

Medline Demo- 12 million journal articles from > 4,000 journals

Dynamic Drill-Down in Auto-Extracted Entities

Dynamic Drill-Down

• MESH keywords• Publication year• Journal Title• Author(s)• Substances• Etc

LiveAnalytics™: Tool for Dynamic Drill-Down

• Multi-dimensional navigation and binning tool– Multiple ways to drill down through structured data

– E.g. database rows, metadata, automatically extracted attributes (entities)

• Data properties may be:– Textual: author, publication, …

– Numeric: date, price, …

– Free text: abstract, ...

• Source may be historical or live feeds

Relevance Rule Based

• Algorithmic– Mining

• Preparing unstructured information for intelligent information discovery• Information/entity extraction and structural document analysis, classification, NLP...

– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...

– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP

– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability

Not only query results; Also supporting query Result driven info discovery

• Rule Based– Rules: override algorithmic search, preferred content– Automatic detection of “correct answers”

Relevance ModelRule Based: Business Logic Example

Example: Scirus

Scirus

Scirus is the leading online search enginefor scientific content

ProprietaryDatabases

ValueAdded

Functionalities

ScientificWeb Pages

Twice winner of SEW Best Specialty

Search Engine award

120 million Web pages(.edu, .gov, .org, .com, …)

18M articlerecords (Medline, SciencDirect, …)

• Large-scale content aggregation• Automatic content & page classificat.• Query refinements (1-D drill-down)

Summary

• Search engines can do more than just search…– Total information management system– Open, scalable and modular architecture: Allows for customization– Adapts to content and queries– Powerful data discovery and navigation

• Many exciting technology developments to come– More advanced content and query analysis– Adaptive query- & content-sensitive matching– Dynamic result set presentation and navigation– …

Questions?