Digital Libraries: What Should We Expect from Search Engines Dr. John M. Lervik, CEO FAST ECDL 2003.
-
Upload
coleen-alexander -
Category
Documents
-
view
215 -
download
1
Transcript of Digital Libraries: What Should We Expect from Search Engines Dr. John M. Lervik, CEO FAST ECDL 2003.
Fast Search & Transfer (FAST)
San Francisco Tokyo
Chicago
Norway
Munich
Rome
LondonParis
BostonWashington
• Fast Search & Transfer (FAST)– Search technology company founded 1997 (background from NTNU)
– ~ 200 employees
– Profitable and well funded (> $90m cash)
– Oslo Stock Exchange: 'FAST.NO'
• Strong focus on publishing
– Publishers: Elsevier,
LexisNexis, etc.
– Digital Libraries: Norwegian Nat’l Library,
Univ Library Bielefeld, etc.
FAST is the creator of the real-time integrated search and filter solution behind the scenes at
some of the world's best known companies with the world's most demanding search problems
FAST is the creator of the real-time integrated search and filter solution behind the scenes at
some of the world's best known companies with the world's most demanding search problems
Outline
• What is a search engine?
• Third Generation Search Engines: Architecture
• Third Generation Search Engines: Relevance
• Example: Scirus
• Summary
Digital Library Challenges
• Digital libraries face an information management challenge– Huge and increasing amount of digital data– Data/content aggregation, data store (repository), information retrieval &
discovery, etc
• Increasing amount of digital data– Media types: Books, magazines, CDs, ...– Media formats: Text/numbers (incl metadata), audio files, images, video– Must support various access patterns, copyright, etc
• Need flexible and efficient interfaces between information and users
• Search engine as digital information management platform
Traditional View of Search Engines- 1st and 2nd Generation
Unstructured Data
Analyze Content
Analyze Query
Matching
Result Set
User Query
Search Engines vs Databases...
• 3-D scalability
– Data volume: 10 – 100 TB
– Number of users: > 1,000 QPS
– Latency: < 1 sec data input and query latency
• Data & content diversity
– Both structured and unstructured data
– Manage multiple data & content repositories
• Understand content and users
– Linguistic methodology to understand textual content
– Query analysis
• On-the-fly data analysis
– Reduce dependency on upfront data modeling
– Powerful content discovery and result navigationDatabases: Transaction processing, Structured data, Upfront data modeling
Search Engines: Aggregate & Retrieve data, Structured & Unstructured data, On-the-fly data analysis
Search Engines now also provide: • Data consistency• Fault tolerance/high availability• Distributed architectures• Low latency incremental indexing• ...
Search EngineHow It Works
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ERY &
RESU
LTPR
OC
ESSING
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
WebContent
Files,Documents
Databases
CustomApplications
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Pipeline
Multimedia
Open, modular, scalable architecture
Search EngineHow It Works
• Connect to content sources and get data– Web pages (e.g. XML, HTML, WML): Crawler– Files, documents (e.g. Word, Excel, pdf): File traverser– Database content (e.g. Oracle, DB2): Database connectors– Applications (e.g. Notes, Exchange, CMS/DMS): Application connectors
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ER
Y &
RE
SU
LT
PR
OC
ES
SIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
WebContent
Files,Documents
Databases
CustomApplications
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Multimedia
Search EngineHow It Works
• Analyze and index content to make it searchable– Convert and process content through pre-processing pipeline:
• Lemmatization, entity extraction, taxonomy classification, ontology• Custom logic (e.g. adding special tags)
– Write content to index files
WebContent
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ERY /R
ESULT
PRO
CESSIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Files,Documents
Databases
CustomApplications
ContentPush
Pipeline
Multimedia
Search EngineHow It Works
• Analyze query– Use query language or query API– Convert and process query through query pipeline:
• Linguistic processing• Custom logic (e.g. query term modification/addition)
WebContent
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ERY
PRO
CESSIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Files,Documents
Databases
CustomApplications
Multimedia
Search EngineHow It Works
• Match query to content index– Query- and content adaptive matching– Exploit all information and structure in the data
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ERY /R
ESULT
PRO
CESSIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
WebContent
ContentPush
Files,Documents
Databases
CustomApplications
Pipeline
Multimedia
CO
NN
ECTO
RS
Search EngineHow It Works
• Return results to user– Convert and process results through result pipeline:
• Resort, filter for security, organize for dynamic drilldown– Pass results on to application (generated or through API) – Push results to alert engine and then external environment (e.g. mail, queue)
WebContent
Pipeline
SEARCH
RESU
LTPR
OC
ESSING
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Files,Documents
Databases
CustomApplications
Multimedia
Search Engine FeaturesRelevant, Organized Information
• Linguistic Analysis– Auto-language detection– Natural language processing– Approximate matching (spelling)– Lemmatization (grammar)– Entity extraction, anti-phrasing– Multiple dictionaries, thesauri
• Taxonomy and Classification– Structured, unstructured data– Supervised, unsupervised categorization– Dynamic classification– Auto-taxonomy generation (terms, Web)– Taxonomy toolkit– Ontologies
• Tuning relevancy– Absolute and relative query boosting– Relative document boosting– Custom processing logic (pre-index, query)– Business Manager’s Control Panel
• Powerful Search Language– Exact matches, wildcards, multiple terms– “more like this” (query by example), “near”– Text, integer, Boolean expressions– Integer comparisons (>, , =, <, , )– Infinite level of parentheses
• Flexible Search and Sort– Range searching– Default sort, sort by field– Static, dynamic teasers, any field– Full inclusion, exclusion URI control– Robot aware
• Navigation and Drill-Down Tools– Structure, unstructured data– Dynamic drill-down (faceted browsing)– Results-based binning– Statistical analysis
Search Engines: An Ideal Information Management Platform
• Software system for overall information management– Universal data aggregation: public & proprietary
– Central content repository: source data & metadata, ...
– Efficient information access (through seach interface) – including push (alert)
– Powerful data mining and discovery
Relevance Model3rd Generation Search Engines
• Algorithmic– Mining
• Preparing unstructured information for intelligent information discovery• Information/entity extraction, structural document analysis, classification, NLP, ...
– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability
Not only query results; Also supporting query Result driven info discovery
• Rule Based– Rules: Override algorithmic search, preferred content– Automatic detection of “correct answers”
RelevanceMining: Content Refinement
• Algorithmic– Mining
• Preparing unstructured information for intelligent information discovery• Information/entity extraction and structural document analysis, classification,
NLP, ...– Matching
• Tuning of recall: Spell-check, query analysis, linguistic analysis ...– Ranking
• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability
Not only query results; Also supporting query Result driven info discovery
• Rule Based– Rules: Override algorithmic search, preferred content– Automatic detection of “correct answers”
RelevanceMining: Entity Extraction
• What is it?– Techniques to automatically detect, extract and normalize ”entities” from
documents’ text, e.g., names of people or companies, noun phrases, product names or codes, adresses, chemical compounds, acronyms, ...
• Why bother?– Make unstructured data more structured
• Enhance relevance (precision searching:”it” vs. ”IT”, boosting based on entities, improved vectorization, ...)
• Enables navigation, i.e., attributes to drill down along– Dictionary compilation
• Spellchecking, classification, query associations, etc.– Improve classification quality: entities are more specific and less ambiguous
• How is it done?– Realized as document processors that annotate/modify the indexed document
and/or otherwise persist the entities.– Guided by dictionaries, syntactical cues, grammars and/or document structure
Entities...
PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes.
The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything I'd love to do well here," the American said.
A first round loser last year, Williams is hoping to progress beyond the quarter-finals for the first time in her career.
... As Opposed to Words
11, 25, 3, 3, 6, 6, 65, a, a, advance, american, and, and, anything, aside, at, become, being, beyond, bianka, blustery, breezed, brushing, career, center, champion, court, d, do, finals, first, first, first, for, french, french, garros, german, her, here, here, hoping, i, i, i, in, in, into, is, lamade, last, loser, love, love, love, million, minutes, monday, more, of, on, open, open, open, paris, past, progress, quarter, raced, reuters, roland, round, round, s, said, second, second, seed, seeded, than, the, the, the, the, the, the, the, the, the, time, to, to, to, to, u, venus, well, williams, williams, wimbledon, year
(Wordlist of article above)
RelevanceMining: Structural Document Analysis
Investigations in E. coli
B. C. AbracadabraDepartment of Molecular MedicineUniversity of Wisconsin
S. MiheevAnalytical LaboratoryRussian Academy of ScienecesMoscow
Journal of Cancer Research
Issue 5, 2003 -12
Abstract
1. Introduction
9. References
2. Materials and Methods
In this study we investigate………
[1] Abracadabra, B.C., Discovery of E. coli for Genetic Research, Conf. Canc. 1997, 231-245
[2] Tomason T., Latest Developments in Cancer Research, Int J Med, 1999, 23, 12-16
[3] Zoralek Q.W., Geneteics. A Personal View. Int. Conf. Gen., 1993, 3-12
• Benefits– Sensitive to
• Content• Structure• Formatting
– Document classification• Improve precision in
information discovery
– Document segmentation• Segmentation used in
tunable relevance model• Enabling contextual
entity extraction
RelevanceMining: Structural Document Analysis
Investigations in E. coli
B. C. AbracadabraDepartment of Molecular MedicineUniversity of Wisconsin
S. MiheevAnalytical LaboratoryRussian Academy of ScienecesMoscow
Journal of Cancer Research
Issue 5, 2003 -12
Abstract
1. Introduction
9. References
2. Materials and Methods
In this study we investigate………
[1] Abracadabra, B.C., Discovery of E. coli for Genetic Research, Conf. Canc. 1997, 231-245
[2] Tomason T., Latest Developments in Cancer Research, Int J Med, 1999, 23, 12-16
[3] Zoralek Q.W., Geneteics. A Personal View. Int. Conf. Gen., 1993, 3-12
Journal Title
Article Title
Section Heading
“Readable” Text paragraph
Bibliography heading
Bibliography line
Author Section
TextBlockTypes
ComplexBlockTypes
DocType
RelevanceMatching: Recall optimization
• Algorithmic– Mining
• Preparing unstructured information for intelligent information discovery• Information/entity extraction, structural document analysis, classification, NLP...
– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability
Not only query results; Also supporting query Result driven info discovery
• Rule-Based– Rules: Override algorithmic search, preferred content– Automatic detection of “correct answers”
RelevanceMatching: Reasons for inadequate matching
• Ortographical analysis (incl. typos + other spelling variations)– Typos (in queries, documents)
– Official variants: E.g. German (Dutch) spelling reform
• Morphological analysis (incl. all forms of a given word)– Handled via linguistic normalization (lemmatization)
• Syntactic analysis (including handling natural language)– Entity/phrase extraction, anti-phrases, etc.
• Semantic analysis (understanding the intention behind the words) – Handled by a combination of general and specific thesauri and ontologies,
as well as automatic phrasing, and anti-phrasing, etc.
helwett packardhewletpackardhewlett packarthewlitt-packardhewlett pachardhewlett pacardhewlett packhardhewlett packerthewlet packerdhewelt packardhewlet pakardhewelet packardhawlet packardhwelett packardheweltt packardhewellet packard
hewlet packardhewlettpackardhewlitt packardhewllet packardhewlwtt packardhewett packardhewitt packardhewlette packardhewlett packerdhewelett packardhawlett packardhewlet-packardhewlett pakardhewllett packardhewlit packardhewlett parkard
Reasons for Inadequate Matching Ortographical Analysis
hewlatt packardhelwet packardhewleet packardhewlwt packardhewlwtt-packardhelett packardHewlittpackardhawlett-packardhewlett parckardhewlet packarthulett packardhewlert packardhewlet pacardhewletpackard comhewlet pachard
....
• And quite a few more variations...
• Taking this into account lead to a 500% increase in recall!
• This observation applies to all queries involving proper names, product names, technical terms, etc
RelevanceMatching: Linguistic Query Analysis
TokenizerPhrasing
Spellchecker Anti-Phrasing&
Stopwords Normalization
Baseform red.Synonyms Temp.
Morph/synExpansion
Customer’sQT
Lemmatization + Synonym: Reduction to base form, represented symbolically:
<lang> <concept> <lemma>012 319827 002
For lemmatization / synonymdictionaries added by customerwithout wanting to re-index:
Query Expansion
Language specificstop word lists
Character Normalization- configurable for other languages- can be switched off at query time
E.g. special thesaurus support- for narrower & broader terms- for special phonetic search- ….
TokenizerMake sure- special characters are treated correctly- on demand: no lower casing- etc..
NLQ
Natural Language Analysis
Adaptive Query EvaluationSelect: Content Ranking profiles
Adaptive Query Evaluation
RelevanceMatching: Adaptive Query Evaluation
General Queries
Hybrid Queries
Specific Queries
Content Format Reference
‘New York’
‘HP printer driver LP 6j’
‘C source code downloads’
RelevanceRanking: The CASQF Framework
• Algorithmic– Mining
• Preparing unstructured information for intelligent information discovery• Information/entity extraction and structural document analysis, classification, NLP...
– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability
Not only query results; Also supporting query Result driven info discovery
• Rule Based– Rules: Override algorithmic search, preferred content– Automatic detection of “correct answers”
Relevance Ranking: The CASQF Framework
• Completeness– How well does the query match superior contexts like the title or the url?– Example: query=”Mexico”, Is ”Mexico” or ”University of New Mexico” best?
• Authority– Is the document considered an authority for this query?– Examples: Web link cardinality, article references, product revenue, page
impressions, ...
• Statistics– How well does the contents of this document on overall match the query?– Examples: Proximity, context weights, tf-idf, degree of linguistic normalization,++
• Quality– What is the quality of the document? – Examples: Homepage?, Entry point to product group?, Press release?, ...
• Freshness– How fresh is the document compared to the time of the query?
Orthogonal attributes aggregated from search relevance primitives
• CASQF Attributes– Completeness– Authority– Statistics– Quality– Freshness
• Correlations– Complete &
Fresh?– Authority &
Frequent query– ….
Transfer functions Attribute weights Correlations
Relevance Ranking: Machine Learning Optimization
Auto-language detectionApproximate matchingLemmatization (grammar)Phrase detectionAnti-phrasing (stop words)Multiple dictionariesthesauriNatural language
(who, what, where, etc.)Proximity searchSpatial relevanceContextual tagging…
RelevancePrimitives
Relevance Navigation & Information Discovery
• Algorithmic– Mining
• Preparing unstructured information for intelligent information discovery• Information/entity extraction and structural document analysis, classification, NLP...
– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability
Not only query results; Also supporting query Result driven info discovery
• Rule Based– Rules: override algorithmic search, preferred content– Automatic detection of “correct answers”
Form of Result Sets
• Traditional: Results sets are typically lists of document identifiers
• 3rd generation: Result set depending on the query intentions– Traditional result set lists
– Dynamic clustering: Supervised and unsupervised
– Dynamic drill-down: Live analytics
– ...
2 ways to search: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”
2 ways to search: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”
Intelligent OrganizationIntelligent Organization
The search bar
The hierarchical tree
Medline Demo- 12 million journal articles from > 4,000 journals
Dynamic Drill-Down in Auto-Extracted Entities
LiveAnalytics™: Tool for Dynamic Drill-Down
• Multi-dimensional navigation and binning tool– Multiple ways to drill down through structured data
– E.g. database rows, metadata, automatically extracted attributes (entities)
• Data properties may be:– Textual: author, publication, …
– Numeric: date, price, …
– Free text: abstract, ...
• Source may be historical or live feeds
Relevance Rule Based
• Algorithmic– Mining
• Preparing unstructured information for intelligent information discovery• Information/entity extraction and structural document analysis, classification, NLP...
– Matching• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking• Tuning of precision• Assigning of key parameters – The CASQF framework• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery• Extensive result analysis through LiveAnalytics• Adaptive answering/discovery capability
Not only query results; Also supporting query Result driven info discovery
• Rule Based– Rules: override algorithmic search, preferred content– Automatic detection of “correct answers”
Scirus
Scirus is the leading online search enginefor scientific content
ProprietaryDatabases
ValueAdded
Functionalities
ScientificWeb Pages
Twice winner of SEW Best Specialty
Search Engine award
120 million Web pages(.edu, .gov, .org, .com, …)
18M articlerecords (Medline, SciencDirect, …)
• Large-scale content aggregation• Automatic content & page classificat.• Query refinements (1-D drill-down)
Summary
• Search engines can do more than just search…– Total information management system– Open, scalable and modular architecture: Allows for customization– Adapts to content and queries– Powerful data discovery and navigation
• Many exciting technology developments to come– More advanced content and query analysis– Adaptive query- & content-sensitive matching– Dynamic result set presentation and navigation– …