Books and Webs: Pulling the Down Rows
-
Upload
peter-brantley -
Category
Technology
-
view
972 -
download
2
description
Transcript of Books and Webs: Pulling the Down Rows
![Page 1: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/1.jpg)
Peter Brantley Internet Archive The Presidio 11.09
![Page 2: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/2.jpg)
Essential premise :
combining web search with book search is an
engineering challenge
![Page 3: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/3.jpg)
I. Presenting combined search
![Page 4: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/4.jpg)
For several years, I served the University of California as the Director of Technology for the California Digital Library.
(the digital library group for the UC system)
![Page 5: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/5.jpg)
We held various conversations over time with Google engineers in similar spaces ...
grappling with the indexing, search, and user interface issues with combined but disparate content pools (books, journals, web, image, video).
(an important issue for digital libraries)
![Page 6: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/6.jpg)
In academic info markets, “metasearch” – distributed queries with central resolution, contested for primacy with search over aggregated content.
To an extent, only LANL and commercial search pursued aggregation at scale.
Aggregation wins.
![Page 7: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/7.jpg)
“Google is undertaking the most radical change to its search results ever, introducing a "Universal Search" system that will blend listings from its news, video, images, local and book search engines among those it gathers from crawling web pages.”
“With Universal Search, Google will hit a range of its vertical search engines, then decide if the relevancy of a result from book search is higher than a match from web page search.”
Danny Sullivan, “Google 2.0”, May 16 2007, Search Engine Land
![Page 8: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/8.jpg)
Simple search box ... but
User search intentionality for books vs. web can differ
“mark twain hawai’i”
![Page 9: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/9.jpg)
Google Scholar is vertical search engine.
Explicit opt-‐in discovery service for STM journal content, utilized in HE academia.
Many concerns with combining the Scholar product with Big Daddy. User search goals differ; content distinct; different indexing.
![Page 10: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/10.jpg)
From 2007 – early 2009, I was the Director of the Digital Library Federation. I made a request of Google to update members on GBS status at DLF’s Fall Forum, Nov. 2008.
They issued an explicit request for HE CS/EE attention to the problem of integrating book and web search. Paraphrasing: “Not a well solved problem”.
![Page 11: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/11.jpg)
Some comparisons between web pages
and books.
![Page 12: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/12.jpg)
web: short doc (web page) length
books: long doc (book) length
![Page 13: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/13.jpg)
web: high data density (per doc size)
books: highly variant data density (e.g. fiction vs. non-‐fiction)
![Page 14: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/14.jpg)
web: trillions of unique web pages
books: (low) millions of unique books
![Page 15: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/15.jpg)
web: many complex media types
books: text and image media
![Page 16: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/16.jpg)
web: dynamic over time (avg. TTL of web pages is short)
books: static over time (print books permanently fixed)
![Page 17: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/17.jpg)
web: single instances (web pages)
books: duplicate instances (copies), similar instances (editions), in multiple languages
![Page 18: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/18.jpg)
web: hyperlinked in/out (useful in relevance)
books: normally quiescent (sometimes citations)
![Page 19: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/19.jpg)
web: designed component structure {page hierarchy > web site}
books: artificial component structure {page images > book}
![Page 20: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/20.jpg)
Bibliographic data cf. full text (book) data:
The Melvyl Recommender Project Full Text Extension
(Supplementary Report) California Digital Library
October 2006
Funded by the Andrew W. Mellon Foundation
![Page 21: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/21.jpg)
Project Lead Peter Brantley, Director of Technology
Implementation Team Kirk Hastings, Text Systems Designer Martin Haye, Programmer (Contractor) Steve Toub, Web Design Manager Colleen Whitney, Programmer and Coordinator
Assessment Team Jane Lee, Assessment Analyst Felicia Poe, Assessment Coordinator Lisa Schiff, Digital Ingest Programmer
![Page 22: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/22.jpg)
Often many different editions of popular books. Can easily artificially boost search (n_copies).
e.g. “Moby Dick” published 100s of times (and in many languages)
Depending on publication date: either public domain (dep. on country) or in-‐copyright (out-‐of-‐print or in-‐print)
![Page 23: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/23.jpg)
In CDL tests, for texts vs. bib records:
Search scoring for full text documents was typically 10 -‐ 100 times larger than for metadata-‐only records.
(Probably approximate magnitude cf. to representative web pages).
![Page 24: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/24.jpg)
Easy for a single work to overwhelm web pages in relevance for a well-‐fitting query.
E.g. “English working class labor industrial”
The making of the English working class. Author: E P Thompson Publisher: New York, Pantheon Books [1964, ©1963]
![Page 25: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/25.jpg)
Books are long strings of many words, split into n_sized chunks for parsing.
Term indexing based on overlapping and variant length “word vectors”
“battle” “of” “britain” “battle of” “britain” “battle” “of britain” “battle of britain”
![Page 26: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/26.jpg)
{Search Term} and {Document} weights
1. How often is a search term found within a given sized chunk of text?
2. How many chunks of text is the term found within?
3. How many chunks of text does the document contain?
![Page 27: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/27.jpg)
Which is better?
1. Adequate matches over many fields, 2. Better matches in fewer fields.
Metrics vary between books and web. One learns from one’s mistakes. More books, more mistakes.
![Page 28: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/28.jpg)
1. Books are sooo much longer than web pages. 2. Books produce 1000’s more chunks than web. 3. Term weighting is very complex for long docs. 4. Indexes must be integrated for web and books. 5. But source term indexes are biased differently.
![Page 29: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/29.jpg)
II. What you get from books
![Page 30: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/30.jpg)
The dialectic between books and web provides benefits from their integration (no matter the pain).
Books enrich general web search, not just via the data within books, but also by books-‐as-‐data.
![Page 31: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/31.jpg)
All search is made smarter by analysis.
1. structure 2. contextualization 3. relatedness 4. normalization 5. association
![Page 32: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/32.jpg)
Because of digitization, books have complications cf. web pages; a result of OCR.
1. Language detection 2. Determining which words get indexed
(– stop words like “of” “a” “the” etc.) 3. OCR mistakes hamper word recognition
![Page 33: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/33.jpg)
Common OCR traps:
embedded languages Latin or archaic spelling complex scripts (e.g. captions) hyphenated words
![Page 34: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/34.jpg)
ricain ricaine ricaines ricana ricanai ricains rical rically ricals
ricanant ricanante ricane ricamente ricanement ricanements rican ricanes ricans
![Page 35: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/35.jpg)
More words from more books, more spelling mistakes.
This is a good thing!
Leads to improved spelling correction (in multiple languages) and more sensitive translation.
![Page 36: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/36.jpg)
“Our understanding of language is, in large part, built inductively from statistical analysis of large samples of language as used ‘in the wild,’ and the larger the sample, the better our understanding.”
-‐ Hank Bromley, IA
![Page 37: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/37.jpg)
“Before the 1930’s, and even 40’s or 50’s in some parts, at harvest time, a horse or mule drawn wagon would go through the field, straddling two rows of corn. Adults working on each side of the wagon would pull the corn from the standing corn stalks and toss it into the wagon. The unfortunate younger ones would have to pull corn from the down rows – stoop labor in its worst form.” -‐ JDB
![Page 38: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/38.jpg)
Statistical analysis of which terms tend to appear in the vicinity of which others), useful not only for context-‐sensitive OCR, but more significantly, for building semantic maps and other kinds of knowledge representation.
“dead as a door nail” – the term “door nail” is not commonly found elsewhere.
![Page 39: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/39.jpg)
Analysis via co-‐occurrence enables one to construct a better general search engine by enhancing the ability to distinguish among multiple meanings of a given word based on the context in which the word occurs.
![Page 40: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/40.jpg)
LSA is an CS term referring to a technique in “natural language processing ... of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.”
-‐ Wikipedia.org
![Page 41: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/41.jpg)
(LSI = LSA in context of info retrieval (IR).)
“Clustering is a way to group documents based on their conceptual similarity to each other ... . This is very useful when dealing with an unknown collection of unstructured text.”
![Page 42: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/42.jpg)
“Because it uses a strictly mathematical approach, LSI is inherently independent of language. This enables LSI to elicit the semantic content of information written in any language without requiring the use of auxiliary structures, such as dictionaries and thesauri.”
![Page 43: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/43.jpg)
“[Q]ueries can be made in one language, such as English, and conceptually similar results will be returned even if they are composed of an entirely different language or of multiple languages.”
![Page 44: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/44.jpg)
“LSI automatically adapts to new and changing terminology, and it has been shown to be very tolerant of noise (i.e., misspelled words, typo-‐graphical errors, unreadable characters, etc.).
“This is especially important for applications using text derived from Optical Character Recognition (OCR) ...” -‐ Wikipedia.org
![Page 45: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/45.jpg)
The More Data, The Better ...
The More Books, The Better Web Search.
![Page 46: Books and Webs: Pulling the Down Rows](https://reader038.fdocuments.net/reader038/viewer/2022110119/5558532dd8b42a993b8b4a99/html5/thumbnails/46.jpg)
Contact information:
peter brantley internet archive @naypinya (twitter) peter @ archive.org