Books and Webs: Pulling the Down Rows

Peter Brantley Internet Archive The Presidio 11.09

Essential premise :

combining web search with book search is an

engineering challenge

I. Presenting combined search

For several years, I served the University of California as the Director of Technology for the California Digital Library.

(the digital library group for the UC system)

We held various conversations over time with Google engineers in similar spaces ...

grappling with the indexing, search, and user interface issues with combined but disparate content pools (books, journals, web, image, video).

(an important issue for digital libraries)

In academic info markets, “metasearch” – distributed queries with central resolution, contested for primacy with search over aggregated content.

To an extent, only LANL and commercial search pursued aggregation at scale.

Aggregation wins.

“Google is undertaking the most radical change to its search results ever, introducing a "Universal Search" system that will blend listings from its news, video, images, local and book search engines among those it gathers from crawling web pages.”

“With Universal Search, Google will hit a range of its vertical search engines, then decide if the relevancy of a result from book search is higher than a match from web page search.”

Danny Sullivan, “Google 2.0”, May 16 2007, Search Engine Land

Simple search box ... but

User search intentionality for books vs. web can differ

“mark twain hawai’i”

Google Scholar is vertical search engine.

Explicit opt-‐in discovery service for STM journal content, utilized in HE academia.

Many concerns with combining the Scholar product with Big Daddy. User search goals differ; content distinct; different indexing.

From 2007 – early 2009, I was the Director of the Digital Library Federation. I made a request of Google to update members on GBS status at DLF’s Fall Forum, Nov. 2008.

They issued an explicit request for HE CS/EE attention to the problem of integrating book and web search. Paraphrasing: “Not a well solved problem”.

Some comparisons between web pages

and books.

web: short doc (web page) length

books: long doc (book) length

web: high data density (per doc size)

books: highly variant data density (e.g. fiction vs. non-‐fiction)

web: trillions of unique web pages

books: (low) millions of unique books

web: many complex media types

books: text and image media

web: dynamic over time (avg. TTL of web pages is short)

books: static over time (print books permanently fixed)

web: single instances (web pages)

books: duplicate instances (copies), similar instances (editions), in multiple languages

web: hyperlinked in/out (useful in relevance)

books: normally quiescent (sometimes citations)

web: designed component structure {page hierarchy > web site}

books: artificial component structure {page images > book}

Bibliographic data cf. full text (book) data:

The Melvyl Recommender Project Full Text Extension

(Supplementary Report) California Digital Library

October 2006

Funded by the Andrew W. Mellon Foundation

Project Lead   Peter Brantley, Director of Technology

Implementation Team   Kirk Hastings, Text Systems Designer   Martin Haye, Programmer (Contractor)   Steve Toub, Web Design Manager   Colleen Whitney, Programmer and Coordinator

Assessment Team   Jane Lee, Assessment Analyst   Felicia Poe, Assessment Coordinator   Lisa Schiff, Digital Ingest Programmer

Often many different editions of popular books. Can easily artificially boost search (n_copies).

e.g. “Moby Dick” published 100s of times (and in many languages)

Depending on publication date: either public domain (dep. on country) or in-‐copyright (out-‐of-‐print or in-‐print)

In CDL tests, for texts vs. bib records:

Search scoring for full text documents was typically 10 -‐ 100 times larger than for metadata-‐only records.

(Probably approximate magnitude cf. to representative web pages).

Easy for a single work to overwhelm web pages in relevance for a well-‐fitting query.

E.g. “English working class labor industrial”

  The making of the English working class.   Author: E P Thompson   Publisher: New York, Pantheon Books   [1964, ©1963]

Books are long strings of many words, split into n_sized chunks for parsing.

Term indexing based on overlapping and variant length “word vectors”

“battle” “of” “britain” “battle of” “britain” “battle” “of britain” “battle of britain”

{Search Term} and {Document} weights

1.  How often is a search term found within a given sized chunk of text?

2.  How many chunks of text is the term found within?

3.  How many chunks of text does the document contain?

Which is better?

1.  Adequate matches over many fields, 2.  Better matches in fewer fields.

Metrics vary between books and web. One learns from one’s mistakes. More books, more mistakes.

1.  Books are sooo much longer than web pages. 2.  Books produce 1000’s more chunks than web. 3.  Term weighting is very complex for long docs. 4.  Indexes must be integrated for web and books. 5.  But source term indexes are biased differently.

II. What you get from books

The dialectic between books and web provides benefits from their integration (no matter the pain).

Books enrich general web search, not just via the data within books, but also by books-‐as-‐data.

All search is made smarter by analysis.

1.  structure 2.  contextualization 3.  relatedness 4.  normalization 5.  association

Because of digitization, books have complications cf. web pages; a result of OCR.

1.  Language detection 2.  Determining which words get indexed

(– stop words like “of” “a” “the” etc.) 3.  OCR mistakes hamper word recognition

Common OCR traps:

  embedded languages   Latin or archaic spelling   complex scripts (e.g. captions)   hyphenated words

  ricain   ricaine   ricaines   ricana   ricanai   ricains   rical   rically   ricals

  ricanant   ricanante   ricane   ricamente   ricanement   ricanements   rican   ricanes   ricans

More words from more books, more spelling mistakes.

This is a good thing!

Leads to improved spelling correction (in multiple languages) and more sensitive translation.

“Our understanding of language is, in large part, built inductively from statistical analysis of large samples of language as used ‘in the wild,’ and the larger the sample, the better our understanding.”

-‐ Hank Bromley, IA

“Before the 1930’s, and even 40’s or 50’s in some parts, at harvest time, a horse or mule drawn wagon would go through the field, straddling two rows of corn. Adults working on each side of the wagon would pull the corn from the standing corn stalks and toss it into the wagon. The unfortunate younger ones would have to pull corn from the down rows – stoop labor in its worst form.” -‐ JDB

Statistical analysis of which terms tend to appear in the vicinity of which others), useful not only for context-‐sensitive OCR, but more significantly, for building semantic maps and other kinds of knowledge representation.

“dead as a door nail” – the term “door nail” is not commonly found elsewhere.

Analysis via co-‐occurrence enables one to construct a better general search engine by enhancing the ability to distinguish among multiple meanings of a given word based on the context in which the word occurs.

LSA is an CS term referring to a technique in “natural language processing ... of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.”

-‐ Wikipedia.org

(LSI = LSA in context of info retrieval (IR).)

“Clustering is a way to group documents based on their conceptual similarity to each other ... . This is very useful when dealing with an unknown collection of unstructured text.”

“Because it uses a strictly mathematical approach, LSI is inherently independent of language. This enables LSI to elicit the semantic content of information written in any language without requiring the use of auxiliary structures, such as dictionaries and thesauri.”

“[Q]ueries can be made in one language, such as English, and conceptually similar results will be returned even if they are composed of an entirely different language or of multiple languages.”

“LSI automatically adapts to new and changing terminology, and it has been shown to be very tolerant of noise (i.e., misspelled words, typo-‐graphical errors, unreadable characters, etc.).

“This is especially important for applications using text derived from Optical Character Recognition (OCR) ...” -‐ Wikipedia.org

The More Data, The Better ...

The More Books, The Better Web Search.

Contact information:

peter brantley internet archive @naypinya (twitter) peter @ archive.org

Books and Webs: Pulling the Down Rows

Technology

Transcript of Books and Webs: Pulling the Down Rows