Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis &...
-
Upload
morgan-drake -
Category
Documents
-
view
216 -
download
0
Transcript of Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis &...
Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison
Lynn Silipigni ConnawayConsulting Research Scientist III
Akeisha HeardTechnical Intern
XXV Annual Charleston Conference04 November 2005
Introduction
Research Goals
Develop a service to support advanced collection intelligence
Cluster collected objects based on their issuing entity • As can be determined via metadata about the objects• Gain intelligence about the nature of individual
publishers • Collection intelligence• Acquisition patterns• User behavior
Research Objectives
Resolve• ISBN prefixes to publisher name • Variant publisher names to a preferred form
Capture and make available for use various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries
Theoretical Foundation: Authority Control
Adhere to authorized form• Personal names• Corporate entities
Why no authorized form for publishing entities?
Pragmatic Foundation: Collection Development
Identified publisher series • Retrospective conversion project (1984)
Family tree• Which publishers are related?
Approval plans• Which publishers publish which subjects?
Pragmatic Foundation: OCLC WorldCat Data Mining
Collection Analysis• Which libraries have the most items by a publisher in a
particular subject area?• How do library holdings by publisher compare?
E-books for a particular STM publisher (2000)• Cataloged as reproductions
• 2 publishers!
Pragmatic Foundation: Citation Analysis
Sweetland (1989)• Reader functions of citations
• Information retrieval via citation databases• Document retrieval
• Includes interlibrary loan verification• Bibliometrics• Faculty and researcher productivity measure
Other functions• Creation of references/bibliographies
Pragmatic Foundation : Education for Librarians
Collection development & acquisitions librarian education• Subject focuses of publishers• Parent and subsidiary relationships
Specialized Corporate Authority Files
ACOLIT (Ruggeri, 2004) • Names, uniform titles, Italian and international Catholic
institutions, Catholic religious communities, and institutions
• Related to the Catholic Church, Papal State, and Vatican City State
COPAR (Boddaert, 2004)• French official corporate bodies
• Mainly national and preceding the French Revolution CORELI (Boddaert, 2004)
• Religious corporate bodies from 3 French ancient specialized catalogues
Specialized Corporate Authority Files
Chinese Modern Author Authority Database (Hu, Tam & Lo, 2004)• Chinese authors of expanded works and Chinese
corporate bodies since 1912 Chinese Name Authority Database (Hu, Tam & Lo, 2004)
• Mainly Taiwanese personal names with some Taiwanese corporate bodies
Specialized Corporate Authority Files
Case study by Elias & Fair (1983)• Standard Oil Co.’s Media Query File• No authority control• 3 professionals in 6 months averaged 12 telephone
calls/day from reporters• Decided against canonical list for media names• Noted 20 unique variants for Wall Street Journal
including WSJ, Wall St. Jnl, Wall Street Jnl
Specialized Corporate Authority Files
Case study by French, Powell & Schulman (1997, 2000)• Smithsonian Astrophysical Observatory’s Astrophysics
Data System database• Programmatically identify author affiliations and map
variant names to canonical name• Investigated various techniques separately and
iteratively to bring variants together including:• Lexical cleanup• Data clustering algorithms• Approximate string-matching
• Reduced number of unique strings by 55%• Required manual review of clusters
Database Quality
Literature: Database Quality
Review by O’Neill & Vizine-Goetz (1988)• Busch (1981)
• < 35% of 141 OCLC libraries routinely reported errors• Pollock & Zamora (1983)
• Noted misspellings comprise 90-96% of errors & include:
• Omission• Insertion• Substitution• Transposition
Literature: Database Quality
Intner (1989)• Reviewed 215 matching records in OCLC and RLIN• Errors relating to publishers:
OCLC RLIN
Count (Total)
% Count (Total)
%
Application of AACR2 & LCRI
64 (205)
31.2 52 (191)
27.2
MARC tagging in 260 field
4 (25)
16.0 3 (26)
11.5
Typographic errors 4 (32)
12.5 6 (45)
13.3
Literature: Database Quality
Romero (1994)• Evaluated cataloging of library science students
• Noted 221 errors (28.22%) in the publisher description area
Issues: Historical Practices
Different rules for abbreviations • LC Rule Interpretation B.14
• State postal (2-letter) abbreviation if it appears in the item along with the place
• Anglo-American Cataloguing Rules, Revised (2002)• Abbreviations included in Appendix B.14
Issues: Historical Practices
ALA Catalog Rules (1941)• Multiple places of publication and publishers and neither
or first is prominent• Include first listed first, indicate omission
• Multiple places of publication and publishers and first is not prominent
• Include prominent first• Include first listed second
• Unknown place of publication – [n.p.]
Issues: Historical Practices
Anglo-American Cataloging Rules (1967)• Multiple places of publication and publishers and neither
or first is prominent• Include first listed only, omit others
• Multiple places of publication and publishers and first is not prominent
• Include prominent only, omit others• Unknown place of publication – [n. p.]
Issues: Historical Practices
Anglo-American Cataloguing Rules, Revised (2002)• Multiple places of publication and publishers and neither
or first is prominent• Include first listed only, omit others
• Multiple places of publication and publishers and first is not prominent
• Include first listed first• Include prominent second
• Unknown place of publication – [S.l.]
Issues: Historical and Local Practices
“u.a.”• At least one German institution uses “u.a.” as mark of
omission• Means “et al.”
• Not an AACR2r rule• Local practice?
• Is local practice/policy an error?
Issues: Historical and Local Practices
WorldCat enhanced records• Eliminate or lessen the probability of these issues
Examining Quality of WorldCat
WorldCat: Publisher Name Selection Criteria
Fixed field lang = “eng”
WorldCat by Language
English61%
Non-English
39%
WorldCat: ISBN Validation Errors
WorldCat records with ISBNs: 22.69%
ISBNs by Language
Non-English45% English
55%
WorldCat: ISBN Validation Errors
English Language
Valid 7,561,445 99.90%
Invalid 7,600 0.10%
All Languages
Valid 13,147,325 99.88%
Invalid 15,654 0.12%
WorldCat: MARC Tagging Errors
Examined English language records based on some known issues and manual evaluation
Total MARC tagging errors found: 11,874 (0.03%)
WorldCat Tagging Errors
MARC 260 vs 300 tagging
55%
Dates tagging
43%
Other2%
WorldCat: MARC Tagging Errors
MARC 260 vs 300 tagging• In 260 field, information from 300 field in
$a, $b, $c and/or $e Dates tagging
• Date in $a or $b• Five digit year• “cm” follows year
WorldCat: Typographical Errors
Used “Typographical Errors in Library Databases” to identify and quantify English language WorldCat errors (Ballard, 2005)• Total errors: 26,599 (0.08%) • Require manual examination to determine if actual
errors• Searching for Institi*
• Misspelled: • American Institite of Physics• British Standards Institition
• Spelled correctly:• Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin
Institute for Advanced Studies)
WorldCat: Typographical Errors
Top words (10.4%):
Word Probability According to Ballard
Error Type WorldCat Count
Worchester Highest Insertion 398
Metheun High Transposition
355
Universt* Highest Omission 299
Unives* Highest Omission 275
Westminister [and] Press Highest Insertion 266
Niagr* High Omission 260
Phildel* High Omission 235
Tallahasee High Omission 234
John Hopkins Press Highest Omission 227
Institi* High Substitution 226
WorldCat: Typographical Errors
“Westminister” • Only included on Ballard list in combination with other
words• Total errors in WorldCat: 628 (2.36%)• Require manual review
Where are we now?
WorldCat: MARC 260 Evaluation
Top 10 terms in 260 $b in WorldCat
Term Count
press 2,094,111
co 1,664,005
university 1,550,435
dept 1,084,647
pub 984,234
research 853,954
service 710,314
institute 660,346
office 649,794
chu ban she 620,735
WorldCat: MARC 260 Evaluation
University Press names in 260 $b in WorldCat
Term Count
oxford 35,804
hopkins 22,564
cambridge 21,951
harvard 17,069
cornell 11,305
stanford 10,900
purdue 5,468
yale 5,076
princeton 4,746
rutgers 3,854
Clustering
Attempting programmatic clustering of publishers using ISBN prefixes• Data clustering (The Free Dictionary)
• "The science of extracting useful information from large data sets or databases"
• Classification of similar objects into different groups• Partitioning of a data set into subsets (clusters)
• Data in each subset (ideally) share some common trait
WorldCat: Clustering Example
Used ISBN prefix 019 (Oxford University Press)• Total WorldCat records: 58,004,317• Records with ISBN prefix 019: 84,276 (0.15%)• Non-unique publisher names from ISBN prefix records:
91,528
One or more 019 ISBN
All 019 ISBNs
NACO normalized unique publisher names
1,550 1,386
Number of clusters 919 799
Non-singleton clusters
222 (24.16%)
205 (25.66%)
Largest cluster 82 text strings 81 text strings
Challenges: Publisher Name Authority File
Quality issue• Level of acceptance for cluster
• What is acceptable? Subsidiaries and Relationships
• Oxford & Auckland• Examined manually to determine relationship
Form of name• What is acceptable?
• Likely to use the most prominent form of name
Questions and Discussion
Contact Information:[email protected]@oclc.org
Project Web Site:http://www.oclc.org/research/projects/publisherns/