Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis &...

39
Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist III Akeisha Heard Technical Intern XXV Annual Charleston Conference 04 November 2005

Transcript of Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis &...

Page 1: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison

Lynn Silipigni ConnawayConsulting Research Scientist III

Akeisha HeardTechnical Intern

XXV Annual Charleston Conference04 November 2005

Page 2: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Introduction

Page 3: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Research Goals

Develop a service to support advanced collection intelligence

Cluster collected objects based on their issuing entity • As can be determined via metadata about the objects• Gain intelligence about the nature of individual

publishers • Collection intelligence• Acquisition patterns• User behavior

Page 4: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Research Objectives

Resolve• ISBN prefixes to publisher name • Variant publisher names to a preferred form

Capture and make available for use various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries

Page 5: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Theoretical Foundation: Authority Control

Adhere to authorized form• Personal names• Corporate entities

Why no authorized form for publishing entities?

Page 6: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Pragmatic Foundation: Collection Development

Identified publisher series • Retrospective conversion project (1984)

Family tree• Which publishers are related?

Approval plans• Which publishers publish which subjects?

Page 7: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Pragmatic Foundation: OCLC WorldCat Data Mining

Collection Analysis• Which libraries have the most items by a publisher in a

particular subject area?• How do library holdings by publisher compare?

E-books for a particular STM publisher (2000)• Cataloged as reproductions

• 2 publishers!

Page 8: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Pragmatic Foundation: Citation Analysis

Sweetland (1989)• Reader functions of citations

• Information retrieval via citation databases• Document retrieval

• Includes interlibrary loan verification• Bibliometrics• Faculty and researcher productivity measure

Other functions• Creation of references/bibliographies

Page 9: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Pragmatic Foundation : Education for Librarians

Collection development & acquisitions librarian education• Subject focuses of publishers• Parent and subsidiary relationships

Page 10: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Specialized Corporate Authority Files

ACOLIT (Ruggeri, 2004) • Names, uniform titles, Italian and international Catholic

institutions, Catholic religious communities, and institutions

• Related to the Catholic Church, Papal State, and Vatican City State

COPAR (Boddaert, 2004)• French official corporate bodies

• Mainly national and preceding the French Revolution CORELI (Boddaert, 2004)

• Religious corporate bodies from 3 French ancient specialized catalogues

Page 11: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Specialized Corporate Authority Files

Chinese Modern Author Authority Database (Hu, Tam & Lo, 2004)• Chinese authors of expanded works and Chinese

corporate bodies since 1912 Chinese Name Authority Database (Hu, Tam & Lo, 2004)

• Mainly Taiwanese personal names with some Taiwanese corporate bodies

Page 12: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Specialized Corporate Authority Files

Case study by Elias & Fair (1983)• Standard Oil Co.’s Media Query File• No authority control• 3 professionals in 6 months averaged 12 telephone

calls/day from reporters• Decided against canonical list for media names• Noted 20 unique variants for Wall Street Journal

including WSJ, Wall St. Jnl, Wall Street Jnl

Page 13: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Specialized Corporate Authority Files

Case study by French, Powell & Schulman (1997, 2000)• Smithsonian Astrophysical Observatory’s Astrophysics

Data System database• Programmatically identify author affiliations and map

variant names to canonical name• Investigated various techniques separately and

iteratively to bring variants together including:• Lexical cleanup• Data clustering algorithms• Approximate string-matching

• Reduced number of unique strings by 55%• Required manual review of clusters

Page 14: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Database Quality

Page 15: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Literature: Database Quality

Review by O’Neill & Vizine-Goetz (1988)• Busch (1981)

• < 35% of 141 OCLC libraries routinely reported errors• Pollock & Zamora (1983)

• Noted misspellings comprise 90-96% of errors & include:

• Omission• Insertion• Substitution• Transposition

Page 16: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Literature: Database Quality

Intner (1989)• Reviewed 215 matching records in OCLC and RLIN• Errors relating to publishers:

OCLC RLIN

Count (Total)

% Count (Total)

%

Application of AACR2 & LCRI

64 (205)

31.2 52 (191)

27.2

MARC tagging in 260 field

4 (25)

16.0 3 (26)

11.5

Typographic errors 4 (32)

12.5 6 (45)

13.3

Page 17: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Literature: Database Quality

Romero (1994)• Evaluated cataloging of library science students

• Noted 221 errors (28.22%) in the publisher description area

Page 18: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Issues: Historical Practices

Different rules for abbreviations • LC Rule Interpretation B.14

• State postal (2-letter) abbreviation if it appears in the item along with the place

• Anglo-American Cataloguing Rules, Revised (2002)• Abbreviations included in Appendix B.14

Page 19: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Issues: Historical Practices

ALA Catalog Rules (1941)• Multiple places of publication and publishers and neither

or first is prominent• Include first listed first, indicate omission

• Multiple places of publication and publishers and first is not prominent

• Include prominent first• Include first listed second

• Unknown place of publication – [n.p.]

Page 20: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Issues: Historical Practices

Anglo-American Cataloging Rules (1967)• Multiple places of publication and publishers and neither

or first is prominent• Include first listed only, omit others

• Multiple places of publication and publishers and first is not prominent

• Include prominent only, omit others• Unknown place of publication – [n. p.]

Page 21: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Issues: Historical Practices

Anglo-American Cataloguing Rules, Revised (2002)• Multiple places of publication and publishers and neither

or first is prominent• Include first listed only, omit others

• Multiple places of publication and publishers and first is not prominent

• Include first listed first• Include prominent second

• Unknown place of publication – [S.l.]

Page 22: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Issues: Historical and Local Practices

“u.a.”• At least one German institution uses “u.a.” as mark of

omission• Means “et al.”

• Not an AACR2r rule• Local practice?

• Is local practice/policy an error?

Page 23: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Issues: Historical and Local Practices

WorldCat enhanced records• Eliminate or lessen the probability of these issues

Page 24: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Examining Quality of WorldCat

Page 25: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: Publisher Name Selection Criteria

Fixed field lang = “eng”

WorldCat by Language

English61%

Non-English

39%

Page 26: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: ISBN Validation Errors

WorldCat records with ISBNs: 22.69%

ISBNs by Language

Non-English45% English

55%

Page 27: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: ISBN Validation Errors

English Language

Valid 7,561,445 99.90%

Invalid 7,600 0.10%

All Languages

Valid 13,147,325 99.88%

Invalid 15,654 0.12%

Page 28: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: MARC Tagging Errors

Examined English language records based on some known issues and manual evaluation

Total MARC tagging errors found: 11,874 (0.03%)

WorldCat Tagging Errors

MARC 260 vs 300 tagging

55%

Dates tagging

43%

Other2%

Page 29: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: MARC Tagging Errors

MARC 260 vs 300 tagging• In 260 field, information from 300 field in

$a, $b, $c and/or $e Dates tagging

• Date in $a or $b• Five digit year• “cm” follows year

Page 30: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: Typographical Errors

Used “Typographical Errors in Library Databases” to identify and quantify English language WorldCat errors (Ballard, 2005)• Total errors: 26,599 (0.08%) • Require manual examination to determine if actual

errors• Searching for Institi*

• Misspelled: • American Institite of Physics• British Standards Institition

• Spelled correctly:• Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin

Institute for Advanced Studies)

Page 31: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: Typographical Errors

Top words (10.4%):

Word Probability According to Ballard

Error Type WorldCat Count

Worchester Highest Insertion 398

Metheun High Transposition

355

Universt* Highest Omission 299

Unives* Highest Omission 275

Westminister [and] Press Highest Insertion 266

Niagr* High Omission 260

Phildel* High Omission 235

Tallahasee High Omission 234

John Hopkins Press Highest Omission 227

Institi* High Substitution 226

Page 32: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: Typographical Errors

“Westminister” • Only included on Ballard list in combination with other

words• Total errors in WorldCat: 628 (2.36%)• Require manual review

Page 33: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Where are we now?

Page 34: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: MARC 260 Evaluation

Top 10 terms in 260 $b in WorldCat

Term Count

press 2,094,111

co 1,664,005

university 1,550,435

dept 1,084,647

pub 984,234

research 853,954

service 710,314

institute 660,346

office 649,794

chu ban she 620,735

Page 35: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: MARC 260 Evaluation

University Press names in 260 $b in WorldCat

Term Count

oxford 35,804

hopkins 22,564

cambridge 21,951

harvard 17,069

cornell 11,305

stanford 10,900

purdue 5,468

yale 5,076

princeton 4,746

rutgers 3,854

Page 36: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Clustering

Attempting programmatic clustering of publishers using ISBN prefixes• Data clustering (The Free Dictionary)

• "The science of extracting useful information from large data sets or databases"

• Classification of similar objects into different groups• Partitioning of a data set into subsets (clusters)

• Data in each subset (ideally) share some common trait

Page 37: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

WorldCat: Clustering Example

Used ISBN prefix 019 (Oxford University Press)• Total WorldCat records: 58,004,317• Records with ISBN prefix 019: 84,276 (0.15%)• Non-unique publisher names from ISBN prefix records:

91,528

One or more 019 ISBN

All 019 ISBNs

NACO normalized unique publisher names

1,550 1,386

Number of clusters 919 799

Non-singleton clusters

222 (24.16%)

205 (25.66%)

Largest cluster 82 text strings 81 text strings

Page 38: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Challenges: Publisher Name Authority File

Quality issue• Level of acceptance for cluster

• What is acceptable? Subsidiaries and Relationships

• Oxford & Auckland• Examined manually to determine relationship

Form of name• What is acceptable?

• Likely to use the most prominent form of name

Page 39: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Questions and Discussion

Contact Information:[email protected]@oclc.org

Project Web Site:http://www.oclc.org/research/projects/publisherns/