20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

83
Crowdsourcing, Family History, and Long Tails for Libraries http://slidesha.re/1qzB8vv Frederick Zarndt [email protected] Secretary, IFLA Newspapers Section Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.

description

In all of its many flavors, crowdsourcing works. It works for cultural heritage organizations too. During this presentation we look at various aspects of crowdsourced OCR text correction, commenting, and tagging for digitized historical newspapers at the National Library of Australia’s Trove, the California Digital Newspaper Collection (CDNC), and at the Cambridge Public Library in Cambridge Massachusetts as well as the astounding number of historical birth, death, marriage, census, and other records transcribed by “crowd” volunteers at Family Search. Some aspects include: demographics, experiences, motivation, quality, preferred data, economics and marketing. You will see that crowd sourcing is not only feasible but also practical and desirable. You will wonder why your own cultural heritage organization hasn't begun its own crowdsourcing project!

Transcript of 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Page 1: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Crowdsourcing, Family History, and Long Tails for Libraries

!http://slidesha.re/1qzB8vv

Frederick Zarndt [email protected]

Secretary, IFLA Newspapers Section

Photo held by John Oxley Library, State Library of Queensland. Original from

Courier-mail, Brisbane, Queensland, Australia.

Page 2: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Crowdsourcing is the practice of obtaining needed services, ideas, or content by

soliciting contributions from a large group of people, and especially from an online

community, rather than from traditional employees or suppliers. ... [It] is different

from ordinary outsourcing since it is a task or problem that is outsourced to an undefined public rather than a specific, named group.

Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Crowdsourcing (accessed March 17, 2013)

Page 3: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“crowdsourcing” !

was coined by Jeff Howe in “The rise of crowdsourcing” published in Wired

magazine June 2006.

Page 4: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

web trends for “crowdsourcing”

Jan-2006 to Jun-2014

Page 5: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

• On the date of publication of Jeff Howe’s Wired magazine article, 1-Jun-2007, Wikipedia did not have an entry (list) of crowdsourcing projects*.

• On 25-Jan-2010 Wikipedia’s list of crowdsourcing projects had 35 entries*.

• On 17-Mar -2013 Wikipedia’s list of crowdsourcing projects had 158 entries+.

* From Internet Archives’ Wayback Machine.+ Wikipedia contributors, "List of crowdsourcing projects," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/List_of_crowdsourcing_projects (accessed March 17, 2013).

Page 6: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Amazon Mechanical Turk was launched Nov 2005 Alexa global / country rank of Amazon Mechanical Turk (June 2014): 6,465 / 2,046

crowdsourcing

Page 7: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

crowdsourcing

Each day 200,000,000 recaptcha’s are solved by humans around the world

Page 8: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Galaxy Zoo was 1st launched July 2007 Alexa global / country traffic rank of Galaxy Zoo (June 2014): 606,971 / 100,298

citizen science

Page 9: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Kickstarter was 1st launched in 2009 Alexa global / country traffic rank of Kickstarter (June 2014): 782 / 326 60,000+ projects successfully funded with more than USD $1,000,000,000

crowd funding

Page 10: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

crowd collaboration

Page 11: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Family Search Indexing was 1st launched (beta) 2004 Alexa global / country traffic rank of FamilySearch (June 2014): 4,385 / 1,321

Page 12: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Project Gutenberg was 1st launched Dec 1971 Alexa global / country traffic rank of Project Gutenberg (June 2014): 6,615 / 4,066

Page 13: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]
Page 14: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Alexa global / country traffic rank of National Library of Finland 2,535,854 (31-Oct-2012) / 199 (2-Apr-2012)

Page 15: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

so what? why should a library care about

crowdsourcing?

Time Life Pictures

Getty Images

Page 16: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“user engagement refers to the quality of the user experience that emphasizes the

positive aspects of the interaction with a web application, and in particular the phenomena

associated with wanting to use that web application longer and frequently”

Elad Yom-Tov, Mounia Lalmas, Georges Dupret, Ricardo Baeza-Yates, Pinard Donmez, and Janette Lehmann. 2012. The effect of links on networked user engagement. In Proceedings of the 21st international conference companion on World Wide Web (WWW '12 Companion). ACM, New York, NY, USA, 641-642. DOI=10.1145/2187980.2188167 http://doi.acm.org/10.1145/2187980.2188167

Page 17: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“in addition to increasing search accuracy or lowering the costs of document transcription,

crowdsourcing is the single greatest advancement in getting people using and interacting with library

collections”

Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June 2013).

Page 18: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“While [the National Library of Australia’s] Trove offers a range of user engagement features, and use of each of these features continues to grow, it is Trove’s newspaper text correction features that have attracted the highest level of user engagement.”

Marie-Louise Ayres. 2013. ‘Singing for their supper’: Trove, Australian newspapers, and the crowd. Paper presented at IFLA WLIC 2013, Singapore. Accessed June 2014 IFLA Library http://library.ifla.org/id/eprint/245.

Page 19: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Alexa global / country rank of National Library of Australia (June 2014): 10,964 / 249 Trove gets ~78% of all National Library web traffic.

Page 20: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

National Library of Australia

• Online since 2008 • More than 13,000,000 / 127,437,967 newspaper

pages / articles (May 2014) • Top text corrector 2,625,205+ lines (May 2014) • 2,682,119 lines corrected each month (average for

1st 5 months 2014) • 129,046,297 lines corrected as of May 2014, up from

66,527,535 lines corrected May 2012 • 129,300 / 8,218 registered / active users (May 2014)

Page 21: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

1

1,000

1,000,000

Australian New

spapers

Books

Pictures and photos

Journal Articles

Music sound and video

Maps

Archived websites

Diaries, letters, archives

People and organisations

unique visits page views

2013 monthly averages

Page 22: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

0

1,500,000

3,000,000

4,500,000

6,000,000

Australian New

spapers

Books

Pictures and photos

Journal Articles

Music sound and video

Maps

Archived websites

Diaries, letters, archives

People and organisations

unique visits page views

2013 monthly averages

Page 23: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]
Page 24: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

California Digital Newspaper Collection

• CDNC began digitizing newspapers in 2005 as part of NDNP

• Newspapers digitized to article-level as well as to page-level as required by NDNP

• Hosted on Veridian beginning 2009

• Collection size 61,412 issues, 545,955 pages, 6,364,529 articles (May 2014)

Page 25: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

OCR text correction

• OCR text correction added Aug 2011

• Corrections are done line by line

• 2246 registered / 1,266 active users (Jun 2014)

• 2,656,497+ lines of text corrected (Jun 2014)

• ~2% of the collection corrected, 98% to go!

• Top corrector 717,855 lines > 2x 2nd corrector

Page 26: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]
Page 27: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Cambridge Public Library Historic Newspaper Collection• Cambridge Historic Newspapers online since Jan 2012.

• Cambridge Massachusetts Public Library digitized local newspapers (http://cambridge.dlconsulting.com/)

• Newspapers digitized to article-level

• Collection size 6,346 issues, 59,070 pages, 669,406 articles (May 2014)

• Collection includes 13,099 obituary cards

Page 28: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

0%10%

90%

Historic Cambridge Newspapers(1846-1923)

Cambridge City Directories(1848 - 1910)

Cambridge Chronicle (August 2005 to present)

2013 monthly averages

Page 29: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

why correct text? here’s why…

Image copyright Karl R Lilliendahl

Photographer

Page 30: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4\irJi.- ~ ; ;✓ ' • * On ijfr r inn l j j j i l F i i j ' 1 1 f H a v o d i v y d , Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . •On Tncsdav last , Mr. Charles. IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. A s b t C n v H a l l , m a r L a n c a s t e r , Mr.,Geo. Worn ick, many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol "through Ins bead, 1 which instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week,

raw OCR text

Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.

newspaper image

Page 31: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

user lines corrected*1 646,8732 236,3233 111,7494 100,7495 99,9996 87,7207 82,7688 63,7869 57,44110 56,458

lines corrected* user2,455,338 11,822,422 21,448,370 31,265,217 41,174,835 51,069,669 61,058,179 71,020,462 8949,694 9886,315 10

*numbers from Mar 2014

Page 32: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

User rank

Lines corrected Jun 2014

1 717,8552 271,9723 120,2204 113,7875 109,9996 99,9997 94,7428 65,6379 63,78610 59,724

Lines corrected Oct 2012242,96587,51531,31824,14423,18419,24018,89816,87511,7849,762

Page 33: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

uncorrected OCR accuracy by newspaper titleTitle OCR character

accuracy~OCR word

accuracy

PRP Pacific Rural Press 1871 - 1922 92.6% 68.1%

SFC San Francisco Call 1890 - 1913 92.6% 68.1%

LAH Los Angeles Herald 1873 - 1910 88.7% 54.9%

LH Livermore Herald 1877 - 1899 88.6% 54.6%

DAC Daily Alta California 1841 - 1891 88.2% 53.4%

CFJ California Farmer and Journal of Useful Sciences 1855 - 1880 86.5% 48.4%

SN Sausalito News 1885 - 1922 70.4% 17.3%

*Word accuracy assumes average word length is 5 characters

Page 34: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

corrected OCR accuracy by newspaper titleTitle OCR character

accuracyCorrected accuracy

PRP Pacific Rural Press 1871 - 1922 92.6% 99.3%

SFC San Francisco Call 1890 - 1913 92.6% 99.6%

LAH Los Angeles Herald 1873 - 1910 88.7% 99.1%

LH Livermore Herald 1877 - 1899 88.6% 99.9%

DAC Daily Alta California 1841 - 1891 88.2% 99.9%

CFJ California Farmer and Journal of Useful Sciences 1855 - 1880 86.5% 99.8%

SN Sausalito News 1885 - 1922 70.4% 100.0%

Page 35: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Title OCR character accuracy

~OCR word accuracy

Corrected accuracy

~Corrected word accuracy

PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5%

SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0%

LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6%

LH 1877 - 1899 88.6% 54.6% 99.9% 99.5%

DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5%

CF 1855 - 1880 86.5% 48.4% 98.3% 91.8%

SN 1885 - 1922 70.4% 17.3% 100.0% 100.0%

*Word accuracy assumes average word length is 5 characters

corrected OCR accuracy by newspaper title

Page 36: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

correction accuracy by user

User Average OCR accuracy

Correction accuracy

A 70.4% 100.0%B 87.1% 99.5%C 95.4% 99.5%D 86.5% 98.3%E 95.3% 100.0%F 91.0% 100.0%G 91.0% 99.8%H 90.5% 99.0%I 96.6% 99.8%J 94.8% 100.0%K 86.8% 99.3%

Page 37: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

that’s interesting, but who wants to correct

OCR text? it’s

Page 38: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in Crowdsourcing – A Study on Mechanical Turk.”

Motivation

Page 39: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Motivation Genealogists and family

historians

• National Library of Australia’s 2012 Trove status report showed that ~50% of Trove users are family historians

• National Library of New Zealand survey found that ~50% of PapersPast users are genealogists

PAPERSPAST

Page 40: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

• 72% visit UDN for genealogical research • 20% visit for various other types of historical research • 87% find obituaries useful • Over 60% find the other genealogical article types (birth and

wedding announcements) useful • Only 7% do not find genealogical articles useful • Many are writing family histories and consequently also look

for general background information • Older content is much more highly valued than more recent

content (see more detailed explanation that follows) • 44% find smaller, rural papers more useful, while only 15%

find larger, metropolitan papers more useful

Motivation 2012 user survey

John Herbert and Randy Olsen. Small town papers: still delivering the news. WLIC 2012, Helsinki Finland. http://conference.ifla.org/past-wlic/2012/119-herbert-en.pdf

Page 41: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

• CDNC and Cambridge Public Library published a user survey in Mar 2013

• 604 / 32 responses

• Surveys are (mostly) identical except for organization name

Motivation 2013 user survey

Page 42: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

User demographic Genealogists and family historians

Page 43: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

X User demographic No spring chickens

Page 44: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

User demographic Reasons for use

Page 45: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

User demographic Types of information

Page 46: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

• “I enjoy the correction - it’s a great way to learn more about past history and things of interest whilst doing a ‘service to the community’ by correcting text for the benefit of others.”

• “I have recently retired from IT and thought that I could be of some assistance to the project. It benefits me and other people. It helps with family research.”

Rose Holley. March 2009. Many Hands Make Light Work. National Library of Australia. Accessed June 2014 http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf.

Motivation Trove users’ report

Page 47: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“The ‘typical’ Trove user is a very well educated, highly paid, English speaking employed woman aged fifty or over, with a significant or primary interest in family or local history, who visits the Trove website very frequently. Users of Trove newspapers are older than the average Trove

user; only 13% of newspaper users are under 40 years or age.”

Marie-Louise Ayres. ‘Singing for their supper’: Trove, Australian newspapers, and the crowd. WLIC 2013,Singapore. http://library.ifla.org/245/1/153-ayres-en.pdf.

Motivation Engaged users: Who are they?

Page 48: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“Many of Trove’s user engagement features are very popular. More than 100,000 users have

registered to date, and more than 2 million tags and nearly 60,000 comments had been added…

[Trove] text correction, however, stands head and shoulders above any other user engagement

features.”

Motivation Engaged users: What do they do?

Marie-Louise Ayres. ‘Singing for their supper’: Trove, Australian newspapers, and the crowd. WLIC 2013,Singapore. http://library.ifla.org/245/1/153-ayres-en.pdf.

Page 49: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“when someone transcribes a document, they are actually better fulfilling the mission of a cultural

heritage organization than someone who simply stops by to flip through the pages”

Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June 2013).

Motivation Engaged users

Page 50: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“I am interested in all kinds of history. I have pursued genealogy as a hobby for many years. I correct text at CDNC because I see it as a constructive way to contribute to a worthwhile project.

Because I am interested in history, I enjoy it.” Wesley, California

Personal communications with CDNC text correctors.

Motivation CDNC users’ report

Page 51: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

! “I only correct the text on articles of local interest - nothing at state, national or international level, no advertisements, etc.  The objective is to be able to help researchers to locate local people, places, organizations and events using the on-line

search at CDNC.  I correct local news & gossip, personal items, real estate transactions, superior court proceedings, county and

local board of supervisors meetings, obituaries, birth notices, marriages, yachting news, etc.”

Ann, California

Personal communications with CDNC text correctors.

Motivation CDNC users’ report

Page 52: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“I am correcting text for the Coronado Tent City Program for 1903.  It is important to correct any problems with personal names and other information so that researchers will be able

to search by keyword and be assured of retrieving desired results. ... type fonts cause a great deal of difficulty in

digitizing the text and can cause problems for searchers.  Also, many of the guests' names at Tent City and Hotel Del

Coronado were taken from the registration books and reported in the Program.  This led to many problems in spelling of last names and the editors were not careful to be consistent in the

spellings.  This Program is an important resource since it provides an excellent picture of daily life in Tent City and

captures much of the history of Coronado itself.” Gene, California

Personal communications with CDNC text correctors.

Motivation CDNC users’ report

Page 53: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

“I have always been interested in history, especially the development of the American West, and nothing brings it alive

better than newspapers of the time. I believe them to be an invaluable source of knowledge for us and future generations.”

David, United Kingdom

Personal communications with CDNC text correctors.

Motivation CDNC users’ report

Page 54: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

CDNC is an excellent source of information matching my personal interest in such topics as sea history, development

of shipbuilding, clippers and other ships etc. ... Unfortunately, the quality of text ... is rather poor I’m

afraid. This is why I started to do all corrections necessary for myself ... and to leave the corrected text for use of

others. .... I am not doing this very regularly as this is just my hobby and pleasure.

Jerzey, Poland

Personal communications with CDNC text correctors.

Motivation CDNC users’ report

Page 55: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

As an amateur historical researcher my time for research is very limited.  Making time to travel to archives, libraries, and historical societies does not happen as often as I would like.  The Cambridge

Public Library’s online newspaper collection has been an invaluable resource and it is fun.  I am very grateful for all the help I have received

over the years from so many research organizations. Correcting text has several benefits.  It makes it much more likely that I will find a story if I decide to search for it in the future.  It is a way of saying

‘thank you’ to the Cambridge Library for having such a great resource available and maybe I can make the next person’s research a little

easier. It is my own little historical preservation project. Cambridge Historical Newspapers Text Corrector

Personal communications with CDNC text correctors.

Motivation Cambridge users’ report

Page 56: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

so old, boring, easily entertained people correct text. convince me there are

real benefits.

Page 57: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Economic benefits

Public domain photo courtesy of US Navy

Page 58: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

$Economics

Financial value of outsourced OCR text correction for newspapers?

The Assumptions

• 25 to 50 characters per line in a newspaper column: Assume 40 characters per line (CDNC sample average)

• Outsourced text transcription or correction costs USD $0.35 to $1.20 per 1000 characters: Assume $0.50 per 1000 characters

Page 59: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

$$ 2,656,497 lines x 40 characters per line x 1/1000 x $0.50 = $53,130

$ 129,046,297 lines x 40 characters per line x 1/1000 x $0.50 = $2,580,926

Economics

Page 60: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

$Financial value of in-house OCR text correction?

The Assumptions

• Correction takes 15 seconds per line

• Cost is hourly wage plus benefits of lowest level employee, $10 for CDNC, $41.88* for Australia

AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report.

Economics

Page 61: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

$$ 2,656,497 lines x 15 seconds per line x 1/3600 hrs per second x $10.00 per hr = $110,687

$ 129,046,297 lines x 15 seconds per line x 1/3600 hrs per second x $41.88 per hr = $22,518,579

Economics

Page 62: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Accuracy

“His Accuracy Depends on Ours!" Office for Emergency Management. Office of War Information. Domestic Operations Branch. Bureau of Special Services. [Photo held at US National Archives and Records Administration]

Page 63: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Accuracy

• Edwin Kiljin (Koninklijke Bibliotheek the Netherlands) reports raw OCR character accuracies of 68% for early 20th century newspapers

• Rose Holley (National Library of Australia) reports raw OCR character accuracy varied from 71% to 98% on a sample Trove digitized newspapers

Rose Holley. How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine. Mar/Apr 2009. Accessed June 2014 http://www.dlib.org/dlib/march09/holley/03holley.html.

Edwin Kiljin. The current state-of-art in newspaper digitization. D-Lib Magazine. Jan/Feb 2008. Accessed June 2014 http://www.dlib.org/dlib/january08/klijn/01klijn.html.

Public domain graphic courtesy of Wikimedia Commons.

Page 64: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

AccuracyMAPPING TEXTS* assesses digitization quality of digital newspapers by comparing the number of words recognized to the total number of words scanned

* Mapping texts is a collaboration between the University of North Texas and Stanford University aimed at experimenting with new methods for finding and analyzing meaningful patterns embedded in massive collections of digital newspapers.

Page 65: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

How does low text accuracy affect search recall?

The Facts • Average uncorrected OCR character accuracy of the

CDNC sample data is ~89%

• Average length of an English word is 5 characters

• Average word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to 60% or 6 out of 10 words correct

Accuracy

Page 66: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

ARNDT

ARNDTARNDT

ARNDT ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

Search recall no text correction

instances of “ARNDT” found instances of “ARNDT” not found

Page 67: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Accuracy

The Facts • Average corrected character accuracy of the CDNC

sample data is ~99.4%

• Average word accuracy of CDNC corrected text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%

Page 68: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

ARNDT

ARNDTARNDT

ARNDT ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

instances of “ARNDT” found instances of “ARNDT” not found

Search recall with text correction

Page 69: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

A search for “Arndt” at Chronicling America gives 10,267 results*

• If Chronicling America text accuracy is 55.8% (same as uncorrected CDNC sample), then 8,133 instances of “Arndt” were not found

• If text accuracy is 97.0%, then 317 instances of “Arndt” were not found

Accuracy

* Search performed 31 Oct 2012

Page 70: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Accuracy

Suppose the word/name is longer than 5 characters?

The Facts • Assume that average uncorrected / corrected OCR

character accuracy is ~89% / ~99% same as CDNC.

Name Name length Raw text accuracy Corrected text accuracy

Eklund 6 49.7% 94.2%

Kennedy 7 44.2% 93.25

Espinosa 8 39.4% 92.3%

Bonaparte 9 35% 91.4%

Chatterjee 10 31.2% 90.4%

Page 71: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Accuracy

Name Number of search results

Missing results with raw text accuracy

Missing results with corrected text accuracy

Eklund 2,951 2,987 182

Kennedy 360,723 455,392 26,111

Espinosa 1,918 2,950 160

Bonaparte 44,664 82,947 4,203

Chatterjee 19 42 2

Chronicling America searches done 19-Mar-2013 (6,025,474 pages from 1836 to 1922).

Page 72: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

but you left out long

tails…

Public domain illustration from "On The Genesis of

Species" by St. George Mivart

Page 73: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]
Page 74: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

the long tail* of crowdsourced OCR text correction

a probability distribution has a long tail if a larger share of population rests within its tail than it would

under a normal distribution !

the most productive users represent a small fraction of the total user population and ~50% of total

production, or, said a different way, the largest fraction but individually not quite so productive

users are as important as the most productive users

The phrase “long tail” was popularized by Chris Anderson in the October 2004 Wired magazine article The Long Tail and by Clay Shirky’s February 2003 essay “Power laws, web logs, and inequality”.

Page 75: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

user lines corrected*1 646,8732 236,3233 111,7494 100,7495 99,9996 87,7207 82,7688 63,7869 57,44110 56,458

lines corrected* user2,455,338 11,822,422 21,448,370 31,265,217 41,174,835 51,069,669 61,058,179 71,020,462 8949,694 9886,315 10

*numbers from Mar 2014

Page 76: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

OCR text correction long tails

0

75000

150000

225000

300000

CDNC lines corrected by text corrector

0

750,000

1,500,000

2,250,000

3,000,000

NLA lines corrected by text corrector

top corrector 242,965 top corrector 1,456,906

50%

50%

50%

50%

Page 77: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Future considerations• How to market / advertise

crowdsourcing?

• How to motivate crowdsourcers?

• Is authentication / identity of crowdsourcers an issue?

• How to administer crowdsourced data?

Photo of Aleister Crowley [Public domain] from Wikimedia Commons

Page 78: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Conclusions

Conclusion of the Sonata for piano #32, opus 111 by Ludwig van Beethoven

• Lots of crowdsourcing in cultural heritage organizations and elsewhere

• Benefits are multi-faceted: Economic, data accuracy, user engagement, increased web traffic

Page 79: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

are we finished now?

Image copyright Dan Heller

www.danheller.com

Page 80: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Resources

Public domain photo “A useful instruction for young sailors from the Royal Hospital School, Greenwich” from the National Maritime Museum.

Page 81: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Correct California newspapers at http://cdnc.ucr.edu

Correct Cambridge MA newspapers http://bit.ly/cambridgepublic

Correct Australian newspapers http://trove.nla.gov.au

Correct Virginia newspapers http://virginiachronicle.com

Try crowdsourcing!

Page 82: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

Other resources

Mapping Texts at http://mappingtexts.stanford.edu/

Wragge Labs at http://wraggelabs.com/

Wikipedia list of crowdsourcing projects https://en.wikipedia.org/wiki/

List_of_crowdsourcing_projects

Wikipedia list of digitized newspapers http://en.wikipedia.org/wiki/

List_of_online_newspaper_archives

Page 83: 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

?Photo held by John Oxley Library, State Library of Queensland. Original from

Courier-mail, Brisbane, Queensland, Australia.

Frederick Zarndt [email protected]

Secretary, IFLA Newspapers Section