Open Library at the API Workshop

Hello.MITH API Workshop

George OatesMaryland, February 2011

Monday, April 11, 2011

Some rights reserved by mattdork


I work at the Internet Archive, leading The Open Library project. We recently moved in to this church in The Richmond in San Francisco. We’re turning it into a library.

http://creativecommons.org/licenses/by-nc-sa/2.0/


http://www.flickr.com/photos/dork/

http://www.flickr.com/photos/dork/


We’re based in San Francisco, California, where I happen to have been living for about 5 years.

Universal Access toAll Knowledge


Since 1996, the non-profit Internet Archive has been building a digital library of Internet sites and other things in digital form. archive.org has a ton of texts, video, software, live music... all sorts of things.

Our mission is Universal Access to all Knowledge. Not a bad reason to get out of bed each day...

Some rights reserved by heather


It’s not your traditional non-profit... Lots of the staff are technologists and developers.

http://creativecommons.org/licenses/by-nc-nd/2.0/

http://creativecommons.org/licenses/by-nc-nd/2.0/

http://www.flickr.com/photos/heather/


archive.orgMonday, April 11, 2011

We have many computers. They store over- 100,000 hours of TV from channels all over the world- 250,000 moving images or video- 500,000 audio recordings- 2.5 million scanned texts- 150,000,000,000 web pages

By rkumar


Just the other day we had 2.88 petabytes of hard drives delivered. That’s enough storage for about 2 billion books.




Another major part of what we do is scanning books. This is a picture of one of the scanning centers in San Francisco. We currently employ about 200 staff scanning books


And today, we have over million free texts available online ‐ that includes over 1 million books150 million pages scanned1,000 books scanned EVERY day24 scanning centers in 5 countries, and we hope for more.


We’re also scanning microfilm, which is much faster than individual books. Here’s an example of the record of the populaJon census from 1790 to 1930. Scanned from microfilm from the collecJons of the Allen County Public Library and originally from the United States NaJonal Archives Record AdministraJon.


Examples of Cross Writing from Boston Public Library


Over 1 million free books that you can read on archive.org today, and access through the Open Library site, by checking the little “Only eBooks” box as you search.


As well as being able to download these books in a variety of different formats, from PDF to TXT and more, we also have a web-based book reader, which you can use to read our scanned texts within your web browser, without the need for any additional software. At the end of 2010, we released a new version of our open source, browser-based BookReader.

I’ve actually come to Wellington direct from a meeting in San Francisco called Books in Browser, held at the Internet Archive last week. It was there that we announced an upcoming new release of our bookreader, which will hopefully go live in the next few weeks... Here are some screenshots...


The main reason we wanted to improve on the current design was to try to build an “app-level quality” book reading experience right in the browser. This included several improvement for touch interfaces in browsers on devices like the iPad.

From a straightforward design perspective, there were also improvements to be made on usability and simple stuff like making the book bigger in the browser window.


This is a screenshot with the toolbar open, where you can see new features like a navigation bar at the bottom that allows you to scroll through the book, a “read to me” feature which plays the book in a computer-y voice, and highlights what’s being read. Also, if we know a table of contents for the book, each chapter is mapped along the navigation bar.

We’ve also rewritten the full text search engine, and I’ll talk more about that a bit later.

By rkumar


Apologies for the slightly blurry picture, but this is my boss, Brewster Kahle, who founded the Internet Archive back in 1996. He’s playing with a touchscreen which is displaying the new bookreader. The screen’s been installed in one of the reading desks that used to sit in the reading room of the Christian Science church before it became our new home. A big part of the bookreader redesign was to evolve an app-level quality book reading experience within a web browser. If you have an iPad, I’d encourage you to try it!




The Open Library project was launched back in 2007. In May 2010, we launched a total site redesign. Just last week, we released a revised home page, building on our new Lending program, and generally trying to do a better job of communicating that you can come to Open Library to find something to read for free, or a book to borrow. We also added activity graphs to try to show that there’s stuff happening, all day, every day.

A “Wikipedia for Books”


There are a few different ways to describe what Open Library is, but I think the explanation that makes the most sense is “a Wikipedia for Books”.


Scrolling down the home page...


We have a lending library of some 10,000 20th Century books. You can also access another 80,000 books if you’re (literally) sitting in one of the 150 or so libraries participating in our “In-Library Lending” program. Each participating library contributes eBooks into the in-library pool, and you can borrow anything in the pool, once you’re sitting in one of the libraries.


Yay! Graphs going up! (That peak you can see across the graphs is our lending launch. For more info, read “Get Thee to a Library!” http://blog.openlibrary.org/2011/02/22/get-thee-to-a-library/)


Snapshot of the various combinations of links we can provide to get you to books... For books we can’t lend through our own lending program, we’ve connected to Overdrive... We’re hoping to make the vendors you can buy from more dynamic, and open up the sources for online free texts. Right now, it’s just the Internet Archive texts that we link to in full.

lending ebooks

• map / openstreen


You can browse a map of (mainly North American) libraries participating in the In-Library lending program. If you’re interested to join in, please contact us!

borrow page

• screen


Here’s what a page looks like to borrow a book. You can see 3 options: In Browser, PDF, and ePub.

In-browser is available immediately. You need to download/install Adobe Digital Editions to read PDF or ePub versions.

DeveloperResources


Open Libraryhttp://openlibrary.org/developers


Python, Postgres, SOLR, JSON, REST

http://openlibrary.org/developers


http://github.com/openlibraryMonday, April 11, 2011

We certainly have our code online at github, but we rarely receive patches. I’m OK with this, at least for now.

http://github.com/openlibrary

http://github.com/openlibrary

JSON/RDFhttp://openlibrary.org/developers




{"description": {"type": "/type/text", "value": "Published in 1845, this pre-eminent American slave narrative powerfully details the life of the internationally famous abolitionist Frederick Douglass from his birth into slavery in 1818 to his escape to the North in 1838\u2014how he endured the daily physical and spiritual brutalities of his owners and drivers, how he learned to read and write, and how he grew into a man who could only live free or die."}, "created": {"type": "/type/datetime", "value": "2009-10-16T05:15:16.306558"}, "title": "Narrative of the life of Frederick Douglass, an American slave", "covers": [5658658], "subject_places": ["United States", "Maryland"], "last_modified": {"type": "/type/datetime", "value": "2011-02-26T02:29:58.442342"}, "subject_people": ["Frederick Douglass (1818-1895)", "Frederick Douglass (1817?-1895)", "Harriet A. Jacobs (1813-1897)"], "key": "/works/OL69181W", "authors": [{"type": {"key": "/type/author_role"}, "author": {"key": "/authors/OL23684A"}}], "latest_revision": 9, "subject_times": ["19th century"], "type": {"key": "/type/work"}, "subjects": ["Biography", "Abolitionists", "African American abolitionists", "Slaves", "Slavery", "United States", "African Americans", "Women slaves", "Social conditions", "Antislavery movements", "History", "Accessible book", "OverDrive", "Biography & Autobiography", "Nonfiction", "Classic Literature", "Fiction", "Protected DAISY"], "revision": 9}


JSON blob

• http://openlibrary.org/works/OL69181W/

• http://openlibrary.org/works/OL69181W.json

• http://openlibrary.org/works/OL69181W.rdf


HTML, JSON, RDF

http://openlibrary.org/works/OL69181W/Narrative_of_the_life_of_Frederick_Douglass_an_American_slave

http://openlibrary.org/works/OL69181W/Narrative_of_the_life_of_Frederick_Douglass_an_American_slave

http://openlibrary.org/works/OL69181W.json

http://openlibrary.org/works/OL69181W.json

http://openlibrary.org/works/OL69181W.rdf

http://openlibrary.org/works/OL69181W.rdf

Data Dumpshttp://archive.org/details/ol_data




archive.org/details/ol_dataMonday, April 11, 2011

There’s a copy of everything we’re using on the Internet Archive too.

APIhttp://openlibrary.org/developers/api


Open Library has a RESTful API, best used to link into Open Library data in JSON, YAML and RDF/XML.



APIhttp://openlibrary.org/developers/api

BooksCovers

Search insideSubjects

Recent ChangesLists


Open Library has a RESTful API, best used to link into Open Library data in JSON, YAML and RDF/XML.

http://openlibrary.org/dev/docs/api/books

http://openlibrary.org/dev/docs/api/books

http://openlibrary.org/dev/docs/api/covers

http://openlibrary.org/dev/docs/api/covers

http://openlibrary.org/dev/docs/api/search_inside

http://openlibrary.org/dev/docs/api/search_inside

http://openlibrary.org/dev/docs/api/subjects

http://openlibrary.org/dev/docs/api/subjects

http://openlibrary.org/dev/docs/api/recentchanges

http://openlibrary.org/dev/docs/api/recentchanges

http://openlibrary.org/dev/docs/api/lists


Request:

GET http://openlibrary.org/people/george08/lists.json

Request:

{ "links": { "self": "/people/george08/lists.json", "next": "/people/george08/lists.json?limit=5&offset=5" }, "size": 12, "entries": [ { "url": "/people/george08/lists/OL13L", "full_url": "/people/george08/lists/OL13L/Various_Seeds_for_Testing", "name": "Various Seeds for Testing", "last_update": "2010-12-21T00:46:17.712513", "seed_count": 13, "edition_count": 13181 }, { "url": "/people/george08/lists/OL97L", "full_url": "/people/george08/lists/OL97L/Time_Travel", "name": "Time Travel", "last_update": "2010-12-17T18:27:14.781336", "seed_count": 5, "edition_count": 838 }, { ... }, { ... }, { ... } ]} http://openlibrary.org/dev/docs/api/lists





We built lists for a couple of reasons: 1, to help people collect things together, and 2, to make it easy to get at smaller sets of records.

Covershttp://openlibrary.org/developers/api




http://covers.openlibrary.org/b/$key/$value-$size.jpg


Where:

• key can be any one of ISBN, OLCC, LCCN, OLID and ID (case-insensitive)• value is the value of the chosen key• size can be one of S, M and L for small, medium and large respectively.

http://covers.openlibrary.org/b/$key/$value-$size.jpg

http://covers.openlibrary.org/b/olid/OL7440033M-S.jpg (we use this)

http://covers.openlibrary.org/b/isbn/0385472579-S.jpg


http://covers.openlibrary.org/b/lccn/93005405-S.jpg

http://covers.openlibrary.org/b/oclc/28419896-S.jpg


Where:

• key can be any one of ISBN, OLCC, LCCN, OLID and ID (case-insensitive)• value is the value of the chosen key• size can be one of S, M and L for small, medium and large respectively.

http://covers.openlibrary.org/b/olid/OL7440033M-S.jpg










Yay!


DOUBLEYay!



One of quite a few examples of Open Library in the wild includes the National Library of Australia’s new search engine, Trove.


You can see there that there are links to Open Library books wherever one can be sourced.

There are a growing number of sites making use of Open Library data... and that’s what we’re all about - data in, data out. The more interconnections we can make with other systems, the easier it will be for people to land where they want to go inside Open Library.


This is ImportBot. He gets new catalog records from the Library of Congress and puts them into Open Library every Tuesday. We also import records from Amazon, and from the Internet Archive. ImportBot looks for recently scanned books, and creates new records (or merges them with existing ones) just a few minutes after the record is created on the Internet Archive.


You can see ImportBot working away, just like you can see the Wiki’s edit history for every person who edits something.


Another quick note on data in before I move on...

We’ve been experimenting with a couple of other “surgical” bots, that look across the catalog and connect edition records directly to other services by stamping identifiers from other systems into Open Library. This is a bot written by a developer called Ben Gimpert, that takes a file mapping ISBN to Goodreads IDs, and looks for ISBN matches in OL, then adding the Goodreads ID to those records. This allows us to construct links to Goodreads, and to make the Goodreads ID available through the API.


You can see we’ve added a little widget on the page that connects to Goodreads, if you have an account, you can add our records to your lists on Goodreads. There’s also a LibraryThing ID too, added by a similar batch bot update.

Writing bots to do things like this is the sort of development we’d like to open up to external developers too...

BookReaderhttp://openlibrary.org/dev/docs/ia





This is a screenshot with the toolbar open, where you can see new features like a navigation bar at the bottom that allows you to scroll through the book, a “read to me” feature which plays the book in a computer-y voice, and highlights what’s being read. Also, if we know a table of contents for the book, each chapter is mapped along the navigation bar.

We’ve also rewritten the full text search engine, and I’ll talk more about that a bit later.


The Library of Congress is using our Bookreader on read.gov. There are quite a few other examples of the IA Bookreader out there on the web. Hopefully the redesign (with touch interactions etc) will attract new people too...


Princeton Digital Library

Internet Archivehttp://openlibrary.org/dev/docs/ia




http://archive.org/helpMonday, April 11, 2011

http://archive.org/help

http://archive.org/help

Raw Full Text > 4 million documents

with metadata




Stanford NLP thing

http://nlp.stanford.edu/Monday, April 11, 2011

We’ve just begun experimenting with some of the software made by the the Stanford Natural Language Processing Group - that includes members of both the Linguistics Department and the Computer Science Department, One idea is to fold this software into the scanning process, so we can do a first pass on entity extraction on full text of a book, to extract things like names, places and common subjects...

http://nlp.stanford.edu

http://nlp.stanford.edu


But then of course, you can do cool stuff like this :)

Challenges


Tension? http://flic.kr/p/6zyU3UMonday, April 11, 2011

The Taxonomy vs Folksonomy debate may be represented thusly.

http://flic.kr/p/6zyU3U

http://flic.kr/p/6zyU3U

1) Books are for use.

2) Every reader his [or her] book.

3) Every book its reader.

4) Save the time of the User.

5) The library is a growing organism.


So, on the basis of the idea of our current catalog being a substrate, as Ranganathan suggests in his five laws of library science...


So... Open Library is a virtual space. Its organization isn’t constrained like a physical catalog. In fact, the more connections you can make into one of our “virtual index cards” the more ways people have to discover and navigate its contents.

http://www.flickr.com/photos/brixton/1394845916/

http://flic.kr/p/6pmtQLMonday, April 11, 2011

But, librarians are (very clever) humans too. And everyone who’s responsible for putting books into a traditional catalogue must work within patterns. Patterns that have grown semantically remarkable and deeply complex.

http://www.lib.cam.ac.uk/exhibitions/Fantasy_to_Federation/Bellin1753.jpg


Unknown author 403Unknown Author 358Author unknown 254No Author 145Author Unknown 59No Author. 54Author 20No author. 16No author 12unknown author 8Unknown Author Unknown 7no author 7No Author Stated 7(No Author) 6No author noted 5No author noted. 4no author listed 4(no author) 4Author Not Stated 4Author. 4No author specified 3Miscellaneous Author 3no Author 3Author One 3Multi-Author 3No Author Listed 3No Stated Author 3Author Anonymous 2(no author given) 2Author 2Author Wright 2Unkown Author 2No author stated 2Mms suspense author 2Author Test 2TEST AUTHOR 2

http://openlibrary.org/search

?author=author


Duplicate authors (and editions) are an issue... This is an example search for author records with “author” in their names... you can see the variety of ways that catalogers have noted unknown authors...

http://openlibrary.org/search?author=author




http://www.flickr.com/photos/blackbeltjones/4294354526/Monday, April 11, 2011

We’ve noticed a TON of minor variations in the way cataloguers enter data... Trivial to us, but very hard for computers to differentiate

http://www.flickr.com/photos/eyeliam/2562666943/


Substrate:any surface on which a plant or animal lives or on which a material sticks

Some rights reserved by Brynja Eldon


We have a repository that mostly contains records created by professionals. I find it useful to consider these records as a substrate, something that can be reacted upon.



http://www.flickr.com/photos/brynja_eldon/


What if we consider the source Open Library records like that?

Some rights reserved by Brynja Eldon


Now that we’ve begun to reveal this substrate, how will people react to it? What reactions has it caused so far?






Handwritten scribbles and scrawls; annotations; corrections

Some rights reserved by jared


What if a catalog looks like this? Is crystalline? What if it is unconstrained by the need to sort, say, alphabetically?

From the artist of this image, Jared Tarbell: “Lines like crystals form at perpendicular angles to existing lines. A complex form emerges. 1000 classic computational substrate, color palette stolen from Jackson Pollock: A simple perpendicular growth rule creates intricate city-like structures. The simple rule, the complex results, the enormous potential for modification; this has got to be one of my all time favorite self-discovered algorithms. Lines likes crystals grow on a computational substrate.”

http://creativecommons.org/licenses/by/2.0/

http://creativecommons.org/licenses/by/2.0/

http://www.flickr.com/photos/generated/

http://www.flickr.com/photos/generated/


What happens when you introduce turbulence into the catalog? Here are a few examples of the sorts of edits we’re seeing... at a rate of about 100,000 edits per month.

http://www.flickr.com/photos/rreis/4859722551/sizes/l/

000s of edits per month


What happens when you introduce turbulence into the catalog? Here are a few examples of the sorts of edits we’re seeing... at a rate of about 100,000 edits per month.

if you don’t stimulate an organism, it atrophies

http://www.flickr.com/photos/rreis/4859722551/sizes/l/

Activity/History


One of the key components to any happy social system is the visibility of other people, and a sense of activity. This is one of the key elements we’re focussed on in the redesign. This particular list shows all edits by humans on Open Library, and actually, turns out to be a handy way to spot check what’s happening. You’ll notice too, there’s a special tab for the variety of edits that we run across the system using bots. Often pretty mechanical and repetitive, we found that the bots obscure the humans if you just mush everything up in a big list, so we separated them.

Activity/HistoryLive Data


One of the key components to any happy social system is the visibility of other people, and a sense of activity. This is one of the key elements we’re focussed on in the redesign. This particular list shows all edits by humans on Open Library, and actually, turns out to be a handy way to spot check what’s happening. You’ll notice too, there’s a special tab for the variety of edits that we run across the system using bots. Often pretty mechanical and repetitive, we found that the bots obscure the humans if you just mush everything up in a big list, so we separated them.

Solutions?


http://www.flickr.com/photos/emdot/400280705/

Shelf


I really like how Raymond described his book yesterday, that as soon as he’d written it, it began to decay... Concrete, decay



http://www.flickr.com/photos/arenamontanus/352130655/

Network


Plastic, self-healing



Minimum Viable Record


Now, I want to try a little exercise. I’m going to hand out an index card to all of you, and ask you to nominate 5 fields that you think is enough to describe a book. I’ll collate the results and report back later.

http://dotspotting.stamen.com/


Stamen Design in SF. Got funding from Knight Foundation to build Citytracking. Challenge is a “hodgepodge of bits—including APIs [2] and official sources, scraped websites, sometimes-reusable data formats and datasets, visualizations, embeddable widgets etc.—is fractured, overly technical and obscure, held in the knowledge base of a relatively small number of people, and requires considerable expertise to harness.”

http://dotspotting.stamen.com/about


“...the first part of this project is to start from scratch, in a 'clean room' environment. We've started from a baseline that's really straightforward, tackling the simplest part: getting dots on maps, without legacy code or any baggage. Just that, to start. Dots on maps.

But “dots on maps” implies a few other things: getting the locations, putting them on there, working with them, and̶crucially̶getting them out in a format that people can work with.”


Stamen Design in SF. Got funding from Knight Foundation to build Citytracking. Challenge is a “hodgepodge of bits—including APIs [2] and official sources, scraped websites, sometimes-reusable data formats and datasets, visualizations, embeddable widgets etc.—is fractured, overly technical and obscure, held in the knowledge base of a relatively small number of people, and requires considerable expertise to harness.”

“...the first part of this project is to start from scratch, in a 'clean room' environment. We've started from a baseline that's really straightforward, tackling the simplest part: getting dots on maps, without legacy code or any baggage. Just that, to start. Dots on maps.

But “dots on maps” implies a few other things: getting the locations, putting them on there, working with them, and̶crucially̶getting them out in a format that people can work with.”





Online Publishing Distribution System (OPDS)http://bookserver.archive.org/catalog/new


This is an example of trying something very bare bones, to try to help systems intercommunicate more easily. (Open Library plans to publish OPDS feeds soon.)Online Publishing Distribution System (OPDS): The Open Publication Distribution System (OPDS) Catalog specification is a syndication format for electronic publications based on Atom RFC4287 and HTTP RFC2616.

American notes for general circulation [microform]February 25, 2011 10:22 AMAuthor: Dickens, Charles, 1812-1870Publisher: New York : HarperYear published: 1842Book contributor: Canadiana.orgLanguage: enDownload Ebook: (PDF) (EPUB)



Individuals can also add new books with a few details like Title, Author, Publisher and Publish Date. That’s enough for a stub, and then people are invited to add more details.

Canonical ID?


Canonical ID?Collect them.



Another experiment we’re looking forward to trying is about identifiers. We’re not particularly concerned about canonical identifiers. Perhaps it’s a waste of time to wait for one, so instead, we’re going to try and attach as many ID types to our records as we can. (This list is just a braindump - not active yet.) The idea is that people could add a URL or actual identifier and Open Library would just do the right thing. A suggestion (after this presentation was delivered) was that people could ping Open Library with an identifier, not even knowing what TYPE of ID it is. Perhaps Open Library could help “triangulate” this query towards a book record. “Record laundering.”

Canonical ID?Exchange them.


http://openlibrary.org/books/olid/OL7440033M

http://openlibrary.org/books/isbn/0385472579


http://openlibrary.org/books/lccn/93005405

http://openlibrary.org/books/oclc/28419896

http://openlibrary.org/books/id/240727

http://openlibrary.org/books/amazon/...

http://openlibrary.org/books/bookmooch/...

http://openlibrary.org/books/goodreads/...

http://openlibrary.org/books/ocaid/...

http://openlibrary.org/books/librarything/...

http://openlibrary.org/books/paperback_swap/...

http://openlibrary.org/books/Your ID Here/...


You can already ping Open Library with an ID other than the Open Library identifier to see if we have any matches.

http://openlibrary.org/books/




http://covers.openlibrary.org/b/id/240727-S.jpg
















http://openlibrary.org/books/olid/OL7440033M



http://openlibrary.org/books/lccn/93005405

http://openlibrary.org/books/oclc/28419896

http://openlibrary.org/books/id/240727

http://openlibrary.org/books/amazon/...

http://openlibrary.org/books/bookmooch/...

http://openlibrary.org/books/goodreads/...

http://openlibrary.org/books/librarything/...

http://openlibrary.org/books/ocaid/...

http://openlibrary.org/books/paperback_swap/...

http://openlibrary.org/books/Your ID Here/...






















Your ID


Your ID

Everyone else’s


Make nodes,not cards

Monday, April 11, 2011Some rights reserved by yobink

Network,not sequence


Thanks!George [email protected]

@openlibrary


mailto:[email protected]

mailto:[email protected]

Open Library at the API Workshop

Documents

Transcript of Open Library at the API Workshop