Download - Auf der Suche nach der verlorenen …Intro Suchdienst SWBplusFazitLiteratur Code View Snipplet 1 class CatalogEnrichment: 2 def init(self,id,ppn,abstracts,tableofcontents, ∗∗other_kwargs):

Intro Suchdienst SWBplus Fazit Literatur

Auf der Suche nach der verlorenenKataloganreicherungen

Tobias Rademacher

Universitätsbibliothek Leipzig

27. Juni 2012

1 / 33


Gliederung

EinführungKataloganreicherungenIndexierung und Enterprise Search Platform

Integration des SuchdienstesThin Integration LayerAutomatic Testing (TDD & BDD)

Integration von SWBplusHarvestingIndexierung der SWBplus-Anreicherung

Fazit

2 / 33


Kataloganreicherungen

• Inhaltsverzeichnissen• Abstracts• Rezensionen• etc.

3 / 33


SWBplus

• Südwestdeutscher Bibliotheksverbund• Verknüpfung von Kataloganreicherungen mit Katalogen der

Verbundbibliotheken• Digitalisierung von »Inhaltsverzeichnisse[n], Klappentexte[n]

und Rezensionen«1 (Scans)• Ziel: qualitativ ›vollwertigere‹ Recherchemöglichkeiten anbieten

können• verbundübergreifender Austausch von digitalisierten (Scans)

Beständen von Inhaltsvezeichnissen• Kooperation mit Verlagen: »Verlagsinformationen, Cover,

Leseproben etc.«2

1SWBplus 20122SWBplus 2012

4 / 33


SWBplus Website

5 / 33


Apache SolrFeatures I

• »open source enterprise search platform«3

• Volltextsuche• Umfassende Abfragemöglichkeiten• offene Schnittstellen• Faceted Search and Filtering

3Apache Solr 20126 / 33


Apache SolrFeatures II

• Konfigurierbare Textanalyse (↔ Indexierung)• Rich Document Parsing → Apache Tika• offene Schnittstellen• Management Sever/Monitoring (JMX)• intern: Lucene-Engine

7 / 33


SchemaSolr Schema für Kataloganreicherungen

• ›straightforward‹• Fields entsprechend Anforderungen• Deklaration von »Copy Fields«

8 / 33


SchemaSolr Schema für Kataloganreicherungen

1 < f i e l d name="id" t ype=" string " i n d e x e d="true" s t o r e d="true" r e q u i r e d="true"/>2 < f i e l d name="ppn" t ype=" string " i n d e x e d="true" s t o r e d="true" r e q u i r e d="true"/>3 < f i e l d name=" abstract " t ype=" text_general " i n d e x e d="true" s t o r e d="true"/>4 < f i e l d name=" tableofcontents " t ype=" text_general " i n d e x e d="true" s t o r e d="true"/>56 <uniqueKey>i d</ uniqueKey>78 <c o p y F i e l d s o u r c e=" abstract " d e s t="text"/>9 <c o p y F i e l d s o u r c e=" tableofcontents " d e s t="text"/>

9 / 33


ContentStreamingExtractingRequestHandler

• Rich Document Parsing Apache Tika (Apache PDFBox)• SolrJ ContentStreamUpdateRequest (Java-Klasse)

10 / 33


ContentStreamingExtractingRequestHandler

12 S o l r S e r v e r s e r v e r = new H t t p S o l r S e r v e r (TEST_SERVER_URL ) ;3 S t r i n g hashSum = c a l c u l a t e D i g e s t F o r ( e x t r a c t a b l e C o n t e n t ) ;45 ContentStreamUpdateRequest updateRequest6 = new ContentStreamUpdateRequest ( "/ update / extract " ) ;78 updateRequest . a d d F i l e ( e x t r a c t a b l e C o n t e n t ) ;9

10 updateRequest . setParam ( " literal .id" , e x t r a c t a b l e C o n t e n t . getName ( ) ) ;11 updateRequest . setParam ( "fmap. content " , " tableofcontents " ) ; // ! ! ! !12 updateRequest . setParam ( "ppn" , t e s tMapp ing . ge t ( "ppn" ) ) ;13 updateRequest . setParam ( "id" , hashSum ) ;14 NamedList<Object> r e s p o n s e = s e r v e r . r e q u e s t ( updateRequest ) ;

11 / 33


AnforderungenAnforderungen

• Schlankheit• Suchergebnis → JSON-Format

• PPN• Index selbst (jeweils für Abstracts, Inhaltsverzeichnisse)• ggf. ID (z. B. Digest des ursprünglichen Textes)

12 / 33


MVCModel-View-Controller Pattern

13 / 33


EvaluationEvaluation

• Solr API• Frameworks• Ruby on Rails 3.x vs. Django 1.x• Ruby 1.9.x vs. Python 2.x• → vorhandenes Know-How

14 / 33


DjangoDjango Framework

• Python Web framework• MVC und DRY principle• Modularisierung & Pluggablity

15 / 33


Django ProjektDjango Projekt Struktur

Site

urls.py

settings.py

Catalog Enrichment Module

views.py models.py

features

abstracts.features abstracts.py

16 / 33


Python Solr LibrarySunburnt

• vs. Haystack → ORM/Models• ›pythonic‹• Querying API• Object-Mapping von Suchergebnissen (Python-Klasse)• erlaubt die Einschränkung einer Suche auf bestimmte

Indexfelder• Faceting• Pagination with Django

17 / 33


Content-negotiation & DRYRubyish

1 def i n d e x2 @people = Person . f i n d ( : a l l )34 respond_to do | fo rmat |5 fo rmat . html6 format . j s o n { r e n d e r : j s o n => @people . to_ j son }7 format . xml { r e n d e r : xml => @people . to_xml }8 end9 end

18 / 33


Content-negotiationContent-negotiation & Django

Pythonic

• Django-Decorators & MultiResponse class• Bennett 2008: Another take on content negotiation• Content-negotiation framework for Django4

4Oxford-University 201219 / 33


Content-negotiationPythonic

1 c l a s s IndexView ( JSONView , HTMLView ) :2 def get ( s e l f , r e q u e s t ) :3 # . . .4 r e s p o n s e = s e l f . r e n d e r ( r e q u e s t , contex t , ’index ’ )5 r e s p o n s e [ ’X-Renderer - Format ’ ] = r e s p o n s e . r e n d e r e r . fo rmat6 r e t u r n r e s p o n s e

20 / 33


CodeView Snipplet1 c l a s s Cata logEnr i chment :

2 def __init__ ( s e l f , id , ppn , a b s t r a c t s , t a b l e o f c o n t e n t s , ∗∗ other_kwargs ) :3 s e l f . i d = i d4 s e l f . ppn = ppn5 s e l f . a b s t r a c t s = a b s t r a c t s6 s e l f . t a b l e o f c o n t e n t s = t a b l e o f c o n t e n t s78 def __repr__ ( s e l f ) :9 r e t u r n ’CatalogEnrichment ("%s", "%s") ’ % ( id , ppn )

1011 c l a s s QueryAbst rac t sV iew ( JSONView , HTMLView ) :12 def get ( s e l f , r e q u e s t ) :13 s o l r = S o l r I n t e r f a c e (SOLR_SERVER_URL)14 r e s u l t = s o l r . que ry ( r e q u e s t .GET. ge t ( ’fulltext ’ ) ) . \15 f i e l d _ l i m i t ( [ "ppn" , " abstract " , "id" ] ) . \16 e x e c u t e ( c o n s t r u c t o r=Cata logEnr i chment )17 r e s p o n s e = s e l f . r e n d e r ( r e q ue s t , \18 Context ({ "solr - result " . r e s u l t } ) ,\19 ’query - abstracts ’ )20 r e s p o n s e [ ’X-Renderer - Format ’ ] = r e s p o n s e . r e n d e r e r . fo rmat21 r e t u r n r e s p o n s e

21 / 33


Django TestsDjango Unit & Interation Tests

• automatisch• »test harness« → Regression testing, Refactoring• UnitTests → Mikroebene• InterationTests → intermeditär• code-zentrisch• Fixtures und Testdaten

22 / 33


BDDLettuce

• User Stories/Use Cases ↔ Test-Driven Development• Falcão 2012: »BDD5 tool for python, 100 % inspired on

cucumber.«• Gherkin language → Features• Python → Steps (Implementierung)

5Behavior Driven Development23 / 33


SWBplus ErnteKataloganreicherung ernten

• kein Webservice• keine (rest)-API

Lösungsansatz→ Harvasting von pdf-Dateien

24 / 33


ErntearbeiterCrawler-Skript

• Datenauszug muss bekannt sein (Liste)• Crawler-Skript

• Parallelisierung der HTTP-GET-Requests• Metadatenextraktion

• Aus dem Dateinamen selbst• Selektion von HTML-Elementen via XPath/CSS-Selektoren• Optional: Fingerabdrücke sammeln (Message-Digest)

25 / 33


ErntearbeiterCrawler-Skript

GET file

GET http

consume file request

process http request

26 / 33


Crawler-SkriptRuby

1 EM. run do2 EM: : Synchrony : : F i b e r I t e r a t o r . new ( u r l s _ s u b s e t , @concur rency ) . each do | u r l |3 u r l . s t r i p !4 beg in5 f i l e_name = URI . p a r s e ( u r l ) . path6 f i l e_name . gsub ! (/^\// , "" )7 f i l e _ r e q u e s t = EventMachine : : HttpRequest . new ( u r l ) . ge t8 c o n s u m e _ f i l e _ r e q u e s t ( f i l e _ r e q u e s t , f i l e_name , u r l , pend ing )9 un l e s s @ppn_output . empty ?

10 htm_url = u r l . gsub ! ( / \ . pdf / , ".htm" )11 htm_request = EventMachine : : HttpRequest . new ( htm_url ) . ge t12 consume_http_request ( htm_request , f i l e_name , htm_url , pend ing )13 end14 r e scue => e15 h a n d l e _ e r r o r ( e )16 end17 end18 end

27 / 33


IndexierungKeeper-Skript

• Solr ExtractingRequestHandler via HTTP-Post-Requests• Skriptbasiert

• Python (httplib, twisted.web)• Ruby (net/http, em/http)• Shell (curl)

• mittels Solrj (Java Application)

28 / 33


Keeper-SkriptRuby

1 EM: : Synchrony : : F i b e r I t e r a t o r . new ( u r l s _ s u b s e t , @concur rency ) . each_key do2 | resource_name |3 metadata = @ppn_ l i s t [ resource_name ]4 beg in5 type = metadata [ resource_name ] . to_sym6 i f @types_se lec t ion_map . i n c l u d e _ k e y ?( type )7 pa ramete r s = b u i l d _ p a r a m e t e r s ( resource_name , type , metadata )8 s o u r c e = F i l e . j o i n ( @source_d i r , resource_name )9 i n d e x _ r e q u e s t _ u r i = "{ @solr_uri }{ @solr_extract_handler }?#{ parameters }"

10 i n d e x _ r e q u e s t =11 EventMachine : : HttpRequest . new ( i n d e x _ r e q u e s t _ u r i ) . po s t ( : f i l e => s o u r c e )12 consume_index_request ( i ndex_reque s t , source , pend ing )13 e l s e14 message = " Loremp ipsum "15 @ logge r . e r r o r ( message )16 end17 r e scue => e18 h a n d l e _ e r r o r ( e )19 end20 end

29 / 33


ReflexionKritische Pfade

• Performance und Verhalten unter Last• Zeichensätze & Encoding (Multi-Language)• existierender OCR-Layer notwendig (PDF-Dateien)• Qualität des OCR-Layers

• ›ii‹ anstelle von ›ü‹ (Umlautproblematik)• ›ä‹ anstelle von ›a‹ (Diakritische Zeichen)• Ligaturen (ß)• Mixed Languages (Schritzeichen & Alphabetschrift)• Worttrennung (Hyphenation)

• ›Test early and fail often.‹

30 / 33


FazitFazit

• Suchindex-Erzeugung auf Basis der Digitalisierungen vonSWBplus

• Implementierung eines eigenen (schlanken) Dienstes• SWPPlus ›Erntemaschine‹• Digitalisierung → Such-Index

31 / 33


Literatur I

Apache Solr (2012). url: http://lucene.apache.org/solr/(besucht am 25. 06. 2012).

Bennett, James (2008). Another take on content negotiation. url:http://is.gd/gSkIj0 (besucht am 25. 06. 2012).

Django Project (2012). url: https://www.djangoproject.com/(besucht am 25. 06. 2012).

Event Machine (2012). url:http://eventmachine.rubyforge.org/EventMachine.html(besucht am 25. 06. 2012).

Falcão, Gabriel (2012). Lettuce. url:https://www.djangoproject.com/ (besucht am 25. 06. 2012).

Gamma, Erich, Richard Helm und Ralph Johnson (2001).Entwurfsmuster. Elemente wiederverwendbarer objektorientierterSoftware. München: Addison-Wesley.

32 / 33

http://lucene.apache.org/solr/

http://is.gd/gSkIj0

https://www.djangoproject.com/

http://eventmachine.rubyforge.org/EventMachine.html

https://www.djangoproject.com/


Literatur II

Grigorik, Ilya (2012a). EM-HTTP-Request. url:https://github.com/igrigorik/em-http-request (besuchtam 25. 06. 2012).

– (2012b). EM-Synchrony. url:https://github.com/igrigorik/em-synchrony (besucht am25. 06. 2012).

Services, Oxford University Computing (2012). Content-negotiationframework for Django. url:https://github.com/oucs/django-conneg (besucht am25. 06. 2012).

SWBplus (2012). url: http://www.bsz-bw.de/digitalebibliothek/swbplus.html (besucht am25. 06. 2012).

33 / 33

https://github.com/igrigorik/em-http-request

https://github.com/igrigorik/em-synchrony

https://github.com/oucs/django-conneg

http://www.bsz-bw.de/digitalebibliothek/swbplus.html

http://www.bsz-bw.de/digitalebibliothek/swbplus.html

Download - Auf der Suche nach der verlorenen …Intro Suchdienst SWBplusFazitLiteratur Code View Snipplet 1 class CatalogEnrichment: 2 def __init__(self,id,ppn,abstracts,tableofcontents, ∗∗other_kwargs):

Download - Auf der Suche nach der verlorenen …Intro Suchdienst SWBplusFazitLiteratur Code View Snipplet 1 class CatalogEnrichment: 2 def init(self,id,ppn,abstracts,tableofcontents, ∗∗other_kwargs):