Auf der Suche nach der verlorenen …Intro Suchdienst SWBplusFazitLiteratur Code View Snipplet 1...

Intro Suchdienst SWBplus Fazit Literatur

Auf der Suche nach der verlorenenKataloganreicherungen

Tobias Rademacher

Universitätsbibliothek Leipzig

27. Juni 2012

1 / 33


Gliederung

EinführungKataloganreicherungenIndexierung und Enterprise Search Platform

Integration des SuchdienstesThin Integration LayerAutomatic Testing (TDD & BDD)

Integration von SWBplusHarvestingIndexierung der SWBplus-Anreicherung

Fazit

2 / 33


Kataloganreicherungen

• Inhaltsverzeichnissen• Abstracts• Rezensionen• etc.

3 / 33


SWBplus

• Südwestdeutscher Bibliotheksverbund• Verknüpfung von Kataloganreicherungen mit Katalogen der

Verbundbibliotheken• Digitalisierung von »Inhaltsverzeichnisse[n], Klappentexte[n]

und Rezensionen«1 (Scans)• Ziel: qualitativ ›vollwertigere‹ Recherchemöglichkeiten anbieten

können• verbundübergreifender Austausch von digitalisierten (Scans)

Beständen von Inhaltsvezeichnissen• Kooperation mit Verlagen: »Verlagsinformationen, Cover,

Leseproben etc.«2

1SWBplus 20122SWBplus 2012

4 / 33


SWBplus Website

5 / 33


Apache SolrFeatures I

• »open source enterprise search platform«3

• Volltextsuche• Umfassende Abfragemöglichkeiten• offene Schnittstellen• Faceted Search and Filtering

3Apache Solr 20126 / 33


Apache SolrFeatures II

• Konfigurierbare Textanalyse (↔ Indexierung)• Rich Document Parsing → Apache Tika• offene Schnittstellen• Management Sever/Monitoring (JMX)• intern: Lucene-Engine

7 / 33


SchemaSolr Schema für Kataloganreicherungen

• ›straightforward‹• Fields entsprechend Anforderungen• Deklaration von »Copy Fields«

8 / 33


SchemaSolr Schema für Kataloganreicherungen

1 < f i e l d name="id" t ype=" string " i n d e x e d="true" s t o r e d="true" r e q u i r e d="true"/>2 < f i e l d name="ppn" t ype=" string " i n d e x e d="true" s t o r e d="true" r e q u i r e d="true"/>3 < f i e l d name=" abstract " t ype=" text_general " i n d e x e d="true" s t o r e d="true"/>4 < f i e l d name=" tableofcontents " t ype=" text_general " i n d e x e d="true" s t o r e d="true"/>56 <uniqueKey>i d</ uniqueKey>78 <c o p y F i e l d s o u r c e=" abstract " d e s t="text"/>9 <c o p y F i e l d s o u r c e=" tableofcontents " d e s t="text"/>

9 / 33


ContentStreamingExtractingRequestHandler

• Rich Document Parsing Apache Tika (Apache PDFBox)• SolrJ ContentStreamUpdateRequest (Java-Klasse)

10 / 33


ContentStreamingExtractingRequestHandler

12 S o l r S e r v e r s e r v e r = new H t t p S o l r S e r v e r (TEST_SERVER_URL ) ;3 S t r i n g hashSum = c a l c u l a t e D i g e s t F o r ( e x t r a c t a b l e C o n t e n t ) ;45 ContentStreamUpdateRequest updateRequest6 = new ContentStreamUpdateRequest ( "/ update / extract " ) ;78 updateRequest . a d d F i l e ( e x t r a c t a b l e C o n t e n t ) ;9

10 updateRequest . setParam ( " literal .id" , e x t r a c t a b l e C o n t e n t . getName ( ) ) ;11 updateRequest . setParam ( "fmap. content " , " tableofcontents " ) ; // ! ! ! !12 updateRequest . setParam ( "ppn" , t e s tMapp ing . ge t ( "ppn" ) ) ;13 updateRequest . setParam ( "id" , hashSum ) ;14 NamedList<Object> r e s p o n s e = s e r v e r . r e q u e s t ( updateRequest ) ;

11 / 33


AnforderungenAnforderungen

• Schlankheit• Suchergebnis → JSON-Format

• PPN• Index selbst (jeweils für Abstracts, Inhaltsverzeichnisse)• ggf. ID (z. B. Digest des ursprünglichen Textes)

12 / 33


MVCModel-View-Controller Pattern

13 / 33


EvaluationEvaluation

• Solr API• Frameworks• Ruby on Rails 3.x vs. Django 1.x• Ruby 1.9.x vs. Python 2.x• → vorhandenes Know-How

14 / 33


DjangoDjango Framework

• Python Web framework• MVC und DRY principle• Modularisierung & Pluggablity

15 / 33


Django ProjektDjango Projekt Struktur

Site

urls.py

settings.py

Catalog Enrichment Module

views.py models.py

features

abstracts.features abstracts.py

16 / 33


Python Solr LibrarySunburnt

• vs. Haystack → ORM/Models• ›pythonic‹• Querying API• Object-Mapping von Suchergebnissen (Python-Klasse)• erlaubt die Einschränkung einer Suche auf bestimmte

Indexfelder• Faceting• Pagination with Django

17 / 33


Content-negotiation & DRYRubyish

1 def i n d e x2 @people = Person . f i n d ( : a l l )34 respond_to do | fo rmat |5 fo rmat . html6 format . j s o n { r e n d e r : j s o n => @people . to_ j son }7 format . xml { r e n d e r : xml => @people . to_xml }8 end9 end

18 / 33


Content-negotiationContent-negotiation & Django

Pythonic

• Django-Decorators & MultiResponse class• Bennett 2008: Another take on content negotiation• Content-negotiation framework for Django4

4Oxford-University 201219 / 33


Content-negotiationPythonic

1 c l a s s IndexView ( JSONView , HTMLView ) :2 def get ( s e l f , r e q u e s t ) :3 # . . .4 r e s p o n s e = s e l f . r e n d e r ( r e q u e s t , contex t , ’index ’ )5 r e s p o n s e [ ’X-Renderer - Format ’ ] = r e s p o n s e . r e n d e r e r . fo rmat6 r e t u r n r e s p o n s e

20 / 33


CodeView Snipplet1 c l a s s Cata logEnr i chment :

2 def __init__ ( s e l f , id , ppn , a b s t r a c t s , t a b l e o f c o n t e n t s , ∗∗ other_kwargs ) :3 s e l f . i d = i d4 s e l f . ppn = ppn5 s e l f . a b s t r a c t s = a b s t r a c t s6 s e l f . t a b l e o f c o n t e n t s = t a b l e o f c o n t e n t s78 def __repr__ ( s e l f ) :9 r e t u r n ’CatalogEnrichment ("%s", "%s") ’ % ( id , ppn )

1011 c l a s s QueryAbst rac t sV iew ( JSONView , HTMLView ) :12 def get ( s e l f , r e q u e s t ) :13 s o l r = S o l r I n t e r f a c e (SOLR_SERVER_URL)14 r e s u l t = s o l r . que ry ( r e q u e s t .GET. ge t ( ’fulltext ’ ) ) . \15 f i e l d _ l i m i t ( [ "ppn" , " abstract " , "id" ] ) . \16 e x e c u t e ( c o n s t r u c t o r=Cata logEnr i chment )17 r e s p o n s e = s e l f . r e n d e r ( r e q ue s t , \18 Context ({ "solr - result " . r e s u l t } ) ,\19 ’query - abstracts ’ )20 r e s p o n s e [ ’X-Renderer - Format ’ ] = r e s p o n s e . r e n d e r e r . fo rmat21 r e t u r n r e s p o n s e

21 / 33


Django TestsDjango Unit & Interation Tests

• automatisch• »test harness« → Regression testing, Refactoring• UnitTests → Mikroebene• InterationTests → intermeditär• code-zentrisch• Fixtures und Testdaten

22 / 33


BDDLettuce

• User Stories/Use Cases ↔ Test-Driven Development• Falcão 2012: »BDD5 tool for python, 100 % inspired on

cucumber.«• Gherkin language → Features• Python → Steps (Implementierung)

5Behavior Driven Development23 / 33


SWBplus ErnteKataloganreicherung ernten

• kein Webservice• keine (rest)-API

Lösungsansatz→ Harvasting von pdf-Dateien

24 / 33


ErntearbeiterCrawler-Skript

• Datenauszug muss bekannt sein (Liste)• Crawler-Skript

• Parallelisierung der HTTP-GET-Requests• Metadatenextraktion

• Aus dem Dateinamen selbst• Selektion von HTML-Elementen via XPath/CSS-Selektoren• Optional: Fingerabdrücke sammeln (Message-Digest)

25 / 33


ErntearbeiterCrawler-Skript

GET file

GET http

consume file request

process http request

26 / 33


Crawler-SkriptRuby

1 EM. run do2 EM: : Synchrony : : F i b e r I t e r a t o r . new ( u r l s _ s u b s e t , @concur rency ) . each do | u r l |3 u r l . s t r i p !4 beg in5 f i l e_name = URI . p a r s e ( u r l ) . path6 f i l e_name . gsub ! (/^\// , "" )7 f i l e _ r e q u e s t = EventMachine : : HttpRequest . new ( u r l ) . ge t8 c o n s u m e _ f i l e _ r e q u e s t ( f i l e _ r e q u e s t , f i l e_name , u r l , pend ing )9 un l e s s @ppn_output . empty ?

10 htm_url = u r l . gsub ! ( / \ . pdf / , ".htm" )11 htm_request = EventMachine : : HttpRequest . new ( htm_url ) . ge t12 consume_http_request ( htm_request , f i l e_name , htm_url , pend ing )13 end14 r e scue => e15 h a n d l e _ e r r o r ( e )16 end17 end18 end

27 / 33


IndexierungKeeper-Skript

• Solr ExtractingRequestHandler via HTTP-Post-Requests• Skriptbasiert

• Python (httplib, twisted.web)• Ruby (net/http, em/http)• Shell (curl)

• mittels Solrj (Java Application)

28 / 33


Keeper-SkriptRuby

1 EM: : Synchrony : : F i b e r I t e r a t o r . new ( u r l s _ s u b s e t , @concur rency ) . each_key do2 | resource_name |3 metadata = @ppn_ l i s t [ resource_name ]4 beg in5 type = metadata [ resource_name ] . to_sym6 i f @types_se lec t ion_map . i n c l u d e _ k e y ?( type )7 pa ramete r s = b u i l d _ p a r a m e t e r s ( resource_name , type , metadata )8 s o u r c e = F i l e . j o i n ( @source_d i r , resource_name )9 i n d e x _ r e q u e s t _ u r i = "{ @solr_uri }{ @solr_extract_handler }?#{ parameters }"

10 i n d e x _ r e q u e s t =11 EventMachine : : HttpRequest . new ( i n d e x _ r e q u e s t _ u r i ) . po s t ( : f i l e => s o u r c e )12 consume_index_request ( i ndex_reque s t , source , pend ing )13 e l s e14 message = " Loremp ipsum "15 @ logge r . e r r o r ( message )16 end17 r e scue => e18 h a n d l e _ e r r o r ( e )19 end20 end

29 / 33


ReflexionKritische Pfade

• Performance und Verhalten unter Last• Zeichensätze & Encoding (Multi-Language)• existierender OCR-Layer notwendig (PDF-Dateien)• Qualität des OCR-Layers

• ›ii‹ anstelle von ›ü‹ (Umlautproblematik)• ›ä‹ anstelle von ›a‹ (Diakritische Zeichen)• Ligaturen (ß)• Mixed Languages (Schritzeichen & Alphabetschrift)• Worttrennung (Hyphenation)

• ›Test early and fail often.‹

30 / 33


FazitFazit

• Suchindex-Erzeugung auf Basis der Digitalisierungen vonSWBplus

• Implementierung eines eigenen (schlanken) Dienstes• SWPPlus ›Erntemaschine‹• Digitalisierung → Such-Index

31 / 33


Literatur I

Apache Solr (2012). url: http://lucene.apache.org/solr/(besucht am 25. 06. 2012).

Bennett, James (2008). Another take on content negotiation. url:http://is.gd/gSkIj0 (besucht am 25. 06. 2012).

Django Project (2012). url: https://www.djangoproject.com/(besucht am 25. 06. 2012).

Event Machine (2012). url:http://eventmachine.rubyforge.org/EventMachine.html(besucht am 25. 06. 2012).

Falcão, Gabriel (2012). Lettuce. url:https://www.djangoproject.com/ (besucht am 25. 06. 2012).

Gamma, Erich, Richard Helm und Ralph Johnson (2001).Entwurfsmuster. Elemente wiederverwendbarer objektorientierterSoftware. München: Addison-Wesley.

32 / 33

http://lucene.apache.org/solr/

http://is.gd/gSkIj0

https://www.djangoproject.com/

http://eventmachine.rubyforge.org/EventMachine.html

https://www.djangoproject.com/


Literatur II

Grigorik, Ilya (2012a). EM-HTTP-Request. url:https://github.com/igrigorik/em-http-request (besuchtam 25. 06. 2012).

– (2012b). EM-Synchrony. url:https://github.com/igrigorik/em-synchrony (besucht am25. 06. 2012).

Services, Oxford University Computing (2012). Content-negotiationframework for Django. url:https://github.com/oucs/django-conneg (besucht am25. 06. 2012).

SWBplus (2012). url: http://www.bsz-bw.de/digitalebibliothek/swbplus.html (besucht am25. 06. 2012).

33 / 33

https://github.com/igrigorik/em-http-request

https://github.com/igrigorik/em-synchrony

https://github.com/oucs/django-conneg

http://www.bsz-bw.de/digitalebibliothek/swbplus.html

http://www.bsz-bw.de/digitalebibliothek/swbplus.html

Auf der Suche nach der verlorenen …Intro Suchdienst SWBplusFazitLiteratur Code View Snipplet 1...

Documents

Transcript of Auf der Suche nach der verlorenen …Intro Suchdienst SWBplusFazitLiteratur Code View Snipplet 1...