Jax 2012 - Apache Solr as Enterprise Search Platform
-
Upload
shi-search-analytics-big-data -
Category
Documents
-
view
1.223 -
download
4
description
Transcript of Jax 2012 - Apache Solr as Enterprise Search Platform
![Page 1: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/1.jpg)
Apache Solr als Enterprise Search Plattform
![Page 2: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/2.jpg)
Markus Klose - SHI
• Projektmanagement
• Requirements Engineering
• Certified Solr Trainer
• Enterprise Solution
• Infrastruktursoftware
• Beratung / Implementierung
![Page 3: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/3.jpg)
Agenda
• Enterprise Search
• Solr Basics
• Herausforderungen & Lösungen
• Ausblicke
![Page 4: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/4.jpg)
Enterprise Search
85% aller Firmen haben auf weniger als 50% aller ihrer Daten Zugriff (Google)
![Page 5: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/5.jpg)
Enterprise Search
![Page 6: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/6.jpg)
Enterprise Search mit Solr
• Open Source vs. Kommerziell
• Solr– Relevanz-Algorithmus (TF-IDF)
– Kein Vendor-Lock
– Zugriff auf Source Code
– Aktive Community
– Keine Lizenzgebühren / Kosten
– Performance
![Page 7: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/7.jpg)
Solr Basics
• Solr …– … Framework für Such Applikationen
– … nutzt Lucene
– … Infrastruktur (Cache, Analyzer etc.)
– … konfigurierbar (customizing)
– … läuft in allen gängigen Servlet Containern
– … aktuelle Version 3.6
![Page 8: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/8.jpg)
Solr Basics
• Solr Architektur– Konfigurationen
– RequestHandler
– ResponseWriter
– UpdateHandler
– ReplicationHandler
– ….
![Page 9: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/9.jpg)
Solr Basics
Konfiguration
• solr.xml– Konfiguration meherer Cores
• solrconfig.xml– Handler / SearchComponents etc.
– Caching / Index Settings
• schema.xml– Felder / Typen / Analyze
![Page 10: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/10.jpg)
Solr Basics
HTTP Requests
• Indexierung– http://host:8983/solr/update/csv?stream.file=data.csv&strea
m.contentType=text/plain;charset=utf-8
• Suche– http://host:8983/solr/select?q=baseball&fq=type:pdf&sort=titl
e asc
• Administration (SWAP)– http://host:8983/solr/admin/cores?action=SWAP&core=live&
other=test
![Page 11: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/11.jpg)
Solr Basics
Solr Ökosystem– Hadoop: Verteiltes Dateisystem
– Mahout: Data-Mining
– Tika: Metadaten Indexierung
– Nutch: Web Crawler
– ManifoldCF: Repository Connector
– Pypes – Verarbeitungs Pipeline (Python)
– RabbitMQ - Messaging System
![Page 12: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/12.jpg)
Herausforderungen
• Anbindung versch. Datenquellen
• Verteilte / heterogene Systeme
• Berechtigungen
• Relevanz / Precision & Recall
• Mehrsprachigkeit
• Einheitliche Suche
• etc.
![Page 13: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/13.jpg)
Anbindung verschiedener Datenquellen
• Indexierung - Solr
• Indexierung - DataImportHandler
• Indexierung - Clients
• Indexierung - externe Tools
Herausforderungen
![Page 14: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/14.jpg)
Indexierung - Solr
![Page 15: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/15.jpg)
Indexierung - DIH
• Bestandteile– DataSource
– EntityProcessor
– Transformator
• Use Cases– Datenbanken
– Feeds (RSS/ATOM) & XML Dateien
– Rich Content
– Mail Server
![Page 16: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/16.jpg)
Indexierung – Clients
• Java (SolrJ)
• JavaScript
• PHP
• Ruby
• C# (SolrNet)
• Python
Apache Solr PHP Client
![Page 17: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/17.jpg)
Indexierung – externe Tools
• Nutch
• Heritrix
• ManifoldCF– … Sharepoint, Documentum …
• Google Connector Framework
![Page 18: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/18.jpg)
Verteilte Systeme / Skalierbarkeit
• Replikation
• Sharding
• Unique IDs
Herausforderungen
![Page 19: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/19.jpg)
Basisarchitektur
• Eine Instanz übernimmt sowohl die Indexierung als auch die Suche
SolrSolr
Indexierung
Suche
![Page 20: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/20.jpg)
Replication
• Hohes Volumen an Suchanfragen
• 1 Master mit N Slaves
• Delta Replikation möglich
• Konfigurationsdateien können repliziert werden
MasterMaster
Slave 1Slave 1 Slave2Slave2
Suche
Indexierung
![Page 21: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/21.jpg)
Master-Slave-Konfiguration
![Page 22: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/22.jpg)
Sharding
• Verteilung von großen Datenmengen
• Solr sucht über alle Shards & fasst die Ergebnisse zusammen
• Kein globaler TF-IDF
Shard 1Shard 1 Shard 2Shard 2
Indexierung
Searching
![Page 23: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/23.jpg)
Sharding & Replication
• Flexibles Szenario
• Große Datenmengen und hohes Aufkommen von Suchanfragen
Master 1Master 1 Master 2Master 2
Indexierung
Slave 11Slave 11 Slave 12Slave 12 Slave 21Slave 21 Slave 22Slave 22
Suche
![Page 24: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/24.jpg)
Unique IDs
• Update / Deletes / Verteilte Systeme
• Solr FieldType solr.UUIDField
• Basistypen nutzen
• Typische Fehler– ID nicht einzigartig -> weniger im Index
– ID nicht reproduzierbar -> verschiedene Versionen im Index
![Page 25: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/25.jpg)
Heterogene Systeme / Mehrsprachigkeit
• Deduplikation
• Solr - Konfiguration– Dismax/eDismax
• Schema - Konfiguration – Analyse (Tokenizer / Filter)
– Dynamische Felder
– Copy Fields
Herausforderungen
![Page 26: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/26.jpg)
Deduplikation
• Doppelte Dokumente im Index
• schema.xml
• solrconfig.xml
![Page 27: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/27.jpg)
Dismax / eDismax
• DisMax – Disjunction Maximum
• extrem variabel einsetzbar
• versucht immer etwas zurückzuliefern
![Page 28: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/28.jpg)
Analyse
• Feldzentrische Verarbeitung des Inhalts
– Tokenizer
– Tokenfilter
– CharFilter
![Page 29: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/29.jpg)
Schema - Konfiguration
• Dynamische Felder
• Copy Field
• Default Werte
![Page 30: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/30.jpg)
Einheitliche Suche / Berechtigung
• AutoSuggest
• Facetten
• DidYouMean
• Clustering / Field Collapsing
• Berechtigungen
Herausforderungen
![Page 31: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/31.jpg)
AutoSuggest
• Vorschlag des zu suchenden Begriffs
![Page 32: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/32.jpg)
Facetten
• Gruppierung der Ergebnismenge
• Navigationselement
![Page 33: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/33.jpg)
DidYouMean
• Wortvorschlag, basierend auf dem Index
• „Meinten Sie“ - Fuktionalitäten
![Page 34: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/34.jpg)
Clustering
• Alternative Darstellung der Trefferliste
![Page 35: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/35.jpg)
Search - Berechtigungen
• Kein Standard
• Beispiel: ActiveDirectory bei SHI– Index: zusätzliche Information
– Suche: zusätzliche FilterQuery
SolrSolr
Auth.jsp
fq=
allow:“12-33-45-7“ AND
-deny:“12-33-45-7“
q=jax&fq=…
Response
q=jax
Response
![Page 36: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/36.jpg)
Relevanz / Precision & Recall
• TF-IDF
• Sortierung / Function Queries
• Boosting
• Syntax
Herausforderungen
![Page 37: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/37.jpg)
TF-IDF
• Scoring in 2 Phasen– Boolsche Modell
– Vector Space Modell
• Relevanzalgorithmus
![Page 38: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/38.jpg)
Sortierung / Function Queries
• Sortierung – default ist Score
– Konstantes Scoring bei *:*, Range und fq
– Beispiel: sort=titel asc,author desc
• Function Queries– Beeinflussung des Ranking (bf/boost
Parameter oder sort)
– Beispiel: recip(ms(NOW,mydatefield),3.16e-11,1,1)
![Page 39: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/39.jpg)
Syntax
• Query -> q
• FilterQuery ->fq
• Boolean Operatoren -> OR, AND, NOT, +, -
• Phrasen -> “Harrison Ford”~5
• Wildcard -> fi?m, film*
• Fuzzy -> Hale*0.9
• Boost -> q=star OR trek^4.0
• Range -> preis:[1 TO 10] oder preis:{1 TO 10}
![Page 40: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/40.jpg)
Ausblicke
• Solr Cloud– Verteilte Suche mit zentraler Konfiguration
• Near Real Time Search– Alternative Commit Strategie
• JOIN– „Verknüpfung“ von Dokumenten
![Page 41: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/41.jpg)
Weiterführende Informationen
• Solr– Wiki (http://wiki.apache.org/solr)
– Jira (https://issues.apache.org/jira/browse/SOLR)
– Mailinglist (http://lucene.apache.org/solr/mailing_lists.html)
• Websites– SHI (http://www.shi-gmbh.com/blog)
– Lucid Imagination (http://www.lucidimagination.com)
![Page 42: Jax 2012 - Apache Solr as Enterprise Search Platform](https://reader033.fdocuments.net/reader033/viewer/2022060109/5558c30bd8b42a235c8b45af/html5/thumbnails/42.jpg)
Demo / Q & A
Vielen Dank für Ihr Interesse