Suche mit Apache Lucene & Co.

134
Suche mit Apache Lucene & Co Christian Meder Bernhard Pflugfelder inovex Gmbh

Transcript of Suche mit Apache Lucene & Co.

Page 1: Suche mit Apache Lucene & Co.

Suche mit Apache Lucene & Co

Christian Meder Bernhard Pflugfelder

inovex Gmbh

Page 2: Suche mit Apache Lucene & Co.

Background ‣  open source (free software)

‣  Linux

‣  Web

‣  Java

‣  Android

‣  CTO@inovex

‣  Christian Meder

Christian Meder

Speaker

2

Page 3: Suche mit Apache Lucene & Co.

Background ‣  Lucene

‣  Solr

‣  Text Mining Technologies, Information Retrieval

‣  Hadoop

‣  Java

‣  Big Data Engineer@inovex

‣  [email protected]

Bernhard Pflugfelder Speaker

3

Page 4: Suche mit Apache Lucene & Co.

‣  09:00 - 09:30 Introduction, Search in a nutshell

‣  09:30 - 10:00

Solr Exercise 1: Installation, Web Admin Interface ‣  10:00 - 10:30

Solr Exercise 2: Indexing, Queries I ‣  10:30 - 11:00

Coffee Break ‣  11:30 - 12:00

Solr Exercise 3: Data ingestion XML / SQL, Queries II

Session I Agenda

4

Page 5: Suche mit Apache Lucene & Co.

‣  12:00 - 12:30 Solr Exercise 4: Schema, Data types, Analyzers, Stemming

‣  12:30 – 13:30

Lunch

‣  13:30 - 14:00

Solr Exercise 5: Facet search, Filter search, Interval search ‣  14:00 - 14:30

Solr Exercise 6: Dismax, Autosuggestion, MoreLikeThis

Session II Agenda

5

Page 6: Suche mit Apache Lucene & Co.

‣  14:30 - 15:00 ES Exercise 1: Installation, Indexing, Queries I

‣  15:00 - 15:30 Coffee Break

‣  15:30 - 16:00

ES Exercise 2: Schema, Data types, Analyzers, Queries II ‣  16:00 - 16:30

ES Exercise 3: Data ingestion SQL / XML ‣  16:30 - 17:00

ES Exercise 4: Facet search, Filter search, Interval search

Session III Agenda

6

Page 7: Suche mit Apache Lucene & Co.

Search tag cloud

Introduction

7

Page 8: Suche mit Apache Lucene & Co.

‣  Classical search applications are applications focusing on information or document retrieval

‣  Requirement: find information the user asks for!

‣  Some examples:

‣  Web search

‣  Enterprise search

‣  Document search (within DMS or CMS)

‣  Search on portals and archives

‣  Product search

‣  Specialized searches for people, companies, etc.

Classical search applications

Introduction

8

Page 9: Suche mit Apache Lucene & Co.

Where search is in Enterprise Search Introduction

9

Page 10: Suche mit Apache Lucene & Co.

Where search is in Online shops Introduction

10

Page 11: Suche mit Apache Lucene & Co.

Where search is in Semantic search @ Google

Introduction

11

Page 12: Suche mit Apache Lucene & Co.

Where search is in

Introduction

12

Navigation & Information access

Page 13: Suche mit Apache Lucene & Co.

Data Analysis Search-based applications

Introduction

13

http://datarpm.com/product

Page 14: Suche mit Apache Lucene & Co.

‣  Can you think of other scenarios where search applications will also do a good job?

‣  Remind the key capabilities of search technologies:

‣  Persistency

‣  Flexible data model

‣  Unstructured data, but not only

‣  Extremely quick access to data

‣  Horizontal scalability

There are plenty of applications scenarios out there where search technologies shall be considered!

NoSQL Database Introduction

14

Document store

Page 15: Suche mit Apache Lucene & Co.

Hot open source search technologies

Projects

15

http://lucene.apache.org

http://lucene.apache.org/solr/

http://www.elasticsearch.org

Page 16: Suche mit Apache Lucene & Co.

Lucene is an open source, pure Java API for enabling information retrieval

‣  Originally developed by Doug Cutting 1999 and became Apache TLP in 2001 ‣  Licensed by Apache License 2.0 ‣  Pure Java Library with implementations for :

‣  Lucene.NET (http://lucenenet.apache.org) ‣  PyLucene (http://lucene.apache.org/pylucene/) ‣  and more:

http://wiki.apache.org/lucene-java/LuceneImplementations ‣  Large and very active developer community, well documented and supported (38

active committer!) ‣  Current stable release: 4.2.1 ‣  Widely used and adopted for commercial / non-commercial projects:

http://wiki.apache.org/lucene-java/PoweredBy

Projects

16

Overview http://lucene.apache.org/

Page 17: Suche mit Apache Lucene & Co.

‣  Scalable, High-Performance Indexing ‣  over 95GB/hour on modern hardware

‣  small RAM requirements

‣  incremental indexing as fast as batch indexing

‣  index size roughly 20-30% the size of text indexed

‣  Powerful, Accurate and Efficient Search Algorithms ‣  ranked searching -- best results returned first

‣  many powerful query types

‣  fielded searching (e.g., title, author, contents)

‣  date-range searching

‣  sorting by any field

‣  multiple-index searching with merged results

‣  allows simultaneous update and searching [From http://lucene.apache.org/core/features.html]

Projects

17

Highlights http://lucene.apache.org/

Page 18: Suche mit Apache Lucene & Co.

Solr is a standalone enterprise search server & document store with based on Lucene

‣  Created by Yonik Seeley at CNET Networks in 2004

‣  Introduced as Apache Incubator in 2006, became TLP in 2007 ‣  Licensed by Apache License 2.0 ‣  Seeley and others founded Lucid Imagination -> LucidWorks ‣  Large and very active developer community, well documented and supported

(strong relationship to Lucene community also) ‣  Current stable release: 4.2.1 ‣  Widely used and adopted for commercial / non-commercial projects:

http://wiki.apache.org/solr/PublicServers

Overview Projects

18

http://lucene.apache.org/solr/

Page 19: Suche mit Apache Lucene & Co.

‣  Architectural highlights ‣  Extensible Plugin Architecture

‣  SolrCloud – distributed indexing and search architecture

‣  Efficient Replication to other Solr Search Servers

‣  Configurable Query Result, Filter, and Document cache instances

‣  Access & Monitoring ‣  Standards Based Open Interfaces

‣  XML,JSON and HTTP

‣  REST-like API

‣  Comprehensive HTML Administration Interfaces

‣  Server statistics exposed over JMX for monitoring

Highlights Projects

19

http://lucene.apache.org/solr/

Page 20: Suche mit Apache Lucene & Co.

‣  Data model ‣  Lucene’s document oriented index data structure

‣  Schema for field types and fields of documents

‣  Analysis & Indexing highlights ‣  Out-of-box support for JSON, XML, CSV/delimited-text, DBMS

‣  Support of PDF, DOC, XLS, PPT, HTML

‣  Declarative Lucene Analyzer specification

‣  Many additional text analysis components including word splitting, regex and sounds-like filters

‣  External file-based configuration of stopword lists, synonym lists, and protected word lists

Highlights Projects

20

http://lucene.apache.org/solr/

Page 21: Suche mit Apache Lucene & Co.

Open source search technologies ‣  Search highlights

‣  Facet search and filtering (values, queries, date/time ranges)

‣  Geospatial search (e.g. local search)

‣  Configurable caching

‣  Sorting (number of fields, complex functions of numeric fields)

‣  Autocomplete

‣  Highlighted context snippets

‣  Spelling suggestions for user queries

‣  More Like This suggestions for given document

‣  Function Query

‣  Advanced query parser for high relevancy results from user-entered queries

Highlights Projects

21

http://lucene.apache.org/solr/

Page 22: Suche mit Apache Lucene & Co.

‣  Solr clients in various languages are freely available: ‣  Java, Scala, Ruby, Python, .NET, Javascript (AJAX), …

‣  http://wiki.apache.org/solr/IntegratingSolr

‣  Very helpful tools: ‣  Grep (log file analysis)

‣  Luke (index analysis)

‣  Solrmeter (performance analysis)

‣  Scalable Performance Monitoring for Solr (Monitoring)

Clients & Tools Projects

22

http://lucene.apache.org/solr/

Page 23: Suche mit Apache Lucene & Co.

Documentation URL Getting started http://lucene.apache.org/solr/4_0_0/

tutorial.html Release documentation: http://lucene.apache.org/solr/4_0_0/ Javadocs http://lucene.apache.org/solr/4_0_0/solr-

core/index.html Solr Wiki

http://wiki.apache.org/solr/

Mailing lists http://lucene.apache.org/solr/discussion.html

Apache Solr 3 Enterprise Search Server http://link.packtpub.com/2LjDxE Apache Solr 3.1 Cookbook http://www.packtpub.com/solr-3-1-

enterprise-search-server-cookbook/book LucidWorks Technical Support http://support.lucidworks.com/home

Documentation Projects

23

http://lucene.apache.org/solr/

Page 24: Suche mit Apache Lucene & Co.

+  Solr is a mature technology widely used in commercial applications ‣  Easy integration in third-party application

‣  Big community, good documentation, good support

‣  You have a Solr problem - most likely someone else had it already

‣  Very helpful tools for analysis and monitoring

+  Solr provides a large bundle of features: ‣  Lots of analyzers and specific query types

‣  Individual relevance boosting

‣  Admin interface

-  Because Solr can so much, it’s a heavy weight technology: ‣  much to configure

‣  most part of the configuration is static / no api access

‣  includes redundant functionality (e.g. similar requesthandlers)

Pros & Cons Projects

24

http://lucene.apache.org/solr/

Page 25: Suche mit Apache Lucene & Co.

Search Architecture Projects

25

Page 26: Suche mit Apache Lucene & Co.

‣  Installation

‣  Administration

‣  Solr Web Admin Interface

Solr Exercise I

26

Page 27: Suche mit Apache Lucene & Co.

‣  Solr is a pure Java application ‣  Solr is built upon:

‣  Lucene

‣  Zookeeper

‣  Guava-libraries

‣  HttpComponents, SLF4J, Various Commons libraries

‣  Solr source code available at:

‣  http://svn.apache.org/viewcvs.cgi/lucene/dev/ (Web access)

‣  http://svn.apache.org/repos/asf/lucene/dev/ (anonymous access)

‣  Solr needs a servlet container to run such as Jetty, Tomcat, Glassfish to run

‣  Embedded Jetty for easily playing and testing Solr

Solr Exercise I

27

Overview http://lucene.apache.org/solr/

Page 28: Suche mit Apache Lucene & Co.

Run Solr on embedded Jetty: 1.  Unpack the Solr distribution to your desired location (= SOLR_MAIN)

2.  Change to directory SOLR_MAIN/example

3.  Start the example Solr instance: java -jar start.jar

To verify the installation open your browser and go to the Solr Admin page http://localhost:8983/solr

Solr Exercise I

28

Installation http://lucene.apache.org/solr/

Page 29: Suche mit Apache Lucene & Co.

‣  Solr Core (aka Core) ‣  basically an isolated running instance of a Solr index

‣  each Core has its own solrconfig.xml, schema.xml and index data

‣  search results can not be computed over Cores

‣  Solr Collection (aka Collection) ‣  Logical index distributed over multiple machines

‣  Physical partitioning using sharding

‣  Part of SolrCloud (Scalability, High Availability)

Solr Exercise I

29

Core vs. Collection http://lucene.apache.org/solr/

Page 30: Suche mit Apache Lucene & Co.

Solr Home Directory as recommended: ‣  solr.xml

‣  primary configuration file Solr looks for when starting

‣  this file specifies the list of SolrCores it should load

‣  Solr Core Instance Directories ‣  contains configuration and data of a SolrCore

‣  lib/ ‣  shared lib directory for solr instance

‣  zoo.cfg ‣  Zookeeper configuration when using SolrCloud

‣  How to tell Solr where SOLR_HOME is located? ‣  Use the Java system property: solr.solr.home

‣  e.g. java -Dsolr.solr.home=/some/dir -jar start.jar

Solr Exercise I

30

Solr Home http://lucene.apache.org/solr/

Page 31: Suche mit Apache Lucene & Co.

Solr Core Instance Directory as recommended: ‣  conf/

‣  This directory is mandatory and must contain your solrconfig.xml and schema.xml.

‣  Any other optional configuration files would also be kept here.

‣  data/ ‣  This directory is the default location where Solr will keep your index, and is

used by the replication scripts for dealing with snapshots. ‣  You can override this location in the conf/solrconfig.xml.

‣  lib/ ‣  This directory is optional. If it exists, Solr will load any Jars found in this

directory and use them to resolve any "plugins” specified in your solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...).

Solr Exercise I

31

Instance Directory http://lucene.apache.org/solr/

Page 32: Suche mit Apache Lucene & Co.

Solr includes an Admin Web interface providing your with ‣  General configuration details

‣  Core-specific configuration details

‣  Log information

‣  Run queries

‣  Document field / Term statistics

‣  Document fields

‣  Cache statistics

‣  Server cluster information

Access it via http://localhost:8983/solr

Solr Exercise I

32

Admin Web interface

http://lucene.apache.org/solr/

Page 33: Suche mit Apache Lucene & Co.

‣  Indexing the first XML data

‣  Try first simple queries

‣  Different query types

‣  Get result score

‣  Highlighting

Solr Exercise II

33

Page 34: Suche mit Apache Lucene & Co.

Search Basics Solr Exercise II

34

Document Query

indexing indexing (Query analysis)

Representation Representation (tokens) Query (tokens) evaluation

Index-based search

Page 35: Suche mit Apache Lucene & Co.

‣  An inverted index is an index data structure that ‣  stores mappings from tokens to

their locations (e.g. documents)

‣  allows fast access of those documents that contains specific tokens

‣  The purpose of an inverted index is to allow fast full text searches

Search Basics Solr Exercise II

35

Inverted index

Page 36: Suche mit Apache Lucene & Co.

Solr Exercise II

36  

Index

Document

Document

Document

Document

Field

Field

Field

Field Field

Name Value

Search Basics Data model

Page 37: Suche mit Apache Lucene & Co.

Solr Exercise II

37  

Doc 1:

Penn State Football …

football

Doc 2:

Football players … State

Posting id

word doc offset

1 football Doc 1 3

Doc 1 67

Doc 2 1

2 penn Doc 1 1

3 players Doc 2 2

4 state Doc 1 2

Doc 2 13

Posting Table

Search Basics Data model

Page 38: Suche mit Apache Lucene & Co.

‣  How to select important terms?

‣  Simple method: using middle-frequency words

Solr Exercise II

38

Frequency/Informativity frequency informativity Max. Min.

1 2 3 … Rank

Search Basics Term selection

Page 39: Suche mit Apache Lucene & Co.

‣  tf = term frequency ‣  frequency of a term/keyword in a document ‣  The higher the tf, the higher the importance (weight) for the doc.

‣  df = document frequency

‣  no. of documents containing the term ‣  distribution of the term

‣  idf = inverse document frequency ‣  the unevenness of term distribution in the corpus ‣  the specificity of term to a document ‣  The more the term is distributed evenly, the less it is specific to a document

weight(t,D) = tf(t,D) * idf(t)

Solr Exercise II

39

Search Basics Term selection

Page 40: Suche mit Apache Lucene & Co.

‣  1-word query: The documents to be retrieved are those that include the word

‣  Retrieve the inverted list for the word

‣  Sort in decreasing order of the weight of the word

‣  Multi-word query? -  Combining several lists

-  How to combine matches of these different lists?

-  How to interpret the weight? (IR model)

Solr Exercise II

40

Search Basics Querying

Page 41: Suche mit Apache Lucene & Co.

‣  Vector space = all the terms encountered <t1, t2, t3, …, tn>

‣  Document D = < a1, a2, a3, …, an>

ai = weight of ti in D

‣  Query Q = < b1, b2, b3, …, bn>

bi = weight of ti in Q

‣  R(D,Q) = Sim(D,Q) ‣  Cosine Similarity (TF*IDF) ‣  Okapi BM25

Vector-space model Search Basics

41

t1

t2

D

Q

Page 42: Suche mit Apache Lucene & Co.

‣  The Solr UpdateRequestHandler defines the logic to deal with index update actions based on a specific data source or data format

‣  UpdateRequestHandlers must be defined in the solrconfig.xml and are matched to specific url path in oder to access it via HTTP

‣  Solr supports serveral file types out-of-the-box by using the specific update handler:

‣  Standard UpdateRequestHandler

‣  supporting XML, XSLT, JSON, CSV and javabin

‣  DataImportHandler

‣  Indexing events: Add/Replace, Commit, Soft Commit, Delete

Solr Indexing Update Request handlers

Solr Exercise II

42

<requestHandler name=“update” class="solr.UpdateRequestHandler"/>

Page 43: Suche mit Apache Lucene & Co.

Solr Indexing XML Add Solr Exercise II

43

curl http://localhost:8983/solr/jax2013/update -H 'Content-Type:text/xml' --data-binary '<add> <doc> <field name=”id”>etext78942</field> <field name=”title”>Solr textbook</field> <field name=”subject">search technology</field> <field name=”author">Bernhard Pflugfelder</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add>'

Page 44: Suche mit Apache Lucene & Co.

Solr Indexing XML Update Solr Exercise II

44

curl http://localhost:8983/solr/jax2013/update -H 'Content-Type:text/xml’ --data-binary '<add> <doc> <field name=”id">etext78942</field> <field name=”author" update="set">Christian Meder</field> <field name=”subject" update="add">open source</field> </doc> </add>'

Page 45: Suche mit Apache Lucene & Co.

Solr Indexing XML Delete Solr Exercise II

45

curl http://localhost:8983/solr/jax2013/update -H 'Content-Type:text/xml’ --data-binary '<delete> <id>etext78942</id> <query>author:meder</query> </delete>'

Page 46: Suche mit Apache Lucene & Co.

Solr Indexing XML Commit Solr Exercise II

46

curl http://localhost:8983/solr/jax2013/update -H 'Content-Type:text/xml’ --data-binary '<commit waitSearcher="false"/>'

curl 'http://localhost:8983/solr/jax2013/update? optimize=true&waitFlush=false'

Page 47: Suche mit Apache Lucene & Co.

‣  Multiple index actions in one JSON

Solr Indexing JSON Add / Delete / Commit

Solr Exercise II

47

curl http://localhost:8983/solr/jax2013/update/json -H \ 'Content-type:application/json' -d ’ { "add": { "commitWithin": 5000, "doc": { "f1": "v1", "f1": "v2" } }, "commit": {}, "delete": { "id":"ID" }, "delete": { "query":"QUERY" } "delete": { "query":"QUERY", 'commitWithin':'500' } }'

Page 48: Suche mit Apache Lucene & Co.

‣  Commands add, set and inc

Solr Indexing JSON Atomic updates Solr Exercise II

48

curl http://localhost:8983/solr/jax2013/update/json -H \ 'Content-type:application/json' -d ’ [ { "id" : "etext78942", "title" : {"set":”solr 4.2.1 textbook"}, ”viewcount” : {"inc":3}, "author" : {"add":”Bernhard Pflugfelder"} } ]’

Page 49: Suche mit Apache Lucene & Co.

Solr Indexing Try out Solr Exercise II

49

cd SOLR_MAIN/example/exampledocs curl 'http://localhost:8983/solr/collection1/update/json?

commit=true’ --data-binary @books.json -H 'Content-type:application/json'

cd SOLR_MAIN/example/exampledocs java -jar post.jar -h java -jar post.jar *.xml

Page 50: Suche mit Apache Lucene & Co.

‣  q=+content:goethe +content:schiller

‣  q=+content:goethe -content:schiller

‣  q=title:faust

‣  q=title:faust AND -content:goethe

‣  q=content:“romeo and juliet”

‣  q=title:water*

‣  q=title:water~0.5

‣  q=created:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]

‣  q=viewcount:[20 TO 50]

‣  q=viewcount:[100 TO *]

Solr Queries Solr Exercise II

50

curl –XPOST ‘http://localhost:8983/solr/jax2013/select’ –d

Query Syntax

Page 51: Suche mit Apache Lucene & Co.

Solr Queries

Common parameters

Solr Exercise II

51

Param name Param value Description q string The user query string start number Offset in the list of returned documents rows number Number of documents returned fq string A filter query fl string,string,… Fields returned for each document debugQuery true / false Include debug info in the response

curl –XPOST ‘http://localhost:8983/solr/collection1/select’ –d ‘q=+solr –elasticsearch&start=20&row=40&fl=* score’

Page 52: Suche mit Apache Lucene & Co.

Highlighting Overview

Solr Exercise II

52

Param name Param value Description hl true / false Switch on / off highlighting hl.q string Alternative highlighting query hl.fl string, string,… Fields used for highlighting hl.snippets number Number of maximum snippets hl.fragsize number Number of characters per snippet hl.simple.pre[post] string Text appears before / after match

curl –XPOST ‘http://localhost:8983/solr/collection1/select’ –d ‘q=+solr –elasticsearch&start=20&row=40&fl=* score &hl=true&hl.fl=title,abstract’

Page 53: Suche mit Apache Lucene & Co.

‣  Datainputhandler SQL

‣  Datainputhandler XML

Solr Exercise III

53

Page 54: Suche mit Apache Lucene & Co.

‣  DataInputhandler makes possible to: ‣  index data in relational databases

‣  compose documents from multiple columns and tables

‣  bulk import or incremental update using Delta Query mechanism

‣  schedule full imports and delta imports

‣  Index data from XML/HTML using XPATH expressions

‣  DataInputhandler is part of Solr Contrib

‣  Define in solrconfig.xml

DataInputhandler Overview Solr Exercise III

54

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/home/username/data-config.xml</str> </lst> </requestHandler>

Page 55: Suche mit Apache Lucene & Co.

‣  http://localhost:8983/solr/dataimport?command=full-import ‣  http://localhost:8983/solr/dataimport?command=delta-import

‣  http://localhost:8983/solr/dataimport?command=status

‣  http://localhost:8983/solr/dataimport?command=reload-config

‣  http://localhost:8983/solr/dataimport?command=abort

DataInputhandler Commands

Solr Exercise III

55

Page 56: Suche mit Apache Lucene & Co.

‣  The dataconfig.xml defines the data source and which data shall be used to populate Solr documents during import

‣  Defines tags:

‣  dataSource

‣  document

‣  entity

‣  The entity defines a specific data selection resulting in a Solr document

‣  The query gives the data needed to populate fields of the Solr document

DataInputhandler Configuration Solr Exercise III

56

<dataConfig> <dataSource … /> <document name="products"> <entity name="item" query="select * from item” /> </document> </dataConfig>

Page 57: Suche mit Apache Lucene & Co.

‣  MySQL

‣  Oracle

‣  Use multiple data source within on DIH config by property name

‣  Each entity definition must then define a parameter name as well

DataInputhandler DataSource

Solr Exercise III

57

<dataSource name="jdbc" driver=”com.mysql.jdbc.Driver” url="jdbc:mysql://localhost/dbname" user="db_username" password="db_password"/>/>

<dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@//hostname:port/SID" user="db_username" password="db_password"/>

Page 58: Suche mit Apache Lucene & Co.

DataInputhandler SQL full-import Solr Exercise III

58

<dataConfig> <dataSource … /> <document name="products"> <entity name="item" query="select * from item"> <field column="ID" name="id" /> <field column="NAME" name="name" /> <field column="MANU" name="manu" /> <field column="WEIGHT" name="weight" /> <field column="PRICE" name="price" /> <field column="POPULARITY" name="popularity" /> <field column="INSTOCK" name="inStock" /> <field column="INCLUDES" name="includes" /> </entity> </document> </dataConfig>

Page 59: Suche mit Apache Lucene & Co.

DataInputhandler SQL full-import

Solr Exercise III

59

<dataConfig> <dataSource … /> <document> <entity name="item" query="select * from item"> <entity name="feature" query="select description as

features from feature where item_id='${item.ID}'"/> <entity name="item_category" query="select CATEGORY_ID

from item_category where item_id='${item.ID}'"> <entity name="category" query="select description as cat

from category where id = '${item_category.CATEGORY_ID}'"/> </entity> </entity> </document> </dataConfig>

Page 60: Suche mit Apache Lucene & Co.

‣  Increment update of the specific content of a relational database ‣  Avoid indexing already indexed data again

‣  http://localhost:8983/solr/dataimport?command=delta-import

‣  Provide three specific queries for each entity except root:

‣  The deltaImportQuery gives the data needed to populate fields when running a delta-import

‣  The deltaQuery gives the primary keys of the current entity which have changes since the last index time

‣  The parentDeltaQuery uses the changed rows of the current table (fetched with deltaQuery) to give the changed rows in the parent table. This is necessary because whenever a row in the child table changes, we need to re-generate the document which has that field.

DataInputhandler SQL Delta-Import Solr Exercise III

60

Page 61: Suche mit Apache Lucene & Co.

DataInputhandler SQL Delta-Import

Solr Exercise III

61

<entity name="item" pk="ID” query="select * from item” deltaImportQuery="select * from item where ID='${dih.delta.id}'” deltaQuery="select id from item where last_modified &gt; '${dih.last_index_time}'”>

<entity name="feature" pk="ITEM_ID” query="select description as features from feature where item_id='${item.ID}'” />

<entity name="item_category" pk="ITEM_ID, CATEGORY_ID” query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'">

<entity name="category" pk="ID” query="select description as cat from category where id = '${item_category.CATEGORY_ID}'” />

</entity> </entity>

Page 62: Suche mit Apache Lucene & Co.

‣  HTTP source

‣  XML File source

DataInputhandler Other DataSources Solr Exercise III

62

<dataConfig> <dataSource type="HttpDataSource" /> …

</dataConfig>

<dataConfig> <dataSource type=”FileDataSource" encoding=“UTF-8”/> …

</dataConfig>

Page 63: Suche mit Apache Lucene & Co.

‣  The entity defines location of the XML file ‣  Solr document field population is done by evaluating XPATH expressions

DataInputhandler XML full-import Solr Exercise III

63

<entity name="page” processor="XPathEntityProcessor"

stream="true" forEach="/RDF/etext/" url="../../catalog.rdf.xml" transformer="RegexTransformer,DateFormatTransformer”> <field column="id" xpath="/RDF/etext/@id" /> <field column="title" xpath="/RDF/etext/title" /> <field column="alternative" xpath="/RDF/etext/alternative" /> <field column="author" xpath="/RDF/etext/creator" /> <field column="multi_author” xpath="/RDF/etext/creator/Bag/li" /> <field column="subject" xpath="//LCSH/value" /> <field column="viewcount"

xpath="/RDF/etext/downloads/nonNegativeInteger/value" /> <field column="created"

xpath="/RDF/etext/created/W3CDTF/value" dateTimeFormat="yyyy-MM-dd" /> </entity>

Page 64: Suche mit Apache Lucene & Co.

‣  Schema,

‣  Data types

‣  Analyzers, Tokenizers

Solr Exercise IV

64

Page 65: Suche mit Apache Lucene & Co.

‣  Defines document representation by specifying fields ‣  with a specific field type

‣  with specific field type properties

‣  Dynamic fields

‣  CopyField

‣  Define analyzers:

‣  Tokenizers

‣  Filters

‣  Synonym lists, stop word lists

‣  additional text analysis

‣  Assign analyzers to the Text-based data types (solr.TextField)

‣  Example schema.xml

Schema Solr Exercise IV

65

Overview

Page 66: Suche mit Apache Lucene & Co.

‣  Field types ‣  int, long, float, double, boolean

‣  string, date, binary

‣  derived from solr.TextField

‣  text_general, text_de, text_en, …

‣  Field type properties

‣  indexed (true / false)

‣  stored (true / false)

‣  multiValued (true / false)

‣  termVectors (true / false)

Schema

Fields Solr Exercise IV

66

Page 67: Suche mit Apache Lucene & Co.

Break stream of characters into tokens / terms

‣  Normalization (e.g. case)

‣  Stopwords

‣  Stemming

‣  Lemmatizer / Decomposer

‣  Part of Speech Tagger

‣  Information Extraction

Analyzing / Tokenization

Overview Solr Exercise IV

67

Page 68: Suche mit Apache Lucene & Co.

‣  function words do not bear useful information for searching of, in, about, with, I, although, …

‣  Stopword list: contain stopwords, not to be used as index

‣  Prepositions ‣  Articles ‣  Pronouns ‣  Some adverbs and adjectives ‣  Some frequent words (e.g. document)

‣  The removal of stopwords usually improves search quality

‣  Solr provides default stopword lists for various languages

Analyzing / Tokenization

Stopwords Solr Exercise IV

68

Page 69: Suche mit Apache Lucene & Co.

‣  Apply strict algorithmic normalization of inflection forms (e.g. Porter)

‣  Strategy: removing some endings of words.

Example:

computer, compute, computes, computing, computed, computation are all normalized to comput

‣  But: going -> go, king -> k ???????????

‣  Stemming might work well for English

‣  However, be careful using stemming, especially for German

Analyzing / Tokenization

Stemming

Solr Exercise IV

69

Page 70: Suche mit Apache Lucene & Co.

Analyzing / Tokenization

Define an analyzer Solr Exercise IV

70

<fieldType name=”<name>" class="solr.TextField” positionIncrementGap="100">

<analyzer type="index”> <!– tokenizer and filters for indexing --> <tokenizer class=“CLASS” PARAMS /> <filter class=“CLASS” PARAMS /> </analyzer> <analyzer type="query"> <!– tokenizer and filters for search --> <tokenizer class=“CLASS” PARAMS /> <filter class=“CLASS” PARAMS /> </analyzer>

</fieldType>

Page 71: Suche mit Apache Lucene & Co.

‣  TokenizerFactories ‣  solr.StandardTokenizerFactory

‣  solr.WhitespaceTokenizerFactory

‣  solr.KeywordTokenizerFactory

‣  TokenFilterFactories

‣  solr.LowerCaseFilterFactory

‣  solr.TrimFilterFactory

‣  solr.StopFilterFactory

‣  solr.WordDelimiterFilterFactory

‣  solr.SynonymFilterFactory

‣  solr.EdgeNGramFilterFactory

Analyzing / Tokenization

Tokenizers & Filters Solr Exercise IV

71

Page 72: Suche mit Apache Lucene & Co.

‣  English ‣  solr.PorterStemFilterFactory

‣  solr.SnowballPorterFilterFactory

‣  solr.EnglishMinimalStemFilterFactory

‣  German

‣  solr.SnowballPorterFilterFactory

‣  solr.GermanLightStemFilterFactory

‣  solr.GermanMinimalStemFilterFactory

‣  More information at http://wiki.apache.org/solr/LanguageAnalysis

Analyzing / Tokenization

Language analysis Solr Exercise IV

72

Page 73: Suche mit Apache Lucene & Co.

‣  Faceted search

‣  Filter query

‣  MoreLikeThis query

Solr Exercise V

73

Page 74: Suche mit Apache Lucene & Co.

Faceted search Overview Solr Exercise V

74

Page 75: Suche mit Apache Lucene & Co.

‣  „Die Aussage eines Probanden bei einem Usability-Test einer Faceted Search Lösung im Rahmen dieser Studie ist damit richtungsweisend:

‣  „Mit dem Filter hier habe ich das Gefühl, dass selbst eine schnöde Suche richtig Spaß machen kann.””

‣  Quelle: Faceted Search: Die neue Suche im Usability-Test (zum kostenlosen Download unter http://usability.de)

Faceted search

Motivation Solr Exercise V

75

Page 76: Suche mit Apache Lucene & Co.

‣  Faceted search (aka faceted navigation) organizes search results based on different categories or dimensions giving the user the possibility to drill down the search results

‣  Facets can be authors, titles, tags, dates, languages, file types …

‣  Typically, meta data describing concepts and meaning of documents are useful as facets

‣  Facets can be shown with counts

Faceted search

Overview Solr Exercise V

76

Page 77: Suche mit Apache Lucene & Co.

‣  Solr provides faceting mechanism out-of-the-box including the returning of counts ‣  Important: facet fields must be defined with indexed=true

‣  Often facet fields are analyzed differently as search fields. Therefore it is common to define separate document fields for faceting in schema.xml

‣  Facet fields shall not be tokenized, lower-cased, stemmed

‣  Facet fields can be of type

‣  int, long, float, double, boolean

‣  solr.TextField

‣  date

‣  From the view point of performance also define

‣  stored=false

‣  omitNorms=false

Faceted search

Solr Faceting

Solr Exercise V

77

<field name=”facet_author” indexed=“true” stored=“false” omitNorms=“false” />

Page 78: Suche mit Apache Lucene & Co.

‣  Solr provides two basic mechanism to build facets ‣  Arbitrary faceting (facet.query=query)

‣  Field value faceting (facet.field=fieldname)

‣  In case of Field value faceting two faceting methods can be chosen

‣  Enum Based Field Queries (facet.method=enum)

‣  Field Cache (facet.method=fc)

‣  Other common parameters

Faceted search

Solr Faceting Solr Exercise V

78

Param name Param value Description facet true / false Switch on / off faceting facet.prefix String Facet results must start with prefix facet.sort sort / index Sort facet results facet.limit number Limit number of facet results facet.mincount number Minimal count to be considered

Page 79: Suche mit Apache Lucene & Co.

Faceted search

Date faceting

Solr Exercise V

79

Param name Param value Description facet.date fieldname The fieldname of type

date used for date faceting

facet.date.start date expression The start date of the first date facet interval

facet.date.end date expression The upper bound for the last date facet interval

facet.date.gap date expression The size of each date range interval

q=*:*&rows=0&wt=xml&indent=true&facet=true&facet.date=created& facet.date.start=1996-01-31T23:00:00Z& facet.date.end=2013-04-021T00:00:00Z&facet.date.gap=%2B1YEAR

Page 80: Suche mit Apache Lucene & Co.

Faceted search

Range faceting Solr Exercise V

80

Param name Param value Description facet.range fieldname The fieldname of a

numeric field type facet.range.start number The start date of the first

range interval facet.range.end number The upper bound for the

last range interval facet.range.gap number The size of each range

interval

q=*:*&rows=0&wt=xml&indent=true&facet=true&facet.range=viewcount& facet.range.start=0&facet.range.end=150&facet.range.gap=20

Page 81: Suche mit Apache Lucene & Co.

‣  Filter queries restrict the document result set to a specific subset of the returned set based on the original query

‣  The scores of the documents are not influenced by filter queries

‣  Examples

‣  access permissions (ACLs)

‣  categories or tags

‣  Importantly, the results of filter queries are automatically cached per default

‣  Solr uses a separate in-memory filter cache

‣  Thus, filter queries will be evaluated very fast if they are cached

‣  Complex, often used queries are good candidates for filter queries

‣  Keep in mind that the size of filter cache depends on the search scenario must therefore be tuned explicitly

Filter query Overview Solr Exercise V

81

Page 82: Suche mit Apache Lucene & Co.

‣  Filter queries are defined by query parameter fq

‣  Avoid caching filter queries

Filter query Examples Solr Exercise V

82

q=content:arthur&fq=subject:fantasy&fl=title,author&rows=5

content:arthur&fq=subject:fantasy&fq=viewcount:[* TO 100]& fl=title,author&rows=5

content:arthur&fq=subject:fantasy &fq={!cache=false}viewcount:[* TO 100]&fl=title,author&rows=5

Page 83: Suche mit Apache Lucene & Co.

‣  Idea of MoreLikeThis ‣  MoreLikeThis constructs a query based on the terms of given set of fields

‣  Matching documents are “similar” based on the chosen set of fields

‣  Fields used by MoreLikeThis should define termVerctors=“true”

MoreLikeThis Overview Solr Exercise V

83

Param name Param value Description mlt.fl fieldnames Fields to be used by MLT mlt.mintf number Minimum term ferquency mlt.mindf number Minimum document frequency mlt.minwl number Minimum word length mlt.maxwl number Maximum word length mlt.maxqt number Maximum number of query terms

q=content:schiller&mlt=true&mlt.fl=subject&mlt.mindf=50 &mlt.mintf=1

Page 84: Suche mit Apache Lucene & Co.

‣  Advanced queries:

‣  Dismax query parser

‣  Sorting

‣  Grouping

‣  Autosuggestion

Solr Exercise VI

84

Page 85: Suche mit Apache Lucene & Co.

‣  Motivation ‣  Standard Solr parser only supports simple query control

‣  One field can be defined as default search field

‣  Supports only boolean conjunction of sub queries (AND / OR)

‣  Strict query syntax to perform e.g. phrase queries

‣  Dismax (and eDismax) query parsers are more robust query parsers offering various additional query parameters and controls to optimize queries

‣  These additional query parameters and controls are hidden from the user

‣  Dismax stands for Disjunction Max

‣  Disjunction means that multiple fields can be search simultaneously with different field weights

‣  Max means that the maximum score of the field matches is taken as the document score (instead of the sum)

DisMax Parser Overview Solr Exercise VI

85

Page 86: Suche mit Apache Lucene & Co.

Param name Description q.alt Alternative query executed if the user query is not

specified or blank qf The query fields to be searched for. Each field can be

defined with an individual field weight. mm Minimum match of query words in order to evaluate a

document match pf Defines phrase fields. Boost documents that have the

search terms in close proximity within the phrase fields. ps The phrase slop effecting the boosting of phrase queries

evaluated on the pf fields qs The phrase slop for user defined phrase queries qb A raw query that is added to the user query to influence

scoring bf Function queries that are added to the user queries to

influence scoring

DisMax Parser

Parameters Solr Exercise VI

86

Page 87: Suche mit Apache Lucene & Co.

DisMax Parser

Examples Solr Exercise VI

87

http://localhost:8983/solr/jax2013/select? q=schiller&defType=dismax&qf=author^20.0+content^0.3

http://localhost:8983/solr/jax2013/select? q=schiller&defType=dismax&qf=author^20.0+content^0.3 &bq=subject:drama^5.0

Page 88: Suche mit Apache Lucene & Co.

‣  Ranking (= ordering) the documents results based on criteria ‣  Default ranking is done based on the document score

‣  The sort parameter allows to rank the document results based on an arbitrary field or even function

‣  Sort fields must be defined as indexed=true and multiValued=false

‣  Syntax: …&sort=fieldname [asc/desc],fieldname [asc/desc],…

Sorting Overview Solr Exercise VI

88

http://localhost:8983/solr/jax2013/select? q=schiller&defType=dismax&qf=author^20.0+content^0.3&sort=viewcount+desc

http://localhost:8983/solr/jax2013/select? q=schiller&defType=dismax&qf=author^20.0+content^0.3&sort=viewcount+desc

Page 89: Suche mit Apache Lucene & Co.

Grouping Overview

Solr Exercise VI

89

Page 90: Suche mit Apache Lucene & Co.

‣  Motivation ‣  Documents with a common values for some field are partitioned into groups

‣  Documents with the same field value are collapsed to a single result

Grouping Parameters Solr Exercise VI

90

Query parameter Query value Description group true / false Switch on / off grouping group.field fieldname Field to group on rows number Number of groups returned start number Offset in into the list of returned

groups group.limit number Number of docs returned for each

group group.offset number Offset into the list of returned

documents per group sort fieldname [asc/desc] Sort groups on some field group.sort fieldname [asc/desc] Sort documents of every group on

some field

Page 91: Suche mit Apache Lucene & Co.

Autosuggestion Overview Solr Exercise VI

91

Page 92: Suche mit Apache Lucene & Co.

‣  Autosuggestion (aka Autocomplete) is a common search feature that supports the user by providing query suggestions during typing

‣  Autosuggestion functionality can include

‣  the search index

‣  separate word lists

‣  synonyms / black lists

‣  grouping suggestions

‣  Fuzziness

‣  Whatever mechanism is actually used to provide autosuggest, it must be evaluated suggestions very quickly.

‣  Solr provides different mechanisms to build autosuggestion functionality:

‣  using facet search

‣  using standard search (standard query parser)

‣  using spellchecker Solr plugin

Autosuggestion Overview Solr Exercise VI

92

Page 93: Suche mit Apache Lucene & Co.

‣  Define new field title_auto using for autosuggestion

‣  Define the field type text_auto providing specific analysis for autosuggestion

‣  How to get suggestions for a user query?

Autosuggestion Using faceting Solr Exercise VI

93

<field name=”title" type="text_general" indexed="true” stored="true” /> <field name=”title_auto" type="text_auto" indexed="true" stored="true” /> <copyField source=”content" dest=”content_auto" />

<fieldType name="text_auto" class="solr.TextField” positionIncrementGap="100”> <analyzer>

<tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

q=*:*&facet=true&facet.field=title_auto&facet.mincount=1&facet.prefix=schi

Page 94: Suche mit Apache Lucene & Co.

‣  Again, define new field title_auto as in previous slide ‣  Next, redefine the field type text_auto as follows

‣  Now, you can use the standard Solr query parser to get suggestions

Autosuggestion Using standard search

Solr Exercise VI

94

<fieldType name="text_auto" class="solr.TextField” positionIncrementGap=“100”> <analyzer>

<tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize=“25" side="front" /> </analyzer>

</fieldType>

q=title_auto:query&q.op=AND&rows=5&fl=title q=title_auto:query&q.op=AND&rows=0&facet=true& facet.field=tag&facet.mincount=1&facet.limit=5

Page 95: Suche mit Apache Lucene & Co.

Elasticsearch is a “distributed-from-scratch” search server based on Lucene

Created by Shay Banon with a first version made public in 02/2010:

ElasticSearch itself was born out of my frustration with the fact that there isn’t really a good, open source, solution for distributed search engine out there, which also combines what I expect of search engines after building Compass (and on that, I will blog later…). I have been working on this for the past several months, pouring my search and distributed knowledge into this (and portions of my heart and time ;) )

[http://www.elasticsearch.org/blog/2010/02/08/youknowforsearch.html]

Overview Projects

95

http://www.elasticsearch.org/

Page 96: Suche mit Apache Lucene & Co.

‣  Current stable version 0.20.6 ‣  Licensed by Apache License 2.0

‣  Small group of core developer, but strong support of valuable Lucene committer

‣  Already a promising list of users (small and big companies)

‣  github, soundcloud, stackoverflow, mozilla, klout

‣  http://www.elasticsearch.org/users/

Overview Projects

96

http://www.elasticsearch.org/

Page 97: Suche mit Apache Lucene & Co.

‣  Pure Java application ‣  Search, indexing und scoring is done by Lucene

‣  Document-oriented

‣  Schema-less

‣  Well, ElasticSearch might be schema-less, Lucene isn’t!

‣  ElasticSearch therefore automatically detect correct types

‣  However, a schema is still needed! Why?

‣  HTTP & JSON API for all interactions

‣  Indexing / Updating

‣  Searching

‣  Administration / Monitoring

‣  Distribution is fundamental feature of ElasticSearch!

Highlights Projects

97

http://www.elasticsearch.org/

Page 98: Suche mit Apache Lucene & Co.

‣  Facet search and filtering (values, queries, date/time ranges) ‣  Lots of query types

‣  Script filters

‣  Geospatial search called GeoShape Query

‣  Configurable caching for

‣  Filters

‣  Field data

‣  NRT search with separate API

‣  Sorting, Highlighting

‣  MoreLikeThis based on document or field

‣  Multi Tenancy:

‣  Define multiple indices that e.g. handles documents differently during indexing

‣  Still, you can search over them with one query

Highlights Projects

98

http://www.elasticsearch.org/

Page 99: Suche mit Apache Lucene & Co.

‣  ElasticSearch Gateway Module stores indices and metadata to: ‣  Local FS, Shared FS, Hadoop, Amazon S3

‣  River Interface:

‣  Pluggable service to constantly pull data

‣  Manage over specific REST endpoint

‣  Implementations for CouchDB, MongoDB

‣  Lucene Analyzer specification over elasticsearch.yml or API

‣  Bulk indexing

‣  Default: single document indexing

‣  Bulk indexing over specific REST endpoints

Highlights Projects

99

http://www.elasticsearch.org/

Page 100: Suche mit Apache Lucene & Co.

+  Simple but effective architecture +  Easiness of use, even when using distributed search

+  High matureness, even though ES is young

+  Modern technologies used

+  HTTP and JSON only

-  Shard splitting is not trivial

-  Still small community and small group of core developer

-  Compared to Solr:

-  Less number of query types

-  Less possibilities for boosting

-  Less number of analyzers

-  Missing features such as clustering, autocomplete, spell checking

Pros & Cons Projects

100

http://www.elasticsearch.org/

Page 101: Suche mit Apache Lucene & Co.

‣  Installation

‣  Indexing

‣  Queries I

ES Exercise I

101

Page 102: Suche mit Apache Lucene & Co.

‣  On Linux systems

‣  On Windows systems

‣  Run

Installation ES Exercise I

102

unzip elasticsearch-0.20.6.zip cd elasticsearch-0.20.6 bin/elasticsearch –f

[unzip elasticsearch-0.20.6.zip] dir elasticsearch-0.20.6 bin/elasticsearch.bat -f

curl -X GET http://localhost:9200/

http://www.elasticsearch.org/

Page 103: Suche mit Apache Lucene & Co.

‣  On Linux systems

‣  Run

‣  Shutdown

Installation ES Exercise I

103

unzip elasticsearch-0.20.6.zip cd elasticsearch-0.20.6 bin/elasticsearch –p path/to/pidfile

curl -X GET http://localhost:9200/

curl -XPOST 'http://localhost:9200/_shutdown’ curl -XPOST 'http://localhost:9200/_cluster/nodes/_shutdown’

http://www.elasticsearch.org/

Page 104: Suche mit Apache Lucene & Co.

‣  bin/ ‣  eslasticsearch [elasticsearch.bat] to start elasticsearch server

‣  script plugin [plugin.bat] to install plugins

‣  config/ ‣  contains the global configuration

‣  server config file elasticsearch.yml

‣  logging config file logging.yml

‣  data/ ‣  standard directory containing index data

‣  configurable by path.data

ES_HOME ES Exercise I

104

http://www.elasticsearch.org/

Page 105: Suche mit Apache Lucene & Co.

‣  lib/ ‣  shared library directory

‣  place additional libraries here

‣  logs/ ‣  log files will be placed here using default log configuration

‣  configurable by path.log in elasticsearch.yml

ES_HOME ES Exercise I

105

http://www.elasticsearch.org/

Page 106: Suche mit Apache Lucene & Co.

‣  cluster ‣  one or more nodes build a cluster

‣  usually distributed over various machines

‣  one master node that is automatically chosen

‣  node ‣  running instance of elasticsearch

‣  a node automatically discovers other nodes at start up

‣  node discovery is done either using unicast or multicast messages

‣  index ‣  separate document database model with own mapping and types

‣  is partitioned in one or more primary and replica shards

Terminology ES Exercise I

106

http://www.elasticsearch.org/

Page 107: Suche mit Apache Lucene & Co.

‣  mapping ‣  schema definition defining types with their associated fields

‣  field types and properties

‣  shard ‣  low level data structure of elasticsearch

‣  single Lucene index

‣  managed automatically by elasticsearch

‣  primary shard ‣  every documents is exclusively stored in a primary shard

‣  all primary shards make up the documents of the index

‣  default: 5 primary shards

Terminology ES Exercise I

107

http://www.elasticsearch.org/

Page 108: Suche mit Apache Lucene & Co.

‣  replica shard ‣  each primary shard is replicated 0 or more times

‣  replica shards are distributed automatically

‣  replica shards are used for search and primary shard fail-over

‣  type ‣  within an index zero or more types can be defined

‣  a type defines a certain set of field similar to a table structure

‣  types are defined in the mapping

Terminology ES Exercise I

108

http://www.elasticsearch.org/

Page 109: Suche mit Apache Lucene & Co.

‣  Index API ‣  index (PUT/POST)

‣  update (PUT/POST)

‣  delete (DELETE),

‣  delete by query (DELETE)

‣  Documents are defined as JSON objects

‣  index and type are defined in the url path

‣  automatic creation of an index and mapping

‣  action.auto_create_index

‣  index.mapper.dynamic

‣  elasticssearch automatically identifies field types based on JSON input

‣  automatic ID generation

Index API ES Exercise I

109

http://www.elasticsearch.org/

Page 110: Suche mit Apache Lucene & Co.

‣  Index a book

‣  Index a book with defining a named type

Index API ES Exercise I

110

$ curl -XPUT 'http://localhost:9200/books/book/1' -d '{ "author" : "bernhard pflugfelder", "post_date" : "2013-04-22T14:12:12", "title" : "my first book", "abstract" : "this book is about elasticsearch", }'

$ curl -XPUT 'http://localhost:9200/books/book/1' -d '{ "book" : { "author" : "bernhard pflugfelder", "post_date" : "2013-04-22T14:12:12", "title" : "my first book", "abstract" : "this book is about elasticsearch", } }'

http://www.elasticsearch.org/

Page 111: Suche mit Apache Lucene & Co.

‣  Index a book with automatic ID generation

‣  Result

Index API ES Exercise I

111

$ curl -XPOST 'http://localhost:9200/books/book/' -d '{ "author" : "bernhard pflugfelder", "post_date" : "2013-04-22T14:12:12", "title" : "my first book", "abstract" : "this book is about elasticsearch", }'

{ "ok" : true, "_index" : "books", "_type" : "book", "_id" : "6a8ca01c-7896-48e9-81cc-9f70661fcb32", "_version" : 1 }

http://www.elasticsearch.org/

Page 112: Suche mit Apache Lucene & Co.

‣  Update operations are done by providing a script manipulating the field structure ‣  Following steps composes the update process:

‣  fetch the requested document

‣  apply the script

‣  indexed as a new document

‣  Only the source field _source can be updated

‣  _source is always stored in the index

‣  stores the actual JSON used at index time

‣  can be disabled for every type separately

‣  can be compressed (from version 0.90 compression is done automatically)

Index API ES Exercise I

112

{ "book" : {

"_source" : {"enabled" : false}} }

http://www.elasticsearch.org/

Page 113: Suche mit Apache Lucene & Co.

‣  Create a new field tag

‣  Replace the value of field tag

‣  Add an additional value for the field tag

Index API ES Exercise I

113

curl -XPOST 'localhost:9200/books/book/1/_update' -d '{ "script" : "ctx._source.tag = "search"" }'

curl -XPOST 'localhost:9200/books/book/1/_update' -d '{ "script" : "ctx._source.tags += tag", "params" : { "tag" : "open source technologies" }

curl -XPOST 'localhost:9200/books/book/1/_update' -d '{ "script" : "ctx._source.tag = "search technologies"" }'

http://www.elasticsearch.org/

Page 114: Suche mit Apache Lucene & Co.

‣  Delete a document based on its unique ID

‣  Delete a document based on a search query

Index API ES Exercise I

114

curl -XDELETE 'http://localhost:9200/books/book/1'

$ curl -XDELETE 'http://localhost:9200/books/book/_query' -d '{ "term" : { "author" : "bernhard pflugfelder" } } '

http://www.elasticsearch.org/

Page 115: Suche mit Apache Lucene & Co.

‣  Term query

‣  Terms query

Search API ES Exercise I

115

$ curl -XGET 'http://localhost:9200/books/book/_search' -d '{ "query" : { "term" : { "author" : "bernhard" } }}'

$ curl -XGET 'http://localhost:9200/books/book/_search' -d '{ "query" : { "terms" : { "author" : [ "bernhard”, “pflugfelder” ],

“minimum_match” : 1 }}}'

http://www.elasticsearch.org/

Page 116: Suche mit Apache Lucene & Co.

‣  Match queries accepts text, numeric and date values ‣  Match queries are applied per field, automatically chosen proper analyzer

‣  Types of match queries

‣  boolean (default)

‣  phrase match

‣  phrase prefix match

‣  multi match (two or more fields are searched)

Search API ES Exercise I

116

http://www.elasticsearch.org/

Page 117: Suche mit Apache Lucene & Co.

‣  Simple syntax

‣  Extended syntax

Search API ES Exercise I

117

$ curl -XGET 'http://localhost:9200/books/book/_search' -d '{ "query" : { "term" : { "author" : "bernhard" } }}'

{"match" : { "abstract" : { "query" : "about elasticsearch", "operator" : "and" }}}

Param name Param value Description operator “and”, “or” boolean operator fuzziness 0.0 – 1.0 add fuzziness to the original terms

http://www.elasticsearch.org/

Page 118: Suche mit Apache Lucene & Co.

‣  Simple syntax

‣  Extended syntax

Search API ES Exercise I

118

$ curl -XGET 'http://localhost:9200/books/book/_search' -d '{ "query" : { "match_phrase" : { ” abstract" : "about elasticsearch" } }}'

{"match_phrase" : { ”abstract" : { "query" : "about elasticsearch", "operator" : "and" }}}

Param name Param value Description slop number phrase sloppiness analyzer 0.0 – 1.0 analyzer name to be used for query

http://www.elasticsearch.org/

Page 119: Suche mit Apache Lucene & Co.

‣  Mapping (aka schema)

‣  Field types

‣  Analyzers

‣  Queries II

ES Exercise II

119

Page 120: Suche mit Apache Lucene & Co.

‣  The schema mapping defines the index structure and document representation ‣  Elasticsearch works without an explicit schema (“schema-less”),

‣  Automatic inference is however dangerous in many situations

‣  This, define an explicit schema is the preferred way

‣  A mapping consists of:

‣  type name

‣  list of fields (i.e. properties)

‣  each property defines a field type and, optionally, field attributes

‣  Mappings are formatted in JSON

‣  Mappings are managed using the Mapping API (PUT / POST / GET)

Mapping ES Exercise II

120

http://www.elasticsearch.org/

Page 121: Suche mit Apache Lucene & Co.

‣  Define a mapping for type book

‣  Retrieve the current mapping for type book

Mapping ES Exercise II

121

# echo " { "mappings" : {

"books" : { "properties" : { ”id" : { "type" : "string" }, "title" : { "type" : "string" },

"author" : { "type" : "string" }, ”subject" : { "type" : ”string" }, ”view_count" : { "type" : ”integer" }, "created" : { "type" : "date",

"format" : “dateOptionalTime" } }}}} " > book.json curl –XPUT 'localhost:9200/gutenberg/books/_mapping’ –d @book.json

# curl 'localhost:9200/gutenberg/books/_mapping?pretty=1

http://www.elasticsearch.org/

Page 122: Suche mit Apache Lucene & Co.

‣  Field types ‣  string, date

‣  number

‣  byte, short, integer, long, float, double

‣  boolean, binary (BASE64)

‣  Common field attributes

Mapping ES Exercise II

122

Name Value Description index_name string field name stored within the index index yes / no Field shall be searchable store yes ( no Original values shall be stored analyzer string Analyzer used for that field null_value value Default field value if a value is not assigned to a

document

http://www.elasticsearch.org/

Page 123: Suche mit Apache Lucene & Co.

Analyzers ES Exercise II

123

‣  Analyzers are defined either ‣  in elasticsearch.yml or elasticsearch.json

‣  by the Index API

‣  Common analyzers

‣  standard

‣  whitespace

‣  stop

‣  keyword

‣  language

‣  snowball

curl 'localhost:9200/_analyze?analyzer=standard' -d ’elasticsearch is groovy!’ curl 'localhost:9200/_analyze?analyzer=whitespace' -d ’elasticsearch is groovy!' curl 'localhost:9200/_analyze?analyzer=stop' -d ’elasticsearch is groovy!' curl 'localhost:9200/_analyze?analyzer=keyword' -d ’elasticsearch is groovy!’

http://www.elasticsearch.org/

Page 124: Suche mit Apache Lucene & Co.

Analyzers ES Exercise II

124

discovery.zen.multicast.enabled: false http:

max_content_length: 100000 index:

number_of_shards: 1 analysis: analyzer: Default: type: standard lowercase_analyzer: type: custom tokenizer: standard filter: [standard, lowercase]

http://www.elasticsearch.org/

Page 125: Suche mit Apache Lucene & Co.

‣  Elasticsearch provides two highlighting algorithms ‣  fast vector highlighter

‣  highlighter (standard implementation)

‣  Requirement to use fast vector highlighter

Highlighting ES Exercise II

125

{”books" : { ”title" : {"type" : "string”,

"term_vector" : "with_positions_offsets”}}}

{ "query" : {...}, "highlight" : { "pre_tags" : ["<tag1>", "<tag2>"], "post_tags" : ["</tag1>", "</tag2>"], "fields" : { "_all" : {} } } }

http://www.elasticsearch.org/

Page 126: Suche mit Apache Lucene & Co.

‣  Faceted search

‣  Filter query

‣  Sorting

‣  More Like This

ES Exercise III

126

Page 127: Suche mit Apache Lucene & Co.

‣  Elasticsearch provides the following facet mechanism: ‣  Group results by a field value

‣  Group by numeric or date ranges

‣  Group numeric or date values in equally sized buckets (histogram)

‣  Group results around a coordinate based on the geo distance

‣  Basic facet definition

‣  Facet types: terms, range, histogram, date_histogram, geo_distance

Faceted search ES Exercise III

127

{ "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true }}}

http://www.elasticsearch.org/

Page 128: Suche mit Apache Lucene & Co.

Faceted search ES Exercise III

128

curl -X POST http://localhost:9200/gutenberg/books/_search?pretty=1 -d ’ { "from": 0, "size": 10, "query": { "match": { ”author": ”schiller" } }, "facets": { "tagsFacet": { "terms": { "field": ”subject", "size": 10 } } } }'

http://www.elasticsearch.org/

Page 129: Suche mit Apache Lucene & Co.

Faceted search ES Exercise III

129

{ "query" : { "match_all" : {} }, "facets" : { "range1" : { "range" : { ”view_count" : [ { "to" : 50 }, { "from" : 20, "to" : 70 }, { "from" : 70, "to" : 120 }, { "from" : 150 } ] } } } }

http://www.elasticsearch.org/

Page 130: Suche mit Apache Lucene & Co.

‣  Histogram facet works on any numeric field ‣  Field values are rounded to fit in the respective bucket

‣  The property interval defines the bucket size

Faceted search ES Exercise III

130

{ "query" : { "match_all" : {} }, "facets" : { "histo1" : { "histogram" : { "field" : ”view_count", "interval" : 100 } } } }

http://www.elasticsearch.org/

Page 131: Suche mit Apache Lucene & Co.

‣  Elastic search also provides filter queries internally cached for optimal performance

‣  A filter query can be applied based on a returned search result like here

Filter query ES Exercise III

131

curl -XPOST 'localhost:9200/gutenberg/books/_search?pretty=1' -d ' { "query" : { "term" : { ”title" : ”schiller" } }, "filter" : { "term" : { ”subject" : ”drama" } }, "facets" : { "tag" : { "terms" : { "field" : ”subject" } } } }'

http://www.elasticsearch.org/

Page 132: Suche mit Apache Lucene & Co.

‣  Or the filter query is applied during the search of the user query at first place

‣  Difference to previous filter query?

Filter query ES Exercise III

132

curl -XPOST 'localhost:9200/books/_search?pretty=1' -d ' { "filtered" : { "query" : { "term" : { ”author" : “schiller" } }, "filter" : { "range" : { ”view_count" : { "from" : 50, "to" : 100 } } } } }'

http://www.elasticsearch.org/

Page 133: Suche mit Apache Lucene & Co.

‣  Sorting is done based on one or multiple fields ‣  In case of multiple sorting fields, sorting is done per field

‣  ascending / descending sorting

‣  _score refers to sort based on the score

Sorting ES Exercise III

133

curl -XPOST 'localhost:9200/gutenberg/books/_search?pretty=1' -d ’ { "sort" : [ { ”view_count" : {"order" : ”desc"} }, "_score” ], "query" : { "term" : { "title" : ”schiller" } } }'

http://www.elasticsearch.org/

Page 134: Suche mit Apache Lucene & Co.

mlt query ES Exercise III

134

curl -XPOST 'localhost:9200/gutenberg/books/_search?pretty=1' -d ’ { "more_like_this" : { "fields" : ["title", ”subject"], "like_text" : "text like this one", "min_term_freq" : 1, "max_query_terms" : 12 } }'

http://www.elasticsearch.org/

Name Value Description fields fieldname(s) List of fields used for mlt like_text string The text to find docs like min_term_freq number Minimal term freq max_query_terms number Maximal term freq min_doc_freq number Minimal document freq max_doc_freq number Maximal document freq percent_terms_to_match 0.0 – 1.0 Percentage of terms match