How to use Elasticsearch Analyzers by EmergiNet

26
Analyzers Pablo Musa EmergiNet 05 de Maio de 2014

description

Presentation ( em Português - Brasil) about how to use Elasticsearch analyzers to boost your searches. The content was exhibited in Elasticsearch Meetup in Rio de Janeiro and Porto Alegre by Pablo Musa from EmergiNet.

Transcript of How to use Elasticsearch Analyzers by EmergiNet

Page 1: How to use  Elasticsearch Analyzers by EmergiNet

Analyzers

Pablo Musa

EmergiNet

05 de Maio de 2014

Page 2: How to use  Elasticsearch Analyzers by EmergiNet

Outline

1 Motivacao

2 Elasticsearch e EmergiNet

3 Conceitos Basicos

4 Criando um Analisador

5 Problemas Comuns

6 Outros Trabalhos

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 2 / 26

Page 3: How to use  Elasticsearch Analyzers by EmergiNet

MotivacaoCaso de Uso

Site de compras

“Full text search” em SQL e complexo e lento

Necessidade de um sistema de busca:

I mais rapido

I mais preciso

I mais simples de desenvolver

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 3 / 26

Page 4: How to use  Elasticsearch Analyzers by EmergiNet

Elasticsearch

Rapido (em media 100x)

Resultados excelentes

Facil de consumir

I Instalacao muito simples e escalavel

I API RESTful simples utilizando JSON

I “Schema e automatico”

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 4 / 26

Page 5: How to use  Elasticsearch Analyzers by EmergiNet

Elasticsearch e EmergiNetNem sempre o padrao e o melhor

Ninguem conhece melhor seus dados do que voce

Mapping personalizado

EmergiNet solucao de consultoria ou execucao de projetos

Otimizar a aplicacao e incluir funcionalidades

1 Ordenacao

2 Aggregations

3 Auto-Complete, Suggester

4 Auxiliar no SEO (Search Engine Optimization)

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 5 / 26

Page 6: How to use  Elasticsearch Analyzers by EmergiNet

ElasticsearchEmpty Index

{

"settings": {

"analysis": {

"filter": {

},

"analyzer": {

"my_analyzer": {

"type": "",

"char_filter": [],

"tokenizer": "",

"filter": []

}

}

}

},

"mappings": {

"my_type": {

"properties": {

"title": {

"type": "",

"index": "",

"analyzer": ""

}

}

}

}

}

“Empty” analysis and mappings. Example of the structure to be fulfilled.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 6 / 26

Page 7: How to use  Elasticsearch Analyzers by EmergiNet

Etapas de um analisador

1 Arrumar

2 Quebrar

3 Normalizar

Elasticsearch oferece analisadores pre-definidosPor exemplo: standard, simple, whitespace, language

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 7 / 26

Page 8: How to use  Elasticsearch Analyzers by EmergiNet

ArrumarCharacter Filters

“Pre-processamento”

Limpeza da string

Opcional

Atualmente existem 3 tipos:I mapping (ex: "ph" => "f")

I html strip (removes tags and maps entities, "á" => "a")

I pattern replace (regular expression)

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 8 / 26

Page 9: How to use  Elasticsearch Analyzers by EmergiNet

ArrumarAnalysis with Character Filters

"analysis": {

"filter": {

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "",

"filter": []

}

}

}

Analysis with character filter function only.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 9 / 26

Page 10: How to use  Elasticsearch Analyzers by EmergiNet

QuebrarTokenizers

“Processamento”

Quebra da string em termos individuais

Obrigatorio

Atualmente existem 10 tipos:I standard

I keyword

I whitespace

I ngram, edge ngram

I letter, lowercase (opt), pattern, uax email url, path hierarchy

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 10 / 26

Page 11: How to use  Elasticsearch Analyzers by EmergiNet

QuebrarAnalysis with Character Filters and Tokenizers

"analysis": {

"filter": {

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": []

}

}

}

Analysis with character filter and tokenizer function.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 11 / 26

Page 12: How to use  Elasticsearch Analyzers by EmergiNet

NormalizarToken Filters

“Pos-processamento”

Normalizar os tokens (alterar ou remover)

Opcional

Atualmente existem 33 tipos:I ascii folding

I lowercase, uppercase

I stop

I stemmer

I ngram, edge ngram, length, snowball, synonym, ...

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 12 / 26

Page 13: How to use  Elasticsearch Analyzers by EmergiNet

NormalizarAnalysis Complete

"analysis": {

"filter": {

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"asciifolding"

]

}

}

}

Analysis using all functions.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 13 / 26

Page 14: How to use  Elasticsearch Analyzers by EmergiNet

Normalizarstop token filter

Stop Words

Remove palavras indesejadas

E baseado em uma lista de palavras e deve ser criado manualmente

"stop_noise": {

"type": "stop",

"stopwords_path": "sw.txt"

}

"stop_noise": {

"type": "stop",

"stopwords": ["o", "a",

"no", "na","de","da",

"as","os"]

}

Stop word token filter definition. ignore case and remove trailing are boolean settings.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 14 / 26

Page 15: How to use  Elasticsearch Analyzers by EmergiNet

NormalizarAnalysis Complete with stop words

"analysis": {

"filter": {

"stop_noise": {

"type": "stop",

"stopwords_path": "sw.txt"

}

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding"

]

}

}

}

Analysis using all functions and my own stop words filter.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 15 / 26

Page 16: How to use  Elasticsearch Analyzers by EmergiNet

Normalizarstemmer token filter

Stemmer (derivacoes)

“Trava” as palavras ("jogar"=>"joga" ou "jogar" =>"jog")

E baseado em um conjunto ja existente, mas deve ser criadomanualmente

"my_stemmer": {

"type": "stemmer",

"name": "light_portuguese"

}

Stemmer token filter definition. minimal portuguese and portuguese are other portugueseoptions.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 16 / 26

Page 17: How to use  Elasticsearch Analyzers by EmergiNet

NormalizarAnalysis Complete with stop words and stemmer

"analysis": {

"filter": {

"stop_noise": {

"type": "stop",

"stopwords_path": "sw.txt"

},

"light_pt": {

"type": "stemmer",

"name": "light_portuguese"

},

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding",

"light_pt"

]

}

}

}

Analysis using all functions, with my own stop words and light portuguese stemmer filters.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 17 / 26

Page 18: How to use  Elasticsearch Analyzers by EmergiNet

One Field Mapping

"mappings": {

"my_type": {

"properties": {

"title": {

"type": "string",

"index": "analyzed",

"analyzer": "my_analyzer",

}

}

}

}

Simple mapping with one string field using my analyzer.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 18 / 26

Page 19: How to use  Elasticsearch Analyzers by EmergiNet

Problemas

Ordenar

Aggregation

SEO (Search Engine Optimization)

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 19 / 26

Page 20: How to use  Elasticsearch Analyzers by EmergiNet

ProblemasOrdenacao

Ordenacao em campos indexados gera resultados aleatorios

"Telha" < "casa"

Novo analisador

"sort": {

"type": "custom",

"tokenizer": "keyword",

"filter": [

"lowercase",

"asciifolding"

]

}

Sort analyzer. Makes use of lowercase and asciifolding filters and the keyword tokenizer.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 20 / 26

Page 21: How to use  Elasticsearch Analyzers by EmergiNet

ProblemasAggregation

Como funciona: ”sao”, ”paulo”, ”rio”

O que queremos: ”Sao Paulo”

Ou seja, nao queremos analise

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 21 / 26

Page 22: How to use  Elasticsearch Analyzers by EmergiNet

ProblemasSearch Engine Optimization

Stemmer e ruim

Novo analisador

"url_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding"

]

}

URL analyzer for SEO. It will not be used in mappings.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 22 / 26

Page 23: How to use  Elasticsearch Analyzers by EmergiNet

ProblemasSearch Engine Optimization

Nao precisamos mapea-lo para um field

analyze API

curl -XPOST "http://localhost:9200/my_index/_analyze?analyzer=my_analyzer" -d ’{

"O Meetup Elasticsearch RJ sera no dia 05 de maio as 18h."

}’

> meetup elasticsearch rj sera dia 05 maio 18h

analyze API Example.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 23 / 26

Page 24: How to use  Elasticsearch Analyzers by EmergiNet

Resultado

{

"settings": {

"analysis": {

"filter": {

"stop_noise": {

"type": "stop",

"stopwords_path": "sw.txt"

},

"light_pt": {

"type": "stemmer",

"name": "light_portuguese"

} },

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding",

"light_pt"

]

},

"sort": {

"type": "custom",

"tokenizer": "keyword",

"filter": [

"lowercase",

"asciifolding"

]

},

"url_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding"

]

}

}

}

},

"mappings": {

"my_type": {

"properties": {

"title": {

"type": "string",

"index": "analyzed",

"analyzer": "my_analyzer",

"fields": {

"sort": {

"type": "string",

"index": "analyzed",

"analyzer": "sort"

},

"raw": {

"type": "string",

"index": "not_analyzed"

}

}

}

}

}

}

}

Complete mapping for one field using sub-fields to text search, sort, and aggregation.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 24 / 26

Page 25: How to use  Elasticsearch Analyzers by EmergiNet

Outros Trabalhos

Boost

Parent/Child

Armazenamento de Logs (Logstash + Kibana)

Consultoria de infra estrutura para ELK

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 25 / 26

Page 26: How to use  Elasticsearch Analyzers by EmergiNet

Obrigadowww.emergi.net - [email protected]

“Keep it simple, but not simpler.”