Download - How to use Elasticsearch Analyzers by EmergiNet

Transcript
Page 1: How to use  Elasticsearch Analyzers by EmergiNet

Analyzers

Pablo Musa

EmergiNet

05 de Maio de 2014

Page 2: How to use  Elasticsearch Analyzers by EmergiNet

Outline

1 Motivacao

2 Elasticsearch e EmergiNet

3 Conceitos Basicos

4 Criando um Analisador

5 Problemas Comuns

6 Outros Trabalhos

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 2 / 26

Page 3: How to use  Elasticsearch Analyzers by EmergiNet

MotivacaoCaso de Uso

Site de compras

“Full text search” em SQL e complexo e lento

Necessidade de um sistema de busca:

I mais rapido

I mais preciso

I mais simples de desenvolver

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 3 / 26

Page 4: How to use  Elasticsearch Analyzers by EmergiNet

Elasticsearch

Rapido (em media 100x)

Resultados excelentes

Facil de consumir

I Instalacao muito simples e escalavel

I API RESTful simples utilizando JSON

I “Schema e automatico”

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 4 / 26

Page 5: How to use  Elasticsearch Analyzers by EmergiNet

Elasticsearch e EmergiNetNem sempre o padrao e o melhor

Ninguem conhece melhor seus dados do que voce

Mapping personalizado

EmergiNet solucao de consultoria ou execucao de projetos

Otimizar a aplicacao e incluir funcionalidades

1 Ordenacao

2 Aggregations

3 Auto-Complete, Suggester

4 Auxiliar no SEO (Search Engine Optimization)

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 5 / 26

Page 6: How to use  Elasticsearch Analyzers by EmergiNet

ElasticsearchEmpty Index

{

"settings": {

"analysis": {

"filter": {

},

"analyzer": {

"my_analyzer": {

"type": "",

"char_filter": [],

"tokenizer": "",

"filter": []

}

}

}

},

"mappings": {

"my_type": {

"properties": {

"title": {

"type": "",

"index": "",

"analyzer": ""

}

}

}

}

}

“Empty” analysis and mappings. Example of the structure to be fulfilled.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 6 / 26

Page 7: How to use  Elasticsearch Analyzers by EmergiNet

Etapas de um analisador

1 Arrumar

2 Quebrar

3 Normalizar

Elasticsearch oferece analisadores pre-definidosPor exemplo: standard, simple, whitespace, language

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 7 / 26

Page 8: How to use  Elasticsearch Analyzers by EmergiNet

ArrumarCharacter Filters

“Pre-processamento”

Limpeza da string

Opcional

Atualmente existem 3 tipos:I mapping (ex: "ph" => "f")

I html strip (removes tags and maps entities, "á" => "a")

I pattern replace (regular expression)

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 8 / 26

Page 9: How to use  Elasticsearch Analyzers by EmergiNet

ArrumarAnalysis with Character Filters

"analysis": {

"filter": {

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "",

"filter": []

}

}

}

Analysis with character filter function only.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 9 / 26

Page 10: How to use  Elasticsearch Analyzers by EmergiNet

QuebrarTokenizers

“Processamento”

Quebra da string em termos individuais

Obrigatorio

Atualmente existem 10 tipos:I standard

I keyword

I whitespace

I ngram, edge ngram

I letter, lowercase (opt), pattern, uax email url, path hierarchy

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 10 / 26

Page 11: How to use  Elasticsearch Analyzers by EmergiNet

QuebrarAnalysis with Character Filters and Tokenizers

"analysis": {

"filter": {

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": []

}

}

}

Analysis with character filter and tokenizer function.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 11 / 26

Page 12: How to use  Elasticsearch Analyzers by EmergiNet

NormalizarToken Filters

“Pos-processamento”

Normalizar os tokens (alterar ou remover)

Opcional

Atualmente existem 33 tipos:I ascii folding

I lowercase, uppercase

I stop

I stemmer

I ngram, edge ngram, length, snowball, synonym, ...

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 12 / 26

Page 13: How to use  Elasticsearch Analyzers by EmergiNet

NormalizarAnalysis Complete

"analysis": {

"filter": {

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"asciifolding"

]

}

}

}

Analysis using all functions.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 13 / 26

Page 14: How to use  Elasticsearch Analyzers by EmergiNet

Normalizarstop token filter

Stop Words

Remove palavras indesejadas

E baseado em uma lista de palavras e deve ser criado manualmente

"stop_noise": {

"type": "stop",

"stopwords_path": "sw.txt"

}

"stop_noise": {

"type": "stop",

"stopwords": ["o", "a",

"no", "na","de","da",

"as","os"]

}

Stop word token filter definition. ignore case and remove trailing are boolean settings.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 14 / 26

Page 15: How to use  Elasticsearch Analyzers by EmergiNet

NormalizarAnalysis Complete with stop words

"analysis": {

"filter": {

"stop_noise": {

"type": "stop",

"stopwords_path": "sw.txt"

}

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding"

]

}

}

}

Analysis using all functions and my own stop words filter.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 15 / 26

Page 16: How to use  Elasticsearch Analyzers by EmergiNet

Normalizarstemmer token filter

Stemmer (derivacoes)

“Trava” as palavras ("jogar"=>"joga" ou "jogar" =>"jog")

E baseado em um conjunto ja existente, mas deve ser criadomanualmente

"my_stemmer": {

"type": "stemmer",

"name": "light_portuguese"

}

Stemmer token filter definition. minimal portuguese and portuguese are other portugueseoptions.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 16 / 26

Page 17: How to use  Elasticsearch Analyzers by EmergiNet

NormalizarAnalysis Complete with stop words and stemmer

"analysis": {

"filter": {

"stop_noise": {

"type": "stop",

"stopwords_path": "sw.txt"

},

"light_pt": {

"type": "stemmer",

"name": "light_portuguese"

},

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding",

"light_pt"

]

}

}

}

Analysis using all functions, with my own stop words and light portuguese stemmer filters.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 17 / 26

Page 18: How to use  Elasticsearch Analyzers by EmergiNet

One Field Mapping

"mappings": {

"my_type": {

"properties": {

"title": {

"type": "string",

"index": "analyzed",

"analyzer": "my_analyzer",

}

}

}

}

Simple mapping with one string field using my analyzer.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 18 / 26

Page 19: How to use  Elasticsearch Analyzers by EmergiNet

Problemas

Ordenar

Aggregation

SEO (Search Engine Optimization)

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 19 / 26

Page 20: How to use  Elasticsearch Analyzers by EmergiNet

ProblemasOrdenacao

Ordenacao em campos indexados gera resultados aleatorios

"Telha" < "casa"

Novo analisador

"sort": {

"type": "custom",

"tokenizer": "keyword",

"filter": [

"lowercase",

"asciifolding"

]

}

Sort analyzer. Makes use of lowercase and asciifolding filters and the keyword tokenizer.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 20 / 26

Page 21: How to use  Elasticsearch Analyzers by EmergiNet

ProblemasAggregation

Como funciona: ”sao”, ”paulo”, ”rio”

O que queremos: ”Sao Paulo”

Ou seja, nao queremos analise

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 21 / 26

Page 22: How to use  Elasticsearch Analyzers by EmergiNet

ProblemasSearch Engine Optimization

Stemmer e ruim

Novo analisador

"url_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding"

]

}

URL analyzer for SEO. It will not be used in mappings.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 22 / 26

Page 23: How to use  Elasticsearch Analyzers by EmergiNet

ProblemasSearch Engine Optimization

Nao precisamos mapea-lo para um field

analyze API

curl -XPOST "http://localhost:9200/my_index/_analyze?analyzer=my_analyzer" -d ’{

"O Meetup Elasticsearch RJ sera no dia 05 de maio as 18h."

}’

> meetup elasticsearch rj sera dia 05 maio 18h

analyze API Example.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 23 / 26

Page 24: How to use  Elasticsearch Analyzers by EmergiNet

Resultado

{

"settings": {

"analysis": {

"filter": {

"stop_noise": {

"type": "stop",

"stopwords_path": "sw.txt"

},

"light_pt": {

"type": "stemmer",

"name": "light_portuguese"

} },

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding",

"light_pt"

]

},

"sort": {

"type": "custom",

"tokenizer": "keyword",

"filter": [

"lowercase",

"asciifolding"

]

},

"url_analyzer": {

"type": "custom",

"char_filter": [

"html_strip"

],

"tokenizer": "standard",

"filter": [

"lowercase",

"stop_noise",

"asciifolding"

]

}

}

}

},

"mappings": {

"my_type": {

"properties": {

"title": {

"type": "string",

"index": "analyzed",

"analyzer": "my_analyzer",

"fields": {

"sort": {

"type": "string",

"index": "analyzed",

"analyzer": "sort"

},

"raw": {

"type": "string",

"index": "not_analyzed"

}

}

}

}

}

}

}

Complete mapping for one field using sub-fields to text search, sort, and aggregation.

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 24 / 26

Page 25: How to use  Elasticsearch Analyzers by EmergiNet

Outros Trabalhos

Boost

Parent/Child

Armazenamento de Logs (Logstash + Kibana)

Consultoria de infra estrutura para ELK

Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 25 / 26

Page 26: How to use  Elasticsearch Analyzers by EmergiNet

Obrigadowww.emergi.net - [email protected]

“Keep it simple, but not simpler.”