Download - Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam 2016

MAKING YOUR ELASTIC CLUSTER PERFORM

Created by @jettroCoenradie

WHY USE ELASTICSEARCH

START WITH ELASTICSEARCH

curl 'localhost:9200?pretty'

{ "name" : "Tatterdemalion", "cluster_name" : "elasticsearch", "version" : { "number" : "2.3.1", "build_hash" : "bd980929010aef404e7cb0843e61d0665269fc39", "build_timestamp" : "2016-04-04T12:25:05Z", "build_snapshot" : false, "lucene_version" : "5.5.0" }, "tagline" : "You Know, for Search"}

curl -XPOST 'localhost:9200/conferences/conference/1?pretty' -d '

{

"name": "Codemotion Amsterdam",

"location": "Kromhouthal"

}'

THAT WAS EASY!

DESIGN YOUR CLUSTER

How to install and configure?

How many nodes?

What hardware?

INSTALLATION

Just download, unzip and run

Use package manager: yum, apt

Use ansible, chef or puppet

CONFIGURATION

/etc/defaults/elasticsearch

/etc/elasticsearch/elasticsearch.yml

/ETC/DEFAULTS/ELASTICSEARCH

# Heap size defaults to 256m min, 1g max

# Set ES_HEAP_SIZE to 50% of available RAM, but no more than 31g

ES_HEAP_SIZE=2g

/ETC/ELASTICSEARCH/ELASTICSEARCH.YML

cluster.name: playgroundnode.name: node-1discovery.zen.ping.unicast.hosts: ["node-1", "node-2", "node-3"]discovery.zen.minimum_master_nodes: 2path.repo: /opt/es_snapshots/script.inline: true

How many nodes do I need?

Development / non-critical

Small production

Large production

What hardware do I need?

HARDWARE

Prefer cores over clock speedChoose between 8-64GbPrefer SSD

DESIGN YOUR INDICES

How many shards?

How many replicas?

Time based indices?

What does an index look like?

How many shards do I need?

Amount of docs or terms

Indexing speeds

Not bigger than than 50Gb

Why not a lot of shards?

Start small and test

How many replicas do I need?

Should I use Types?

Should I use Aliases?

Working with time based indices?

Option to change shards per time period

Use index templates

DESIGN YOUR MAPPING

Do I need a mapping?

What do analyzers do?

Do I need an analyzer?

The default uses dynamic type mapping

Make your mapping explicit: Date, Geo_point, long

disable dynamic type mapping

PUT /_settings{ "index.mapper.dynamic":false}

A mapping is persistent, can only add new things.

Use multi field mapping: name

PUT /conferences{ "mappings": { "conference": { "properties": { "name": { "type": "string", "analyzer": "standard", "fields": { "raw": { "type": "string", "index": "not_analyzed" } } }}}}}

An analyzer creates terms out of data

Has three components:

Character filter - replace & with and

Tokenizer - on whitespace, regexp, ngrams

Filters - ascii folding, language specific, lowercase, stop words

CHOOSE THE RIGHT ANALYZER FOR THE JOB

Custom using tokenizer and filters combinations

Use the multi field approach for special analyzers.

Do not analyze if you don't need it.

INDEXING DOCUMENTS

How to improve indexing performance?

What happens when we index a document?

TIPS

DISABLE OR DECREASE REFRESH RATE

curl -XGET 'http://localhost:9200/meetups/_settings' -d '{ "index" : { "refresh_interval" : "-1" } }'

INDEX WITHOUT REPLICAS

curl -XGET 'http://localhost:9200/meetups/_settings' -d '{ "index" : { "number_of_replicas" : 0 } }'

USE BULK

Bulk request should be between 5-15Mb max

Round robin requests over nodes

QUERYING DOCUMENTS

How to make queries faster?

curl -XGET 'http://localhost:9200/_search'

curl -XGET 'http://localhost:9200/meetups/_search? q=venue.city:amsterdam%20AND%20description:elasticsearch &pretty'

curl -XGET "http://localhost:9200/meetups/_search" -d'{ "query": { "bool": { "must": [ { "match": { "venue.city": "amsterdam" } }, { "match": { "description": "elasticsearch" } } ]}}}'

How to make a query Faster?

curl -XGET "http://localhost:9200/meetups/_search" -d'{ "query": { "bool": { "must": [ { "match": { "description": "elasticsearch" } } ], "filter": { "term": { "venue.city.raw": "Amsterdam" } } }}}'

Query context Filter contextHow well does it match? Does it match?

Calculates score true/false

Not-cacheable Cacheable

Use filter context if you do not need a score

Don't ask for hits if you do not use them

Request only the fields that you need

curl -XGET "http://localhost:9200/meetups/_search" -d'

{

"_source": {

"include": ["venue.*", "group.name", "name"]

},

"query": {

"simple_query_string": {

"query": "elastic OR elasticsearch"

}

}

}'

Profile api to learn about the performance

"profile": true

ANALYTICS FROM DOCUMENTS

Why use not_analyzed fields?

Aggregations, maybe the reason why elasticsearch became so popular

curl -XGET "http://localhost:9200/meetups/_search" -d'{ "size": 0, "aggs": { "byCity": { "terms": { "field": "venue.city.raw", "size": 10 } } }}'

"buckets": [ { "key": "Amsterdam", "doc_count": 8 }, { "key": "Ede", "doc_count": 1 }, { "key": "Leidschendam", "doc_count": 1 }, { "key": "Rotterdam", "doc_count": 1 }]

Inverted index not suitable for aggregations

DOC_VALUES

Stored on disk during indexing

All fields except analyzed strings

FIELDDATA

For analyzed strings

Stored in the heap

MONITORING THE CLUSTER

How can I see what elastic is doing?

What numbers are important?

GET /_cluster/health

{ "cluster_name": "playground", "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 55, "active_shards": 55, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 16, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 77.46478873239437}

GET /_cluster/health?level=indices

"indices": { "conferences": { "status": "yellow", "number_of_shards": 5, "number_of_replicas": 1, "active_primary_shards": 5, "active_shards": 5, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 5 }}

indices - field_data, filter_cacheos - cpu, memory, loadprocess - file descriptors, cpu, memoryjvm - memory, garbage collectionthread_pool - threads, rejectedfs - disk space

GET /_nodes/stats?human

_CAT API

GET /_cat/health?v

epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent

1462882362 14:12:42 playground yellow 1 1 56 56 0 0

GET /_cat/indices?v

health status index pri rep docs.count docs.deleted store.size pri.store.size

green open gridshore-logs-2016.01.19 5 0 1007 0 1.2mb

green open .kibana 1 0 100 1 100.2kb

yellow open topbeat-2016.05.04 5 1 170264 0 44.4mb

green open meetups-20160509113909 1 0 11 0 67.3kb

GET /_cat/fielddata?v

id host ip node total

aqj9L-DPR86J8CgYitcHsA 127.0.0.1 127.0.0.1 node-JC 0b

CLUSTER LOGS

How to configure what is logged?

LOGGING

Can be changed dynamically

PUT /_cluster/settings{ "transient" : { "logger.discovery" : "DEBUG" }}

SLOWLOG

PUT /meetups/_settings{ "index.search.slowlog.threshold.query.warn" : "10s", "index.search.slowlog.threshold.fetch.debug": "500ms", "index.indexing.slowlog.threshold.index.info": "5s"}

PUT /_cluster/settings{ "transient" : { "logger.index.search.slowlog" : "DEBUG", "logger.index.indexing.slowlog" : "WARN" }}

[2016-05-11 16:25:02,105][DEBUG][index.search.slowlog.query]

[meetups-20160509113909]took[518.5micros],took_millis[0],

types[], stats[], search_type[QUERY_AND_FETCH], total_shards[1]

, source[{"size":0,"aggs":{"byCity":{"terms":{"field":

"venue.city.raw","size":10}}}}], extra_source[],

QUESTIONS?

Twitter: @jettroCoenradie

Github: https://github.com/jettro

Blog: https://amsterdam.luminis.eu/news/

Licence: http://creativecommons.org/licenses/by-nc-sa/3.0/