MAKING YOUR ELASTIC CLUSTER PERFORM
Created by @jettroCoenradie
WHY USE ELASTICSEARCH
START WITH ELASTICSEARCH
curl 'localhost:9200?pretty'
{ "name" : "Tatterdemalion", "cluster_name" : "elasticsearch", "version" : { "number" : "2.3.1", "build_hash" : "bd980929010aef404e7cb0843e61d0665269fc39", "build_timestamp" : "2016-04-04T12:25:05Z", "build_snapshot" : false, "lucene_version" : "5.5.0" }, "tagline" : "You Know, for Search"}
curl -XPOST 'localhost:9200/conferences/conference/1?pretty' -d '
{
"name": "Codemotion Amsterdam",
"location": "Kromhouthal"
}'
THAT WAS EASY!
DESIGN YOUR CLUSTER
How to install and configure?
How many nodes?
What hardware?
INSTALLATION
Just download, unzip and run
Use package manager: yum, apt
Use ansible, chef or puppet
CONFIGURATION
/etc/defaults/elasticsearch
/etc/elasticsearch/elasticsearch.yml
/ETC/DEFAULTS/ELASTICSEARCH
# Heap size defaults to 256m min, 1g max
# Set ES_HEAP_SIZE to 50% of available RAM, but no more than 31g
ES_HEAP_SIZE=2g
/ETC/ELASTICSEARCH/ELASTICSEARCH.YML
cluster.name: playgroundnode.name: node-1discovery.zen.ping.unicast.hosts: ["node-1", "node-2", "node-3"]discovery.zen.minimum_master_nodes: 2path.repo: /opt/es_snapshots/script.inline: true
How many nodes do I need?
Development / non-critical
Small production
Large production
What hardware do I need?
HARDWARE
Prefer cores over clock speedChoose between 8-64GbPrefer SSD
DESIGN YOUR INDICES
How many shards?
How many replicas?
Time based indices?
What does an index look like?
How many shards do I need?
Amount of docs or terms
Indexing speeds
Not bigger than than 50Gb
Why not a lot of shards?
Start small and test
How many replicas do I need?
Should I use Types?
Should I use Aliases?
Working with time based indices?
Option to change shards per time period
Use index templates
DESIGN YOUR MAPPING
Do I need a mapping?
What do analyzers do?
Do I need an analyzer?
The default uses dynamic type mapping
Make your mapping explicit: Date, Geo_point, long
disable dynamic type mapping
PUT /_settings{ "index.mapper.dynamic":false}
A mapping is persistent, can only add new things.
Use multi field mapping: name
PUT /conferences{ "mappings": { "conference": { "properties": { "name": { "type": "string", "analyzer": "standard", "fields": { "raw": { "type": "string", "index": "not_analyzed" } } }}}}}
An analyzer creates terms out of data
Has three components:
Character filter - replace & with and
Tokenizer - on whitespace, regexp, ngrams
Filters - ascii folding, language specific, lowercase, stop words
CHOOSE THE RIGHT ANALYZER FOR THE JOB
Custom using tokenizer and filters combinations
Use the multi field approach for special analyzers.
Do not analyze if you don't need it.
INDEXING DOCUMENTS
How to improve indexing performance?
What happens when we index a document?
TIPS
DISABLE OR DECREASE REFRESH RATE
curl -XGET 'http://localhost:9200/meetups/_settings' -d '{ "index" : { "refresh_interval" : "-1" } }'
INDEX WITHOUT REPLICAS
curl -XGET 'http://localhost:9200/meetups/_settings' -d '{ "index" : { "number_of_replicas" : 0 } }'
USE BULK
Bulk request should be between 5-15Mb max
Round robin requests over nodes
QUERYING DOCUMENTS
How to make queries faster?
curl -XGET 'http://localhost:9200/_search'
curl -XGET 'http://localhost:9200/meetups/_search? q=venue.city:amsterdam%20AND%20description:elasticsearch &pretty'
curl -XGET "http://localhost:9200/meetups/_search" -d'{ "query": { "bool": { "must": [ { "match": { "venue.city": "amsterdam" } }, { "match": { "description": "elasticsearch" } } ]}}}'
How to make a query Faster?
curl -XGET "http://localhost:9200/meetups/_search" -d'{ "query": { "bool": { "must": [ { "match": { "description": "elasticsearch" } } ], "filter": { "term": { "venue.city.raw": "Amsterdam" } } }}}'
Query context Filter contextHow well does it match? Does it match?
Calculates score true/false
Not-cacheable Cacheable
Use filter context if you do not need a score
Don't ask for hits if you do not use them
Request only the fields that you need
curl -XGET "http://localhost:9200/meetups/_search" -d'
{
"_source": {
"include": ["venue.*", "group.name", "name"]
},
"query": {
"simple_query_string": {
"query": "elastic OR elasticsearch"
}
}
}'
Profile api to learn about the performance
"profile": true
ANALYTICS FROM DOCUMENTS
Why use not_analyzed fields?
Aggregations, maybe the reason why elasticsearch became so popular
curl -XGET "http://localhost:9200/meetups/_search" -d'{ "size": 0, "aggs": { "byCity": { "terms": { "field": "venue.city.raw", "size": 10 } } }}'
"buckets": [ { "key": "Amsterdam", "doc_count": 8 }, { "key": "Ede", "doc_count": 1 }, { "key": "Leidschendam", "doc_count": 1 }, { "key": "Rotterdam", "doc_count": 1 }]
Inverted index not suitable for aggregations
DOC_VALUES
Stored on disk during indexing
All fields except analyzed strings
FIELDDATA
For analyzed strings
Stored in the heap
MONITORING THE CLUSTER
How can I see what elastic is doing?
What numbers are important?
GET /_cluster/health
{ "cluster_name": "playground", "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 55, "active_shards": 55, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 16, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 77.46478873239437}
GET /_cluster/health?level=indices
"indices": { "conferences": { "status": "yellow", "number_of_shards": 5, "number_of_replicas": 1, "active_primary_shards": 5, "active_shards": 5, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 5 }}
indices - field_data, filter_cacheos - cpu, memory, loadprocess - file descriptors, cpu, memoryjvm - memory, garbage collectionthread_pool - threads, rejectedfs - disk space
GET /_nodes/stats?human
_CAT API
GET /_cat/health?v
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1462882362 14:12:42 playground yellow 1 1 56 56 0 0
GET /_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open gridshore-logs-2016.01.19 5 0 1007 0 1.2mb
green open .kibana 1 0 100 1 100.2kb
yellow open topbeat-2016.05.04 5 1 170264 0 44.4mb
green open meetups-20160509113909 1 0 11 0 67.3kb
GET /_cat/fielddata?v
id host ip node total
aqj9L-DPR86J8CgYitcHsA 127.0.0.1 127.0.0.1 node-JC 0b
CLUSTER LOGS
How to configure what is logged?
LOGGING
Can be changed dynamically
PUT /_cluster/settings{ "transient" : { "logger.discovery" : "DEBUG" }}
SLOWLOG
PUT /meetups/_settings{ "index.search.slowlog.threshold.query.warn" : "10s", "index.search.slowlog.threshold.fetch.debug": "500ms", "index.indexing.slowlog.threshold.index.info": "5s"}
PUT /_cluster/settings{ "transient" : { "logger.index.search.slowlog" : "DEBUG", "logger.index.indexing.slowlog" : "WARN" }}
[2016-05-11 16:25:02,105][DEBUG][index.search.slowlog.query]
[meetups-20160509113909]took[518.5micros],took_millis[0],
types[], stats[], search_type[QUERY_AND_FETCH], total_shards[1]
, source[{"size":0,"aggs":{"byCity":{"terms":{"field":
"venue.city.raw","size":10}}}}], extra_source[],
QUESTIONS?
Twitter: @jettroCoenradie
Github: https://github.com/jettro
Blog: https://amsterdam.luminis.eu/news/
Licence: http://creativecommons.org/licenses/by-nc-sa/3.0/
Top Related