Query DSL In Elasticsearch

Narayan Kumar Software ConsultantKnoldus Software LLP

Query DSL In Elasticsearch

Agenda

Overview of Elasticsearch

What is Query DSL?

Queries VS Filters

Type of query

Demo

Overview of Elasticsearch

Elasticsearch realTime, search & analytics engine

open-source

distributed

multi tenancy

scales massively

high availability

schema free

restful API

JSON over HTTP

lucene based

fault tolerance

P1.Distributed :means elasticsearch distribute our data in a cluster using shards.example laptop.P2.scale massively: means so we can scale the ES cluster smoothly from small size to big cluster and we can scale horizontally and vertically as well.P3.high availabilty:means elasticsearch create duplicatcy over data through replica so,when a node goes down then the other nodes in cluster replicate the primary shard of that node so your data is'n lost at all.P3 Rest full api: ES provides rest full api so we can easly interact with es cluster and perform ES operation.P4 Json over HTTP: means ES is JSON in and JSON out .we can write ES query in json format and ES returns result in JSON format as well.P5. Schema free : means ES provide schema free type support if you have not defined mapping of your type then it automatically understand and gnerate mapping for your type.P6multi tenancy:Multiple application access same index without any modification as compare to RDBMS based appliction.we can easly distribute ES index across application.Ex:Kibana,logstash.

What is Query DSL ?

It is rich flexible query language.

Elasticsearch provides a full Query DSL based on JSON to define queries.

We can think Query DSL as an AST of queries, consisting of two types of clauses.

Leaf query clauses: It looks for a particular value in a particular field, such as the match, term or range queries. Compound query clauses: It wraps other leaf or compound queries and are used to combine multiple queries in a logical fashion.

Queries VS Filters

Queries

full text search relevance scoring

heavier not cacheable

Filters

exact matching

binary yes / no

fast

cacheable

P1.for performance perspective first perfrom filter and then perform query over filtered data.

Type of query

Match All Query

Full text queries

Term level queries

Compound queries

P1.There are two type of context one is query context and other is filter contex.P2.The behaviour of a query clause depends on whether it is used in query context or in filter context:P4.Query contextA query clause used in query context answers the question How well does this document match this query clause?.it claculates score of the documents.P5.Filter contextIn filter context, a query clause answers the question Does this document match this query clause? The answer is a simple Yes or Nono scores are calculated.

Match All Query

The most simple query, which matches all documents, giving them all a _score of 1.0.

Example:

"query": { "match_all": {} }

P1.it is by default query if you not mention any query.Like : index/type/_search.

Full text queries

The high-level full text queries are usually used for running full text queries on full text fields like the body of an email.

These are full text queries:

match_query

multi_match query

common_terms query

query_string query

simple_query_string

P1.They understand how the field being queried is analyzed and will apply each fields analyzer (or search_analyzer) to the query string before executing.P2.first of all I want to diffrentiate fulltext value and exact value.

Full text queries continue ..

match_query: A family of match queries that accepts text/numerics/dates, analyzes them, and constructs a query.

{ "query": { "match": { "body": { "query": "i spent at starbucks", "operator": "and" } } }}

multi_match:The multi_match query builds on the match query to allow multi-field queries

{ "query": { "multi_match": { "query": "share post", "fields": [ "verb" ] } }}


common_terms query: The common terms query is a modern alternative to stopwords which improves the precision and recall of search results (by taking stopwords into account), without sacrificing performance.

query_string query:A query that uses a query parser in order to parse its content.

"common": { "body": { "query": "i am spent at starbucks", "cutoff_frequency": 0.001, "low_freq_operator": "and" } }

"query": { "query_string": { "query": "(verb:post) AND (body:i am today OR body:came to starbucks)" } }

P1.The common terms query divides the query terms into two groups: more important (ie low frequency terms) and less important (ie high frequency terms which would previously have been stopwords).P2.query string is usefull in when we pass a complex queries as URL parameters and want to perform some boolean operation over it.


simple_query_string query: A query that uses the SimpleQueryParser to parse its context.The simple_query_string query will never throw an exception, and discards invalid parts of the query.

"query": { "simple_query_string": { "query": "\"at starbucks\" | today -starbucks", "fields": [ "body" ], "flags": "OR|NOT|PHRASE" } }

P1.it supports multi field to allow perform query on multiple field at same time.P2.simple_query_string support multiple flags to specify which parsing features should be enabled. It is specified as a |-delimited string with the flags parameter.it help we can wirte more optimized simple query.

Term level queries

The term-level queries operate on the exact terms that are stored in the inverted index.These queries are usually used for structured data like numbers, dates, and enums, rather than full text fields.

term_query terms_query

range_query

exists_query

prefix_query

These are term level queries:

wildcard_query

regexp_query

fuzzy_query

type_query

ids_query

P1.it supports multi field to allow perform query on multiple field at same time.P2.simple_query_string support multiple flags to specify which parsing features should be enabled. It is specified as a |-delimited string with the flags parameter.it help we can wirte more optimized simple query.P3. Missing query is depricted so we can use exists query in must_not clause of bool query instead of missing_query.

Term level queries continue.

term_query: The term query finds documents that contain the exact term specified in the inverted index.

"term": { "actor.postedTime": "2010-11-17T03:55:57.000Z" }

"terms": { "verb": [ "share", "post" ] }

terms_query: Filters documents that have fields that match any of the provided terms.

P1.The term query looks for the exact term in the fields inverted indexit doesnt know anything about the fields analyzer. This makes it useful for looking up values in not_analyzed string fields, or in numeric or date fields.P2. When we write term query in filter context it generates Bitset[1,0,1,0],which discribes that which documents match aginst this query.


range_query: Matches documents with fields that have terms within a certain range.

"range": { "actor.friendsCount": { "gte": 10, "lte": 500 } }

exists_query: Returns documents that have at least one non-null value in the original field.

"exists": { "field": "actor.links.href" }


prefix_query: Matches documents that have fields containing terms with a specified prefix.

wildcard_query: Matches documents that have fields matching a wildcard expression .

"prefix": { "body": "rt" }

"wildcard": { "actor.preferredUsername": "ba*" }

"regexp": { "actor.preferredUsername": "ba.*lan" }

regexp_query:The regexp query allows you to use regular expression term queries.

P1.Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. Note this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?.P2.The performance of a regexp query heavily depends on the regular expression chosen.if it possible then we should use more prefix character to optimize regexp query.P3.Regular expressions are dangerous because its easy to accidentally create an innocuous looking one that requires an exponential number of internal determinized automaton states (and corresponding RAM and CPU) for Lucene to execute.

Compound Queries

Compound query: Compound queries wrap other compound or leaf queries, either to combine their results and scores, to change their behaviour, or to switch from query to filter context.

The queries in this group are:

constant_score query

bool query

dis_max query

function_score query

boosting query

indices query

and, or, not

filtered query

limit query

P1.filtered query has depricated in version 2.0.0

Compound queries continue.

dis_max query: A query that generates the union of documents produced by its subqueries.

"dis_max": { "queries": [ { "term": { "verb": "share"}}, { "term": { "verb": "post"}} ] }

boosting query:The boosting query can be used to effectively demote results that match a given query.

"boosting": { "positive": {"term": { "verb": "post" } }, "negative": { "range": { "actor.friendsCount": {"from": 10,"to": 500 } } }, "negative_boost": 0.5 }

P1. and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.P2.A query which accepts multiple queries, and returns any documents which match any of the query clauses. While the bool query combines the scores from all matching queries, the dis_max query uses the score of the single best- matching query clause.P3.Unlike the "NOT" clause in bool query, this still selects documents that contain undesirable terms, but reduces their overall score.


bool query: A query that matches documents matching boolean combinations of other queries.

"bool" : { "must" : { "term" : { "verb": "post" } }, "filter": { "term" : { "actor.displayName": "rajni" } }, "must_not" : { "range" : { "actor.friendsCount" : { "from" : 10, "to": 500 } } }, "should" : [ { "term" : { "actor.twitterTimeZone": "casablanca" } }, { "term" : { "generator.displayName": "twitter for iPhone" } } ] }

P1.The bool query takes a more-matches-is-better approach, so the score from each matching must or should clause will be added together to provide the final _score for each document.P2.The must and should clauses have their scores combinedthe more matching clauses, the betterwhile the must_not and filter clauses are executed in filter context.


constant_score query: A query which wraps another query, but executes it in filter context. All matching documents are given the same constant _score.

"constant_score": { "filter": { "range": { "actor.friendsCount": { "from": 10, "to": 500 } } } }

P1.constant query: but executes it in filter context. All matching documents are given the same constant _score.

Other DSL Queries

Joining queries: Performing full SQL-style joins in a distributed system like Elasticsearch . Example: nested_query,has_parent query,has_child query etc.

Geo queries: These queries are related to geo_point and geo_shape related operations.

Specialized queries: These queries have no any group. It uses for some specific requirement like template_query,script_query etc.

Span queries:These are typically used to implement very specific queries on legal documents or patents.

References

https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html

MEAP Edition Elasticsearch in Action Version 9

Thank you

Query DSL In Elasticsearch

Software

Transcript of Query DSL In Elasticsearch