神刀安全网

Elasticsearch Queries: A Thorough Guide

Elasticsearch Queries: A Thorough Guide

Even though search is the primary function of Elasticsearch, getting search right can be tough and sometimes even confusing. To help, this guide will take you through the ins and outs of search queries and set you up for future searching success.

Lucene queries

Elasticsearch is built on Lucene, the search library from Apache, and exposes Lucene’s query syntax. It’s such an integral part of Elasticsearch that when you query the root of an Elasticsearch cluster, it will tell you the Lucene version:

{ "status" : 200, "name" : "Ikthalon", "cluster_name" : "elasticsearch", "version" : { "number" : "1.7.5", "build_hash" : "00f95f4ffca6de89d68b7ccaf80d148f1f70e4d4", "build_timestamp" : "2016-02-02T09:55:30Z", "build_snapshot" : false, "lucene_version" : "4.10.4" }, "tagline" : "You Know, for Search" }

Knowing the Lucene syntax and operators will go a long way in helping you build queries. It’s used in both the simple and the standard query string query. Here’s some of the basics:

Boolean Operators

As with most computer languages, Elasticsearch supports the AND, OR, and NOT operators:

  • jack AND jill — Will return events that contain both jack and jill
  • ahab NOT moby — Will return events that contain ahab but not moby
  • tom OR jerry — Will return events that contain tom or jerry, or both

Fields

You might be looking for events where a specific field contains certain terms. You specify that as follows:

  • name:”Ned Stark”

Be careful with values with spaces such as “Ned Stark.” You’ll need to enclose it in double quotes to ensure that the whole value is used.

Ranges

You can search for fields within a specific range, using square brackets for inclusive range searches and curly braces for exclusive range searches:

  • age:[3 TO 10] — Will return events with age between 3 and 10
  • price:{100 TO 400} — Will return events with prices between 101 and 399
  • name: [Adam TO Ziggy] — Will return names between and including Adam and Ziggy

As you can see in the examples above, you can use ranges in non-numerical fields like strings and dates as well.

Wildcards, Regexes and Fuzzy Searching

Search would not be search without wildcards. You can use the * character for multiple character wildcards or the ? character for single character wildcards:

  • Ma?s — Will match Mars, Mass, and Maps
  • Ma*s — Will match Mars, Matches, and Massachusetts

Regexes give you even more power. Just place your regex between forward slashes (/):

  • /m[ea]n/ — Will match both pen and pan
  • /<.+>/ — Will match text that resembles an HTML tag

Fuzzy searching uses the Damerau-Levenshtein Distance to match terms that are similar in spelling. This is great when your data set has misspelled words. Use the tilde (~) along with a number to specify the how big the distance between words can be:

  • john~2 — Will match, amongst others, jean, johns, jhon, and horn

URI Search

The easiest way to search your Elasticsearch cluster is through URI search. You can pass a simple query to Elasticsearch using the q query parameter. The following query will search your whole cluster for documents with a name field equal to “travis”:

  • curl “localhost:9200/_search?q=name:travis”

With the Lucene syntax, you can build quite impressive searches. Usually you’ll have to URL-encode characters such as spaces (it’s been omitted in these examples for clarity):

  • curl “localhost:9200/_search?q=name:john~1 AND (age:[30 TO 40} OR surname:K*) AND -city”

A number of options are available that allow you to customize the URI search, specifically in terms of which analyzer to use (analyzer), whether the query should be fault-tolerant (lenient), and whether an explanation of the scoring should be provided (explain).

Although the URI search is a simple and efficient way to query your cluster, you’ll quickly find that it doesn’t support all of the features offered to you by Elasticsearch. The full power of Elasticsearch is exposed through Request Body Search. Using Request Body Search allows you to build a complex search request using various elements and query clauses that will match, filter, and order as well as manipulate documents based on multiple criteria.

The Request Body Search

Request Body Search uses a JSON document that contains various elements to create a search on your Elasticsearch cluster. Not only can you specify search criteria, you can also specify the range and number of documents that you expect back, the fields that you want, and various other options.

The first element of a search is the query element that uses Query DSL. Using Query DSL can sometimes be confusing because the DSL can be used to combine and build up query clauses into a query that can be nested deeply. Since most of the Elasticsearch documentation only refers to clauses in isolation, it’s easy to lose sight of where clauses should be placed.

To use the Query DSL, you need to include a “query” element in your search body and populate it with a query built using the DSL:

  • {“query”: { “match”: { “_all”: “meaning” } } }

In this case, the “query” element contains a “match” query clause that looks for the term “meaning” in all of the fields in all of the documents in your cluster.

The query element is used along with other elements in the search body:

{

“query”: {

“match”: { “_all”: “meaning” }

},

“fields”: [“name”, “surname”, “age”],

“from”: 100, “size”: 20

}

Here, we’re using the “fields” element to restrict which fields should be returned and the “from” and “size” elements to tell Elasticsearch we’re looking for documents 100 to 119 (starting at 100 and counting 20 documents).

The Query DSL

The Query DSL can be invoked using most of Elasticsearch’s search APIs. For simplicity, we’ll look only at the Search API that uses the _search endpoint. When calling the search API, you can specify the index and / or type on which you want to search. You can even search on multiple indices and types by separating their names with commas or using wildcards to match multiple indices and types:

Search on all the Logstash indices:

  • curl localhost:9200/logstash-*/_search

Search in the current and legacy indices, in the documents type:

  • curl localhost:9200/current,legacy/documents/_search

Search in the clients indices, in the bigcorp and smallco types:

  • curl localhost:9200/clients/bigcorp,smallco/_search

We’ll be using Request Body Searches, so searches should be invoked as follows:

  • curl localhost:9200/_search -d ‘{“query”:{“match”: {“_all”:”meaning”}}}’

Compound Queries

Although there are multiple query clause types, the one you’ll use the most is Compound Queries because it’s used to combine multiple clauses to build up complex queries.

The Bool Query is probably used the most because it can combine the features of some of the other compound query clauses such as the And, Or, Filter, and Not clauses. It is used so much that these four clauses have been deprecated in various versions in favor of using the Bool query. Using it is best explained with an example:

curl localhost:9200/_search -d ‘{

“query”:{

“bool”: {

“must”: {

“fuzzy” : {

“name”: “john”,

“fuzziness”: 2

}

},

“must_not”: {

“match”: {

“_all”: “city”

}

},

“should”: [

{

“range”: {

“age”: { “from”: 30, “to”: 40 }

}

},

{

“wildcard” : { “surname” : “K*” }

}

]

}

}

}’

Within the query element, we’ve added the bool clause that indicates that this will be a boolean query. There’s quite a lot going in there, so let’s cover it clause-by-clause, starting at the top:

must

All queries within this clause must match a document for it to be returned by Elasticsearch. Think of this as your AND queries. The query we used here is the fuzzy query, and it will match any documents that have a name field that matches “john” in a fuzzy way. The extra “fuzziness” parameter tells Elasticsearch that it should be using a Damerau-Levenshtein Distance of 2 two determine the fuzziness.

must_not

Any documents that match the query within this clause will be excluded from the result set. This is the NOT or minus (-) operator of the query DSL. In this case, we do a simple match query, looking for documents that contain the term “city.” Using _all as the field name indicates that the term can appear in any of the document’s fields. This is the must_not clause, so matching documents will be excluded.

should

Up until now, we have been dealing with absolutes: must and must_not . Should is less absolute and is equivalent to the OR operator. Elasticsearch will return any documents that match one or more of the queries in the should clause. The first query that we provided looks for documents where the age field is between 30 and 40. The second query does a wildcard search on the surname field, looking for values that start with “K.”

The query contained three different clauses, so Elasticsearch will only return documents that match the criteria in all of them. These queries can be nested, so you can build up very complex queries by specifying a bool query as a must , must_not , should or filter query.

filter

One clause type we have not discussed for a compound query is the filter clause. Here is an example where we use one:

curl localhost:9200/_search -d ‘{

“query”:{

“bool”: {

“must”: {

{ “match_all”: {} }

},

“filter”: {

“term”: {

“email”: “joe@bloggs.com”

}

}

}

}

}`

The match_all query in the must clause tells Elasticsearch that it should return all of the documents. This might not seem to be a very useful search, but it comes in handy when you use it in conjunction with a filter as we have done here. The filter we have specified is a term query, asking for all documents that contain an email field with the value “joe@bloggs.com.” We have used a filter to specify which documents we want, so they will all be returned with a score of 1. Filters are not used in the calculation of scores, so the match_all query gives all documents a score of 1.

One thing to note is that this query will not work as expected if the email field is analyzed, which is the default for fields in Elasticsearch. The reason behind this is a topic best discussed in another blog post, but it comes down to the fact that Elasticsearch analyzes both fields and queries when they come in. In this case, the email field will be broken up into three parts: joe, bloggs, and com. This means that it will match searches and documents for any three of those terms.

Filters Versus Queries

People who have used Elasticsearch before version 2 will be familiar with filters and queries. You used to build up a query body using both filters and queries. The difference between the two was that filters were generally faster because they check only if a document matches at all and not whether it matches well. In other words, filters give a boolean answer whereas queries return a calculated score of how well a document matches a query. Various performance enhancements were associated with filters due to their simplified nature.

Since version 2 of Elasticsearch, filters and queries have been merged and any query clause can be used as either a filter or a query (depending on the context). As with version 1, filters are cached and should be used if scoring does not matter.

Scoring

We have mentioned the fact that Elasticsearch returns a score along with all of the matching documents from a query:

> curl “localhost:9200/_search?q=application” { "_shards":{ "total" : 5, "successful" : 5, "failed" : 0 }, "hits":{ "total" : 1, "max_score": 2.3, "hits" : [ { "_index" : "logstash-2016.04.04", "_type" : "logs", "_id" : "1", "_score": 2.3, "_source" : { "message" : "Log message from my application" } } ] } }

This score is calculated against the documents in Elasticsearch based on the provided queries. Factors such as the length of a field, how often the specified term appears in the field, and (in the case of wildcard and fuzzy searches) how closely the term matches the specified value all influence the score. The calculated score is then used to order documents, usually from the highest score to lowest, and the highest scoring documents are then returned to the client. There are various ways to influence the scores of different queries such as the boost parameter. This is especially useful if you want certain queries in a complex query to carry more weight than others and you are looking for the most significant documents.

When using a query in a filter context (as explained earlier), no score is calculated. This provides the enhanced performance usually associated with using filters but does not provide the ordering and significance features that comes with scoring.

Conclusion

The hardest thing about Elasticsearch is the depth and breadth of the available features. We have tried to cover the essential elements in as much detail as possible without drowning you in information. Ask any questions you might have in the comments, and look out for more in-depth posts covering some of the features we have mentioned.

Logz.io offers enterprise-grade ELK as a servicewith alerts, unlimited scalability, and collaborative analytics

Start your free trial!

Elasticsearch Queries: A Thorough Guide

About Jurgens du Toit

Jurgens tries to write good code for a living. He even succeeds at it sometimes. When he isn’t writing code, he’s wrangling data as a hobby. Sometimes the data wins, but we don’t talk about that. Ruby and Elasticsearch are his weapons of choice, but his ADD always allows for new interests. He’s also the community maintainer for a number of Logstash inputs.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Elasticsearch Queries: A Thorough Guide

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
分享按钮