Elasticsearch // 2016-04-15
Over the last year or so I’ve been looking quite a bit at Elasticsearch for use as a general purpose time series database for operational data.
Whilst there are definitely a lot of options in this space, most notably InfluxDB , I keep coming back to Elasticsearch when we’re talking about large volumes of data where you’re doing a lot of analytical workload.
More than a few times, I’ve been asked to explain what Elasticsearch looks like to the would-be developer/operations person. This isn’t too suprising, the documentation isn’t great at giving a real world architectural overview – which can really help contextualise the documentation.
Having been asked again today, I’ve decided to write one up here – so I can save some time explaining the same thing over and over.
It’s important to note for the would be reader that this is my imperfect understanding of Elasticsearch. If you spot any glaringly obvious errors please let me know and I will update this accordingly and you’ll have my eternal thanks for helping me grok this a little better.
So, as they say: the show must go on!
ElasticSearch is a search engine built on top of the Apache Lucene . It’s great for full text search (duh) as well as analytical workloads with adhoc queries and aggregations.
In NoSQL parlance, it would be classified as a document-oriented database and that’s primarily how your application will run with it.
You insert your DOCUMENT into an INDEX.
- An index is a namespace for documents and is analogous to a database table
- Unlike a table, there is no schema per se
- You define what fields are indexed for search at this level
An INDEX is backed by one or more SHARDS
- When you create an index, you specify the number of primary shards your data will be split across.
- Elasticsearch uses a hash function to determine what shard your document is stored in and accessed from.
- You cannot change the number of primary shards after the index is created.
A SHARD can have one or more REPLICA SHARDS
- For each primary shard in your index you will have X replica copies (defined by the index policy)
- If the node hosting the primary shard fails, a replica shard will be promoted to primary
- Replica shards are used for scaling out read performance – the more replica shards you have, the more reads you can service.
- Unlike primary shards, you can change the number of replica shards any time
ELASTICSEARCH runs as a CLUSTER made up of NODES
- Nodes automatically form a cluster when correctly configured
- Elasticsearch will automatically distribute (and move) shards as needed by your index configuration
- Your application can talk to any node in the cluster and they will forward your request to a node with the data to service your request
- If you have a busy cluster, you can deploy proxy nodes – these are nodes that don’t store shards and can be used to direct incoming requests
A SHARD is a lucene index
- Every time your query needs to access a shard, the lucene engine needs to be running for that lucene index
- Your lucene index must fit in memory, lucene does not page data from disk
Caring and feeding for your cluster
I won’t cover setting up and maintaining quorum in the cluster, because that’s pretty well covered elsewhere. If you’re running on AWS there’s even a managed product available which helps simplify things a lot.
For the keen observer, that last bullet point raises some interesting constraints when managing your cluster.
Elasticsearch manages lucene process, but remember they don’t share the same memory space. Because your indexes have to be loaded fully into virtual memory, make sure you leave enough memory free for your index data (lucene).
Don’t allocate more than 32GB heap to Elasticsearch. Due to how Java addresses memory, this will slow things down heaps (see what I did there?).
Read this . No really.
Working with elasticsearch
In short, deploying Elasticsearch for your search application requires some careful planning on the ingest side.
Primarily you want to make sure your shards fit in memory on your nodes and your cluster has enough memory to keep commonly accessed shards in memory. Otherwise your queries will be slow because the system is spending all its time unloading and loading datasets, which defeats the point.
Read the designing for scale part of the Elasticsearch guide.
Your best bet is to split the data across indexes by a meaningful criteria, and in the case of time series data this is a natural fit.
Elasticsearch has a feature called index templates , which are super useful for dynamically creating indexes with specific settings. Writes are directed to the correct index and you can automatically have the new index added to an index alias for reads.
Elasticsearch is a great tool, but requires you to plan ahead: Hopefully I’ve given you a good introduction to how things hang together and where there are sharp edges.
I highly recommend reading through the entire Elasticsearch: The definitive guide document, with the above information in mind ahead of time I think it makes for a much more cohesive read.
Good luck and happy hacking!