ELK stands for Elasticsearch, Logstash and Kibana. It provides an open source data analytics platform covering searching/analysing, transforming/enriching and visualising data.
The main components are:
In addition to the three above components, the ELK stack include also Beats a lightweight data shipper.
In this series of posts, we give an introduction to how you can start using the ELK stack to centrally store and manage your data/logs. Our showcase scenario here is how to index and store centrally logs generated by the Bro IDS. We will be deploying a two nodes Elasticsearch cluster. These nodes will be ingesting data coming from a one node processing pipeline running Logstash and a queueing system (Kafka/Redis). For log shipping from the source, we will be using Filebeat.
As mentioned above, Elasticsearch is a distributed full text search engine. It is also a document based/NoSQL data store. In Elasticsearch terminology, a document is a JSON object containing a list of fields. In relational database terminology, a document is similar to a row in a SQL table. An index is a collection of documents (a table in SQL terms). It is composed of shards and replicas. What you need to know about shards is that they are the basic construction block of the Lucene index. In practice you don’t need to worry about them. In contracts, replicas are more relevant from a user perspective. They are used to increase failover and search performance/speed. By default, every Elasticsearch shard/index will have one replica. This configuration can always be changed. Every Elasticsearch indexes includes a mapping defining some properties of the index and the fields of the documents it contains. It defines also how these fields are analysed/indexed by Lucene. In relational database terminology, a mapping is the schema definition of the index/table. By default, Elasticsearch will always try to automatically guess the data type of the document fields it sees. In Elasticsearch terminology, this is called dynamic mapping. Elasticsearch offers also a REST API to let you define your own mapping and interact with the data.
Enough theoretical talk. The installation of Elasticsearch is straight forward. Thanks to the folks at elastic.co there is a binary package for all major Linux distributions. In our case, we are installing Elasticsearch on an Ubuntu server system. The following commands do the job. We need to run them on all the nodes (here 2 nodes) that will be part of our Elasticsearch cluster.
1wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
2echo "deb https://artifacts.elastic.co/packages/5.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list
3apt-get update
4apt-get install openjdk-8-jdk elasticsearch
After running these commands you have now successfully installed Elasticsearch. The next step is to configure your Elasticsearch cluster nodes. The main configuration file is /etc/elasticsearch/elasticsearch.yml
. The following are the main configuration line that need to be added/updated.
1# Cluster name, should be the same on all the nodes
2cluster.name: binor-elk
3# Node name, should be unique per node
4node.name: "node-01"
5# The network interface the elasticsearch service will be listening on. We configure it to listen on the localhost and all the local network interfaces.
6network.host: [_local_, _site_]
7# List of the ip of the nodes in this cluster.
8discovery.zen.ping.unicast.hosts: [ "10.3.0.41", "10.3.0.42" ]
The above are the minimal/basic required configuration change to start your cluster. Depending on the hardware capabilities of the hosts running your Elasticsearch cluster, you can also change the JVM settings. You can increase the size of the heap memory available for the Elasticsearch process. You can also configure Elasticsearch not to use swapping. More details can be found at 1 and 2
After the above configurations, your Elasticsearch cluster is ready. You can start it by running systemctl start elasticsearch.service
on the two nodes of the cluster. To verify that everything is working as expected, you can run the following HTTP request:
1curl -XGET 'localhost:9200/?pretty'
You should then see a response similar to the following:
1{
2 "name" : "node-01",
3 "cluster_name" : "binor-elk",
4 "cluster_uuid" : "o6KIJ5o6TNq0sbO3QiDo4A",
5 "version" : {
6 "number" : "5.1.1",
7 "build_hash" : "5395e21",
8 "build_date" : "2016-12-06T21:36:15.409Z",
9 "build_snapshot" : false,
10 "lucene_version" : "6.3.0"
11 },
12 "tagline" : "You Know, for Search"
13}
Now our Elasticsearch cluster is ready to receive data. But before that, we will define an indexing/mapping template. In particular we want to make sure that special fields with IP and double int values are properly handled by the Elasticsearch indexer.
1curl -XPUT http://127.0.0.1:9200/_template/logstash -d '
2{
3 "order" : 0,
4 "template" : "logstash-*",
5 "settings" : {
6 "index.number_of_shards" : 5,
7 "index.number_of_replicas" : 1,
8 "index.query.default_field" : "message"
9 },
10 "mappings" : {
11 "_default_" : {
12 "properties" : {
13 "src_ip" : {"type" : "ip"},
14 "dst_ip" : {"type" : "ip"},
15 "conn_duration" : {"type" : "double"}
16 }
17 }
18 },
19 "aliases" : { }
20}
21'
In the above definition, we are telling Elasticsearch to always treat the fields src_ip and dst_ip as of type IP. This is a builtin Elasticsearch fields data type. It provides some special search capabilities like IP range or subnet filtering search. The mapping we define here is a basic one. We will update it later.
This conclude the setup of our Elasticsearch cluster. In the next blog posts we will talk about the data processing pipeline and how to configure Logstash to parse and ship logs to Elasticsearch.