Reducing Data Volume
The Enriching Data chapter discussed how data collected at a remote point could be enriched before being sent to a central collector. However, sometimes there is a lot of data relative to the bandwidth available, and we would like to use the distributed processing power in the nodes to reduce the amount of data sent to the collector. This can be effective way to minimize licensing costs when using Splunk for instance, but also a way of working with sites on the edge with only mobile data available.
Filtering
One strategy is to discard events that are not considered important. filter
has three variations: first, pattern matches on field values:
- filter:
patterns:
- severity: high
- source: ^GAUTENG-
Second, conditional expressions.
- filter:
condition: speed > 1
A very useful technique is to only send changed values by using stream
to watch for changes.
- stream:
operation: delta
watch: throughput
- filter:
condition: delta != 0
Discarding and Renaming
We may be only interested in certain data fields.
The third variation of filter
is schema
:
- filter:
schema:
- source
- destination
- sent_kilobytes_per_sec
filter schema
will only pass through events with the specified fields and will discard any other fields.
It's useful to both document event structure and to get rid of any temporary fields generated during processing.
One can always explicitly use remove
.
When JSON is used as data transport, field names are a significant part of the payload size, so renaming fields can make a difference:
- rename:
- source: s
- destination: d
- sent_kilobytes_per_sec: sent
This naturally leads to the next section:
Using a more Compact Data Format
CSV is a very efficient data transfer format because each row does not repeat column names.
collapse
will convert the fields of a JSON event into CSV data.
However, typically you would need to convert this back into JSON to store in Elasticsearch (for example).
Having to create a Logstash filter to do this is tedious, so collapse
provides some conveniences:
With CSV output you can ask for the column names-types to be written to a field. Optionally you can ask for it only to be written if the fields change:
# input:
# {"a":1,"b":"hello"}
# {"a":2,"b":"goodbye"}
- collapse:
output-field: d
csv:
header-field: h
header-field-types: true
header-field-on-change: true
# output:
# {"d":"1,hello","h":"a:num,b:str"}
# {"d":"2,goodbye"}
The reverse operation expand
can happen on a server-side pipe. It will
take the output of the remote pipe and restore the original output.
- expand:
input-field: d
remove: true
delim: ',' # default
csv:
header-field: h
header-field-types: true
If there is a corresponding pipe on the server, then you can move any enrichments to that pipe as well.