Reducing Data Volume
The Enriching Data chapter discussed how data collected at a remote point could be enriched before being sent to a central collector. However, for cases where there is a significant amount of data relative to the bandwidth available, it is useful to reduce the amount of data sent to collector using the distributed processing power in the nodes. This is an effective way to minimize licensing costs (e.g., Splunk) and to work with sites on the edge when only mobile data is available.
Filtering
The first method involves discarding events that are not considered important. filter
has three variations for catagorizing events:
- Pattern matches on field values:
- filter:
patterns:
- severity: high
- source: ^GAUTENG-
- Conditional expressions:
- filter:
condition: speed > 1
A variation of this involves using stream
to watch for changes — a highly effective technique for only sending changed values:
- stream:
operation: delta
watch: throughput
- filter:
condition: delta != 0
Discarding and Renaming
We may be only interested in certain data fields — which brings us to the third variation of filter
:
- Schema:
- filter:
schema:
- source
- destination
- sent_kilobytes_per_sec
filter.schema
only passes through events with the specified fields and discards any other fields.
We recommend documenting event structure and getting rid of any temporary fields generated during processing.
To do this, remember that it is always possible to explicitly use remove
.
When JSON is used as data transport, field names are a significant part of the payload size. It is therefore useful to rename
fields:
- rename:
- source: s
- destination: d
- sent_kilobytes_per_sec: sent
Using a more Compact Data Format
CSV is a very efficient data transfer format because rows do not repeat column names.
collapse
will convert the fields of a JSON event into CSV data.
In most cases, this CSV data would need to be converted back into JSON for storage in analytics engines like Elasticsearch. Creating a Logstash filter to perform this conversion process can be tedious, collapse
provides some workarounds, as seen below.
CSV output allows for the column names-types to be written to a field. There is also an option to specify that it only be written if the fields change:
# Input: {"a":1,"b":"hello"}, {"a":2,"b":"goodbye"}
- collapse:
output-field: d
csv:
header-field: h
header-field-types: true
header-field-on-change: true
# Output: {"d":"1,hello","h":"a:num,b:str"}, {"d":"2,goodbye"}
The reverse operation, expand
, can take place on a server-side Pipe. This takes the output of the remote Pipe and restores the original output:
- expand:
input-field: d
remove: true
delim: ',' # default
csv:
header-field: h
header-field-types: true
If there is a corresponding Pipe on the server, you can also move any enrichments to the Pipe in question.