Generating Events
The generate
action requires special discussion because it is a way to create custom events
and alerts that are aware of the history of the data.
A stream of JSON events is read and passed through. generate
saves these records in a SQLite database so it can use
the full power of SQL to generate historical queries over aggregates such as averages and maximums.
- generate:
bbox.linkutilisation.incoming:
add:
- severity: warning
- type: alert
- title: "Line Utilisation incoming over 90%"
- text: "average incoming ${avg_incoming}kb close to max incoming ${max_incoming}kb: ratio: ${threshold:1}"
let:
avg_incoming: AVG(incomingBytesPerInterval, 5m)
max_incoming: MAX(incomingBytesPerInterval, 60m)
threshold: 90.0/100.0
when: (avg_incoming / max_incoming) > threshold
In short, we take the 15 minute average over incoming byte rate and the 60 minute maximum, and create an alert when the ratio of the received average and the maximum is greater than some threshold.
Under let
, we define variables, which may be simple expressions or may contain aggregate functions like AVG or MAX.
All of the input fields in the JSON events can be used in these expressions.
when
is the condition that must be true for the alert to be generated.
Note that the title and text may contain ${var:1}
, where var
is any defined variable or known field.
(The ":1" means "one decimal place" which helps to keep reports clean)
severity
can be 'info', 'warning' or 'error'. The name of the alert is an aggregation key that uniquely
identifies the type of alert.
This produces the following alert events:
{
"type":"alert",
"text":"average incoming 84kb close to max incoming 84kb: ratio: 0.9",
"title":"Line Utilisation incoming over 90%",
"aggregation_key":"bbox.linkutilisation.incoming",
"severity":"warning",
"@timestamp":"2018-05-01T12:00:00.000Z"
}
Filtering Alerts
If you only want to see the alerts, use filter
to only pass through events that have a field type
that has value "alert":
- filter:
patterns:
- type: alert
Here is a similar example where the latest DNS timeToRespond
is
compared against avg_lookup_hour
.
- generate:
alert.dns.benchmark:
#test: true
add:
- type: alert
- severity: warning
- title: "DNS Lookup over ${threshold}% ${destinationHostname} (${min})"
- text: "Lookup time ${timeToRespond} greater than hour average ${avg_lookup_hour}: ratio: ${ratio:1}"
notification: 5m
any: destinationHostname
let:
avg_lookup_hour: AVG(timeToRespond,60m)
threshold: '150'
ratio: timeToRespond/avg_lookup_hour
when: ratio > (threshold/100.0)
But there is an interesting twist. Say our input records look like this:
{"time":"2018-05-01 12:00:00.050","destinationHostname":"example.com","timeToRespond":200.000000,"min":1}
{"time":"2018-05-01 12:00:00.100","destinationHostname":"frodo.co.za","timeToRespond":70.000000,"min":1}
{"time":"2018-05-01 12:00:00.150","destinationHostname":"panoptix.io","timeToRespond":50.000000,"min":1}
....
Resolving different hosts can have different durations, and generally,
people only want to know when there's a sudden relative change in
resolution time for a particular host. any: destinationHostname
ensures that we track the average time to respond individually for
each host.
Grouping Events by ID
Sometimes a tool used as a probe does not produce single events, but a set of JSON records.
For a pipe that performs a traceroute, its data could look something like this, simplified here to only show the fields of interest.
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":1 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":4 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":7 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":10 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":19 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":6 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":7 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":12 ...}
What the records have in common, is a unique ScanUUID, and what we want to do, is look at the maximum hopNumber for a particular id, and compare the last and current events' numbers.
Previously the aggregate functions took time intervals, and here they take a field with a record index:
- generate:
alert.bbox.hopCount:
add:
- severity: warning
- type: alert
- title: "Change of Hop Count"
- text: "From ${hopCount1} to ${hopCount2}"
at_end: true
let:
hopCount1: MAX(hopNumber,scanUUID:1)
hopCount2: MAX(hopNumber,scanUUID:0)
when: ABS(hopCount1-hopCount2) > 2
More than one Alert
You may define multiple alerts watching the same input events:
- generate:
bbox.temperature:
add:
- severity: warning
- type: alert
- title: "High Bbox Temperature detected"
- text: "temperature is ${temperature:1}"
let:
temperature: AVG(cpuTemperature,15m)
when: temperature > 70
bbox.memusage:
add:
- severity: warning
- type: alert
- title: "High Bbox Memory Usage detected"
- text: "memory used ${memUsedPerc}%"
let:
memoryUsageAvg: MAX(memoryUsage,15m)
totalMemoryAvg: MAX(totalMemory,15m)
memUsedPerc: (memoryUsageAvg/totalMemoryAvg)*100
when: memUsedPerc > 90
bbox.diskusage:
add:
- severity: warning
- type: alert
- title: "High Bbox Disk Usage detected"
- text: "memory usage is ${usagePercentage}"
let:
usagePercentage: AVG(rootPartitionUsagePercentage,15m)
when: usagePercentage > 80
bbox.loadaverage:
add:
- severity: warning
- type: alert
- title: "High Bbox Load detected"
- text: "load average ${loadAverage5mAvg:1}"
let:
loadAverage5mAvg: AVG(loadAverage5m,60m)
when: loadAverage5mAvg > 3
bbox.uptime:
add:
- severity: warning
- type: alert
- title: "Rebooted"
- text: "uptime of ${uptime}"
let:
uptime: AVG(uptimeSeconds,120s)
when: uptime > 120
Enriching Events
The main requirement of an alert is that it should clearly indicate where it came from. The above alert definitions would be fairly useless without some idea of the site that is in trouble, and there are 'magic' numbers like 3 and 80 all over.
In the case of a line saturation alert, it would be good to parameterize the alert so that we can change the threshold easily.
This is a full pipe definition, where threshold has been defined in
context. Note that the existing use has been more accurately renamed ratio
:
name: line_saturation
input:
file: incoming.json
context:
threshold: 90
actions:
- generate:
bbox.linkutilisation.incoming:
add:
- severity: warning
- type: alert
- title: "{{name}} Line Utilisation incoming over {{threshold}}%"
- text: "average incoming ${avg_incoming}kb close to max incoming ${max_incoming}kb: ratio: ${ratio:1} (site {{name}})"
let:
avg_incoming: AVG(incomingBytesPerInterval, 5m)
max_incoming: MAX(incomingBytesPerInterval, 60m)
ratio: '{{threshold}}/100.0'
when: (avg_incoming / max_incoming) > ratio
- filter:
patterns:
- type: alert
output:
write: console
You will note that context variables expand with {{threshold}}
, not
${threshold}
. This is because they are constants that are defined
when the pipe is created, and 'dollar curlies' are for field values
and values calculated from them for each incoming event (just as with
the add action).
{{name}}
always exists, and by will be set to the agent name.
This is clearer, because we don't repeat the magic number 90. But the real power comes from contexts... you may change threshold for all agents in the global context, change it for a particular agent in agent context, and use tag context to set it for particular groups of agents.