Skip to main content
Version: 3.3.0

Generating Events

The generate action requires special discussion because it is a way to create custom events and alerts that are aware of the history of the data.

A stream of JSON events is read and passed through. generate saves these records in a SQLite database so it can use the full power of SQL to generate historical queries over aggregates such as averages and maximums.

- generate:
bbox.linkutilisation.incoming:
add:
- severity: warning
- type: alert
- title: "Line Utilisation incoming over 90%"
- text: "average incoming ${avg_incoming}kb close to max incoming ${max_incoming}kb: ratio: ${threshold:1}"
let:
avg_incoming: AVG(incomingBytesPerInterval, 5m)
max_incoming: MAX(incomingBytesPerInterval, 60m)
threshold: 90.0/100.0
when: (avg_incoming / max_incoming) > threshold

In short, we take the 15 minute average over incoming byte rate and the 60 minute maximum, and create an alert when the ratio of the received average and the maximum is greater than some threshold.

Under let, we define variables, which may be simple expressions or may contain aggregate functions like AVG or MAX.

All of the input fields in the JSON events can be used in these expressions.

when is the condition that must be true for the alert to be generated.

Note that the title and text may contain ${var:1}, where var is any defined variable or known field. (The ":1" means "one decimal place" which helps to keep reports clean)

severity can be 'info', 'warning' or 'error'. The name of the alert is an aggregation key that uniquely identifies the type of alert.

This produces the following alert events:

{
"type":"alert",
"text":"average incoming 84kb close to max incoming 84kb: ratio: 0.9",
"title":"Line Utilisation incoming over 90%",
"aggregation_key":"bbox.linkutilisation.incoming",
"severity":"warning",
"@timestamp":"2018-05-01T12:00:00.000Z"
}

Filtering Alerts

If you only want to see the alerts, use filter to only pass through events that have a field type that has value "alert":

- filter:
patterns:
- type: alert

Here is a similar example where the latest DNS timeToRespond is compared against avg_lookup_hour.

- generate:
alert.dns.benchmark:
#test: true
add:
- type: alert
- severity: warning
- title: "DNS Lookup over ${threshold}% ${destinationHostname} (${min})"
- text: "Lookup time ${timeToRespond} greater than hour average ${avg_lookup_hour}: ratio: ${ratio:1}"
notification: 5m
any: destinationHostname
let:
avg_lookup_hour: AVG(timeToRespond,60m)
threshold: '150'
ratio: timeToRespond/avg_lookup_hour
when: ratio > (threshold/100.0)

But there is an interesting twist. Say our input records look like this:

{"time":"2018-05-01 12:00:00.050","destinationHostname":"example.com","timeToRespond":200.000000,"min":1}
{"time":"2018-05-01 12:00:00.100","destinationHostname":"frodo.co.za","timeToRespond":70.000000,"min":1}
{"time":"2018-05-01 12:00:00.150","destinationHostname":"panoptix.io","timeToRespond":50.000000,"min":1}
....

Resolving different hosts can have different durations, and generally, people only want to know when there's a sudden relative change in resolution time for a particular host. any: destinationHostname ensures that we track the average time to respond individually for each host.

Grouping Events by ID

Sometimes a tool used as a probe does not produce single events, but a set of JSON records.

For a pipe that performs a traceroute, its data could look something like this, simplified here to only show the fields of interest.

{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":1 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":4 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":7 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":10 ...}

{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":19 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":6 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":7 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":12 ...}

What the records have in common, is a unique ScanUUID, and what we want to do, is look at the maximum hopNumber for a particular id, and compare the last and current events' numbers.

Previously the aggregate functions took time intervals, and here they take a field with a record index:

- generate:
alert.bbox.hopCount:
add:
- severity: warning
- type: alert
- title: "Change of Hop Count"
- text: "From ${hopCount1} to ${hopCount2}"
at_end: true
let:
hopCount1: MAX(hopNumber,scanUUID:1)
hopCount2: MAX(hopNumber,scanUUID:0)
when: ABS(hopCount1-hopCount2) > 2

More than one Alert

You may define multiple alerts watching the same input events:

- generate:
bbox.temperature:
add:
- severity: warning
- type: alert
- title: "High Bbox Temperature detected"
- text: "temperature is ${temperature:1}"
let:
temperature: AVG(cpuTemperature,15m)
when: temperature > 70

bbox.memusage:
add:
- severity: warning
- type: alert
- title: "High Bbox Memory Usage detected"
- text: "memory used ${memUsedPerc}%"
let:
memoryUsageAvg: MAX(memoryUsage,15m)
totalMemoryAvg: MAX(totalMemory,15m)
memUsedPerc: (memoryUsageAvg/totalMemoryAvg)*100
when: memUsedPerc > 90

bbox.diskusage:
add:
- severity: warning
- type: alert
- title: "High Bbox Disk Usage detected"
- text: "memory usage is ${usagePercentage}"
let:
usagePercentage: AVG(rootPartitionUsagePercentage,15m)
when: usagePercentage > 80

bbox.loadaverage:
add:
- severity: warning
- type: alert
- title: "High Bbox Load detected"
- text: "load average ${loadAverage5mAvg:1}"
let:
loadAverage5mAvg: AVG(loadAverage5m,60m)
when: loadAverage5mAvg > 3

bbox.uptime:
add:
- severity: warning
- type: alert
- title: "Rebooted"
- text: "uptime of ${uptime}"
let:
uptime: AVG(uptimeSeconds,120s)
when: uptime > 120

Enriching Events

The main requirement of an alert is that it should clearly indicate where it came from. The above alert definitions would be fairly useless without some idea of the site that is in trouble, and there are 'magic' numbers like 3 and 80 all over.

In the case of a line saturation alert, it would be good to parameterize the alert so that we can change the threshold easily.

This is a full pipe definition, where threshold has been defined in context. Note that the existing use has been more accurately renamed ratio:

name: line_saturation
input:
file: incoming.json
context:
threshold: 90
actions:
- generate:
bbox.linkutilisation.incoming:
add:
- severity: warning
- type: alert
- title: "{{name}} Line Utilisation incoming over {{threshold}}%"
- text: "average incoming ${avg_incoming}kb close to max incoming ${max_incoming}kb: ratio: ${ratio:1} (site {{name}})"
let:
avg_incoming: AVG(incomingBytesPerInterval, 5m)
max_incoming: MAX(incomingBytesPerInterval, 60m)
ratio: '{{threshold}}/100.0'
when: (avg_incoming / max_incoming) > ratio
- filter:
patterns:
- type: alert
output:
write: console

You will note that context variables expand with {{threshold}}, not ${threshold}. This is because they are constants that are defined when the pipe is created, and 'dollar curlies' are for field values and values calculated from them for each incoming event (just as with the add action).

{{name}} always exists, and by will be set to the agent name.

This is clearer, because we don't repeat the magic number 90. But the real power comes from contexts... you may change threshold for all agents in the global context, change it for a particular agent in agent context, and use tag context to set it for particular groups of agents.