Skip to main content
Version: Next

Generating Events

The generate action is particularly important — it can be used to create custom events and alerts that take date history into account.

A stream of JSON events is read and passed through. generate saves these records in a SQLite database — used to generate historical queries. These historical queries are then aggregated using categories such as averages and maximums:

- generate:
bbox.linkutilisation.incoming:
add:
- severity: warning
- type: alert
- title: "Line Utilisation incoming over 90%"
- text: "average incoming ${avg_incoming}kb close to max incoming ${max_incoming}kb: ratio: ${threshold:1}"
let:
avg_incoming: AVG(incomingBytesPerInterval, 5m)
max_incoming: MAX(incomingBytesPerInterval, 60m)
threshold: 90.0/100.0
when: (avg_incoming / max_incoming) > threshold

To summarize, start by determining the 15-minute average (over incoming byte rate and the 60 minute maximum) of the stream of JSON events. An alert is created for instances where the ratio and maximum of the received average is greater than a specified threshold. The when condition must be true for an alert to be generated.

Variables are defined under let. These variables may be simple expressions or contain aggregate functions like AVG or MAX.

All the JSON events input fields can be used in these expressions.

note

The title and text may contain ${var:1}, where var is any defined variable or known field. The :1 means "one decimal place" which helps to keep reports clean.

severity can be named as info, warning or error. The alert name is an aggregation key and unique identifier for the type of alert.

This resulting alert events are as follows:

{
"type":"alert",
"text":"average incoming 84kb close to max incoming 84kb: ratio: 0.9",
"title":"Line Utilisation incoming over 90%",
"aggregation_key":"bbox.linkutilisation.incoming",
"severity":"warning",
"@timestamp":"2018-05-01T12:00:00.000Z"
}

Filtering Alerts

To only see alerts, use filter to only pass through events with the field type and value alert:

- filter:
patterns:
- type: alert

Below, find a similar example that compares the latest DNS timeToRespond to the avg_lookup_hour:

- generate:
alert.dns.benchmark:
#test: true
add:
- type: alert
- severity: warning
- title: "DNS Lookup over ${threshold}% ${destinationHostname} (${min})"
- text: "Lookup time ${timeToRespond} greater than hour average ${avg_lookup_hour}: ratio: ${ratio:1}"
notification: 5m
any: destinationHostname
let:
avg_lookup_hour: AVG(timeToRespond,60m)
threshold: '150'
ratio: timeToRespond/avg_lookup_hour
when: ratio > (threshold/100.0)

There may be a "twist"... Let’s say our input records look like this:

{"time":"2018-05-01 12:00:00.050","destinationHostname":"example.com","timeToRespond":200.000000,"min":1}
{"time":"2018-05-01 12:00:00.100","destinationHostname":"frodo.co.za","timeToRespond":70.000000,"min":1}
{"time":"2018-05-01 12:00:00.150","destinationHostname":"panoptix.io","timeToRespond":50.000000,"min":1}
...

Resolving different hosts may include different durations. In general, this is a concern only in instances where there's a sudden relative change in resolution time for a particular host. Using any: destinationHostname tracks the average response time for each host.

Grouping Events by ID

A tool used as a probe may occasionally produce a set of JSON records instead of single events.

A Pipe performing a traceroute could produce data that resembles the example shown below (simplified to only show the fields of interest):

{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":1 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":4 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":7 ...}
...
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":19 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":7 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":12 ...}

The records all have a unique scanUUID. Let’s look at the maximum hopNumber for a particular ID and compare the hopNumber for the last and most recent event.

In the previous example the aggregate functions took time intervals, but here they measure a field with a record index:

- generate:
alert.bbox.hopCount:
add:
- severity: warning
- type: alert
- title: "Change of Hop Count"
- text: "From ${hopCount1} to ${hopCount2}"
at_end: true
let:
hopCount1: MAX(hopNumber,scanUUID:1)
hopCount2: MAX(hopNumber,scanUUID:0)
when: ABS(hopCount1 - hopCount2) > 2

More than one Alert

It’s possible to define multiple alerts that watch the same input events:

- generate:
bbox.temperature:
add:
- severity: warning
- type: alert
- title: "High Bbox Temperature detected"
- text: "temperature is ${temperature:1}"
let:
temperature: AVG(cpuTemperature,15m)
when: temperature > 70

bbox.memusage:
add:
- severity: warning
- type: alert
- title: "High Bbox Memory Usage detected"
- text: "memory used ${memUsedPerc}%"
let:
memoryUsageAvg: MAX(memoryUsage,15m)
totalMemoryAvg: MAX(totalMemory,15m)
memUsedPerc: (memoryUsageAvg/totalMemoryAvg)*100
when: memUsedPerc > 90

bbox.diskusage:
add:
- severity: warning
- type: alert
- title: "High Bbox Disk Usage detected"
- text: "memory usage is ${usagePercentage}"
let:
usagePercentage: AVG(rootPartitionUsagePercentage,15m)
when: usagePercentage > 80

bbox.loadaverage:
add:
- severity: warning
- type: alert
- title: "High Bbox Load detected"
- text: "load average ${loadAverage5mAvg:1}"
let:
loadAverage5mAvg: AVG(loadAverage5m,60m)
when: loadAverage5mAvg > 3

bbox.uptime:
add:
- severity: warning
- type: alert
- title: "Rebooted"
- text: "uptime of ${uptime}"
let:
uptime: AVG(uptimeSeconds,120s)
when: uptime > 120

Enriching Events

The main requirement for an alert is a clear indication of its origin. The above alert definitions require insight into the site that is experiencing issues, taking into account the threshold numbers: 3 and 80.

Adding a parameter to a line saturation alert allows us to easily change the threshold.

Below you will find a full Pipe definition, where threshold has been defined in context. Note that the existing threshold used has been renamed to ratio:

name: line_saturation

input:
file: incoming.json

context:
threshold: 90

actions:
- generate:
bbox.linkutilisation.incoming:
add:
- severity: warning
- type: alert
- title: "{{name}} Line Utilisation incoming over {{threshold}}%"
- text: "average incoming ${avg_incoming}kb close to max incoming ${max_incoming}kb: ratio: ${ratio:1} (site {{name}})"
let:
avg_incoming: AVG(incomingBytesPerInterval, 5m)
max_incoming: MAX(incomingBytesPerInterval, 60m)
ratio: '{{threshold}}/100.0'
when: (avg_incoming / max_incoming) > ratio

- filter:
patterns:
- type: alert

output:
write: console

Context variables expand with {{threshold}} instead of ${threshold}. Constants are defined when the Pipe is created, so the {{}} are for field values and values calculated from each incoming event, as is the case when using the add action.

{{name}} is set to the Agent name and is always present.

This example is clearer because the threshold number — 90 — is not repeated. However, true efficiency is realised by using Contexts. Contexts allow us to change threshold for all Agents in the global Context and change threshold for a particular Agent in the Agent Context. Tag Context can also be set for specified Agent groups.