Enriching Data
Raw data is converted to JSON format (see working with data). It is then enriched, reshaped, and annotated for easy consumption and storage:
add
to add constant value fieldsscript
to conditionally add scripted fieldsenrich
for general CSV lookupsinputs as actions can be used for arbitrary retrieval
Fields with Agent-specific Values
Data needs to be tagged with the location at the source. There are standard Context variables that are different for each Agent, and more can be added to the Pipe Contexts:
- add:
output-fields:
- site: '{{name}}'
- pipe: '{{pipe}}'
Generated Fields
All data needs a timestamp as this is the processing time at the Agent:
- time:
output-field: '@timestamp'
A sequence number can be added to each event, note that this is not persistent across restarts:
- script:
let:
- seq: 'count()'
However, using uuid()
, is a more efficient method for generating fields as it gives a unique ID.
Calculated Fields
Use the script
action
to calculate values for fields and the script.let
action
to anonymize data using hash functions:
script:
let:
- name_hash: md5(name)
- address_hash: md5(address)
remove:
- name
- address
We perform this action
to prepare data for storage purposes and further processing outside of private networks.
Hashes are one-way functions, but it is still possible to encrypt sensitive fields using encrypt()
.
Find the available scripting functions here.
Conditional Fields
If condition
is defined and true
, script
will add fields.
It is easier to use set
rather than let
when adding literal strings. As the add
and script
actions default to never overwriting existing fields, this snippet allows you to add the field quality
to the event. It displays quality: good
if the field has a condition: a > 1
, and quality: bad
for any other values.
- script:
condition: a > 1
set:
- quality: good
- script:
set:
- quality: bad
For a more elegant solution, use the cond
function:
- script:
let:
- quality: cond(a > 1,"good","bad")
Table Lookup
enrich
is an efficient way to enrich data with tables read from a CSV file. If the value of an event matches a column, we can use the value of another column on the same row to create a new field.
Note that the lookup files need to be attached to the Pipe using a files:
section. An example of this can be found at the end of this section.
A sample event:
id,name,nick,office
23,Alice,bbye,head
12,Bob,wkr,kzn
13,John,nomo,wcape
If iden
in the event matches the id
in the table we can set nice_name
to the value of name
:
# Input: {"iden":12}
enrich
- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
add:
event-field: nice_name
lookup-field: name
# Output: {"iden":12,"nice_name":"Bob"}
Specifying a type for the match is required. These are found below:
str
text valuesnum
numbersip
IPv4 addressescidr
IPv4 address ranges. For example: '192.168.1.0/16'num-list
separated by commas. For example: '10,20,30'str-list
separated by commas. For example: 'office,home'num-range
ranges. For example: '10-23'
You may need to satisfy multiple matches:
# Input: {"iden":12,"office":"kzn"}
enrich
- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
- type: str
event-field: office
lookup-field: office
add:
event-field: nice_name
lookup-field: name
# Output: {"iden":12,"office":"kzn","nice_name":"Bob"}
Adding multiple values with enrich
can be tedious, since the match must be repeated:
# Input: {"iden":12}
enrich
- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
add:
event-field: nice_name
lookup-field: name
- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
add:
event-field: nickname
lookup-field: nick
# Output: {"iden":12,"office":"kzn","nice_name":"Bob"}
There is a convenient shortcut. Here, fields to be added need to match the lookup names as follows:
# Input: {"iden":12}
enrich
- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
add:
event-fields:
- name: <unknown>
- nick: ''
# Output: {"iden":12,"name":"Bob","nick":"wkr"}
event-fields
gives the field name, which must match the same column in the CSV file. The value (after the colon) is the default value.
The lookup CSV file will be reloaded if it is modified. This allows other Pipes to modify the enrichment globally.
A complete example illustrating the inclusion of a fruits.csv
file in an optional lookups
subdirectory:
name: simple_echo_with_enrich
files:
- lookups/fruits.csv
input:
echo:
json: true
event: |
{ "this": "a", "that": "b" }
actions:
- enrich:
lookup-file: fruits.csv
add:
event-field: fruit
lookup-field: output_column
match:
- type: str
event-field: this
lookup-field: input_column
output:
print: STDOUT
# Output: {"that":"b","this":"a","fruit":"Apple"}
Here is the lookups/fruits.csv
file:
input_column,output_column
a,"Apple"
b,"Banana"
Enriching with Input
Inputs as actions
is a powerful technique. Let’s say we have an HTTP endpoint that is given a name and returns the city where the person lives as {"city":"NAME"}
. Events containing name
receive a city
field:
name: http-enrich
input:
exec:
command: echo '{"name":"Joe"}'
raw: true
actions:
- input:
http-poll:
address: http://127.0.0.1:3030
query:
- name: ${name}
raw: true
output:
write: console
# Output: {"city":"Johnnesburg","name":"Joe"}
While much of the functionality on Unix-like systems is provided through the CLI, we can still execute commands as actions. The host
command can perform either a forward or reverse DNS lookup:
name: host-enrich
input:
exec:
command: echo '{"ip":"98.137.246.7"}'
raw: true
actions:
- exec:
command: host ${ip}
result:
stdout-field: host
- raw:
extract:
input-field: host
pattern: '(\S+)\.$'
output:
write: console
# Output:
# {"ip":"98.137.246.7","host":"media-router-fp1.prod1.media.vip.gq1.yahoo.com"}
The only requirement is extracting the hostname from the end of the output
afterwards.
The script function ip2asn
is a more appropriate way to get the actual ASN. This function uses the Team Cymru service and in this case, returns YAHOO-GQ1, US
.
Use input: redis
to look up a field in a hash as it is particularly effective in instances where the lookups are simple and often repeated.