The HRT Beat | Tech Blog

HeraclesQL: A Python DSL for Writing Alerts

Written

Topic

Published

Oct 28, 2025

I’m on the Systems Dev team at HRT. My team is responsible for software development that supports Systems, Network, and Datacenter Engineers who maintain the computers HRT relies on for trading and research. While my team has a large variety of responsibilities, I’m focused on developing tools which improve monitoring and alerting. In this post, I’ll be doing a deep dive on my team’s newly open-sourced project: HeraclesQL.

At HRT, VictoriaMetrics is an important part of our monitoring stack. It centers around a timeseries database that stores metrics about services and devices. VictoriaMetrics has its own query language for fetching timeseries data called MetricsQL. Over time, we’ve written thousands of alerts using MetricsQL queries.

While we love VictoriaMetrics and MetricsQL, we’ve encountered some pain points with MetricsQL as the number of teams using VictoriaMetrics has grown. These break down into three categories: excessive code duplication, poor expressiveness in complex queries, and subtle footguns.

HeraclesQL is designed to solve these problems while preserving what makes MetricsQL great. It is a Domain Specific Language (DSL) for writing MetricsQL queries in Python. If you’re familiar with SQLAlchemy for SQL, it can be thought of as filling a similar role. Under the hood, HeraclesQL Python code is essentially a type-safe builder which generates MetricsQL. We’ve provided it as a Python package on PyPi and made it available on GitHub as open source software under the MIT license. Some features include:

MetricsQL-like syntax — HeraclesQL will be immediately familiar to anyone who’s written MetricsQL or PromQL.
Custom Functions, Parameterizable Expressions, and Variables — No more copy and pasting behavior between alerts.
Static Type Safety — MyPy and your editor will catch common problems before they occur.
Meta-alerts — Generate alerts about your alerts to avoid common pitfalls.

Before we do our deep dive, let’s briefly discuss MetricsQL.

A Brief Intro to MetricsQL

MetricsQL is the language for querying the VictoriaMetrics timeseries database. It’s largely based on PromQL, which is the query language for the Prometheus timeseries database. MetricsQL and PromQL are extremely similar, so if you know one you essentially know the other. Both VictoriaMetrics and Prometheus are monitoring platforms, which means they typically store metrics about services or devices.

Data is stored in the form of numerically-valued timeseries. Each timeseries is identified by a set of string key-value pairs called labels. Queries use string matching operations to select timeseries based on the value of labels. The result of a query is a vector that contains zero or more timeseries.

Let’s look at some examples:

A simple MetricsQL query

my_cool_metric{job="my_scrape_job"}

my_cool_metric{job="my_scrape_job"}

This query selects a vector of timeseries with the name my_cool_metric and the label ‘job=”my_scrape_job”. The part of the query inside the curly braces are called matchers. There’s no way to know the number of timeseries this query will select.

MetricsQL supports many built-in functions and operators. For example, a query can use the sum function to calculate a timeseries which is the sum of all timeseries matching a selector:

MetricsQL aggregation

sum(http_request_duration{method="POST"})

sum(http_request_duration{method="POST"})

Since VictoriaMetrics is a timeseries database, many functions are designed to calculate values over time. For example, the rate function can calculate a timeseries which contains the growth rate of a timeseries over a specified lookback:

MetricsQL rollup

increase(http_request_count_total{path=~"/api/v2/.*"}[5m])

increase(http_request_count_total{path=~"/api/v2/.*"}[5m])

There’s too much behavior to detail in this post, so I’ll stop here. The VictoriaMetrics docs are a great resource for learning more about MetricsQL.

The primary purpose of MetricsQL is to write alerts which notify users of anomalous conditions in a service or device. To this end, VictoriaMetrics supports configured alerting rules. Here’s an example:

VictoriaMetrics alerting rule

alert: MyUsefulAlert
expr: my_cool_metric{job="my_scrape_job"} > 10
labels:
    some_cool_label: with_a_useful_value

alert: MyUsefulAlert
expr: my_cool_metric{job="my_scrape_job"} > 10
labels:
    some_cool_label: with_a_useful_value

An alert will be generated if the query defined in the expr field returns any timeseries. One alert will be generated per timeseries.

HeraclesQL

The HeraclesQL Python package provides the core DSL, config generation for VictoriaMetrics, and some useful utilities like embedded testing and development tools.

We want HeraclesQL to be immediately familiar to anybody who’s written PromQL or MetricsQL. To achieve that, we’ve designed HeraclesQL to feel like writing MetricsQL. It’s easier to see what I mean by looking at our example of a simple MetricsQL query:

A simple MetricsQL query

my_cool_metric{job="my_scrape_job"}

my_cool_metric{job="my_scrape_job"}

Now, here’s what it looks like in HeraclesQL:

Simple MetricsQL query in HeraclesQL

v.my_cool_metric(job="my_scrape_job")

v.my_cool_metric(job="my_scrape_job")

HeraclesQL supports all of MetricsQL’s operators. For example:

MetricsQL query with binary operators

(v.left_metric{foo="bar"} + on(hostname) v.right_metric) and v.some_other_metric

(v.left_metric{foo="bar"} + on(hostname) v.right_metric) and v.some_other_metric

Can be written in HeraclesQL like this:

Binary operators in HeraclesQL

(v.left_metric(foo="bar") + v.right_metric).on("hostname") & v.some_other_metric

(v.left_metric(foo="bar") + v.right_metric).on("hostname") & v.some_other_metric

Heracles’ API is completely MyPy and Pyright compliant – Both your editor and MyPy can warn you about typing issues statically:

Poorly typed HeraclesQL expression

# example.py
ql.rate(v.my_metric(foo="bar"))

# example.py
ql.rate(v.my_metric(foo="bar"))

$ mypy example.py
example.py:78: error: Argument 1 to "rate" has incompatible type "SelectedInstantVector"; expected "RangeVector"  [arg-type]

$ mypy example.py
example.py:78: error: Argument 1 to "rate" has incompatible type "SelectedInstantVector"; expected "RangeVector"  [arg-type]

Static typing is important to HeraclesQL’s design because we want HeraclesQL to offer an experience that’s even better than writing MetricsQL. Getting warnings from your editor makes it easier to iterate on queries without resorting to trial and error.

How HeraclesQL Improves on MetricsQL

We designed HeraclesQL to address a number of persistent problems we’ve had with writing alerts for Prometheus and VictoriaMetrics. These fall into three categories:

Code Duplication
Expressiveness
Footguns

Let’s take a look at each one in detail.

Code Duplication

Code duplication was the biggest motivation for HeraclesQL. At HRT, we have thousands of alerts across dozens of teams. Nothing can be shared between alerts when using plain MetricsQL — if two teams have alerts on disk usage with different thresholds, they both would need to write almost identical alerts.

Before developing HeraclesQL, we considered a few options to reduce duplication. These were all some form of text-templating, and they all suffered from similar problems. Namely, text-templating treats the query as text, not a structured language. This means that templates have to be written very carefully to avoid introducing errors on misuse. Text templating also becomes quite unwieldy for complex templates.

HeraclesQL allows type-safe templating inside expressions. Imagine that we want to write an alert that will fire when disk exhaustion is imminent. We might write something like this:

Root filesystem 10% remaining

predict_linear(
  (
    node_filesystem_avail_bytes{mountpoint="/"}
      /
    node_filesystem_size_bytes{mountpoint="/"}
  )[12h:],
  1h
)
  <
0.10

predict_linear(
  (
    node_filesystem_avail_bytes{mountpoint="/"}
      /
    node_filesystem_size_bytes{mountpoint="/"}
  )[12h:],
  1h
)
  <
0.10

This expression returns a non-empty vector if the linear interpolation of the fraction of available filesystem space in one hour is less than 10%.

But what if we want another alert that fires when we’re minutes away from exhaustion? With MetricsQL, our only option is to write another alert with a nearly duplicate expression:

Root filesystem 1% remaining

predict_linear(
  (
    node_filesystem_avail_bytes{mountpoint="/"}
      /
    node_filesystem_size_bytes{mountpoint="/"}
  )[1h:],
  15m
)
  <
0.01

predict_linear(
  (
    node_filesystem_avail_bytes{mountpoint="/"}
      /
    node_filesystem_size_bytes{mountpoint="/"}
  )[1h:],
  15m
)
  <
0.01

This is a simple case, but this can cause a huge amount of code duplication for more complicated alerts.

In HeraclesQL, we can use a function to avoid duplication:

Disk space running out in HeraclesQL

from heracles import ql

v = ql.Selector()

def disk_space_running_out(
    lookback: ql.Duration,
    extrapolate: ql.Duration,
    threshold: float,
) -> ql.InstantVector:
    return (
        ql.predict_linear(
            (
                v.node_filesystem_avail_bytes(mountpoint="/")
                / v.node_filesystem_size_bytes(mountpoint="/")
            )[lookback:],
            extrapolate,
        )
        < threshold
    )


# first alert expression
disk_space_running_out(12 * ql.Hour, 1 * ql.Hour, 0.10)

# second alert expression
disk_space_running_out(1 * ql.Hour, 15 * ql.Minute, 0.01)

from heracles import ql

v = ql.Selector()

def disk_space_running_out(
    lookback: ql.Duration,
    extrapolate: ql.Duration,
    threshold: float,
) -> ql.InstantVector:
    return (
        ql.predict_linear(
            (
                v.node_filesystem_avail_bytes(mountpoint="/")
                / v.node_filesystem_size_bytes(mountpoint="/")
            )[lookback:],
            extrapolate,
        )
        < threshold
    )


# first alert expression
disk_space_running_out(12 * ql.Hour, 1 * ql.Hour, 0.10)

# second alert expression
disk_space_running_out(1 * ql.Hour, 15 * ql.Minute, 0.01)

Now we can make as many variants of this expression as we want without duplicating any code!

HeraclesQL’s internal representation of an expression is essentially the abstract syntax tree of the corresponding MetricsQL expression. The HeraclesQL value is not used to query VictoriaMetrics directly. Instead, it’s used to generate MetricsQL. For example, if we wrote:

Rendering a HeraclesQL query

print(disk_space_running_out(12 * ql.Hour, 1h * ql.Hour, 0.10).render())

print(disk_space_running_out(12 * ql.Hour, 1h * ql.Hour, 0.10).render())

We’d print out MetricsQL:

MetricsQL generated from HeraclesQL

predict_linear(
  (
    node_filesystem_avail_bytes{mountpoint="/"}
      /
    node_filesystem_size_bytes{mountpoint="/"}
  )[12h:],
  3600000
)
  <
0.10

predict_linear(
  (
    node_filesystem_avail_bytes{mountpoint="/"}
      /
    node_filesystem_size_bytes{mountpoint="/"}
  )[12h:],
  3600000
)
  <
0.10

Generating MetricsQL is useful because it allows Heracles code to generate native VictoriaMetrics rule configuration files!

Expressiveness

While MetricsQL is concise for simple queries, it can easily become difficult to understand. Even using ‘WITH’ templating, there’s not much flexibility to define variables or avoid repetition. Here’s an alert to use as an example:

A complicated MetricsQL query

- alert: APIPoorThroughput
  expr: |-
    (
      (
        rate(http_request_duration_sum{path=~"/api/v1/.*"}[5m])
          /
        rate(http_request_duration_count_total[5m])
      )
        /
      changes_prometheus(
        (
          increase(http_response_size_bytes_sum[5m])
            /
          increase(http_response_size_bytes_count_total[5m])
        )
          *
        1M
      )
    )
      >
    0.005

- alert: APIPoorThroughput
  expr: |-
    (
      (
        rate(http_request_duration_sum{path=~"/api/v1/.*"}[5m])
          /
        rate(http_request_duration_count_total[5m])
      )
        /
      changes_prometheus(
        (
          increase(http_response_size_bytes_sum[5m])
            /
          increase(http_response_size_bytes_count_total[5m])
        )
          *
        1M
      )
    )
      >
    0.005

This alert calculates the 5m average request duration per megabyte of response size.

Everybody has their own threshold for readability, but this alert is beginning to reach mine. I can understand it by reading it closely, but it’s difficult to grok right away. Since HeraclesQL queries are just Python code, it’s possible to use variables, branches, loops, and functions to make the query’s intent more clear. We can rewrite this alert using HeraclesQL’s configuration package:

APIPoorThroughput alert in HeraclesQL

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()


@rules.alert()
def api_poor_throughput() -> config.Alert:
    lookback = 5 * ql.Minute
    
   # define a variable which calculates just the request duration
    average_request_duration = ql.increase(
        v.http_request_duration_sum(path=ql.RE("/api/v1/.*"))[lookback]
    ) / ql.increase(v.http_request_duration_count_total[lookback])

    # separately, calculate average response size
    average_response_size = (
        ql.increase(v.http_request_duration_sum[lookback])
        / ql.increase(v.http_response_size_bytes_count_total[lookback])
    ) * (1000 * 1000)

    # use the two variables above to calculate the request duration in terms of response size
    request_duration_per_response_size = (
        average_request_duration / average_response_size
    )
    # finally, return an alert which tests the request_duration_per_response_size against the threshold
    return config.SimpleAlert(expr=request_duration_per_response_size > 0.005)

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()


@rules.alert()
def api_poor_throughput() -> config.Alert:
    lookback = 5 * ql.Minute
    
   # define a variable which calculates just the request duration
    average_request_duration = ql.increase(
        v.http_request_duration_sum(path=ql.RE("/api/v1/.*"))[lookback]
    ) / ql.increase(v.http_request_duration_count_total[lookback])

    # separately, calculate average response size
    average_response_size = (
        ql.increase(v.http_request_duration_sum[lookback])
        / ql.increase(v.http_response_size_bytes_count_total[lookback])
    ) * (1000 * 1000)

    # use the two variables above to calculate the request duration in terms of response size
    request_duration_per_response_size = (
        average_request_duration / average_response_size
    )
    # finally, return an alert which tests the request_duration_per_response_size against the threshold
    return config.SimpleAlert(expr=request_duration_per_response_size > 0.005)

The HeraclesQL version of this alert has several advantages:

lookback is a variable so it doesn’t have to be repeated multiple times
average_request_duration and average_response_size can be calculated separately and stored in named variables
The final alert’s expression clearly explains what it does to a reader without needing any comments

With HeraclesQL, we can go even further than this – what if we want different thresholds for a few different paths? HeraclesQL’s config library lets us use parameterizable functions to register the same alert multiple times:

Defining multiple HeraclesQL alerts from one function

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.selector()

@rules.alert(
"HandlerALowThroughput", 
path_prefix="/api/v1/handler_a",
threshold = 0.005,
)
@rules.alert(
"HandlerBLowThroughput", 
path_prefix="/api/v1/handler_b",
threshold = 0.0001,
)

def api_poor_throughput(
path_prefix: str,
threshold: float,
	lookback: ql.Duration = 5*ql.Minute,
) -> config.Alert:
    # define a variable which calculates just the request duration
    average_request_duration = ql.increase(
        v.http_request_duration_sum(path=ql.RE(f"{path_prefix}.*"))[lookback]
    ) / ql.increase(v.http_request_duration_count_total[lookback])

    # separately, calculate average response size
    average_response_size = (
        ql.increase(v.http_request_duration_sum[lookback])
        / ql.increase(v.http_response_size_bytes_count_total[lookback])
    ) * (1000 * 1000)

    # use the two variables above to calculate the request duration in terms of
    # response size
    request_duration_per_response_size = (
        average_request_duration / average_response_size
    )

    # finally, return an alert which tests the request_duration_per_response_size against the threshold
    return config.SimpleAlert(
expr=request_duration_per_response_size > threshold,
    )

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.selector()

@rules.alert(
"HandlerALowThroughput", 
path_prefix="/api/v1/handler_a",
threshold = 0.005,
)
@rules.alert(
"HandlerBLowThroughput", 
path_prefix="/api/v1/handler_b",
threshold = 0.0001,
)

def api_poor_throughput(
path_prefix: str,
threshold: float,
	lookback: ql.Duration = 5*ql.Minute,
) -> config.Alert:
    # define a variable which calculates just the request duration
    average_request_duration = ql.increase(
        v.http_request_duration_sum(path=ql.RE(f"{path_prefix}.*"))[lookback]
    ) / ql.increase(v.http_request_duration_count_total[lookback])

    # separately, calculate average response size
    average_response_size = (
        ql.increase(v.http_request_duration_sum[lookback])
        / ql.increase(v.http_response_size_bytes_count_total[lookback])
    ) * (1000 * 1000)

    # use the two variables above to calculate the request duration in terms of
    # response size
    request_duration_per_response_size = (
        average_request_duration / average_response_size
    )

    # finally, return an alert which tests the request_duration_per_response_size against the threshold
    return config.SimpleAlert(
expr=request_duration_per_response_size > threshold,
    )

Named variables can also be defined outside of rule functions and referenced multiple times. By default, this will replicate that variable’s sub-expression wherever it is referenced. Often, it’s desirable to create a recording rule from this kind of shared expression. In HeraclesQL, recording rules can be defined using a similar API to alerts, but this can often be overkill for simple cases. Thus, HeraclesQL provides a less verbose way to wrap an expression to produce a recording rule:

A simple recording rule in HeraclesQL

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

lookback: ql.Duration = 5 * ql.Minute

# wrap our expression in rules.record to let heracles know that this should
# produce a recording rule. The resulting variable is a normal InstantVector
average_request_duration = rules.record(
    ql.increase(v.http_request_duration_sum[lookback])
    / ql.increase(v.http_request_duration_count_total[lookback]),
    "my_service:avg_request_duration:5m",
)


@rules.alert(
    "HandlerALowThroughput",
    path_prefix="/api/v1/handler_a",
    threshold=0.005,
)
@rules.alert(
    "HandlerBLowThroughput",
    path_prefix="/api/v1/handler_b",
    threshold=0.0001,
)
def api_poor_throughput(
    path_prefix: str,
    threshold: float,
) -> config.Alert:
    average_response_size = (
        ql.increase(v.http_request_duration_sum[lookback])
        / ql.increase(v.http_response_size_bytes_count_total[lookback])
    ) * (1000 * 1000)

    # use the average_request_duration variable. When we generate MetricsQL, 
    # this variable will render as a reference to
    # my_service:avg_request_duration:5m
    request_duration_per_response_size = (
        average_request_duration(path=ql.RE(f"{path_prefix}.*")) / average_response_size
    )

    return config.SimpleAlert(
        expr=request_duration_per_response_size > threshold,
    )

@rules.alert()
def api_request_duration_very_long() -> config.Alert:
    return config.SimpleAlert(expr=average_request_duration > 10)

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

lookback: ql.Duration = 5 * ql.Minute

# wrap our expression in rules.record to let heracles know that this should
# produce a recording rule. The resulting variable is a normal InstantVector
average_request_duration = rules.record(
    ql.increase(v.http_request_duration_sum[lookback])
    / ql.increase(v.http_request_duration_count_total[lookback]),
    "my_service:avg_request_duration:5m",
)


@rules.alert(
    "HandlerALowThroughput",
    path_prefix="/api/v1/handler_a",
    threshold=0.005,
)
@rules.alert(
    "HandlerBLowThroughput",
    path_prefix="/api/v1/handler_b",
    threshold=0.0001,
)
def api_poor_throughput(
    path_prefix: str,
    threshold: float,
) -> config.Alert:
    average_response_size = (
        ql.increase(v.http_request_duration_sum[lookback])
        / ql.increase(v.http_response_size_bytes_count_total[lookback])
    ) * (1000 * 1000)

    # use the average_request_duration variable. When we generate MetricsQL, 
    # this variable will render as a reference to
    # my_service:avg_request_duration:5m
    request_duration_per_response_size = (
        average_request_duration(path=ql.RE(f"{path_prefix}.*")) / average_response_size
    )

    return config.SimpleAlert(
        expr=request_duration_per_response_size > threshold,
    )

@rules.alert()
def api_request_duration_very_long() -> config.Alert:
    return config.SimpleAlert(expr=average_request_duration > 10)

In both rules, average_request_duration will render as a reference to the recording rule.

Footguns

MetricsQL is extremely powerful, but with that power comes complexity. There’s a great deal of subtle behavior which isn’t always obvious. In our experience, the most common mistakes involve unexpectedly missing timeseries. Take a simple expression, for example:

A potentially broken MetricsQL alert

increase(
  my_service_background_job_started_count{job_name="very-important"}[10m]
)
  <
1

increase(
  my_service_background_job_started_count{job_name="very-important"}[10m]
)
  <
1

If my very-important job doesn’t run at least once in a 10m span, this will generate a vector with at least one timeseries and raise an alert. But what happens if my_service_background_job_started_count isn’t in VictoriaMetrics? In that case, the expression will always return an empty vector because the selector my_service_background_job_started_count{job_name="very-important"} will return nothing. I will not get any alerts even though my job definitely isn’t running!

This kind of bug can easily occur for metrics which are registered dynamically. If my_service registers new job_name values only when a job starts, this metric would never be generated until the job runs for the first time.

To avoid this, it’s best practice to add an absence check to selectors that reference timeseries that may be dynamically produced. However, this is easily overlooked. In practice, we found that alerts at HRT rarely handled timeseries absence correctly.

HeraclesQL provides a way to automatically add this check:

AlertForMissingData context in HeraclesQL

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

# use config.AlertForMissingData to generating a meta-alert that verifies
# all the timeseries we expect to select are actually selected
with rules.context(config.AlertForMissingData()):

    @rules.alert()
    def very_important_job_not_running() -> config.Alert:
        return config.SimpleAlert(
            # we don't need to include an absence checks in our expression,
            # config.AlertForMissingData will do it for us
            expr=ql.increase(                v.my_service_background_job_started_count(job_name="very-important")[
                    10 * ql.Minute
                ]
            )
            > 0,
        )

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

# use config.AlertForMissingData to generating a meta-alert that verifies
# all the timeseries we expect to select are actually selected
with rules.context(config.AlertForMissingData()):

    @rules.alert()
    def very_important_job_not_running() -> config.Alert:
        return config.SimpleAlert(
            # we don't need to include an absence checks in our expression,
            # config.AlertForMissingData will do it for us
            expr=ql.increase(                v.my_service_background_job_started_count(job_name="very-important")[
                    10 * ql.Minute
                ]
            )
            > 0,
        )

Before understanding what AlertForMissingData does, we need to take a small detour to explain what rules.context is doing. In addition to registering alerts, the RuleBundle object is able to perform post-processing. All HeraclesQL vector objects store the entire AST of the query, so post-processors are able to walk the AST and generate meta-queries. The context manager above is used to register AlertForMissingData as a post processor for every alert defined within the context.

To understand what this does, it’s easiest to look at the output HeraclesQL’s config generation produces:

Generated output

  rules:
  - alert: VeryImportantJobNotRunning
    expr: |-
      increase(
        my_service_background_job_started_count{job_name="very-important"}[10m]
      )
        >
      0.0
  - alert: VeryImportantJobNotRunningDataMissing
    expr: absent(my_service_background_job_started_count{job_name="very-important"})

  rules:
  - alert: VeryImportantJobNotRunning
    expr: |-
      increase(
        my_service_background_job_started_count{job_name="very-important"}[10m]
      )
        >
      0.0
  - alert: VeryImportantJobNotRunningDataMissing
    expr: absent(my_service_background_job_started_count{job_name="very-important"})

In addition to our alert, Heracles has generated a VeryImportJobNotRunningDataMissing alert which fires when my_service_background_job_started_count is absent. AlertForDataMissing generated this alert by traversing the AST of our first alert’s expression and generating an absence check for each root-level vector selector.

Not all common footguns are so straightforward. For example, in MetricsQL, binary operations inherently perform setwise operations. Something as simple as this can have multiple meanings:

A simple binary operation

vector_x + vector_y

vector_x + vector_y

The ‘+’ operator can be thought of as working in two steps: First, it finds all the timeseries in vector_y which have matching labels to timeseries in vector_x, keeping only the intersection. Next, it adds the values of the remaining timeseries to produce a new vector. The net result is an operation which can mean either “add” or “filter” depending on the context. When writing a query, it’s easy to accidentally filter when you only meant to add (in fact, HeraclesQL’s standard library implements generic joins using the + operator).

For these ambiguous cases, HeraclesQL provides ‘annotations’ which can be attached to queries to signal intent. In this case, we can write something like this:

Annotated binary operation

from heracles import ql, config
from heracles.ql import assertions

v = ql.Selector()

(v.vector_x + v.vector_y).annotate(assertions.no_cardinality_change)

from heracles import ql, config
from heracles.ql import assertions

v = ql.Selector()

(v.vector_x + v.vector_y).annotate(assertions.no_cardinality_change)

When used with the AlertForAssertions context, this will generate a meta-alert which fires when the assertion is not true:

Generated invalid data meta-alert

  rules:
  - alert: ExampleAlert
    expr: vector_x + vector_y
  - alert: ExampleAlertInvalidData
    expr: absent(count(vector_x) == count(vector_x + vector_y))

  rules:
  - alert: ExampleAlert
    expr: vector_x + vector_y
  - alert: ExampleAlertInvalidData
    expr: absent(count(vector_x) == count(vector_x + vector_y))

These are just a few examples of meta-alerts in HeraclesQL. Both contexts and annotations are just Python code that processes the expression’s AST. It’s possible to write custom implementations that can do just about anything you can imagine!

Pulling it Together – Jenkins Alerts

HeraclesQL’s config library brings all of these features together to make a powerful code-as-configuration system. As a more realistic example, let’s write a small library for alerting based on the outcome of Jenkins jobs using the Jenkins Prometheus plugin: https://plugins.jenkins.io/prometheus/

We’ll work under the assumption that we have multiple teams that want to alert on the status of their jobs. Jobs can do many things — they can be periodic jobs, they can be builds, and they can be manually triggered one-offs. Using plain VictoriaMetrics config, we’re likely to end up with a lot of near-duplicate alerts.

A basic alert might look like this:

Example jenkins alert

rules:
- alert: JenkinsJobFailing
  expr: default_jenkins_builds_last_build_result{job_name="Org/Team/Job"} == 0
  for: 10m
  annotations:
	summary: Job {$labels.job_name} is failing for the last 10m

rules:
- alert: JenkinsJobFailing
  expr: default_jenkins_builds_last_build_result{job_name="Org/Team/Job"} == 0
  for: 10m
  annotations:
	summary: Job {$labels.job_name} is failing for the last 10m

However, there’s already a problem — Jenkins registers metrics dynamically. If the job doesn’t exist for some reason (like bad configuration or a branch in a repo being deleted), this alert will not fire. So ideally we’ll check for absence as well:

Improved jenkins alert

rules:
- alert: JenkinsJobFailing
  expr: |-
    (default_jenkins_builds_last_build_result{job_name="Org/Team/Job"} == 0)
      or
    absent(default_jenkins_builds_last_build_result{job_name="Org/Team/Job"})
  for: 10m
  annotations:
	summary: Job {$labels.job_name} is failing for the last 10m

rules:
- alert: JenkinsJobFailing
  expr: |-
    (default_jenkins_builds_last_build_result{job_name="Org/Team/Job"} == 0)
      or
    absent(default_jenkins_builds_last_build_result{job_name="Org/Team/Job"})
  for: 10m
  annotations:
	summary: Job {$labels.job_name} is failing for the last 10m

Hopefully everyone always remembers to add that!

There’s also one more complication: The Jenkins Prometheus plugin exports two metrics describing the last build result — default_jenkins_builds_last_build_result and default_jenkins_builds_last_build_result_ordinal. Both metrics are gauges that represent the status as a numerical value. The non-ordinal version uses 0 for a set of failed states and 1 for success. The ordinal version represents each state as a separate value (0-4). Confusingly, success is 0!

With many teams writing alerts for Jenkins jobs, this causes a few annoyances:

Some alerts will use default_jenkins_builds_last_build_result and some will use default_jenkins_builds_last_build_result_ordinal even if they semantically find the same condition
It’s easy to mistakenly check for the wrong values from default_jenkins_builds_last_build_result_ordinal since they’re different from the result from default_jenkins_builds_last_build and they’re just numeric values in code.

We found ourselves in this situation at HRT. Some alerts used _ordinal, some alerts forgot the absent check, and some alerts checked for the wrong result codes.

If you’ve read the previous section, you may already have some ideas about how Heracles can improve this. At the very least, we can use AlertForMissingData to ensure that nobody forgets the absent check anymore:

Jenkins alert in HeraclesQL

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

with rules.context(config.AlertForMissingData()):

    @rules.alert()
    def jenkins_job_failing() -> config.Alert:
        return config.SimpleAlert(
            expr=(v.default_jenkins_builds_last_result(jenkins_job="Org/Team/Job") < 1),
            for_=15 * ql.Minute,
            annotations={
                "summary": "Job is failing for 15m",
            },
        )

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

with rules.context(config.AlertForMissingData()):

    @rules.alert()
    def jenkins_job_failing() -> config.Alert:
        return config.SimpleAlert(
            expr=(v.default_jenkins_builds_last_result(jenkins_job="Org/Team/Job") < 1),
            for_=15 * ql.Minute,
            annotations={
                "summary": "Job is failing for 15m",
            },
        )

Next, we can add parameters to this alerting function so that we can define multiple alerts by repeating the @rules.alert annotations:

Alerts for multiple jobs

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.selector()

with rules.context(config.AlertForMissingData()):
    @rules.alert("JobFrequent", 15 * ql.Minute)
    @rules.alert("JobInfrequent", 12 * ql.Hour)
    @rules.alert("JobImportant", 20 * ql.Minute)
    def jenkins_job_failing(job: str, duration: ql.Duration) -> config.Alert:
		    return config.SimpleAlert(
            expr=v.default_jenkins_builds_last_result_ordinal(
                jenkins_job=job_name=f"Org/Team/{job}"
            ) > 1,
            for_=duration,
            annotations={
                "summary": f"{job} is failing for {duration}",
            }
        )

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.selector()

with rules.context(config.AlertForMissingData()):
    @rules.alert("JobFrequent", 15 * ql.Minute)
    @rules.alert("JobInfrequent", 12 * ql.Hour)
    @rules.alert("JobImportant", 20 * ql.Minute)
    def jenkins_job_failing(job: str, duration: ql.Duration) -> config.Alert:
		    return config.SimpleAlert(
            expr=v.default_jenkins_builds_last_result_ordinal(
                jenkins_job=job_name=f"Org/Team/{job}"
            ) > 1,
            for_=duration,
            annotations={
                "summary": f"{job} is failing for {duration}",
            }
        )

There’s still one major limitation with this implementation. As written, we will raise an alert for every job when it’s in FAILURE or UNSTABLE status. While that’s probably fine for most alerts, it might not be for some import alerts. For example, perhaps JobImportant’s alert is supposed to fire even when the job’s ordinal result is 1 (for UNSTABLE). Since we can’t use the same threshold for every alert, we need to add a parameter for build status and generate a more complex expression:

Alerts for multiple jobs with different statuses

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

with rules.context(config.AlertForMissingData()):

    @rules.alert("JobFrequent", duration=15 * ql.Minute)
    @rules.alert("JobInfrequent", duration=12 * ql.Hour)
    @rules.alert("JobImportant", duration=20 * ql.Minute, failure_status=(1, 2, 3, 4))
    def jenkins_job_failing(
        job: str,
        duration: ql.Duration,
        failure_status: Iterable[int] = (2, 3),
    ) -> config.Alert:
        selector = v.default_builds_last_result_ordinal(jenkins_job=f"Org/Team/{job}")
        sub_exprs = [selector == s for s in failure_status]
        final_expr = functools.reduce(ql.InstantVector.or_, sub_exprs)

        return config.SimpleAlert(
            expr=final_expr,
            for_=duration,
            annotations={
                "summary": f"{job} is failing for 15m",
            },
        )

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

with rules.context(config.AlertForMissingData()):

    @rules.alert("JobFrequent", duration=15 * ql.Minute)
    @rules.alert("JobInfrequent", duration=12 * ql.Hour)
    @rules.alert("JobImportant", duration=20 * ql.Minute, failure_status=(1, 2, 3, 4))
    def jenkins_job_failing(
        job: str,
        duration: ql.Duration,
        failure_status: Iterable[int] = (2, 3),
    ) -> config.Alert:
        selector = v.default_builds_last_result_ordinal(jenkins_job=f"Org/Team/{job}")
        sub_exprs = [selector == s for s in failure_status]
        final_expr = functools.reduce(ql.InstantVector.or_, sub_exprs)

        return config.SimpleAlert(
            expr=final_expr,
            for_=duration,
            annotations={
                "summary": f"{job} is failing for 15m",
            },
        )

Great! Now the expression is written in terms of a set of explicit statuses we want to alert on. By overriding the value of failure_status for JobImportant, we are able to alert when it is in anything other than a success status. However, now our alert definitions depend on passing around opaque integer statuses. That’s a step backwards in readability and makes it more likely that new alerts will use the wrong statuses by mistake.

Luckily, HeraclesQL config is Python code! We can just add a new type:

Modelling Jenkins statuses in Python

import enum

class JenkinsJobResult(int, enum.Enum):
    success = 0
    unstable = 1
    failure = 2
    not_built = 3
    aborted = 4

    @staticmethod
    def success_statuses() -> Iterable["JenkinsJobResult"]:
        return (JenkinsJobResult.success, JenkinsJobResult.unstable)

    @staticmethod
    def failure_statuses() -> Iterable["JenkinsJobResult"]:
        return (
            JenkinsJobResult.failure,
            JenkinsJobResult.not_built,
            JenkinsJobResult.aborted,
        )

import enum

class JenkinsJobResult(int, enum.Enum):
    success = 0
    unstable = 1
    failure = 2
    not_built = 3
    aborted = 4

    @staticmethod
    def success_statuses() -> Iterable["JenkinsJobResult"]:
        return (JenkinsJobResult.success, JenkinsJobResult.unstable)

    @staticmethod
    def failure_statuses() -> Iterable["JenkinsJobResult"]:
        return (
            JenkinsJobResult.failure,
            JenkinsJobResult.not_built,
            JenkinsJobResult.aborted,
        )

And we can update the alert to use the new enum:

Using the new Jenkins status model

import functools
from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

with rules.context(config.AlertForMissingData()):

    @rules.alert("JobFrequent", duration=15 * ql.Minute)
    @rules.alert("JobInfrequent", duration=12 * ql.Hour)
    @rules.alert(
        "JobImportant",
        duration=20 * ql.Minute,
        failure_status=(
            JenkinsJobResult.unstable,
            JenkinsJobResult.failure,
            JenkinsJobResult.not_built,
            JenkinsJobResult.aborted,
        ),
    )
    def jenkins_job_failing(
        job: str,
        duration: ql.Duration,
        failure_status: Iterable[
            JenkinsJobResult
        ] = JenkinsJobResult.failure_statuses(),
    ) -> config.Alert:
        selector = v.default_builds_last_result_ordinal(jenkins_job=f"Org/Team/{job}")
        sub_exprs = [selector == s.value for s in failure_status]
        final_expr = functools.reduce(ql.InstantVector.or_, sub_exprs)

        return config.SimpleAlert(
            expr=final_expr,
            for_=duration,
            annotations={
                "summary": f"{job} is failing for 15m",
            },
        )

import functools
from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

with rules.context(config.AlertForMissingData()):

    @rules.alert("JobFrequent", duration=15 * ql.Minute)
    @rules.alert("JobInfrequent", duration=12 * ql.Hour)
    @rules.alert(
        "JobImportant",
        duration=20 * ql.Minute,
        failure_status=(
            JenkinsJobResult.unstable,
            JenkinsJobResult.failure,
            JenkinsJobResult.not_built,
            JenkinsJobResult.aborted,
        ),
    )
    def jenkins_job_failing(
        job: str,
        duration: ql.Duration,
        failure_status: Iterable[
            JenkinsJobResult
        ] = JenkinsJobResult.failure_statuses(),
    ) -> config.Alert:
        selector = v.default_builds_last_result_ordinal(jenkins_job=f"Org/Team/{job}")
        sub_exprs = [selector == s.value for s in failure_status]
        final_expr = functools.reduce(ql.InstantVector.or_, sub_exprs)

        return config.SimpleAlert(
            expr=final_expr,
            for_=duration,
            annotations={
                "summary": f"{job} is failing for 15m",
            },
        )

Much better! Now it’s impossible for users to forget the mapping between logical status and numeric code.

What if we want to share this alert between multiple teams? We can just add more decorators to jenkins_job_failing, but that’ll get hard to read pretty fast. Luckily, I have one more trick up my sleeve: HeraclesQL has an alert extension syntax:

Extending a rule

rules.alert(
    "JobFrequentFailing", rules.extends(jenkins_job_failing, duration=15 * ql.Minute)
)

rules.alert(
    "JobFrequentFailing", rules.extends(jenkins_job_failing, duration=15 * ql.Minute)
)

Now we can update our rules so that new alerts are more declarative:

Declarative alert generation

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

jenkins_alerts = {
    "JobFrequent": {"duration": 15 * ql.Minute},
    "JobInfrequent": {"duration": 12 * ql.Hour},
    "JobImportant": {
        "duration": 20 * ql.Minute,
        "failure_status": (
            JenkinsJobResult.unstable,
            JenkinsJobResult.failure,
            JenkinsJobResult.not_built,
            JenkinsJobResult.aborted,
        ),
    },
}

with rules.context(config.AlertForMissingData()):
    for job, params in jenkins_alerts.items():
        rules.alert(
            f"{job}Failing",
            rules.extends(jenkins_job_failing, **params),
        )

from heracles import ql, config

rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()

jenkins_alerts = {
    "JobFrequent": {"duration": 15 * ql.Minute},
    "JobInfrequent": {"duration": 12 * ql.Hour},
    "JobImportant": {
        "duration": 20 * ql.Minute,
        "failure_status": (
            JenkinsJobResult.unstable,
            JenkinsJobResult.failure,
            JenkinsJobResult.not_built,
            JenkinsJobResult.aborted,
        ),
    },
}

with rules.context(config.AlertForMissingData()):
    for job, params in jenkins_alerts.items():
        rules.alert(
            f"{job}Failing",
            rules.extends(jenkins_job_failing, **params),
        )

And that’s it! New teams can even create new modules and extend this alert from there. This is much easier to extend than a VictoriaMetrics rule config file with a growing list of nearly duplicated rules. It’s also much more difficult to make mistakes — we’re able to use Python’s type system to avoid magic numbers and use HeraclesQL’s config APIs to automatically add meta-alerts to catch common mistakes.

Development and Testing

Our goal with HeraclesQL is to make the experience of developing alerts better. This extends beyond just the code you write to the experience of writing and testing code. To that end, I’d like to highlight two development tools we’re working on: Delos and Hermes.

Delos

Delos is a small development environment for writing HeraclesQL queries. When writing MetricsQL, it’s normal to iterate on queries inside the VictoriaMetrics UI or a similar tool which can graph query results on the fly. Developers lose this ability when writing HeraclesQL because those UIs don’t support HeraclesQL. Delos fills this gap by proxying a live instance of VictoriaMetrics with a query populated from a HeraclesQL Python project.

Delos is still somewhat experimental – it works, but there are plenty of rough edges. For this reason, it’s not included in the wheel just yet (but it can be installed directly as a Go module). Delos has a very simple implementation. The UI is provided by an iframe that embeds the VictoriaMetrics UI from a target VictoriaMetrics instance. Your HeraclesQL query is injected into the UI via a small HTTP API in the Delos server, which executes a Python script that generates MetricsQL from your query.

Using Delos is easy:

Delos interactive mode

$ delos --vm-url <a-vm-instance>
Found virtual env at '/home/you/.venv'
Found python at '/home/you/.venv/bin/python'
 -- Delos Interactive Mode --
-> http://localhost:5000/dev <-
QUERY:
42.0
Enter command:
    'q' - exit
    'e' - edit

$ delos --vm-url <a-vm-instance>
Found virtual env at '/home/you/.venv'
Found python at '/home/you/.venv/bin/python'
 -- Delos Interactive Mode --
-> http://localhost:5000/dev <-
QUERY:
42.0
Enter command:
    'q' - exit
    'e' - edit

Press ‘e’ to open the Delos query file. If you’re using a Python virtual env, this file will be embedded in that environment so you can reference modules from your project.

Delos query file

# Use this file to define the query which Delos will display!
# This script executes inside the provided venv, so code from your project is directly callable.
#
# Delos expects a variable named 'VECTOR' which contains a query. `VECTOR` can also be a function.
from heracles import ql

VECTOR = ql.ScalarLiteral(42)

# Use this file to define the query which Delos will display!
# This script executes inside the provided venv, so code from your project is directly callable.
#
# Delos expects a variable named 'VECTOR' which contains a query. `VECTOR` can also be a function.
from heracles import ql

VECTOR = ql.ScalarLiteral(42)

So for example, to view an alert query, do this:

Referencing project modules from Delos

# Use this file to define the query which Delos will display!
# This script executes inside the provided venv, so code from your project is directly callable.
#
# Delos expects a variable named 'VECTOR' which contains a query. `VECTOR` can also be a function.
from heracles import ql
from my_heracles_rules import my_team_rules

VECTOR = my_team_rules.MyUsefulAlert.expr

# Use this file to define the query which Delos will display!
# This script executes inside the provided venv, so code from your project is directly callable.
#
# Delos expects a variable named 'VECTOR' which contains a query. `VECTOR` can also be a function.
from heracles import ql
from my_heracles_rules import my_team_rules

VECTOR = my_team_rules.MyUsefulAlert.expr

Save this file, and Delos will automatically populate the query into the UI. When you edit MyUsefulAlert, Delos will automatically update the query in the UI. We think Delos provides a better experience than manually writing a rules.yaml file because you can iterate on alerts directly from the definitions. With a rules.yaml, you still have to copy alerts between the VictoriaMetrics UI and the rules.yaml file.

Hermes

If you’re familiar with writing PromQL or MetricsQL alerts, you’re probably familiar with promtool test and vmalert-tool test. These commands allow rules to be tested against a real timeseries database with synthetic data. It’s still possible to use these tools with HeraclesQL generated alerts, but we wanted to provide a better experience.

Hermes re-implements the behavior of vmalert-tool with a programmatic interface. Python code can generate synthetic timeseries, write them to the Hermes client, execute a query, and read back the results. This can be used to implement unit tests for HeraclesQL queries that actually test against a real VictoriaMetrics server.

Unlike vmalert-tool tests, Hermes tests can use imperative code to generate input timeseries. Similarly, Hermes returns the raw query result to the client, allowing imperative logic for checking conditions of the result. This makes it much easier to test complex conditions.

Hermes is still early in its development, so it has some rough edges. However, we think it’ll be a great mechanism for fully native HeraclesQL testing as it improves.

Wrapping Up

We’re very excited about the potential of HeraclesQL. We’re already using it for many alerts inside HRT, and we plan to eventually replace all of our MetricsQL alerts with it. HeraclesQL isn’t done yet, either — we plan to continue developing the project, especially now that it’s available to the community as an open source project!

Future Possibilities

Prometheus Support

HeraclesQL supports MetricsQL because we only use VictoriaMetrics at HRT. However, it’d be entirely possible to support Prometheus’ PromQL if there’s community interest! HeraclesQL was written so that the core types would be easy to swap.

Execution Support

HeraclesQL supports generating query strings and generating static configs. This is what’s necessary for alerting, but VictoriaMetrics can be used in many other cases. In the future, we could add support for executing HeraclesQL queries in the same way one can execute a SQLAlchemy query.