I’m on the Systems Dev team at HRT. My team is responsible for software development that supports Systems, Network, and Datacenter Engineers who maintain the computers HRT relies on for trading and research. While my team has a large variety of responsibilities, I’m focused on developing tools which improve monitoring and alerting. In this post, I’ll be doing a deep dive on my team’s newly open-sourced project: HeraclesQL.
At HRT, VictoriaMetrics is an important part of our monitoring stack. It centers around a timeseries database that stores metrics about services and devices. VictoriaMetrics has its own query language for fetching timeseries data called MetricsQL. Over time, we’ve written thousands of alerts using MetricsQL queries.
While we love VictoriaMetrics and MetricsQL, we’ve encountered some pain points with MetricsQL as the number of teams using VictoriaMetrics has grown. These break down into three categories: excessive code duplication, poor expressiveness in complex queries, and subtle footguns.
HeraclesQL is designed to solve these problems while preserving what makes MetricsQL great. It is a Domain Specific Language (DSL) for writing MetricsQL queries in Python. If you’re familiar with SQLAlchemy for SQL, it can be thought of as filling a similar role. Under the hood, HeraclesQL Python code is essentially a type-safe builder which generates MetricsQL. We’ve provided it as a Python package on PyPi and made it available on GitHub as open source software under the MIT license. Some features include:
- MetricsQL-like syntax — HeraclesQL will be immediately familiar to anyone who’s written MetricsQL or PromQL.
- Custom Functions, Parameterizable Expressions, and Variables — No more copy and pasting behavior between alerts.
- Static Type Safety — MyPy and your editor will catch common problems before they occur.
- Meta-alerts — Generate alerts about your alerts to avoid common pitfalls.
Before we do our deep dive, let’s briefly discuss MetricsQL.
A Brief Intro to MetricsQL
MetricsQL is the language for querying the VictoriaMetrics timeseries database. It’s largely based on PromQL, which is the query language for the Prometheus timeseries database. MetricsQL and PromQL are extremely similar, so if you know one you essentially know the other. Both VictoriaMetrics and Prometheus are monitoring platforms, which means they typically store metrics about services or devices.
Data is stored in the form of numerically-valued timeseries. Each timeseries is identified by a set of string key-value pairs called labels. Queries use string matching operations to select timeseries based on the value of labels. The result of a query is a vector that contains zero or more timeseries.
Let’s look at some examples:
my_cool_metric{job="my_scrape_job"}This query selects a vector of timeseries with the name my_cool_metric and the label ‘job=”my_scrape_job”. The part of the query inside the curly braces are called matchers. There’s no way to know the number of timeseries this query will select.
MetricsQL supports many built-in functions and operators. For example, a query can use the sum function to calculate a timeseries which is the sum of all timeseries matching a selector:
sum(http_request_duration{method="POST"})Since VictoriaMetrics is a timeseries database, many functions are designed to calculate values over time. For example, the rate function can calculate a timeseries which contains the growth rate of a timeseries over a specified lookback:
increase(http_request_count_total{path=~"/api/v2/.*"}[5m])There’s too much behavior to detail in this post, so I’ll stop here. The VictoriaMetrics docs are a great resource for learning more about MetricsQL.
The primary purpose of MetricsQL is to write alerts which notify users of anomalous conditions in a service or device. To this end, VictoriaMetrics supports configured alerting rules. Here’s an example:
alert: MyUsefulAlert
expr: my_cool_metric{job="my_scrape_job"} > 10
labels:
some_cool_label: with_a_useful_valueAn alert will be generated if the query defined in the expr field returns any timeseries. One alert will be generated per timeseries.
HeraclesQL
The HeraclesQL Python package provides the core DSL, config generation for VictoriaMetrics, and some useful utilities like embedded testing and development tools.
We want HeraclesQL to be immediately familiar to anybody who’s written PromQL or MetricsQL. To achieve that, we’ve designed HeraclesQL to feel like writing MetricsQL. It’s easier to see what I mean by looking at our example of a simple MetricsQL query:
my_cool_metric{job="my_scrape_job"}Now, here’s what it looks like in HeraclesQL:
v.my_cool_metric(job="my_scrape_job")HeraclesQL supports all of MetricsQL’s operators. For example:
(v.left_metric{foo="bar"} + on(hostname) v.right_metric) and v.some_other_metricCan be written in HeraclesQL like this:
(v.left_metric(foo="bar") + v.right_metric).on("hostname") & v.some_other_metricHeracles’ API is completely MyPy and Pyright compliant – Both your editor and MyPy can warn you about typing issues statically:
# example.py
ql.rate(v.my_metric(foo="bar"))$ mypy example.py
example.py:78: error: Argument 1 to "rate" has incompatible type "SelectedInstantVector"; expected "RangeVector" [arg-type]Static typing is important to HeraclesQL’s design because we want HeraclesQL to offer an experience that’s even better than writing MetricsQL. Getting warnings from your editor makes it easier to iterate on queries without resorting to trial and error.
How HeraclesQL Improves on MetricsQL
We designed HeraclesQL to address a number of persistent problems we’ve had with writing alerts for Prometheus and VictoriaMetrics. These fall into three categories:
- Code Duplication
- Expressiveness
- Footguns
Let’s take a look at each one in detail.
Code Duplication
Code duplication was the biggest motivation for HeraclesQL. At HRT, we have thousands of alerts across dozens of teams. Nothing can be shared between alerts when using plain MetricsQL — if two teams have alerts on disk usage with different thresholds, they both would need to write almost identical alerts.
Before developing HeraclesQL, we considered a few options to reduce duplication. These were all some form of text-templating, and they all suffered from similar problems. Namely, text-templating treats the query as text, not a structured language. This means that templates have to be written very carefully to avoid introducing errors on misuse. Text templating also becomes quite unwieldy for complex templates.
HeraclesQL allows type-safe templating inside expressions. Imagine that we want to write an alert that will fire when disk exhaustion is imminent. We might write something like this:
predict_linear(
(
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
)[12h:],
1h
)
<
0.10This expression returns a non-empty vector if the linear interpolation of the fraction of available filesystem space in one hour is less than 10%.
But what if we want another alert that fires when we’re minutes away from exhaustion? With MetricsQL, our only option is to write another alert with a nearly duplicate expression:
predict_linear(
(
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
)[1h:],
15m
)
<
0.01This is a simple case, but this can cause a huge amount of code duplication for more complicated alerts.
In HeraclesQL, we can use a function to avoid duplication:
from heracles import ql
v = ql.Selector()
def disk_space_running_out(
lookback: ql.Duration,
extrapolate: ql.Duration,
threshold: float,
) -> ql.InstantVector:
return (
ql.predict_linear(
(
v.node_filesystem_avail_bytes(mountpoint="/")
/ v.node_filesystem_size_bytes(mountpoint="/")
)[lookback:],
extrapolate,
)
< threshold
)
# first alert expression
disk_space_running_out(12 * ql.Hour, 1 * ql.Hour, 0.10)
# second alert expression
disk_space_running_out(1 * ql.Hour, 15 * ql.Minute, 0.01)Now we can make as many variants of this expression as we want without duplicating any code!
HeraclesQL’s internal representation of an expression is essentially the abstract syntax tree of the corresponding MetricsQL expression. The HeraclesQL value is not used to query VictoriaMetrics directly. Instead, it’s used to generate MetricsQL. For example, if we wrote:
print(disk_space_running_out(12 * ql.Hour, 1h * ql.Hour, 0.10).render())We’d print out MetricsQL:
predict_linear(
(
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
)[12h:],
3600000
)
<
0.10Generating MetricsQL is useful because it allows Heracles code to generate native VictoriaMetrics rule configuration files!
Expressiveness
While MetricsQL is concise for simple queries, it can easily become difficult to understand. Even using ‘WITH’ templating, there’s not much flexibility to define variables or avoid repetition. Here’s an alert to use as an example:
- alert: APIPoorThroughput
expr: |-
(
(
rate(http_request_duration_sum{path=~"/api/v1/.*"}[5m])
/
rate(http_request_duration_count_total[5m])
)
/
changes_prometheus(
(
increase(http_response_size_bytes_sum[5m])
/
increase(http_response_size_bytes_count_total[5m])
)
*
1M
)
)
>
0.005This alert calculates the 5m average request duration per megabyte of response size.
Everybody has their own threshold for readability, but this alert is beginning to reach mine. I can understand it by reading it closely, but it’s difficult to grok right away. Since HeraclesQL queries are just Python code, it’s possible to use variables, branches, loops, and functions to make the query’s intent more clear. We can rewrite this alert using HeraclesQL’s configuration package:
from heracles import ql, config
rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()
@rules.alert()
def api_poor_throughput() -> config.Alert:
lookback = 5 * ql.Minute
# define a variable which calculates just the request duration
average_request_duration = ql.increase(
v.http_request_duration_sum(path=ql.RE("/api/v1/.*"))[lookback]
) / ql.increase(v.http_request_duration_count_total[lookback])
# separately, calculate average response size
average_response_size = (
ql.increase(v.http_request_duration_sum[lookback])
/ ql.increase(v.http_response_size_bytes_count_total[lookback])
) * (1000 * 1000)
# use the two variables above to calculate the request duration in terms of response size
request_duration_per_response_size = (
average_request_duration / average_response_size
)
# finally, return an alert which tests the request_duration_per_response_size against the threshold
return config.SimpleAlert(expr=request_duration_per_response_size > 0.005)The HeraclesQL version of this alert has several advantages:
- lookback is a variable so it doesn’t have to be repeated multiple times
- average_request_duration and average_response_size can be calculated separately and stored in named variables
- The final alert’s expression clearly explains what it does to a reader without needing any comments
With HeraclesQL, we can go even further than this – what if we want different thresholds for a few different paths? HeraclesQL’s config library lets us use parameterizable functions to register the same alert multiple times:
from heracles import ql, config
rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.selector()
@rules.alert(
"HandlerALowThroughput",
path_prefix="/api/v1/handler_a",
threshold = 0.005,
)
@rules.alert(
"HandlerBLowThroughput",
path_prefix="/api/v1/handler_b",
threshold = 0.0001,
)
def api_poor_throughput(
path_prefix: str,
threshold: float,
lookback: ql.Duration = 5*ql.Minute,
) -> config.Alert:
# define a variable which calculates just the request duration
average_request_duration = ql.increase(
v.http_request_duration_sum(path=ql.RE(f"{path_prefix}.*"))[lookback]
) / ql.increase(v.http_request_duration_count_total[lookback])
# separately, calculate average response size
average_response_size = (
ql.increase(v.http_request_duration_sum[lookback])
/ ql.increase(v.http_response_size_bytes_count_total[lookback])
) * (1000 * 1000)
# use the two variables above to calculate the request duration in terms of
# response size
request_duration_per_response_size = (
average_request_duration / average_response_size
)
# finally, return an alert which tests the request_duration_per_response_size against the threshold
return config.SimpleAlert(
expr=request_duration_per_response_size > threshold,
)Named variables can also be defined outside of rule functions and referenced multiple times. By default, this will replicate that variable’s sub-expression wherever it is referenced. Often, it’s desirable to create a recording rule from this kind of shared expression. In HeraclesQL, recording rules can be defined using a similar API to alerts, but this can often be overkill for simple cases. Thus, HeraclesQL provides a less verbose way to wrap an expression to produce a recording rule:
from heracles import ql, config
rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()
lookback: ql.Duration = 5 * ql.Minute
# wrap our expression in rules.record to let heracles know that this should
# produce a recording rule. The resulting variable is a normal InstantVector
average_request_duration = rules.record(
ql.increase(v.http_request_duration_sum[lookback])
/ ql.increase(v.http_request_duration_count_total[lookback]),
"my_service:avg_request_duration:5m",
)
@rules.alert(
"HandlerALowThroughput",
path_prefix="/api/v1/handler_a",
threshold=0.005,
)
@rules.alert(
"HandlerBLowThroughput",
path_prefix="/api/v1/handler_b",
threshold=0.0001,
)
def api_poor_throughput(
path_prefix: str,
threshold: float,
) -> config.Alert:
average_response_size = (
ql.increase(v.http_request_duration_sum[lookback])
/ ql.increase(v.http_response_size_bytes_count_total[lookback])
) * (1000 * 1000)
# use the average_request_duration variable. When we generate MetricsQL,
# this variable will render as a reference to
# my_service:avg_request_duration:5m
request_duration_per_response_size = (
average_request_duration(path=ql.RE(f"{path_prefix}.*")) / average_response_size
)
return config.SimpleAlert(
expr=request_duration_per_response_size > threshold,
)
@rules.alert()
def api_request_duration_very_long() -> config.Alert:
return config.SimpleAlert(expr=average_request_duration > 10)In both rules, average_request_duration will render as a reference to the recording rule.
Footguns
MetricsQL is extremely powerful, but with that power comes complexity. There’s a great deal of subtle behavior which isn’t always obvious. In our experience, the most common mistakes involve unexpectedly missing timeseries. Take a simple expression, for example:
increase(
my_service_background_job_started_count{job_name="very-important"}[10m]
)
<
1If my very-important job doesn’t run at least once in a 10m span, this will generate a vector with at least one timeseries and raise an alert. But what happens if my_service_background_job_started_count isn’t in VictoriaMetrics? In that case, the expression will always return an empty vector because the selector my_service_background_job_started_count{job_name="very-important"} will return nothing. I will not get any alerts even though my job definitely isn’t running!
This kind of bug can easily occur for metrics which are registered dynamically. If my_service registers new job_name values only when a job starts, this metric would never be generated until the job runs for the first time.
To avoid this, it’s best practice to add an absence check to selectors that reference timeseries that may be dynamically produced. However, this is easily overlooked. In practice, we found that alerts at HRT rarely handled timeseries absence correctly.
HeraclesQL provides a way to automatically add this check:
from heracles import ql, config
rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()
# use config.AlertForMissingData to generating a meta-alert that verifies
# all the timeseries we expect to select are actually selected
with rules.context(config.AlertForMissingData()):
@rules.alert()
def very_important_job_not_running() -> config.Alert:
return config.SimpleAlert(
# we don't need to include an absence checks in our expression,
# config.AlertForMissingData will do it for us
expr=ql.increase( v.my_service_background_job_started_count(job_name="very-important")[
10 * ql.Minute
]
)
> 0,
)Before understanding what AlertForMissingData does, we need to take a small detour to explain what rules.context is doing. In addition to registering alerts, the RuleBundle object is able to perform post-processing. All HeraclesQL vector objects store the entire AST of the query, so post-processors are able to walk the AST and generate meta-queries. The context manager above is used to register AlertForMissingData as a post processor for every alert defined within the context.
To understand what this does, it’s easiest to look at the output HeraclesQL’s config generation produces:
rules:
- alert: VeryImportantJobNotRunning
expr: |-
increase(
my_service_background_job_started_count{job_name="very-important"}[10m]
)
>
0.0
- alert: VeryImportantJobNotRunningDataMissing
expr: absent(my_service_background_job_started_count{job_name="very-important"})In addition to our alert, Heracles has generated a VeryImportJobNotRunningDataMissing alert which fires when my_service_background_job_started_count is absent. AlertForDataMissing generated this alert by traversing the AST of our first alert’s expression and generating an absence check for each root-level vector selector.
Not all common footguns are so straightforward. For example, in MetricsQL, binary operations inherently perform setwise operations. Something as simple as this can have multiple meanings:
vector_x + vector_yThe ‘+’ operator can be thought of as working in two steps: First, it finds all the timeseries in vector_y which have matching labels to timeseries in vector_x, keeping only the intersection. Next, it adds the values of the remaining timeseries to produce a new vector. The net result is an operation which can mean either “add” or “filter” depending on the context. When writing a query, it’s easy to accidentally filter when you only meant to add (in fact, HeraclesQL’s standard library implements generic joins using the + operator).
For these ambiguous cases, HeraclesQL provides ‘annotations’ which can be attached to queries to signal intent. In this case, we can write something like this:
from heracles import ql, config
from heracles.ql import assertions
v = ql.Selector()
(v.vector_x + v.vector_y).annotate(assertions.no_cardinality_change)When used with the AlertForAssertions context, this will generate a meta-alert which fires when the assertion is not true:
rules:
- alert: ExampleAlert
expr: vector_x + vector_y
- alert: ExampleAlertInvalidData
expr: absent(count(vector_x) == count(vector_x + vector_y))These are just a few examples of meta-alerts in HeraclesQL. Both contexts and annotations are just Python code that processes the expression’s AST. It’s possible to write custom implementations that can do just about anything you can imagine!
Pulling it Together – Jenkins Alerts
HeraclesQL’s config library brings all of these features together to make a powerful code-as-configuration system. As a more realistic example, let’s write a small library for alerting based on the outcome of Jenkins jobs using the Jenkins Prometheus plugin: https://plugins.jenkins.io/prometheus/
We’ll work under the assumption that we have multiple teams that want to alert on the status of their jobs. Jobs can do many things — they can be periodic jobs, they can be builds, and they can be manually triggered one-offs. Using plain VictoriaMetrics config, we’re likely to end up with a lot of near-duplicate alerts.
A basic alert might look like this:
rules:
- alert: JenkinsJobFailing
expr: default_jenkins_builds_last_build_result{job_name="Org/Team/Job"} == 0
for: 10m
annotations:
summary: Job {$labels.job_name} is failing for the last 10mHowever, there’s already a problem — Jenkins registers metrics dynamically. If the job doesn’t exist for some reason (like bad configuration or a branch in a repo being deleted), this alert will not fire. So ideally we’ll check for absence as well:
rules:
- alert: JenkinsJobFailing
expr: |-
(default_jenkins_builds_last_build_result{job_name="Org/Team/Job"} == 0)
or
absent(default_jenkins_builds_last_build_result{job_name="Org/Team/Job"})
for: 10m
annotations:
summary: Job {$labels.job_name} is failing for the last 10mHopefully everyone always remembers to add that!
There’s also one more complication: The Jenkins Prometheus plugin exports two metrics describing the last build result — default_jenkins_builds_last_build_result and default_jenkins_builds_last_build_result_ordinal. Both metrics are gauges that represent the status as a numerical value. The non-ordinal version uses 0 for a set of failed states and 1 for success. The ordinal version represents each state as a separate value (0-4). Confusingly, success is 0!
With many teams writing alerts for Jenkins jobs, this causes a few annoyances:
- Some alerts will use
default_jenkins_builds_last_build_resultand some will usedefault_jenkins_builds_last_build_result_ordinaleven if they semantically find the same condition - It’s easy to mistakenly check for the wrong values from
default_jenkins_builds_last_build_result_ordinalsince they’re different from the result fromdefault_jenkins_builds_last_buildand they’re just numeric values in code.
We found ourselves in this situation at HRT. Some alerts used _ordinal, some alerts forgot the absent check, and some alerts checked for the wrong result codes.
If you’ve read the previous section, you may already have some ideas about how Heracles can improve this. At the very least, we can use AlertForMissingData to ensure that nobody forgets the absent check anymore:
from heracles import ql, config
rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()
with rules.context(config.AlertForMissingData()):
@rules.alert()
def jenkins_job_failing() -> config.Alert:
return config.SimpleAlert(
expr=(v.default_jenkins_builds_last_result(jenkins_job="Org/Team/Job") < 1),
for_=15 * ql.Minute,
annotations={
"summary": "Job is failing for 15m",
},
)Next, we can add parameters to this alerting function so that we can define multiple alerts by repeating the @rules.alert annotations:
from heracles import ql, config
rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.selector()
with rules.context(config.AlertForMissingData()):
@rules.alert("JobFrequent", 15 * ql.Minute)
@rules.alert("JobInfrequent", 12 * ql.Hour)
@rules.alert("JobImportant", 20 * ql.Minute)
def jenkins_job_failing(job: str, duration: ql.Duration) -> config.Alert:
return config.SimpleAlert(
expr=v.default_jenkins_builds_last_result_ordinal(
jenkins_job=job_name=f"Org/Team/{job}"
) > 1,
for_=duration,
annotations={
"summary": f"{job} is failing for {duration}",
}
)There’s still one major limitation with this implementation. As written, we will raise an alert for every job when it’s in FAILURE or UNSTABLE status. While that’s probably fine for most alerts, it might not be for some import alerts. For example, perhaps JobImportant’s alert is supposed to fire even when the job’s ordinal result is 1 (for UNSTABLE). Since we can’t use the same threshold for every alert, we need to add a parameter for build status and generate a more complex expression:
from heracles import ql, config
rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()
with rules.context(config.AlertForMissingData()):
@rules.alert("JobFrequent", duration=15 * ql.Minute)
@rules.alert("JobInfrequent", duration=12 * ql.Hour)
@rules.alert("JobImportant", duration=20 * ql.Minute, failure_status=(1, 2, 3, 4))
def jenkins_job_failing(
job: str,
duration: ql.Duration,
failure_status: Iterable[int] = (2, 3),
) -> config.Alert:
selector = v.default_builds_last_result_ordinal(jenkins_job=f"Org/Team/{job}")
sub_exprs = [selector == s for s in failure_status]
final_expr = functools.reduce(ql.InstantVector.or_, sub_exprs)
return config.SimpleAlert(
expr=final_expr,
for_=duration,
annotations={
"summary": f"{job} is failing for 15m",
},
)Great! Now the expression is written in terms of a set of explicit statuses we want to alert on. By overriding the value of failure_status for JobImportant, we are able to alert when it is in anything other than a success status. However, now our alert definitions depend on passing around opaque integer statuses. That’s a step backwards in readability and makes it more likely that new alerts will use the wrong statuses by mistake.
Luckily, HeraclesQL config is Python code! We can just add a new type:
import enum
class JenkinsJobResult(int, enum.Enum):
success = 0
unstable = 1
failure = 2
not_built = 3
aborted = 4
@staticmethod
def success_statuses() -> Iterable["JenkinsJobResult"]:
return (JenkinsJobResult.success, JenkinsJobResult.unstable)
@staticmethod
def failure_statuses() -> Iterable["JenkinsJobResult"]:
return (
JenkinsJobResult.failure,
JenkinsJobResult.not_built,
JenkinsJobResult.aborted,
)And we can update the alert to use the new enum:
import functools
from heracles import ql, config
rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()
with rules.context(config.AlertForMissingData()):
@rules.alert("JobFrequent", duration=15 * ql.Minute)
@rules.alert("JobInfrequent", duration=12 * ql.Hour)
@rules.alert(
"JobImportant",
duration=20 * ql.Minute,
failure_status=(
JenkinsJobResult.unstable,
JenkinsJobResult.failure,
JenkinsJobResult.not_built,
JenkinsJobResult.aborted,
),
)
def jenkins_job_failing(
job: str,
duration: ql.Duration,
failure_status: Iterable[
JenkinsJobResult
] = JenkinsJobResult.failure_statuses(),
) -> config.Alert:
selector = v.default_builds_last_result_ordinal(jenkins_job=f"Org/Team/{job}")
sub_exprs = [selector == s.value for s in failure_status]
final_expr = functools.reduce(ql.InstantVector.or_, sub_exprs)
return config.SimpleAlert(
expr=final_expr,
for_=duration,
annotations={
"summary": f"{job} is failing for 15m",
},
)Much better! Now it’s impossible for users to forget the mapping between logical status and numeric code.
What if we want to share this alert between multiple teams? We can just add more decorators to jenkins_job_failing, but that’ll get hard to read pretty fast. Luckily, I have one more trick up my sleeve: HeraclesQL has an alert extension syntax:
rules.alert(
"JobFrequentFailing", rules.extends(jenkins_job_failing, duration=15 * ql.Minute)
)Now we can update our rules so that new alerts are more declarative:
from heracles import ql, config
rules = config.RuleBundle("hrtbeat.rules.yml")
v = rules.vectors()
jenkins_alerts = {
"JobFrequent": {"duration": 15 * ql.Minute},
"JobInfrequent": {"duration": 12 * ql.Hour},
"JobImportant": {
"duration": 20 * ql.Minute,
"failure_status": (
JenkinsJobResult.unstable,
JenkinsJobResult.failure,
JenkinsJobResult.not_built,
JenkinsJobResult.aborted,
),
},
}
with rules.context(config.AlertForMissingData()):
for job, params in jenkins_alerts.items():
rules.alert(
f"{job}Failing",
rules.extends(jenkins_job_failing, **params),
)And that’s it! New teams can even create new modules and extend this alert from there. This is much easier to extend than a VictoriaMetrics rule config file with a growing list of nearly duplicated rules. It’s also much more difficult to make mistakes — we’re able to use Python’s type system to avoid magic numbers and use HeraclesQL’s config APIs to automatically add meta-alerts to catch common mistakes.
Development and Testing
Our goal with HeraclesQL is to make the experience of developing alerts better. This extends beyond just the code you write to the experience of writing and testing code. To that end, I’d like to highlight two development tools we’re working on: Delos and Hermes.
Delos
Delos is a small development environment for writing HeraclesQL queries. When writing MetricsQL, it’s normal to iterate on queries inside the VictoriaMetrics UI or a similar tool which can graph query results on the fly. Developers lose this ability when writing HeraclesQL because those UIs don’t support HeraclesQL. Delos fills this gap by proxying a live instance of VictoriaMetrics with a query populated from a HeraclesQL Python project.

Delos is still somewhat experimental – it works, but there are plenty of rough edges. For this reason, it’s not included in the wheel just yet (but it can be installed directly as a Go module). Delos has a very simple implementation. The UI is provided by an iframe that embeds the VictoriaMetrics UI from a target VictoriaMetrics instance. Your HeraclesQL query is injected into the UI via a small HTTP API in the Delos server, which executes a Python script that generates MetricsQL from your query.
Using Delos is easy:
$ delos --vm-url <a-vm-instance>
Found virtual env at '/home/you/.venv'
Found python at '/home/you/.venv/bin/python'
-- Delos Interactive Mode --
-> http://localhost:5000/dev <-
QUERY:
42.0
Enter command:
'q' - exit
'e' - editPress ‘e’ to open the Delos query file. If you’re using a Python virtual env, this file will be embedded in that environment so you can reference modules from your project.
# Use this file to define the query which Delos will display!
# This script executes inside the provided venv, so code from your project is directly callable.
#
# Delos expects a variable named 'VECTOR' which contains a query. `VECTOR` can also be a function.
from heracles import ql
VECTOR = ql.ScalarLiteral(42)So for example, to view an alert query, do this:
# Use this file to define the query which Delos will display!
# This script executes inside the provided venv, so code from your project is directly callable.
#
# Delos expects a variable named 'VECTOR' which contains a query. `VECTOR` can also be a function.
from heracles import ql
from my_heracles_rules import my_team_rules
VECTOR = my_team_rules.MyUsefulAlert.exprSave this file, and Delos will automatically populate the query into the UI. When you edit MyUsefulAlert, Delos will automatically update the query in the UI. We think Delos provides a better experience than manually writing a rules.yaml file because you can iterate on alerts directly from the definitions. With a rules.yaml, you still have to copy alerts between the VictoriaMetrics UI and the rules.yaml file.
Hermes
If you’re familiar with writing PromQL or MetricsQL alerts, you’re probably familiar with promtool test and vmalert-tool test. These commands allow rules to be tested against a real timeseries database with synthetic data. It’s still possible to use these tools with HeraclesQL generated alerts, but we wanted to provide a better experience.
Hermes re-implements the behavior of vmalert-tool with a programmatic interface. Python code can generate synthetic timeseries, write them to the Hermes client, execute a query, and read back the results. This can be used to implement unit tests for HeraclesQL queries that actually test against a real VictoriaMetrics server.
Unlike vmalert-tool tests, Hermes tests can use imperative code to generate input timeseries. Similarly, Hermes returns the raw query result to the client, allowing imperative logic for checking conditions of the result. This makes it much easier to test complex conditions.
Hermes is still early in its development, so it has some rough edges. However, we think it’ll be a great mechanism for fully native HeraclesQL testing as it improves.
Wrapping Up
We’re very excited about the potential of HeraclesQL. We’re already using it for many alerts inside HRT, and we plan to eventually replace all of our MetricsQL alerts with it. HeraclesQL isn’t done yet, either — we plan to continue developing the project, especially now that it’s available to the community as an open source project!
Future Possibilities
Prometheus Support
HeraclesQL supports MetricsQL because we only use VictoriaMetrics at HRT. However, it’d be entirely possible to support Prometheus’ PromQL if there’s community interest! HeraclesQL was written so that the core types would be easy to swap.
Execution Support
HeraclesQL supports generating query strings and generating static configs. This is what’s necessary for alerting, but VictoriaMetrics can be used in many other cases. In the future, we could add support for executing HeraclesQL queries in the same way one can execute a SQLAlchemy query.
