The HRT Beat | Tech Blog

Scaling Prometheus to Billions of Samples a Second

Oct 20, 2021

Scaling Prometheus to Billions of Samples a Second

When scaling out infrastructure, it becomes increasingly important to have unified visibility into how the systems are operating. HRT is always working to expand the capabilities of our research and live trading environments, and one of the ways we do this well is by using Prometheus. Prometheus is an open-source tool which collects time series data using a pull model to scrape data from targets, and then provides a flexible query language which leverages the key/value pairs used to define the data’s dimensionality. We’ve adopted Prometheus as a primary element in how we monitor our global infrastructure because it’s an attractive solution for both host-level and service-level metrics, scaling up immensely on a single instance. Of course, with any service, there’s a limit to how much you can scale up on a single server within a given set of hardware constraints. We’ll explore how HRT approached this single-server scaling problem, and some of the unique challenges we overcame while doing it.

Let’s start by taking a look at the scaling options recommended by the Prometheus community, and discuss why these weren’t great options for HRT. The two primary built-in ways for scaling beyond one server are through federation – splitting Prometheus ingestion across multiple datacenters or services, and then aggregating a portion of the data with a “global” instance – or through sharding the data across multiple instances, which are, again, aggregated through federation. The option of running one Prometheus instance per datacenter fits the infrastructure of a distributed trading environment quite well, but doesn’t align as well for a large research cluster. For our research environment, sharding the data makes more sense. Federation would impose a restriction on our ability to do ad-hoc global aggregation, which we weren’t willing to sacrifice. By its nature, aggregation only allows for running queries on the data pulled into the global instance – where we again run into a limit on how much data can be ingested by a single Prometheus instance.

So, what solution did we come up with that allows us to both scale horizontally and still be able to run queries across all data at once? We leveraged a combination of sharding and remote read, producing a flow that looks something like this:

At its core, we have two types of Prometheus instances in our environment: the global aggregator nodes that receive queries (henceforth referred to as query nodes), and the nodes that actually scrape data from all of our targets, sharded across several instances – circled in the diagram to represent a complete dataset. The query nodes have no scrape targets of their own, and instead just read data from the configured remote read targets. The sole job of these instances is to process queries, including recording rules, alerts, and ad-hoc queries. The shards on the other hand are dedicated mostly to ingesting large amounts of data, much more than what was feasible on a single server.

This architecture introduces several new improvements and issues to address. First, we’ll look at the improvements.

Scaling Data Collection

The first and most obvious win is that we can now scale horizontally to accommodate the ingestion of more data. By sharding the data, our ingestion limits are now a function of how many servers we’ve sharded the data across, rather than how much resource overhead remains on a single server. As we hit scalability limits, the question is no longer “can we upgrade the hardware,” but instead, “where do we want to rack the new server(s).” The latter is much easier to address. The addition of new servers also doesn’t introduce any downtime for existing instances like hardware changes would. 

Massive Recording Rules

Another, more subtle improvement is that we now have a mechanism to process some very large queries which would have timed out under the previous architecture. The goal of recording rules is to allow us to precompute some queries while data is ingested, storing the results in a new time series which is often faster to query than executing the same query across a time range. A common use of these recording rules at HRT is to measure CPU utilization across our entire research cluster. A query like this utilizes hundreds of samples for every server, and it quickly becomes infeasible to query usage metrics over any length of time. In our case, even an instantaneous query became problematic! This is where sharding comes into play, as we’re able to execute a recording rule for a portion of the data on each shard, and then run another aggregation on those results from the query node. Implementation-wise, this is done by appending a temporary, unique “shard” label to the resulting time series recorded by each shard, then re-running a similar aggregation recording rule again on the query node to combine those results into a single time series, while at the same time removing this temporary label.

Restart Times

The last major win I’ll mention is in the amount of time it takes to restart the Prometheus service itself. When Prometheus restarts, one of the longest portions of the startup process is replaying the Write-Ahead Log (WAL). This step scales with the amount of data in the log, and in our case, began to exceed one hour. By sharding our data across multiple instances, restart times are reduced as a function of how many shards we have. This is even shorter for our query nodes, which don’t have any data in the WAL.

All of this may sound enticing, but now I’d like to address some of the downsides of adopting this architecture (and how we addressed them).


Running a heterogeneous Prometheus deployment where each server may serve a unique purpose naturally introduces a lot of complexity. From an administration standpoint, we now have more servers to maintain and configure. Our config file template is complicated to say the least, and the aforementioned feature of sharding recording rules requires some delicate rule management. This is by no means a showstopper, and can be alleviated with solid documentation and regular training sessions, but it bears consideration.

Along with increased administrative overhead, we also introduce more points of failure. One benefit to sharding the data is that the failure of a single service or piece of hardware would only result in a partial outage of data. This is something we want to avoid when maintaining a live trading environment or 24/7 research cluster, so we’ve introduced high availability (HA) and redundancy where possible. This includes things like drive-failure tolerant storage, remote data snapshots, and HA shards and query nodes. Here’s a peek at what our architecture looks like in production:

The overall concept is identical to the stripped-down diagram shown earlier, but now we’ve introduced a duplicate of every Prometheus server, as well as Consul. What we’ve done here is introduce several consul service pools: one for all the query nodes and one for each shard pair. Each consul service has health checks to automatically remove a host from the service pool should the Prometheus instance go down, and to add hosts as we continue to scale out the cluster. Although not an area we’ve explored, if we find that query load becomes too intense, this also gives us the capability to distribute load across more servers by expanding each shard’s service pool.

Is It Slower?

An obvious down-side to reading from remote storage is that it’s not local! This means that getting the necessary data to execute a query requires a network hop, and generally has a higher latency than when reading from local storage. This was a major concern of ours when testing this setup, and initially we saw mixed results for query times when compared with a single Prometheus instance configured to use local storage. However, we proposed a few changes upstream to enable reading from remote sources concurrently, and worked with the Prometheus community to implement it. This change significantly improved our average query performance, theoretically allowing it to scale with the number of shards. The limiting factor is now that we’re bottlenecked by whichever Prometheus instance is slowest to return results. 

Remote Read Support

The final issues that we ran into were around remote read support. These were mostly quality of life features that aren’t yet supported in Prometheus, including things like the passthrough of requests for the /targets, /labels, and /labels/<label_name>/values api endpoints. The lack of support for these means that getting a unified list of targets is not supported, and that we don’t get autofill suggestions natively in the Prometheus UI. The former is not necessarily ever going to be a supported feature, as it arguably goes beyond the purpose of a remote storage source.

Although our Prometheus architecture is somewhat unique and leverages techniques not found in upstream community documentation, HRT has built a robust and horizontally-scalable system for ingesting large amounts of time series data. At the time of this writing, HRT is ingesting roughly 150 million unique series, with many more just over the horizon. Importantly, we’ve retained the ability to query any number of those time series ad-hoc. This design has allowed us to maximize our visibility and improve alerting coverage, while eliminating much of the concern around future growth.

Don't Miss a Beat

Follow us here for the latest in engineering, mathematics, and automation at HRT.