Prometheus v2.30 was released a few days ago, and it is an exciting update. This update is not very inclined on adding new features to the ecosystem, but it brings several enhancements to configurability and resource usage efficiency. It also brings several bug fixes. We will take a look at all of those changes in this article.
So, without further a due, let’s start!
Faster server restart times via snapshotting
With this release, we will see a faster server restart time. For large Prometheus servers that track a large number of concurrent time series, a server restart can take multiple minutes. The Prometheus server must rebuild its latest in-memory state for recent time-series data from the write-ahead-log (WAL) on disk. This process can be pretty slow as Prometheus needs to effectively replay the entire ingestion process for every sample in the WAL. It does that to reconstruct the final time series chunks in memory.
The release of Prometheus v2.30 introduces an experimental snapshotting feature, implemented and enabled using the flag
--enable-feature=memory-snapshot-on-shutdown. With the enabling of this feature, Prometheus writes out a more raw snapshot of its current in-memory state upon shutdown, re-read into memory more efficiently when the server restarts. The first experiments with this code have reduced restart times by 50-80% in one example.
You have to keep in mind that Prometheus only writes out this snapshot when it completes an orderly shutdown and not periodically during regular operation (to reduce overall write load while Prometheus is running). So this new feature only helps in speeding up restarts after clean shutdowns. In instances of crashes or unclean shutdowns, the most recent data while writing is replayed more slowly from the WAL than before.
Controlling scrape intervals and timeouts via relabeling
Unique target labels, along with target relabeling, helps in already allowing Prometheus users to control certain scrape behaviours, such as the address, the HTTP path, or the HTTP parameters sent along with the scrape request. In Prometheus 2.30, we can see the extension of this configurability to two more parameters:
The scrape intervals define how frequently a new
__scrape_interval__ meta label such as
__scrape_interval__=”15s” will scrape and control a target.
Scrape Timeouts define how long a target scrape may take before a new
__scrape_timeout__ meta label such as
__scrape_timeout__=”15s” controls it.
Moreover, this will now allow setting per-target scrape intervals within the same scrape configuration section. Since target labels control the behaviour, service discovery mechanisms can control this behaviour more dynamically. I will temporarily tell Prometheus to scrape specific targets more frequently to get higher-resolution data in specific scenarios or less frequently cause less load.
Improvement of storage efficiency by tuning timestamp tolerances
The latest release brings an improvement in storage efficiency. We can see that Prometheus uses a double delta compression algorithm for storing sample timestamps within each time series. The compression algorithm performs better if the intervals between subsequent timestamps are entirely regular. Although Prometheus already tries to scrape targets regularly, actual scrape timestamps can deviate slightly (by a few milliseconds) from the intended schedule. Go regression extracted this that caused more timer jitter. In previous versions, Prometheus already allowed setting the
--scrape.adjust-timestamps boolean flag to adjust scraping of timestamps by up to 2ms to align them with the intended scrape schedule and thus help achieve better timestamp compression.
With the release of Prometheus v2.30, we can see the addition of support for tuning this scrape timestamp tolerance duration by using the experimental flag
With this release, we can see other minor improvements in Prometheus v2.30, such as improving the usage of WAL load memory by 24% and CPU usage by 19%. We also see two more optional per-scrape metrics, namely, scrape_timeout_seconds and scrape_sample_limit) in addition to up and friends.
The latest release brings a couple of bug fixes. Firstly, we can see that in the case of Exemplars, there is fixing of panic when resizing exemplar storage from 0 to a non-zero size. For TSDB, we can now correctly decrement
prometheus_tsdb_head_active_appenders when the append has no samples. Again with promtool rules backfill, we can see that there will be a return of 1 if backfill was unsuccessful, and also there will be avoiding of creation of overlapping blocks. Finally, in the case of config, we see the fixing of a panic when reloading configuration with a null relabel action.
Throughout the article, we have seen how Prometheus v2.30 brings a lot of essential enhancements that will help users have a better experience with the product. Try out the new version here and contribute to the project by clicking here. Have an excellent experience with Prometheus, and I will see you all in the next one.
You can find more of our blogs below. Happy learning!