What’s new in Cortex v1.10.0?

by | 15.08.2021 | Changelog

Cortex v1.10 is out . We will see the crucial changes along with the enhancements and fixes in this article. The release includes a lot of new features too. We will see all of that, but first, we will see what Cortex is, and it does.

Cortex helps in providing horizontally scalable, highly available, multi-tenant, long term storage for Prometheus. Cortex is a CNCF incubation project used in several production systems, including Weave Cloud and Grafana Cloud. Primarily, we see the usage of Cortex as a remote write destination for Prometheus, with a Prometheus-compatible query API.

Let’s start.

HIGHLIGHTS

Firstly, we will explore the highlights of this release. The new release shows the deprecation of chunks storage and is now in maintenance mode. With version 1.10, we see the supporting of exemplars for in-memory only. The devs have added many new limits to help protect our installation against overload. Also, with this update, we now consider the sharding feature in Alertmanager as complete. The release now has ARM binaries and packages (but not container images yet).

Changes

The new update allows us to prevent path traversal attacks from users able to control the HTTP header X-Scope-OrgID (CVE-2021-36157). Now, Users only control the HTTP header when an auth proxy does not show frontend Cortex while validating the tenant IDs. Again, we can now see enabling of strict JSON unmarshal for pkg/util/validation.Limits struct. Also, the custom UnmarshalJSON() will now fail if the input has unknown fields. The new release shows the deprecation of Cortex chunks storage, and it’s now in maintenance mode. As a result of which the devs encourage all Cortex users to migrate to the blocks storage. We will see no addition of new features to the chunks storage. The default Cortex configuration still runs the chunks engine. For more information, please have a look at the blocks storage doc on how to configure Cortex to run with the blocks storage

With this update, we see removing the example Kubernetes manifests (stored at k8s/) due to a lack of proper support and maintenance. There is a significant change in Querier/ruler with the deprecation of -store.query-chunk-limit CLI flag (and its respective YAML config option max_chunks_per_query) in favour of -querier.max-fetched-chunks-per-query (and also its respective YAML config option max_fetched_chunks_per_query). The new limit specifies the maximum number of chunks that we can fetch in a single query from ingesters. We see that the total number of actual fetched chunks could be 2x the limit for long-term storage, independently applying when querying ingesters and long-term storage.

The new update allows the Alertmanager to configure the experimental receivers firewall on a per-tenant basis. We also see the changing and moving of the following CLI flags (and their respective YAML config options) to the limits config section:

  • Renaming of -alertmanager.receivers-firewall.block.cidr-networks to -alertmanager.receivers-firewall-block-cidr-networks
  • Renaming of -alertmanager.receivers-firewall.block.private-addresses to -alertmanager.receivers-firewall-block-private-addresses

We also see a change in the default value of -server.grpc.keepalive.min-time-between-pings from 5m to 10s and -server.grpc.keepalive.ping-without-stream-allowed to true. For ingester, we observe the changing of the default value of -ingester.active-series-metrics-enabled to true. It incurs a slight increase in memory usage, between 1.2% and 1.6%, which we can measure on ingesters with 1.3M active series. In case of dependency, we can see an update of go-redis from v8.2.3 to v8.9.0.

New Features

The new release brings a new feature for Querier with the addition of a new -querier.max-fetched-series-per-query flag. When Cortex runs with blocks storage, we see the enforcement of the max series per query limit in the querier and applies to unique series received from ingesters and store-gateway (long-term storage). For Querier/Ruler, the addition of new -querier.max-fetched-chunk-bytes-per-query flag. When Cortex is running with blocks storage, we gain to see the enforcement of the max chunk bytes limit in the querier and ruler and limits the size of all aggregated chunks returned from ingesters and storage as bytes for a query.

For Alertmanager, we can see the support of negative matchers, time-based muting – upstream release notes. Also, we can see the addition of rate-limits to notifiers. We can now configure rate limits used by all integrations by using -alertmanager.notification-rate-limit, while we can again specify per-integration rate limits via -alertmanager.notification-rate-limit-per-integration parameter. We can overwrite Both shared and per-integration limits by using the overrides mechanism. The new update applies these limits on individual (per-tenant) alertmanagers. We see rate-limited notifications as failed notifications. It is possible to monitor rate-limited notifications via new cortex_alertmanager_notification_rate_limited_total metric. Again, we can see the addition of -alertmanager.max-config-size-bytes limit to control the configuration files that Cortex users can upload to Alertmanager via API. This limit is configurable per-tenant.

In this new release, the devs have also added -alertmanager.max-templates-count and -alertmanager.max-template-size-bytes options to control the number and size of templates uploaded to Alertmanager via API. These limits are configurable per-tenant. Also, there is the addition of -debug.block-profile-rate flag to enable goroutine blocking events profiling. The new update brings the consideration for the experimental sharding feature for Alertmanager as complete. We can find detailed information about the configuration options here for alertmanager and here for alertmanager storage. To use the feature, we must ensure the configuration of a remote storage backend for Alertmanager to store state using -alertmanager-storage.backend and flags related to the backend. We have to keep in mind that this release does not support the local and configdb storage backends. Also, we must ensure the configuration of a ring store using -alertmanager.sharding-ring.store, set the flags relevant to the chosen store type and enable the feature using -alertmanager.sharding-enabled. We must keep in mind the prior addition of a new configuration option -alertmanager.persist-interval. This addition sets the interval between persisting the current alertmanager state (notification log and silences) to object storage. Refer to the configuration file reference for more information.

Enhancements

The new release brings a new enhancement for Alertmanager, which will clean up persisted state objects from remote storage when we delete a tenant configuration. We can see the added ability to disable Open Census within the GCS client (e.g -gcs.enable-opencensus=false) for storage. Again, for Etcd, the new update brings Added username and password to the etcd config.

Another enhancement for Alertmanager is the introduction of new metrics to monitor operation when using -alertmanager.sharding-enabled:

cortex_alertmanager_state_fetch_replica_state_total

cortex_alertmanager_state_fetch_replica_state_failed_total

cortex_alertmanager_state_initial_sync_total

cortex_alertmanager_state_initial_sync_completed_total

cortex_alertmanager_state_initial_sync_duration_seconds

cortex_alertmanager_state_persist_total

cortex_alertmanager_state_persist_failed_total

Blocks storage got an enhancement with support ingesting exemplars and querying of exemplars. We can enable them by setting new CLI flag -blocks-storage.tsdb.max-exemplars=<n> or config option blocks_storage.tsdb.max_exemplars to positive value. We also see distributor enhancement with several distributors in the ring status section on the admin page. We can also see enhancement with added zone-awareness support to alertmanager for use when we enable sharding. When we enable zone-awareness, we will see replication of alerts across availability zones. Enhancements like the addition of tenant_ids tag to tracing spans will make the software more stable.

For ring and query-frontend, we have to avoid using [automatic private IPs (APIPA)](https://www.pcmag.com/encyclopedia/term/apipa#:~:text=(Automatic Private IP Addressing) The,either permanently or temporarily unavailable.) when discovering IP address from the interface during the registration of the instance in the ring or by query-frontend when we use it with query-scheduler. We see the usage of APIPA as a last resort, with logging indicating usage. This is an essential enhancement. Also, we see another one for memberlist, where we can see the introduction of new metrics to aid troubleshooting tombstone convergence:

memberlist_client_kv_store_value_tombstones

memberlist_client_kv_store_value_tombstones_removed_total

memberlist_client_messages_to_broadcast_dropped_total

The new release brings other enhancements for Alertmanager with the addition of -alertmanager.max-dispatcher-aggregation-groups option to control max number of active dispatcher groups in Alertmanager (per tenant, also overrideable). When it reaches the limit, the Dispatcher produces a log message and increases the cortex_alertmanager_dispatcher_aggregation_group_limit_reached_total metric. Also, another enhancement is the addition of -alertmanager.max-alerts-count and -alertmanager.max-alerts-size-bytes to control the max number of alerts and total size of alerts that a single user can have in Alertmanager’s memory. Adding more alerts will fail with a log message and incrementing cortex_alertmanager_alerts_insert_limited_total metric (per-user). We can override these limits by using per-tenant overrides. Also, we can track current values in cortex_alertmanager_alerts_limiter_current_alerts and cortex_alertmanager_alerts_limiter_current_alerts_size_bytes metrics.

We see enhancement of store-gateway with the addition of -store-gateway.sharding-ring.wait-stability-min-duration and -store-gateway.sharding-ring.wait-stability-max-duration support to store-gateway, to wait for ring stability at startup. The new update enhances ruler with the addition of rule_group label to metrics cortex_prometheus_rule_group_iterations_total and cortex_prometheus_rule_group_iterations_missed_total and also with the addition of new metrics for tracking the total number of queries and push requests sent to ingester, as well as failed queries and push requests. We will now only count failures for internal errors, but not user-errors like limits or invalid queries. It contrasts with existing cortex_prometheus_rule_evaluation_failures_total, which we can also increment when query or samples appending fails due to user-errors.

cortex_ruler_write_requests_total

cortex_ruler_write_requests_failed_total

cortex_ruler_queries_total

cortex_ruler_queries_failed_total

With version 1.10.0, we see enhancement in ingester with the addition of option -ingester.ignore-series-limit-for-metric-names with a comma-separated list of metric names that will get under ignorance in max series per metric limit. There is also the addition of instrumentation to Redis client, with the following metrics: cortex_rediscache_request_duration_seconds. We can also see scanner enhancements through additional support for DynamoDB (v9 schema only) and retrying of failed uploads. The last enhancement which comes through this update is the addition of Cassandra support.

Bug Fixes

The new update fixes a lot of bugs. In the case of Purger, we can see the fixing of the Invalid, null value in the condition for column range. We can see the cause of this fix by nil value in range for the WriteBatch query. For Ingester, we see the fixing of infrequent panic. It happens because of the race condition between TSDB mmap-ed head chunks truncation and queries. For Alertmanager, we know the fixing of the Alertmanager status page, which occurs if we disable clustering via gossip or enable sharding. In the case of Ruler, we see that the /ruler/rule_groups endpoint will work when we use it with object store, and we can now also honour the evaluation delay for the ALERTS and ALERTS_FOR_STATE series. We can now make multiple Get requests instead of MGet on Redis Cluster. Also, for Ingester, we can see the fixing of the issue where runtime limits erroneously override default limits.

For Ruler, we now see startup fixing in single-binary mode when we use the new ruler_storage. Querier gets its fixing of queries failing with “at least requiring of 1 healthy replica, we can only find 0” error right after scaling up store-gateways until they’re ACTIVE in the ring. The new update also fixes an issue where we can see skipping samples in a chunk by the batch iterator. Also, the storage-gateway gets its bug fixing where we don’t have to load all blocks in each store-gateway in case of a coal startup when we enable block sharding. We can only load those blocks which have their ownership by a store-gateway replica. For memberlist, we see the fixing to set the default configuration value for -memberlist.retransmit-factor, if we do not see the providing of that data. This update will improve the propagation delay of the ring state (including, but not limited to, tombstones). We have to keep in mind that this fix has no effect if we get the configuration beforehand.

Conclusion

Throughout the article, we have discussed the various changes, bug fixes, enhancements and new features that this new release brings to Cortex. Version 1.10.0 is quite a big update, and you can also try it out by downloading Cortex from here. Read more of our blogs here.

Join the Community

The DevOps Awareness Program

Subscribe to the newsletter

Join 100+ cloud native ethusiasts

#wearep3r

Join the community Slack

Discuss all things Kubernetes, DevOps and Cloud Native

More stories from our blog

How to Install Portainer on Remote Server ft. VSCode?

How to Install Portainer on Remote Server ft. VSCode?

Portainer is one of the most popular and trusted GUI for managing Docker, Swarms, ACIs and Kubernetes. The company boasts on its’ website for having 500K users, and there’s no doubt to the number looking at how easy it makes managing the tools. This post goes on the...

What’s new in Python-Tuf v0.18.0?

What’s new in Python-Tuf v0.18.0?

Python-Tuf v0.18.0 recently came, and it is quite a big update with major and minor changes. We will go through all of those changes, additions, fixes and removals in this document. Without further a due, let's start! What is Python-Tuf? The Update Framework (TUF) or...

What’s new in Envoyproxy v1.19.1?

What’s new in Envoyproxy v1.19.1?

Envoyproxy came with its new version a few days ago. Version 1.19.1 comes with very few updates. It provides a few minor behavioural changes and a few bug fixes to make the user experience smoother. In this article, we will cover all of the new changes. Let's start!...

What’s new in Jaeger v1.26.0?

What’s new in Jaeger v1.26.0?

Jaeger v1.26.0 recently came. It has a few changes in its backend. In this article, we will cover all of this in a straightforward way. We will see all of the fixes and the new features that the devs have added. Let's start! What is Jaeger? Jaeger is a graduated CNCF...

Prometheus: As Simple As Possible

Prometheus: As Simple As Possible

Distributed systems help an organisation absorb countless benefits but at the cost of complexity. With the rise of the adoption of container orchestrators like Kubernetes, a need for monitoring and alerting systems came. One such system is Prometheus which is famous...

Bootstrap K3S Data: For Beginners

Bootstrap K3S Data: For Beginners

For Kubernetes users, handling data management tasks and other analysis needs can become difficult with the inclusion of edge based devices. Internet of Things (IoT) as a whole is designed to complement online services for devices commonly used by people such as air...

What’s new in Ingress-Nginx Controller v1.0.0?

What’s new in Ingress-Nginx Controller v1.0.0?

Ingress-Nginx controller for Kubernetes came with its new release almost a month earlier. I know we are pretty late in documenting this but trust me, this update is pretty big. And in this article, we will see all of the new features and essential bug fixes and...

Getting gRPC Right: An Introduction and Review

Getting gRPC Right: An Introduction and Review

The question of APIs and their best implementation through online websites will always remain a tough nut to crack as the web undergoes scaled changes each year. It’s hard to think that the web was once draped by HTML and PHP alone until CSS and Javascript made...

What’s new in TikV v5.0.4?

What’s new in TikV v5.0.4?

TikV came up with its new release this month. It is a small one, but we can see a couple of improvements and some bug fixes along the way. In this article, we will see all of those and view the recent changes. Let's start! What is TikV? TiKV is a graduate project of...

Interested in what we do? Looking for help? Wanna talk about software strategy?