Cortex v1.10 is out . We will see the crucial changes along with the enhancements and fixes in this article. The release includes a lot of new features too. We will see all of that, but first, we will see what Cortex is, and it does.
Cortex helps in providing horizontally scalable, highly available, multi-tenant, long term storage for Prometheus. Cortex is a CNCF incubation project used in several production systems, including Weave Cloud and Grafana Cloud. Primarily, we see the usage of Cortex as a remote write destination for Prometheus, with a Prometheus-compatible query API.
Firstly, we will explore the highlights of this release. The new release shows the deprecation of chunks storage and is now in maintenance mode. With version 1.10, we see the supporting of exemplars for in-memory only. The devs have added many new limits to help protect our installation against overload. Also, with this update, we now consider the sharding feature in Alertmanager as complete. The release now has ARM binaries and packages (but not container images yet).
The new update allows us to prevent path traversal attacks from users able to control the HTTP header X-Scope-OrgID (CVE-2021-36157). Now, Users only control the HTTP header when an auth proxy does not show frontend Cortex while validating the tenant IDs. Again, we can now see enabling of strict JSON unmarshal for
pkg/util/validation.Limits struct. Also, the custom
UnmarshalJSON() will now fail if the input has unknown fields. The new release shows the deprecation of Cortex chunks storage, and it’s now in maintenance mode. As a result of which the devs encourage all Cortex users to migrate to the blocks storage. We will see no addition of new features to the chunks storage. The default Cortex configuration still runs the chunks engine. For more information, please have a look at the blocks storage doc on how to configure Cortex to run with the blocks storage
With this update, we see removing the example Kubernetes manifests (stored at k8s/) due to a lack of proper support and maintenance. There is a significant change in Querier/ruler with the deprecation of
-store.query-chunk-limit CLI flag (and its respective YAML config option
max_chunks_per_query) in favour of
-querier.max-fetched-chunks-per-query (and also its respective YAML config option
max_fetched_chunks_per_query). The new limit specifies the maximum number of chunks that we can fetch in a single query from ingesters. We see that the total number of actual fetched chunks could be 2x the limit for long-term storage, independently applying when querying ingesters and long-term storage.
The new update allows the Alertmanager to configure the experimental receivers firewall on a per-tenant basis. We also see the changing and moving of the following CLI flags (and their respective YAML config options) to the limits config section:
- Renaming of
- Renaming of
We also see a change in the default value of
-server.grpc.keepalive.min-time-between-pings from 5m to 10s and
-server.grpc.keepalive.ping-without-stream-allowed to true. For ingester, we observe the changing of the default value of
-ingester.active-series-metrics-enabled to true. It incurs a slight increase in memory usage, between 1.2% and 1.6%, which we can measure on ingesters with 1.3M active series. In case of dependency, we can see an update of go-redis from v8.2.3 to v8.9.0.
The new release brings a new feature for Querier with the addition of a new -querier.max-fetched-series-per-query flag. When Cortex runs with blocks storage, we see the enforcement of the max series per query limit in the querier and applies to unique series received from ingesters and store-gateway (long-term storage). For Querier/Ruler, the addition of new
-querier.max-fetched-chunk-bytes-per-query flag. When Cortex is running with blocks storage, we gain to see the enforcement of the max chunk bytes limit in the querier and ruler and limits the size of all aggregated chunks returned from ingesters and storage as bytes for a query.
For Alertmanager, we can see the support of negative matchers, time-based muting – upstream release notes. Also, we can see the addition of rate-limits to notifiers. We can now configure rate limits used by all integrations by using
-alertmanager.notification-rate-limit, while we can again specify per-integration rate limits via
-alertmanager.notification-rate-limit-per-integration parameter. We can overwrite Both shared and per-integration limits by using the overrides mechanism. The new update applies these limits on individual (per-tenant) alertmanagers. We see rate-limited notifications as failed notifications. It is possible to monitor rate-limited notifications via
new cortex_alertmanager_notification_rate_limited_total metric. Again, we can see the addition of
-alertmanager.max-config-size-bytes limit to control the configuration files that Cortex users can upload to Alertmanager via API. This limit is configurable per-tenant.
In this new release, the devs have also added
-alertmanager.max-template-size-bytes options to control the number and size of templates uploaded to Alertmanager via API. These limits are configurable per-tenant. Also, there is the addition of
-debug.block-profile-rate flag to enable goroutine blocking events profiling. The new update brings the consideration for the experimental sharding feature for Alertmanager as complete. We can find detailed information about the configuration options here for alertmanager and here for alertmanager storage. To use the feature, we must ensure the configuration of a remote storage backend for Alertmanager to store state using
-alertmanager-storage.backend and flags related to the backend. We have to keep in mind that this release does not support the
configdb storage backends. Also, we must ensure the configuration of a ring store using
-alertmanager.sharding-ring.store, set the flags relevant to the chosen store type and enable the feature using
-alertmanager.sharding-enabled. We must keep in mind the prior addition of a new configuration option
-alertmanager.persist-interval. This addition sets the interval between persisting the current alertmanager state (notification log and silences) to object storage. Refer to the configuration file reference for more information.
The new release brings a new enhancement for Alertmanager, which will clean up persisted state objects from remote storage when we delete a tenant configuration. We can see the added ability to disable Open Census within the GCS client (e.g
-gcs.enable-opencensus=false) for storage. Again, for Etcd, the new update brings Added username and password to the etcd config.
Another enhancement for Alertmanager is the introduction of new metrics to monitor operation when using -alertmanager.sharding-enabled:
Blocks storage got an enhancement with support ingesting exemplars and querying of exemplars. We can enable them by setting new CLI flag
-blocks-storage.tsdb.max-exemplars=<n> or config option
blocks_storage.tsdb.max_exemplars to positive value. We also see distributor enhancement with several distributors in the ring status section on the admin page. We can also see enhancement with added zone-awareness support to alertmanager for use when we enable sharding. When we enable zone-awareness, we will see replication of alerts across availability zones. Enhancements like the addition of tenant_ids tag to tracing spans will make the software more stable.
For ring and query-frontend, we have to avoid using [automatic private IPs (APIPA)](https://www.pcmag.com/encyclopedia/term/apipa#:~:text=(Automatic Private IP Addressing) The,either permanently or temporarily unavailable.) when discovering IP address from the interface during the registration of the instance in the ring or by query-frontend when we use it with query-scheduler. We see the usage of APIPA as a last resort, with logging indicating usage. This is an essential enhancement. Also, we see another one for memberlist, where we can see the introduction of new metrics to aid troubleshooting tombstone convergence:
The new release brings other enhancements for Alertmanager with the addition of
-alertmanager.max-dispatcher-aggregation-groups option to control max number of active dispatcher groups in Alertmanager (per tenant, also overrideable). When it reaches the limit, the Dispatcher produces a log message and increases the
cortex_alertmanager_dispatcher_aggregation_group_limit_reached_total metric. Also, another enhancement is the addition of
-alertmanager.max-alerts-size-bytes to control the max number of alerts and total size of alerts that a single user can have in Alertmanager’s memory. Adding more alerts will fail with a log message and incrementing
cortex_alertmanager_alerts_insert_limited_total metric (per-user). We can override these limits by using per-tenant overrides. Also, we can track current values in
We see enhancement of store-gateway with the addition of
-store-gateway.sharding-ring.wait-stability-max-duration support to store-gateway, to wait for ring stability at startup. The new update enhances ruler with the addition of
rule_group label to metrics
cortex_prometheus_rule_group_iterations_missed_total and also with the addition of new metrics for tracking the total number of queries and push requests sent to ingester, as well as failed queries and push requests. We will now only count failures for internal errors, but not user-errors like limits or invalid queries. It contrasts with existing
cortex_prometheus_rule_evaluation_failures_total, which we can also increment when query or samples appending fails due to user-errors.
With version 1.10.0, we see enhancement in ingester with the addition of option
-ingester.ignore-series-limit-for-metric-names with a comma-separated list of metric names that will get under ignorance in max series per metric limit. There is also the addition of instrumentation to Redis client, with the following metrics:
cortex_rediscache_request_duration_seconds. We can also see scanner enhancements through additional support for DynamoDB (v9 schema only) and retrying of failed uploads. The last enhancement which comes through this update is the addition of Cassandra support.
The new update fixes a lot of bugs. In the case of Purger, we can see the fixing of the Invalid, null value in the condition for column range. We can see the cause of this fix by nil value in range for the WriteBatch query. For Ingester, we see the fixing of infrequent panic. It happens because of the race condition between TSDB mmap-ed head chunks truncation and queries. For Alertmanager, we know the fixing of the Alertmanager status page, which occurs if we disable clustering via gossip or enable sharding. In the case of Ruler, we see that the
/ruler/rule_groups endpoint will work when we use it with object store, and we can now also honour the evaluation delay for the
ALERTS_FOR_STATE series. We can now make multiple Get requests instead of MGet on Redis Cluster. Also, for Ingester, we can see the fixing of the issue where runtime limits erroneously override default limits.
For Ruler, we now see startup fixing in single-binary mode when we use the new ruler_storage. Querier gets its fixing of queries failing with “at least requiring of 1 healthy replica, we can only find 0” error right after scaling up store-gateways until they’re ACTIVE in the ring. The new update also fixes an issue where we can see skipping samples in a chunk by the batch iterator. Also, the storage-gateway gets its bug fixing where we don’t have to load all blocks in each store-gateway in case of a coal startup when we enable block sharding. We can only load those blocks which have their ownership by a store-gateway replica. For memberlist, we see the fixing to set the default configuration value for -memberlist.retransmit-factor, if we do not see the providing of that data. This update will improve the propagation delay of the ring state (including, but not limited to, tombstones). We have to keep in mind that this fix has no effect if we get the configuration beforehand.
Throughout the article, we have discussed the various changes, bug fixes, enhancements and new features that this new release brings to Cortex. Version 1.10.0 is quite a big update, and you can also try it out by downloading Cortex from here. Read more of our blogs here.