Spinning up a highly available Prometheus setup with Thanos

By Gcore

March 20, 2023

3 min read

Spinning up a highly available Prometheus setup with Thanos

The Problem

Prometheus has become one of the standard tools of any monitoring solutions due to it’s simple and reliable architecture and ease of use. Despite this, the tool has some shortcomings when working on a certain scale. When trying to scale Prometheus, one major issue you quickly bump into is the problem of cross-shard visibility.

Prometheus encourages a functional sharding approach. Even a single Prometheus server provides enough scalability to free users from the complexity of horizontal sharding in virtually all use cases.

While this is a great deployment model, you often want to access all the data through the same API or UI – that is, a global view. For example, you can render multiple queries in a Grafana graph, but each query can be done only against a single Prometheus server.

The Solution

Thanos is an open-source, highly available Prometheus setup with long term storage capabilities that seeks to act as a “silver bullet” to solve some of the shortcomings that plague vanilla Prometheus setups. Thanos allows users to aggregate Prometheus data natively by directly querying the Prometheus API, efficiently compact it, and most importantly, de-duplicate data.

Thanos’ architecture introduces a central query layer across all the servers via a sidecar component that sits alongside each Prometheus server and a central Querier component that responds to PromQL queries. This makes up a Thanos deployment.

Background

Following the KISS and Unix philosophies, Thanos is made of a set of components with each filling a specific role.

Sidecar: connects to Prometheus, reads its data for query and/or uploads it to cloud storage.
Store Gateway: serves metrics inside of a cloud storage bucket.
Compactor: compacts, downsamples and applies retention on the data stored in the cloud storage bucket.
Receiver: receives data from Prometheus’ remote-write WAL, exposes it and/or uploads it to cloud storage.
Ruler/Rule: evaluates recording and alerting rules against data in Thanos for exposition and/or upload.
Querier/Query: implements Prometheus’ v1 API to aggregate data from the underlying components.See those components on this diagram:

Thanos integrates with existing Prometheus servers through a Sidecar process, which runs on the same machine or in the same pod as the Prometheus server.

The purpose of the Sidecar is to backup Prometheus data into an Object Storage bucket, and give other Thanos components access to the Prometheus metrics via a gRPC API.

The Sidecar makes use of the reload Prometheus endpoint. Make sure it’s enabled with the flag --web.enable-lifecycle.

Installing Thanos

Prerequisites

To install Thanos you’ll need:

One or more Prometheus v2.2.1+ installations with a persistent disk.
Optional object storage.

The easiest way to deploy Thanos for the purposes of this tutorial is to deploy the Thanos sidecar along with Prometheus using the official Helm chart.

To deploy both- just run the next command, putting the values to a file values.yaml and changing --namespace value beforehand:

helm upgrade --version="8.6.0" --install --namespace="my-lovely-namespace" --values values.yaml  prometheus-thanos-sidecar stable/prometheus

Take a note that you need to replace two placeholders in the values: BUCKET_REPLACE_ME and CLUSTER_NAME. Also, adjust all the other values according to your infrastructure requirements.

External Storage

The following configures the sidecar to write Prometheus’ data into a configured object storage:

thanos sidecar \    --tsdb.path            /var/prometheus \          # TSDB data directory of Prometheus    --prometheus.url       "http://localhost:9090" \  # Be sure that the sidecar can use this url!    --objstore.config-file bucket_config.yaml \       # Storage configuration for uploading data

The format of YAML file depends on the provider you choose. Examples of config and up-to-date list of storage types Thanos supports are available here.

Rolling this out has little to zero impact on the running Prometheus instance. It is a good start to ensure you are backing up your data while figuring out the other pieces of Thanos.

Deduplicating data from Prometheus HA pairs

The Query component is also capable of deduplicating data collected from Prometheus HA pairs. This requires configuring Prometheus’s global.external_labels configuration block to identify the role of a given Prometheus instance.

A typical choice is simply the label name “replica” while letting the value be whatever you wish. For example, you might set up the following in Prometheus’s configuration file:

global:  external_labels:    region: eu-west    monitor: infrastructure    replica: A# ...

In a Kubernetes stateful deployment, the replica label can also be the pod name.

Reload your Prometheus instances, and then, in Query, we will define replica as the label we want to enable deduplication to occur on:

thanos query \    --http-address        0.0.0.0:19192 \    --store               1.2.3.4:19090 \    --store               1.2.3.5:19090 \    --query.replica-label replica  # Replica label for de-duplication    --query.replica-label replicaX # Supports multiple replica labels for de-duplication

Go to the configured HTTP address, and you should now be able to query across all Prometheus instances and receive de-duplicated data.

Next Steps

At this point, you should have an idea of how Thanos approaches the task of solving Prometheus’s shortcomings. Thanos takes Prometheus and extends the functionality with the sidecar component to introduce a central query layer to act as a long term metrics store with the ability to de-duplicate your metric data.

I hope this overview has helped you gain valuable context surrounding Thanos and the issues it solves. Thanks for reading!

Optimize your workload: a guide to selecting the best virtual machine configuration

Virtual machines (VMs) offer the flexibility, scalability, and cost-efficiency that businesses need to optimize workloads. However, choosing the wrong setup can lead to poor performance, wasted resources, and unnecessary costs.In this guide, we’ll walk you through the essential factors to consider when selecting the best virtual machine configuration for your specific workload needs.﹟1 Understand your workload requirementsThe first step in choosing the right virtual machine configuration is understanding the nature of your workload. Workloads can range from light, everyday tasks to resource-intensive applications. When making your decision, consider the following:Compute-intensive workloads: Applications like video rendering, scientific simulations, and data analysis require a higher number of CPU cores. Opt for VMs with multiple processors or CPUs for smoother performance.Memory-intensive workloads: Databases, big data analytics, and high-performance computing (HPC) jobs often need more RAM. Choose a VM configuration that provides sufficient memory to avoid memory bottlenecks.Storage-intensive workloads: If your workload relies heavily on storage, such as file servers or applications requiring frequent read/write operations, prioritize VM configurations that offer high-speed storage options, such as SSDs or NVMe.I/O-intensive workloads: Applications that require frequent network or disk I/O, such as cloud services and distributed applications, benefit from VMs with high-bandwidth and low-latency network interfaces.﹟2 Consider VM size and scalabilityOnce you understand your workload’s requirements, the next step is to choose the right VM size. VM sizes are typically categorized by the amount of CPU, memory, and storage they offer.Start with a baseline: Select a VM configuration that offers a balanced ratio of CPU, RAM, and storage based on your workload type.Scalability: Choose a VM size that allows you to easily scale up or down as your needs change. Many cloud providers offer auto-scaling capabilities that adjust your VM’s resources based on real-time demand, providing flexibility and cost savings.Overprovisioning vs. underprovisioning: Avoid overprovisioning (allocating excessive resources) unless your workload demands peak capacity at all times, as this can lead to unnecessary costs. Similarly, underprovisioning can affect performance, so finding the right balance is essential.﹟3 Evaluate CPU and memory considerationsThe central processing unit (CPU) and memory (RAM) are the heart of a virtual machine. The configuration of both plays a significant role in performance. Workloads that need high processing power, such as video encoding, machine learning, or simulations, will benefit from VMs with multiple CPU cores. However, be mindful of CPU architecture—look for VMs that offer the latest processors (e.g., Intel Xeon, AMD EPYC) for better performance per core.It’s also important that the VM has enough memory to avoid paging, which occurs when the system uses disk space as virtual memory, significantly slowing down performance. Consider a configuration with more RAM and support for faster memory types like DDR4 for memory-heavy applications.﹟4 Assess storage performance and capacityStorage performance and capacity can significantly impact the performance of your virtual machine, especially for applications requiring large data volumes. Key considerations include:Disk type: For faster read/write operations, opt for solid-state drives (SSDs) over traditional hard disk drives (HDDs). Some cloud providers also offer NVMe storage, which can provide even greater speed for highly demanding workloads.Disk size: Choose the right size based on the amount of data you need to store and process. Over-allocating storage space might seem like a safe bet, but it can also increase costs unnecessarily. You can always resize disks later, so avoid over-allocating them upfront.IOPS and throughput: Some workloads require high input/output operations per second (IOPS). If this is a priority for your workload (e.g., databases), make sure that your VM configuration includes high IOPS storage options.﹟5 Weigh up your network requirementsWhen working with cloud-based VMs, network performance is a critical consideration. High-speed and low-latency networking can make a difference for applications such as online gaming, video conferencing, and real-time analytics.Bandwidth: Check whether the VM configuration offers the necessary bandwidth for your workload. For applications that handle large data transfers, such as cloud backup or file servers, make sure that the network interface provides high throughput.Network latency: Low latency is crucial for applications where real-time performance is key (e.g., trading systems, gaming). Choose VMs with low-latency networking options to minimize delays and improve the user experience.Network isolation and security: Check if your VM configuration provides the necessary network isolation and security features, especially when handling sensitive data or operating in multi-tenant environments.﹟6 Factor in cost considerationsWhile it’s essential that your VM has the right configuration, cost is always an important factor to consider. Cloud providers typically charge based on the resources allocated, so optimizing for cost efficiency can significantly impact your budget.Consider whether a pay-as-you-go or reserved model (which offers discounted rates in exchange for a long-term commitment) fits your usage pattern. The reserved option can provide significant savings if your workload runs continuously. You can also use monitoring tools to track your VM’s performance and resource usage over time. This data will help you make informed decisions about scaling up or down so you’re not paying for unused resources.﹟7 Evaluate security featuresSecurity is a primary concern when selecting a VM configuration, especially for workloads handling sensitive data. Consider the following:Built-in security: Look for VMs that offer integrated security features such as DDoS protection, web application firewall (WAF), and encryption.Compliance: Check that the VM configuration meets industry standards and regulations, such as GDPR, ISO 27001, and PCI DSS.Network security: Evaluate the VM's network isolation capabilities and the availability of cloud firewalls to manage incoming and outgoing traffic.﹟8 Consider geographic locationThe geographic location of your VM can impact latency and compliance. Therefore, it’s a good idea to choose VM locations that are geographically close to your end users to minimize latency and improve performance. In addition, it’s essential to select VM locations that comply with local data sovereignty laws and regulations.﹟9 Assess backup and recovery optionsBackup and recovery are critical for maintaining data integrity and availability. Look for VMs that offer automated backup solutions so that data is regularly saved. You should also evaluate disaster recovery capabilities, including the ability to quickly restore data and applications in case of failure.﹟10 Test and iterateFinally, once you've chosen a VM configuration, testing its performance under real-world conditions is essential. Most cloud providers offer performance monitoring tools that allow you to assess how well your VM is meeting your workload requirements.If you notice any performance bottlenecks, be prepared to adjust the configuration. This could involve increasing CPU cores, adding more memory, or upgrading storage. Regular testing and fine-tuning means that your VM is always optimized.Choosing a virtual machine that suits your requirementsSelecting the best virtual machine configuration is a key step toward optimizing your workloads efficiently, cost-effectively, and without unnecessary performance bottlenecks. By understanding your workload’s needs, considering factors like CPU, memory, storage, and network performance, and continuously monitoring resource usage, you can make informed decisions that lead to better outcomes and savings.Whether you're running a small application or large-scale enterprise software, the right VM configuration can significantly improve performance and cost. Gcore offers a wide range of virtual machine options that can meet your unique requirements. Our virtual machines are designed to meet diverse workload requirements, providing dedicated vCPUs, high-speed storage, and low-latency networking across 30+ global regions. You can scale compute resources on demand, benefit from free egress traffic, and enjoy flexible pricing models by paying only for the resources in use, maximizing the value of your cloud investments.Contact us to discuss your VM needs

Spinning up a highly available Prometheus setup with Thanos

The Problem

The Solution

Background

Installing Thanos

Prerequisites

External Storage

Deduplicating data from Prometheus HA pairs

Next Steps

Related articles

Pre-configure your dev environment with Gcore VM init scripts

How to cut egress costs and speed up delivery using Gcore CDN and Object Storage

Bare metal vs. virtual machines: performance, cost, and use case comparison

Optimize your workload: a guide to selecting the best virtual machine configuration

How to get the size of a directory in Linux

How to Run Hugging Face Spaces on Gcore Inference at the Edge

Subscribe to our newsletter