Home
Developers
High Availability Kubernetes Monitoring using Prometheus and Thanos

High Availability Kubernetes Monitoring using Prometheus and Thanos

By Gcore

13 min read

High Availability Kubernetes Monitoring using Prometheus and Thanos

Introduction

The need for Prometheus High Availability

Kubernetes adoption has grown multifold in the past few months and it is now clear that Kubernetes is the defacto for container orchestration. That being said, Prometheus is also considered an excellent choice for monitoring both containerized and non-containerized workloads. Monitoring is an essential aspect of any infrastructure, and we should make sure that our monitoring set-up is highly-available and highly-scalable in order to match the needs of an ever growing infrastructure, especially in the case of Kubernetes.

Therefore, today we will deploy a clustered Prometheus set-up which is not only resilient to node failures, but also ensures appropriate data archiving for future references. Our set-up is also very scalable, to the extent that we can span multiple Kubernetes clusters under the same monitoring umbrella.

Present scenario

Majority of Prometheus deployments use persistent volume for pods, while Prometheus is scaled using a federated set-up. However, not all data can be aggregated using a federated mechanism, where you often need a mechanism to manage Prometheus configuration when you add additional servers.

The Solution

Thanos aims at solving the above problems. With the help of Thanos, we can not only multiply instances of Prometheus and de-duplicate data across them, but also archive data in a long term storage such as GCS or S3.

Implementation

Thanos Architecture

Image Source: https://thanos.io/quick-tutorial.md/

Thanos consists of the following components:

Thanos Sidecar: This is the main component that runs along Prometheus. It reads and archives data on the object store. Moreover, it manages Prometheus’ configuration and lifecycle. To distinguish each Prometheus instance, the sidecar component injects external labels into the Prometheus configuration. This component is capable of running queries on Prometheus servers’ PromQL interface. Sidecar components also listen on Thanos gRPC protocol and translate queries between gRPC and REST.
Thanos Store: This component implements the Store API on top of historical data in an object storage bucket. It acts primarily as an API gateway and therefore does not need significant amounts of local disk space. It joins a Thanos cluster on startup and advertises the data it can access. It keeps a small amount of information about all remote blocks on local disk and keeps it in-sync with the bucket. This data is generally safe to delete across restarts at the cost of increased startup times.
Thanos Query: The Query component listens on HTTP and translates queries to Thanos gRPC format. It aggregates the query result from different sources, and can read data from Sidecar and Store. In a HA setup, it even deduplicates the result.

Run-time deduplication of HA groups

Prometheus is stateful and does not allow replicating its database. This means that increasing high-availability by running multiple Prometheus replicas are not very easy to use. Simple load balancing will not work, as for example after some crash, a replica might be up but querying such replica will result in a small gap during the period it was down. You have a second replica that maybe was up, but it could be down in another moment (e.g rolling restart), so load balancing on top of those will not work well.

Thanos Querier instead pulls data from both replicas, and deduplicate those signals, filling the gaps if any, transparently to the Querier consumer.
Thanos Compact: The compactor component of Thanos applies the compaction procedure of the Prometheus 2.0 storage engine to block data stored in object storage. It is generally not semantically concurrency safe and must be deployed as a singleton against a bucket.
It is also responsible for downsampling of data – performing 5m downsampling after 40 hours and 1h downsampling after 10 days.
Thanos Ruler: It basically does the same thing as Prometheus’ rules. The only difference is that it can communicate with Thanos components.

Configuration

Prerequisite

In order to completely understand this tutorial, the following are needed:

Working knowledge of Kubernetes and using kubectl
A running Kubernetes cluster with at least 3 nodes
Implementing Ingress Controller and ingress objects (for the purpose of this demo Nginx Ingress Controller is being used). Although this is not mandatory but it is highly recommended inorder to decrease the number of external endpoints created.
Creating credentials to be used by Thanos components to access object store (in this case GCS bucket)
Create 2 GCS buckets and name them as prometheus-long-term and thanos-ruler
Create a service account with the role as Storage Object Admin
Download the key file as json credentials and name it as thanos-gcs-credentials.json
Create kubernetes secret using the credentials
kubectl create secret generic thanos-gcs-credentials --from-file=thanos-gcs-credentials.json -n monitoring

Deploying various components

Deploying Prometheus Services Accounts, Clusterrole and Clusterrolebinding

apiVersion: v1kind: Namespacemetadata:  name: monitoring---apiVersion: v1kind: ServiceAccountmetadata:  name: monitoring  namespace: monitoring---apiVersion: rbac.authorization.k8s.io/v1beta1kind: ClusterRolemetadata:  name: monitoring  namespace: monitoringrules:- apiGroups: [""]  resources:  - nodes  - nodes/proxy  - services  - endpoints  - pods  verbs: ["get", "list", "watch"]- apiGroups: [""]  resources:  - configmaps  verbs: ["get"]- nonResourceURLs: ["/metrics"]  verbs: ["get"]---apiVersion: rbac.authorization.k8s.io/v1beta1kind: ClusterRoleBindingmetadata:  name: monitoringsubjects:  - kind: ServiceAccount    name: monitoring    namespace: monitoringroleRef:  kind: ClusterRole  Name: monitoring  apiGroup: rbac.authorization.k8s.io---

The above manifest creates the monitoring namespace and service accounts, clusterrole and clusterrolebinding needed by Prometheus.

Deploying Prometheus Configuration configmap

apiVersion: v1kind: ConfigMapmetadata:  name: prometheus-server-conf  labels:    name: prometheus-server-conf  namespace: monitoringdata:  prometheus.yaml.tmpl: |-    global:      scrape_interval: 5s      evaluation_interval: 5s      external_labels:        cluster: prometheus-ha        # Each Prometheus has to have unique labels.        replica: $(POD_NAME)    rule_files:      - /etc/prometheus/rules/*rules.yaml    alerting:      # We want our alerts to be deduplicated      # from different replicas.      alert_relabel_configs:      - regex: replica        action: labeldrop      alertmanagers:        - scheme: http          path_prefix: /          static_configs:            - targets: ['alertmanager:9093']    scrape_configs:    - job_name: kubernetes-nodes-cadvisor      scrape_interval: 10s      scrape_timeout: 10s      scheme: https      tls_config:        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token      kubernetes_sd_configs:        - role: node      relabel_configs:        - action: labelmap          regex: __meta_kubernetes_node_label_(.+)        # Only for Kubernetes ^1.7.3.        # See: https://github.com/prometheus/prometheus/issues/2916        - target_label: __address__          replacement: kubernetes.default.svc:443        - source_labels: [__meta_kubernetes_node_name]          regex: (.+)          target_label: __metrics_path__          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor      metric_relabel_configs:        - action: replace          source_labels: [id]          regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'          target_label: rkt_container_name          replacement: '${2}-${1}'        - action: replace          source_labels: [id]          regex: '^/system\.slice/(.+)\.service$'          target_label: systemd_service_name          replacement: '${1}'    - job_name: 'kubernetes-pods'      kubernetes_sd_configs:        - role: pod      relabel_configs:        - action: labelmap          regex: __meta_kubernetes_pod_label_(.+)        - source_labels: [__meta_kubernetes_namespace]          action: replace          target_label: kubernetes_namespace        - source_labels: [__meta_kubernetes_pod_name]          action: replace          target_label: kubernetes_pod_name        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]          action: keep          regex: true        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]          action: replace          target_label: __scheme__          regex: (https?)        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]          action: replace          target_label: __metrics_path__          regex: (.+)        - source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port]          action: replace          target_label: __address__          regex: ([^:]+)(?::\d+)?;(\d+)          replacement: $1:$2    - job_name: 'kubernetes-apiservers'      kubernetes_sd_configs:        - role: endpoints      scheme: https       tls_config:        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token      relabel_configs:        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]          action: keep          regex: default;kubernetes;https    - job_name: 'kubernetes-service-endpoints'      kubernetes_sd_configs:        - role: endpoints      relabel_configs:        - action: labelmap          regex: __meta_kubernetes_service_label_(.+)        - source_labels: [__meta_kubernetes_namespace]          action: replace          target_label: kubernetes_namespace        - source_labels: [__meta_kubernetes_service_name]          action: replace          target_label: kubernetes_name        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]          action: keep          regex: true        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]          action: replace          target_label: __scheme__          regex: (https?)        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]          action: replace          target_label: __metrics_path__          regex: (.+)        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]          action: replace          target_label: __address__          regex: (.+)(?::\d+);(\d+)          replacement: $1:$2

The above Configmap creates Prometheus configuration file template. This configuration file template will be read by the Thanos sidecar component and it will generate the actual configuration file, which will in turn be consumed by the Prometheus container running in the same pod. It is extremely important to add the external_labels section in the config file so that the Querier can deduplicate data based on that.

Deploying Prometheus Rules configmap

This will create our alert rules which will be relayed to alertmanager for delivery

apiVersion: v1kind: ConfigMapmetadata:  name: prometheus-rules  labels:    name: prometheus-rules  namespace: monitoringdata:  alert-rules.yaml: |-    groups:      - name: Deployment        rules:        - alert: Deployment at 0 Replicas          annotations:            summary: Deployment {{$labels.deployment}} in {{$labels.namespace}} is currently having no pods running          expr: |            sum(kube_deployment_status_replicas{pod_template_hash=""}) by (deployment,namespace)  < 1          for: 1m          labels:            team: devops        - alert: HPA Scaling Limited            annotations:             summary: HPA named {{$labels.hpa}} in {{$labels.namespace}} namespace has reached scaling limited state          expr: |             (sum(kube_hpa_status_condition{condition="ScalingLimited",status="true"}) by (hpa,namespace)) == 1          for: 1m          labels:             team: devops        - alert: HPA at MaxCapacity           annotations:             summary: HPA named {{$labels.hpa}} in {{$labels.namespace}} namespace is running at Max Capacity          expr: |             ((sum(kube_hpa_spec_max_replicas) by (hpa,namespace)) - (sum(kube_hpa_status_current_replicas) by (hpa,namespace))) == 0          for: 1m          labels:             team: devops      - name: Pods        rules:        - alert: Container restarted          annotations:            summary: Container named {{$labels.container}} in {{$labels.pod}} in {{$labels.namespace}} was restarted          expr: |            sum(increase(kube_pod_container_status_restarts_total{namespace!="kube-system",pod_template_hash=""}[1m])) by (pod,namespace,container) > 0          for: 0m          labels:            team: dev        - alert: High Memory Usage of Container           annotations:             summary: Container named {{$labels.container}} in {{$labels.pod}} in {{$labels.namespace}} is using more than 75% of Memory Limit          expr: |             ((( sum(container_memory_usage_bytes{image!="",container_name!="POD", namespace!="kube-system"}) by (namespace,container_name,pod_name)  / sum(container_spec_memory_limit_bytes{image!="",container_name!="POD",namespace!="kube-system"}) by (namespace,container_name,pod_name) ) * 100 ) < +Inf ) > 75          for: 5m          labels:             team: dev        - alert: High CPU Usage of Container           annotations:             summary: Container named {{$labels.container}} in {{$labels.pod}} in {{$labels.namespace}} is using more than 75% of CPU Limit          expr: |             ((sum(irate(container_cpu_usage_seconds_total{image!="",container_name!="POD", namespace!="kube-system"}[30s])) by (namespace,container_name,pod_name) / sum(container_spec_cpu_quota{image!="",container_name!="POD", namespace!="kube-system"} / container_spec_cpu_period{image!="",container_name!="POD", namespace!="kube-system"}) by (namespace,container_name,pod_name) ) * 100)  > 75          for: 5m          labels:             team: dev      - name: Nodes        rules:        - alert: High Node Memory Usage          annotations:            summary: Node {{$labels.kubernetes_io_hostname}} has more than 80% memory used. Plan Capcity          expr: |            (sum (container_memory_working_set_bytes{id="/",container_name!="POD"}) by (kubernetes_io_hostname) / sum (machine_memory_bytes{}) by (kubernetes_io_hostname) * 100) > 80          for: 5m          labels:            team: devops        - alert: High Node CPU Usage          annotations:            summary: Node {{$labels.kubernetes_io_hostname}} has more than 80% allocatable cpu used. Plan Capacity.          expr: |            (sum(rate(container_cpu_usage_seconds_total{id="/", container_name!="POD"}[1m])) by (kubernetes_io_hostname) / sum(machine_cpu_cores) by (kubernetes_io_hostname)  * 100) > 80          for: 5m          labels:            team: devops        - alert: High Node Disk Usage          annotations:            summary: Node {{$labels.kubernetes_io_hostname}} has more than 85% disk used. Plan Capacity.          expr: |            (sum(container_fs_usage_bytes{device=~"^/dev/[sv]d[a-z][1-9]$",id="/",container_name!="POD"}) by (kubernetes_io_hostname) / sum(container_fs_limit_bytes{container_name!="POD",device=~"^/dev/[sv]d[a-z][1-9]$",id="/"}) by (kubernetes_io_hostname)) * 100 > 85          for: 5m          labels:            team: devops

Deploying Prometheus Stateful Set

apiVersion: storage.k8s.io/v1beta1kind: StorageClassmetadata:  name: fast  namespace: monitoringprovisioner: kubernetes.io/gce-pdallowVolumeExpansion: true---apiVersion: apps/v1beta1kind: StatefulSetmetadata:  name: prometheus  namespace: monitoringspec:  replicas: 3  serviceName: prometheus-service  template:    metadata:      labels:        app: prometheus        thanos-store-api: "true"    spec:      serviceAccountName: monitoring      containers:        - name: prometheus          image: prom/prometheus:v2.4.3          args:            - "--config.file=/etc/prometheus-shared/prometheus.yaml"            - "--storage.tsdb.path=/prometheus/"            - "--web.enable-lifecycle"            - "--storage.tsdb.no-lockfile"            - "--storage.tsdb.min-block-duration=2h"            - "--storage.tsdb.max-block-duration=2h"          ports:            - name: prometheus              containerPort: 9090          volumeMounts:            - name: prometheus-storage              mountPath: /prometheus/            - name: prometheus-config-shared              mountPath: /etc/prometheus-shared/            - name: prometheus-rules              mountPath: /etc/prometheus/rules        - name: thanos          image: quay.io/thanos/thanos:v0.8.0          args:            - "sidecar"            - "--log.level=debug"            - "--tsdb.path=/prometheus"            - "--prometheus.url=http://127.0.0.1:9090"            - "--objstore.config={type: GCS, config: {bucket: prometheus-long-term}}"            - "--reloader.config-file=/etc/prometheus/prometheus.yaml.tmpl"            - "--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml"            - "--reloader.rule-dir=/etc/prometheus/rules/"          env:            - name: POD_NAME              valueFrom:                fieldRef:                  fieldPath: metadata.name            - name : GOOGLE_APPLICATION_CREDENTIALS              value: /etc/secret/thanos-gcs-credentials.json          ports:            - name: http-sidecar              containerPort: 10902            - name: grpc              containerPort: 10901          livenessProbe:              httpGet:                port: 10902                path: /-/healthy          readinessProbe:            httpGet:              port: 10902              path: /-/ready          volumeMounts:            - name: prometheus-storage              mountPath: /prometheus            - name: prometheus-config-shared              mountPath: /etc/prometheus-shared/            - name: prometheus-config              mountPath: /etc/prometheus            - name: prometheus-rules              mountPath: /etc/prometheus/rules            - name: thanos-gcs-credentials              mountPath: /etc/secret              readOnly: false      securityContext:        fsGroup: 2000        runAsNonRoot: true        runAsUser: 1000      volumes:        - name: prometheus-config          configMap:            defaultMode: 420            name: prometheus-server-conf        - name: prometheus-config-shared          emptyDir: {}        - name: prometheus-rules          configMap:            name: prometheus-rules        - name: thanos-gcs-credentials          secret:            secretName: thanos-gcs-credentials  volumeClaimTemplates:  - metadata:      name: prometheus-storage      namespace: monitoring    spec:      accessModes: [ "ReadWriteOnce" ]      storageClassName: fast      resources:        requests:          storage: 20Gi

It is important to understand the following about the manifest provided above:

Prometheus is deployed as a stateful set with 3 replicas and each replica provisions its own persistent volume dynamically.
Prometheus configuration is generated by the Thanos sidecar container using the template file we created above.
Thanos handles data compaction and therefore we need to set –storage.tsdb.min-block-duration=2h and –storage.tsdb.max-block-duration=2h
Prometheus stateful set is labelled as thanos-store-api: true so that each pod gets discovered by the headless service, which we will create next. It is this headless service which will be used by the Thanos Querier to query data across all Prometheus instances. We also apply the same label to the Thanos Store and Thanos Ruler component so that they are also discovered by the Querier and can be used for querying metrics.
GCS bucket credentials path is provided using the GOOGLE_APPLICATION_CREDENTIALS environment variable, and the configuration file is mounted to it from the secret which we created as a part of prerequisites.

Deploying Prometheus Services

apiVersion: v1kind: Servicemetadata:   name: prometheus-0-service  annotations:     prometheus.io/scrape: "true"    prometheus.io/port: "9090"  namespace: monitoring  labels:    name: prometheusspec:  selector:     statefulset.kubernetes.io/pod-name: prometheus-0  ports:     - name: prometheus       port: 8080      targetPort: prometheus---apiVersion: v1kind: Servicemetadata:   name: prometheus-1-service  annotations:     prometheus.io/scrape: "true"    prometheus.io/port: "9090"  namespace: monitoring  labels:    name: prometheusspec:  selector:     statefulset.kubernetes.io/pod-name: prometheus-1  ports:     - name: prometheus       port: 8080      targetPort: prometheus---apiVersion: v1kind: Servicemetadata:   name: prometheus-2-service  annotations:     prometheus.io/scrape: "true"    prometheus.io/port: "9090"  namespace: monitoring  labels:    name: prometheusspec:  selector:     statefulset.kubernetes.io/pod-name: prometheus-2  ports:     - name: prometheus       port: 8080      targetPort: prometheus---#This service creates a srv record for querier to find about store-api'sapiVersion: v1kind: Servicemetadata:  name: thanos-store-gateway  namespace: monitoringspec:  type: ClusterIP  clusterIP: None  ports:    - name: grpc      port: 10901      targetPort: grpc  selector:    thanos-store-api: "true"

We create different services for each Prometheus pod in the stateful set, although it is not needed. These are created only for debugging purposes. The purpose of thanos-store-gateway headless service has been explained above. We will later expose Prometheus services using an ingress object.

Deploying Thanos Querier

apiVersion: v1kind: Namespacemetadata:  name: monitoring---apiVersion: apps/v1kind: Deploymentmetadata:  name: thanos-querier  namespace: monitoring  labels:    app: thanos-querierspec:  replicas: 1  selector:    matchLabels:      app: thanos-querier  template:    metadata:      labels:        app: thanos-querier    spec:      containers:      - name: thanos        image: quay.io/thanos/thanos:v0.8.0        args:        - query        - --log.level=debug        - --query.replica-label=replica        - --store=dnssrv+thanos-store-gateway:10901        ports:        - name: http          containerPort: 10902        - name: grpc          containerPort: 10901        livenessProbe:          httpGet:            port: http            path: /-/healthy        readinessProbe:          httpGet:            port: http            path: /-/ready---apiVersion: v1kind: Servicemetadata:  labels:    app: thanos-querier  name: thanos-querier  namespace: monitoringspec:  ports:  - port: 9090    protocol: TCP    targetPort: http    name: http  selector:    app: thanos-querier

This is one of the main components of Thanos deployment. Note the following:

The container argument –store=dnssrv+thanos-store-gateway:10901 helps to discover all components from which metric data should be queried.
The service thanos-querier provided a web interface to run PromQL queries. It also has the option to de-duplicate data across various Prometheus clusters.
This is the end point where we provide Grafana as a datasource for all dashboards.

Deploying Thanos Store Gateway

apiVersion: v1kind: Namespacemetadata:  name: monitoring---apiVersion: apps/v1beta1kind: StatefulSetmetadata:  name: thanos-store-gateway  namespace: monitoring  labels:    app: thanos-store-gatewayspec:  replicas: 1  selector:    matchLabels:      app: thanos-store-gateway  serviceName: thanos-store-gateway  template:    metadata:      labels:        app: thanos-store-gateway        thanos-store-api: "true"    spec:      containers:        - name: thanos          image: quay.io/thanos/thanos:v0.8.0          args:          - "store"          - "--log.level=debug"          - "--data-dir=/data"          - "--objstore.config={type: GCS, config: {bucket: prometheus-long-term}}"          - "--index-cache-size=500MB"          - "--chunk-pool-size=500MB"          env:            - name : GOOGLE_APPLICATION_CREDENTIALS              value: /etc/secret/thanos-gcs-credentials.json          ports:          - name: http            containerPort: 10902          - name: grpc            containerPort: 10901          livenessProbe:            httpGet:              port: 10902              path: /-/healthy          readinessProbe:            httpGet:              port: 10902              path: /-/ready          volumeMounts:            - name: thanos-gcs-credentials              mountPath: /etc/secret              readOnly: false      volumes:        - name: thanos-gcs-credentials          secret:            secretName: thanos-gcs-credentials---

This will create the store component which serves metrics from object storage to the Querier.

Deploying Thanos Ruler

apiVersion: v1kind: Namespacemetadata:  name: monitoring---apiVersion: v1kind: ConfigMapmetadata:  name: thanos-ruler-rules  namespace: monitoringdata:  alert_down_services.rules.yaml: |    groups:    - name: metamonitoring      rules:      - alert: PrometheusReplicaDown        annotations:          message: Prometheus replica in cluster {{$labels.cluster}} has disappeared from Prometheus target discovery.        expr: |          sum(up{cluster="prometheus-ha", instance=~".*:9090", job="kubernetes-service-endpoints"}) by (job,cluster) < 3        for: 15s        labels:          severity: critical---apiVersion: apps/v1beta1kind: StatefulSetmetadata:  labels:    app: thanos-ruler  name: thanos-ruler  namespace: monitoringspec:  replicas: 1  selector:    matchLabels:      app: thanos-ruler  serviceName: thanos-ruler  template:    metadata:      labels:        app: thanos-ruler        thanos-store-api: "true"    spec:      containers:        - name: thanos          image: quay.io/thanos/thanos:v0.8.0          args:            - rule            - --log.level=debug            - --data-dir=/data            - --eval-interval=15s            - --rule-file=/etc/thanos-ruler/*.rules.yaml            - --alertmanagers.url=http://alertmanager:9093            - --query=thanos-querier:9090            - "--objstore.config={type: GCS, config: {bucket: thanos-ruler}}"            - --label=ruler_cluster="prometheus-ha"            - --label=replica="$(POD_NAME)"          env:            - name : GOOGLE_APPLICATION_CREDENTIALS              value: /etc/secret/thanos-gcs-credentials.json            - name: POD_NAME              valueFrom:                fieldRef:                  fieldPath: metadata.name          ports:            - name: http              containerPort: 10902            - name: grpc              containerPort: 10901          livenessProbe:            httpGet:              port: http              path: /-/healthy          readinessProbe:            httpGet:              port: http              path: /-/ready          volumeMounts:            - mountPath: /etc/thanos-ruler              name: config            - name: thanos-gcs-credentials              mountPath: /etc/secret              readOnly: false      volumes:        - configMap:            name: thanos-ruler-rules          name: config        - name: thanos-gcs-credentials          secret:            secretName: thanos-gcs-credentials---apiVersion: v1kind: Servicemetadata:  labels:    app: thanos-ruler  name: thanos-ruler  namespace: monitoringspec:  ports:    - port: 9090      protocol: TCP      targetPort: http      name: http  selector:    app: thanos-ruler

Now if you fire-up on interactive shell in the same namespace as our workloads, and try to see to which all pods does our thanos-store-gateway resolves, you will see something like this:

root@my-shell-95cb5df57-4q6w8:/# nslookup thanos-store-gatewayServer:		10.63.240.10Address:	10.63.240.10#53Name:	thanos-store-gateway.monitoring.svc.cluster.localAddress: 10.60.25.2Name:	thanos-store-gateway.monitoring.svc.cluster.localAddress: 10.60.25.4Name:	thanos-store-gateway.monitoring.svc.cluster.localAddress: 10.60.30.2Name:	thanos-store-gateway.monitoring.svc.cluster.localAddress: 10.60.30.8Name:	thanos-store-gateway.monitoring.svc.cluster.localAddress: 10.60.31.2root@my-shell-95cb5df57-4q6w8:/# exit

The IP’s returned above correspond to our Prometheus pods, thanos-store and thanos-ruler. This can be verified as

$ kubectl get pods -o wide -l thanos-store-api="true"NAME                     READY   STATUS    RESTARTS   AGE    IP           NODE                              NOMINATED NODE   READINESS GATESprometheus-0             2/2     Running   0          100m   10.60.31.2   gke-demo-1-pool-1-649cbe02-jdnv   <none>           <none>prometheus-1             2/2     Running   0          14h    10.60.30.2   gke-demo-1-pool-1-7533d618-kxkd   <none>           <none>prometheus-2             2/2     Running   0          31h    10.60.25.2   gke-demo-1-pool-1-4e9889dd-27gc   <none>           <none>thanos-ruler-0           1/1     Running   0          100m   10.60.30.8   gke-demo-1-pool-1-7533d618-kxkd   <none>           <none>thanos-store-gateway-0   1/1     Running   0          14h    10.60.25.4   gke-demo-1-pool-1-4e9889dd-27gc   <none>           <none>

Deploying Alertmanager

apiVersion: v1kind: Namespacemetadata:  name: monitoring---kind: ConfigMapapiVersion: v1metadata:  name: alertmanager  namespace: monitoringdata:  config.yml: |-    global:      resolve_timeout: 5m      slack_api_url: "<your_slack_hook>"      victorops_api_url: "<your_victorops_hook>"    templates:    - '/etc/alertmanager-templates/*.tmpl'    route:      group_by: ['alertname', 'cluster', 'service']      group_wait: 10s      group_interval: 1m      repeat_interval: 5m        receiver: default       routes:      - match:          team: devops        receiver: devops        continue: true       - match:           team: dev        receiver: dev        continue: true    receivers:    - name: 'default'    - name: 'devops'      victorops_configs:      - api_key: '<YOUR_API_KEY>'        routing_key: 'devops'        message_type: 'CRITICAL'        entity_display_name: '{{ .CommonLabels.alertname }}'        state_message: 'Alert: {{ .CommonLabels.alertname }}. Summary:{{ .CommonAnnotations.summary }}. RawData: {{ .CommonLabels }}'      slack_configs:      - channel: '#k8-alerts'        send_resolved: true    - name: 'dev'      victorops_configs:      - api_key: '<YOUR_API_KEY>'        routing_key: 'dev'        message_type: 'CRITICAL'        entity_display_name: '{{ .CommonLabels.alertname }}'        state_message: 'Alert: {{ .CommonLabels.alertname }}. Summary:{{ .CommonAnnotations.summary }}. RawData: {{ .CommonLabels }}'      slack_configs:      - channel: '#k8-alerts'        send_resolved: true---apiVersion: extensions/v1beta1kind: Deploymentmetadata:  name: alertmanager  namespace: monitoringspec:  replicas: 1  selector:    matchLabels:      app: alertmanager  template:    metadata:      name: alertmanager      labels:        app: alertmanager    spec:      containers:      - name: alertmanager        image: prom/alertmanager:v0.15.3        args:          - '--config.file=/etc/alertmanager/config.yml'          - '--storage.path=/alertmanager'        ports:        - name: alertmanager          containerPort: 9093        volumeMounts:        - name: config-volume          mountPath: /etc/alertmanager        - name: alertmanager          mountPath: /alertmanager      volumes:      - name: config-volume        configMap:          name: alertmanager      - name: alertmanager        emptyDir: {}---apiVersion: v1kind: Servicemetadata:  annotations:    prometheus.io/scrape: 'true'    prometheus.io/path: '/metrics'  labels:    name: alertmanager  name: alertmanager  namespace: monitoringspec:  selector:    app: alertmanager  ports:  - name: alertmanager    protocol: TCP    port: 9093    targetPort: 9093

This will create our alertmanager deployment which will deliver all alerts generated as per Prometheus rules.

Deploying Kubestate Metrics

apiVersion: v1kind: Namespacemetadata:  name: monitoring---apiVersion: rbac.authorization.k8s.io/v1 # kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1kind: ClusterRoleBindingmetadata:  name: kube-state-metricsroleRef:  apiGroup: rbac.authorization.k8s.io  kind: ClusterRole  name: kube-state-metricssubjects:- kind: ServiceAccount  name: kube-state-metrics  namespace: monitoring---apiVersion: rbac.authorization.k8s.io/v1# kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1kind: ClusterRolemetadata:  name: kube-state-metricsrules:- apiGroups: [""]  resources:  - configmaps  - secrets  - nodes  - pods  - services  - resourcequotas  - replicationcontrollers  - limitranges  - persistentvolumeclaims  - persistentvolumes  - namespaces  - endpoints  verbs: ["list", "watch"]- apiGroups: ["extensions"]  resources:  - daemonsets  - deployments  - replicasets  verbs: ["list", "watch"]- apiGroups: ["apps"]  resources:  - statefulsets  verbs: ["list", "watch"]- apiGroups: ["batch"]  resources:  - cronjobs  - jobs  verbs: ["list", "watch"]- apiGroups: ["autoscaling"]  resources:  - horizontalpodautoscalers  verbs: ["list", "watch"]---apiVersion: rbac.authorization.k8s.io/v1# kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1kind: RoleBindingmetadata:  name: kube-state-metrics  namespace: monitoringroleRef:  apiGroup: rbac.authorization.k8s.io  kind: Role  name: kube-state-metrics-resizersubjects:- kind: ServiceAccount  name: kube-state-metrics  namespace: monitoring---apiVersion: rbac.authorization.k8s.io/v1# kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1kind: Rolemetadata:  namespace: monitoring  name: kube-state-metrics-resizerrules:- apiGroups: [""]  resources:  - pods  verbs: ["get"]- apiGroups: ["extensions"]  resources:  - deployments  resourceNames: ["kube-state-metrics"]  verbs: ["get", "update"]---apiVersion: v1kind: ServiceAccountmetadata:  name: kube-state-metrics  namespace: monitoring---apiVersion: apps/v1kind: Deploymentmetadata:  name: kube-state-metrics  namespace: monitoringspec:  selector:    matchLabels:      k8s-app: kube-state-metrics  replicas: 1  template:    metadata:      labels:        k8s-app: kube-state-metrics    spec:      serviceAccountName: kube-state-metrics      containers:      - name: kube-state-metrics        image: quay.io/mxinden/kube-state-metrics:v1.4.0-gzip.3        ports:        - name: http-metrics          containerPort: 8080        - name: telemetry          containerPort: 8081        readinessProbe:          httpGet:            path: /healthz            port: 8080          initialDelaySeconds: 5          timeoutSeconds: 5      - name: addon-resizer        image: k8s.gcr.io/addon-resizer:1.8.3        resources:          limits:            cpu: 150m            memory: 50Mi          requests:            cpu: 150m            memory: 50Mi        env:          - name: MY_POD_NAME            valueFrom:              fieldRef:                fieldPath: metadata.name          - name: MY_POD_NAMESPACE            valueFrom:              fieldRef:                fieldPath: metadata.namespace        command:          - /pod_nanny          - --container=kube-state-metrics          - --cpu=100m          - --extra-cpu=1m          - --memory=100Mi          - --extra-memory=2Mi          - --threshold=5          - --deployment=kube-state-metrics---apiVersion: v1kind: Servicemetadata:  name: kube-state-metrics  namespace: monitoring  labels:    k8s-app: kube-state-metrics  annotations:    prometheus.io/scrape: 'true'spec:  ports:  - name: http-metrics    port: 8080    targetPort: http-metrics    protocol: TCP  - name: telemetry    port: 8081    targetPort: telemetry    protocol: TCP  selector:    k8s-app: kube-state-metrics

Kubestate metrics deployment is needed to relay some important container metrics which are not natively exposed by the kubelet and hence are not directly available to Prometheus.

Deploying Node-Exporter Daemonset

apiVersion: v1kind: Namespacemetadata:  name: monitoring---apiVersion: extensions/v1beta1kind: DaemonSetmetadata:  name: node-exporter  namespace: monitoring  labels:    name: node-exporterspec:  template:    metadata:      labels:        name: node-exporter      annotations:         prometheus.io/scrape: "true"         prometheus.io/port: "9100"    spec:      hostPID: true      hostIPC: true      hostNetwork: true      containers:        - name: node-exporter          image: prom/node-exporter:v0.16.0          securityContext:            privileged: true          args:            - --path.procfs=/host/proc            - --path.sysfs=/host/sys          ports:            - containerPort: 9100              protocol: TCP          resources:            limits:              cpu: 100m              memory: 100Mi            requests:              cpu: 10m              memory: 100Mi          volumeMounts:            - name: dev              mountPath: /host/dev            - name: proc              mountPath: /host/proc            - name: sys              mountPath: /host/sys            - name: rootfs              mountPath: /rootfs      volumes:        - name: proc          hostPath:            path: /proc        - name: dev          hostPath:            path: /dev        - name: sys          hostPath:            path: /sys        - name: rootfs          hostPath:            path: /

Node-Exporter daemonset runs a pod of node-exporter on each node and exposes very important node related metrics which can be pulled by Prometheus instances.
Deploying Grafana

apiVersion: v1kind: Namespacemetadata:  name: monitoring---apiVersion: storage.k8s.io/v1beta1kind: StorageClassmetadata:  name: fast  namespace: monitoringprovisioner: kubernetes.io/gce-pdallowVolumeExpansion: true---apiVersion: apps/v1beta1kind: StatefulSetmetadata:  name: grafana  namespace: monitoringspec:  replicas: 1  serviceName: grafana  template:    metadata:      labels:        task: monitoring        k8s-app: grafana    spec:      containers:      - name: grafana        image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4        ports:        - containerPort: 3000          protocol: TCP        volumeMounts:        - mountPath: /etc/ssl/certs          name: ca-certificates          readOnly: true        - mountPath: /var          name: grafana-storage        env:        - name: GF_SERVER_HTTP_PORT          value: "3000"          # The following env variables are required to make Grafana accessible via          # the kubernetes api-server proxy. On production clusters, we recommend          # removing these env variables, setup auth for grafana, and expose the grafana          # service using a LoadBalancer or a public IP.        - name: GF_AUTH_BASIC_ENABLED          value: "false"        - name: GF_AUTH_ANONYMOUS_ENABLED          value: "true"        - name: GF_AUTH_ANONYMOUS_ORG_ROLE          value: Admin        - name: GF_SERVER_ROOT_URL          # If you're only using the API Server proxy, set this value instead:          # value: /api/v1/namespaces/kube-system/services/monitoring-grafana/proxy          value: /      volumes:      - name: ca-certificates        hostPath:          path: /etc/ssl/certs  volumeClaimTemplates:  - metadata:      name: grafana-storage      namespace: monitoring    spec:      accessModes: [ "ReadWriteOnce" ]      storageClassName: fast      resources:        requests:          storage: 5Gi---apiVersion: v1kind: Servicemetadata:  labels:    kubernetes.io/cluster-service: 'true'    kubernetes.io/name: grafana  name: grafana  namespace: monitoringspec:  ports:  - port: 3000    targetPort: 3000  selector:    k8s-app: grafana

This will create our Grafana Deployment and Service which will be exposed using our Ingress Object. We should add Thanos-Querier as the datasource for our Grafana deployment. In order to do so:

Click on Add DataSource
Set Name: DS_PROMETHEUS
Set Type: Prometheus
Set URL: http://thanos-querier:9090
Save and Test. You can now build your custom dashboards or simply import dashboards from grafana.net. Dashboard #315 and #1471 are good to start with.

Deploying the Ingress Object

apiVersion: extensions/v1beta1kind: Ingressmetadata:  name: monitoring-ingress  namespace: monitoring  annotations:    kubernetes.io/ingress.class: "nginx"spec:  rules:  - host: grafana.<yourdomain>.com    http:      paths:      - path: /        backend:          serviceName: grafana          servicePort: 3000  - host: prometheus-0.<yourdomain>.com    http:      paths:      - path: /        backend:          serviceName: prometheus-0-service          servicePort: 8080  - host: prometheus-1.<yourdomain>.com    http:      paths:      - path: /        backend:          serviceName: prometheus-1-service          servicePort: 8080  - host: prometheus-2.<yourdomain>.com    http:      paths:      - path: /        backend:          serviceName: prometheus-2-service          servicePort: 8080  - host: alertmanager.<yourdomain>.com    http:       paths:      - path: /        backend:          serviceName: alertmanager          servicePort: 9093  - host: thanos-querier.<yourdomain>.com    http:      paths:      - path: /        backend:          serviceName: thanos-querier          servicePort: 9090  - host: thanos-ruler.<yourdomain>.com    http:      paths:      - path: /        backend:          serviceName: thanos-ruler          servicePort: 9090

This is the final piece in the puzzle. This will help expose all our services outside the Kubernetes cluster and help us access them. Make sure you replace <yourdomain> with a domain name which is accessible to you and you can point the Ingress-Controller’s service to.

You should now be able to access Thanos Querier at http://thanos-querier.<yourdomain>.com . It will look something like this:

Make sure deduplication is selected.

If you click on Stores all the active endpoints discovered by thanos-store-gateway service can be seen

Now you add Thanos Querier as the datasource in Grafana and start creating dashboards

Kubernetes Cluster Monitoring Dashboard

Kubernetes Node Monitoring Dashboard

Conclusion

Integrating Thanos with Prometheus definitely provides the ability to scale Prometheus horizontally, and also since Thanos-Querier is able to pull metrics from other querier instances, you can practically pull metrics across clusters visualize them in a single dashboard.

We are also able to archive metric data in an object store that provides infinite storage to our monitoring system along with serving metrics from the object storage itself. A major part of cost for this set-up can be attributed to the object storage (S3 or GCS). This can be further reduced if we apply appropriate retention policies to them.

However, achieving all this requires quite a bit of configuration on your part. The manifests provided above have been tested in a production environment. Feel free to reach out should you have any questions around them.

Discover more with Gcore Managed Kubernetes

Pre-configure your dev environment with Gcore VM init scripts

Provisioning new cloud instances can be repetitive and time-consuming if you’re doing everything manually: installing packages, configuring environments, copying SSH keys, and more. With cloud-init, you can automate these tasks and launch development-ready instances from the start.Gcore Edge Cloud VMs support cloud-init out of the box. With a simple YAML script, you can automatically set up a development-ready instance at boot, whether you’re launching a single machine or spinning up a fleet.In this guide, we’ll walk through how to use cloud-init on Gcore Edge Cloud to:Set a passwordInstall packages and system updatesAdd users and SSH keysMount disks and write filesRegister services or install tooling like Docker or Node.jsLet’s get started.What is cloud-init?cloud-init is a widely used tool for customizing cloud instances during the first boot. It reads user-provided configuration data—usually YAML—and uses it to run commands, install packages, and configure the system. In this article, we will focus on Linux-based virtual machines.How to use cloud-init on GcoreFor Gcore Cloud VMs, cloud-init scripts are added during instance creation using the User data field in the UI or API.Step 1: Create a basic scriptStart with a simple YAML script. Here’s one that updates packages and installs htop:#cloud-config package_update: true packages: - htop Step 2: Launch a new VM with your scriptGo to the Gcore Customer Portal, navigate to VMs, and start creating a new instance (or just click here). When you reach the Additional options section, enable the User data option. Then, paste in your YAML cloud-init script.Once the VM boots, it will automatically run the script. This works the same way for all supported Linux distributions available through Gcore.3 real-world examplesLet’s look at three examples of how you can use this.Example 1: Add a password for a specific userThe below script sets the for the default user of the selected operating system:#cloud-config password: <password> chpasswd: {expire: False} ssh_pwauth: True Example 2: Dev environment with Docker and GitThe following script does the following:Installs Docker and GitAdds a new user devuser with sudo privilegesAuthorizes an SSH keyStarts Docker at boot#cloud-config package_update: true packages: - docker.io - git users: - default - name: devuser sudo: ALL=(ALL) NOPASSWD:ALL groups: docker shell: /bin/bash ssh-authorized-keys: - ssh-rsa AAAAB3Nza...your-key-here runcmd: - systemctl enable docker - systemctl start docker Example 3: Install Node.js and clone a repoThis script installs Node.js and clones a GitHub repo to your Gcore VM at launch:#cloud-config packages: - curl runcmd: - curl -fsSL https://deb.nodesource.com/setup_18.x | bash - - apt-get install -y nodejs - git clone https://github.com/example-user/dev-project.git /home/devuser/project Reusing and versioning your scriptsTo avoid reinventing the wheel, keep your cloud-init scripts:In version control (e.g., Git)Templated for different environments (e.g., dev vs staging)Modular so you can reuse base blocks across projectsYou can also use tools like Ansible or Terraform with cloud-init blocks to standardize provisioning across your team or multiple Gcore VM environments.Debugging cloud-initIf your script doesn’t behave as expected, SSH into the instance and check the cloud-init logs:sudo cat /var/log/cloud-init-output.log This file shows each command as it ran and any errors that occurred.Other helpful logs:/var/log/cloud-init.log /var/lib/cloud/instance/user-data.txt Pro tip: Echo commands or write log files in your script to help debug tricky setups—especially useful if you’re automating multi-node workflows across Gcore Cloud.Tips and best practicesIndentation matters! YAML is picky. Use spaces, not tabs.Always start the file with #cloud-config.runcmd is for commands that run at the end of boot.Use write_files to write configs, env variables, or secrets.Cloud-init scripts only run on the first boot. To re-run, you’ll need to manually trigger cloud-init or re-create the VM.Automate it all with GcoreIf you're provisioning manually, you're doing it wrong. Cloud-init lets you treat your VM setup as code: portable, repeatable, and testable. Whether you’re spinning up ephemeral dev boxes or preparing staging environments, Gcore’s support for cloud-init means you can automate it all.For more on managing virtual machines with Gcore, check out our product documentation.Explore Gcore VM product docs

How to cut egress costs and speed up delivery using Gcore CDN and Object Storage

If you’re serving static assets (images, videos, scripts, downloads) from object storage, you’re probably paying more than you need to, and your users may be waiting longer than they should.In this guide, we explain how to front your bucket with Gcore CDN to cache static assets, cut egress bandwidth costs, and get faster TTFB globally. We’ll walk through setup (public or private buckets), signed URL support, cache control best practices, debugging tips, and automation with the Gcore API or Terraform.Why bother?Serving directly from object storage hits your origin for every request and racks up egress charges. With a CDN in front, cached files are served from edge—faster for users, and cheaper for you.Lower TTFB, better UXWhen content is cached at the edge, it doesn’t have to travel across the planet to get to your user. Gcore CDN caches your assets at PoPs close to end users, so requests don’t hit origin unless necessary. Once cached, assets are delivered in a few milliseconds.Lower billsMost object storage providers charge $80–$120 per TB in egress fees. By fronting your storage with a CDN, you only pay egress once per edge location—then it’s all cache hits after that. If you’re using Gcore Storage and Gcore CDN, there’s zero egress fee between the two.Caching isn’t the only way you save. Gcore CDN can also compress eligible file types (like HTML, CSS, JavaScript, and JSON) on the fly, further shrinking bandwidth usage and speeding up file delivery—all without any changes to your storage setup.Less origin traffic and less data to transfer means smaller bills. And your storage bucket doesn’t get slammed under load during traffic spikes.Simple scaling, globallyThe CDN takes the hit, not your bucket. That means fewer rate-limit issues, smoother traffic spikes, and more reliable performance globally. Gcore CDN spans the globe, so you’re good whether your users are in Tokyo, Toronto, or Tel Aviv.Setup guide: Gcore CDN + Gcore Object StorageLet’s walk through configuring Gcore CDN to cache content from a storage bucket. This works with Gcore Object Storage and other S3-compatible services.Step 1: Prep your bucketPublic? Check files are publicly readable (via ACL or bucket policy).Private? Use Gcore’s AWS Signature V4 support—have your access key, secret, region, and bucket name ready.Gcore Object Storage URL format: https://<bucket-name>.<region>.cloud.gcore.lu/<object> Step 2: Create CDN resource (UI or API)In the Gcore Customer Portal:Go to CDN > Create CDN ResourceChoose "Accelerate and protect static assets"Set a CNAME (e.g. cdn.yoursite.com) if you want to use your domainConfigure origin:Public bucket: Choose None for authPrivate bucket: Choose AWS Signature V4, and enter credentialsChoose HTTPS as the origin protocolGcore will assign a *.gcdn.co domain. If you’re using a custom domain, add a CNAME: cdn.yoursite.com CNAME .gcdn.co Here’s how it works via Terraform: resource "gcore_cdn_resource" "cdn" { cname = "cdn.yoursite.com" origin_group_id = gcore_cdn_origingroup.origin.id origin_protocol = "HTTPS" } resource "gcore_cdn_origingroup" "origin" { name = "my-origin-group" origin { source = "mybucket.eu-west.cloud.gcore.lu" enabled = true } } Step 3: Set caching behaviorSet Cache-Control headers in your object metadata: Cache-Control: public, max-age=2592000 Too messy to handle in storage? Override cache logic in Gcore:Force TTLs by path or extensionIgnore or forward query strings in cache keyStrip cookies (if unnecessary for cache decisions)Pro tip: Use versioned file paths (/img/logo.v3.png) to bust cache safely.Secure access with signed URLsWant your assets to be private, but still edge-cacheable? Use Gcore’s Secure Token feature:Enable Secure Token in CDN settingsSet a secret keyGenerate time-limited tokens in your appPython example: import base64, hashlib, time secret = 'your_secret' path = '/videos/demo.mp4' expires = int(time.time()) + 3600 string = f"{expires}{path} {secret}" token = base64.urlsafe_b64encode(hashlib.md5(string.encode()).digest()).decode().strip('=') url = f"https://cdn.yoursite.com{path}?md5={token}&expires={expires}" Signed URLs are verified at the CDN edge. Invalid or expired? Blocked before origin is touched.Optional: Bind the token to an IP to prevent link sharing.Debug and cache tuneUse curl or browser devtools: curl -I https://cdn.yoursite.com/img/logo.png Look for:Cache: HIT or MISSCache-ControlX-Cached-SinceCache not working? Check for the following errors:Origin doesn’t return Cache-ControlCDN override TTL not appliedCache key includes query strings unintentionallyYou can trigger purges from the Gcore Customer Portal or automate them via the API using POST /cdn/purge. Choose one of three ways:Purge all: Clear the entire domain’s cache at once.Purge by URL: Target a specific full path (e.g., /images/logo.png).Purge by pattern: Target a set of files using a wildcard at the end of the pattern (e.g., /videos/*).Monitor and optimize at scaleAfter rollout:Watch origin bandwidth dropCheck hit ratio (aim for >90%)Audit latency (TTFB on HIT vs MISS)Consider logging using Gcore’s CDN logs uploader to analyze cache behavior, top requested paths, or cache churn rates.For maximum savings, combine Gcore Object Storage with Gcore CDN: egress traffic between them is 100% free. That means you can serve cached assets globally without paying a cent in bandwidth fees.Using external storage? You’ll still slash egress costs by caching at the edge and cutting direct origin traffic—but you’ll unlock the biggest savings when you stay inside the Gcore ecosystem.Save money and boost performance with GcoreStill serving assets direct from storage? You’re probably wasting money and compromising performance on the table. Front your bucket with Gcore CDN. Set smart cache headers or use overrides. Enable signed URLs if you need control. Monitor cache HITs and purge when needed. Automate the setup with Terraform. Done.Next steps:Create your CDN resourceUse private object storage with Signature V4Secure your CDN with signed URLsCreate a free CDN resource now

Bare metal vs. virtual machines: performance, cost, and use case comparison

Choosing the right type of server infrastructure is critical to how your application performs, scales, and fits your budget. For most workloads, the decision comes down to two core options: bare metal servers and virtual machines (VMs). Both can be deployed in the cloud, but they differ significantly in terms of performance, control, scalability, and cost.In this article, we break down the core differences between bare metal and virtual servers, highlight when to choose each, and explain how Gcore can help you deploy the right infrastructure for your needs. If you want to learn about either BM or VMs in detail, we’ve got articles for those: here’s the one for bare metal, and here’s a deep dive into virtual machines.Bare metal vs. virtual machines at a glanceWhen evaluating whether bare metal or virtual machines are right for your company, consider your specific workload requirements, performance priorities, and business objectives. Here’s a quick breakdown to help you decide what works best for you.FactorBare metal serversVirtual machinesPerformanceDedicated resources; ideal for high-performance workloadsShared resources; suitable for moderate or variable workloadsScalabilityOften requires manual scaling; less flexibleHighly elastic; easy to scale up or downCustomizationFull control over hardware, OS, and configurationLimited by hypervisor and provider’s environmentSecurityIsolated by default; no hypervisor layerShared environment with strong isolation protocolsCostHigher upfront cost; dedicated hardwarePay-as-you-go pricing; cost-effective for flexible workloadsBest forHPC, AI/ML, compliance-heavy workloadsStartups, dev/test, fast-scaling applicationsAll about bare metal serversA bare metal server is a single-tenant physical server rented from a cloud provider. Unlike virtual servers, the hardware is not shared with other users, giving you full access to all resources and deeper control over configurations. You get exclusive access and control over the hardware via the cloud provider, which offers the stability and security needed for high-demand applications.The benefits of bare metal serversHere are some of the business advantages of opting for a bare metal server:Maximized performance: Because they are dedicated resources, bare metal servers provide top-tier performance without sharing processing power, memory, or storage with other users. This makes them ideal for resource-intensive applications like high-performance computing (HPC), big data processing, and game hosting.Greater control: Since you have direct access to the hardware, you can customize the server to meet your specific requirements. This is especially important for businesses with complex, specialized needs that require fine-tuned configurations.High security: Bare metal servers offer a higher level of security than their alternatives due to the absence of virtualization. With no shared resources or hypervisor layer, there’s less risk of vulnerabilities that come with multi-tenant environments.Dedicated resources: Because you aren’t sharing the server with other users, all server resources are dedicated to your application so that you consistently get the performance you need.Who should use bare metal servers?Here are examples of instances where bare metal servers are the best option for a business:High-performance computing (HPC)Big data processing and analyticsResource-intensive applications, such as AI/ML workloadsGame and video streaming serversBusinesses requiring enhanced security and complianceAll about virtual machinesA virtual server (or virtual machine) runs on top of a physical server that’s been partitioned by a cloud provider using a hypervisor. This allows multiple VMs to share the same hardware while remaining isolated from each other.Unlike bare metal servers, virtual machines share the underlying hardware with other cloud provider customers. That means you’re using (and paying for) part of one server, providing cost efficiency and flexibility.The benefits of virtual machinesHere are some advantages of using a shared virtual machine:Scalability: Virtual machines are ideal for businesses that need to scale quickly and are starting at a small scale. With cloud-based virtualization, you can adjust your server resources (CPU, memory, storage) on demand to match changing workloads.Cost efficiency: You pay only for the resources you use with VMs, making them cost-effective for companies with fluctuating resource needs, as there is no need to pay for unused capacity.Faster deployment: VMs can be provisioned quickly and easily, which makes them ideal for anyone who wants to deploy new services or applications fast.Who should use virtual machines?VMs are a great fit for the following:Web hosting and application hostingDevelopment and testing environmentsRunning multiple apps with varying demandsStartups and growing businesses requiring scalabilityBusinesses seeking cost-effective, flexible solutionsWhich should you choose?There’s no one-size-fits-all answer. Your choice should depend on the needs of your workload:Choose bare metal if you need dedicated performance, low-latency access to hardware, or tighter control over security and compliance.Choose virtual servers if your priority is flexible scaling, faster deployment, and optimized cost.If your application uses GPU-based inference or AI training, check out our dedicated guide to VM vs. BM for AI workloads.Get started with Gcore BM or VMs todayAt Gcore, we provide both bare metal and virtual machine solutions, offering flexibility, performance, and reliability to meet your business needs. Gcore Bare Metal has the power and reliability needed for demanding workloads, while Gcore Virtual Machines offers customizable configurations, free egress traffic, and flexibility.Compare Gcore BM and VM pricing now

Optimize your workload: a guide to selecting the best virtual machine configuration

Virtual machines (VMs) offer the flexibility, scalability, and cost-efficiency that businesses need to optimize workloads. However, choosing the wrong setup can lead to poor performance, wasted resources, and unnecessary costs.In this guide, we’ll walk you through the essential factors to consider when selecting the best virtual machine configuration for your specific workload needs.﹟1 Understand your workload requirementsThe first step in choosing the right virtual machine configuration is understanding the nature of your workload. Workloads can range from light, everyday tasks to resource-intensive applications. When making your decision, consider the following:Compute-intensive workloads: Applications like video rendering, scientific simulations, and data analysis require a higher number of CPU cores. Opt for VMs with multiple processors or CPUs for smoother performance.Memory-intensive workloads: Databases, big data analytics, and high-performance computing (HPC) jobs often need more RAM. Choose a VM configuration that provides sufficient memory to avoid memory bottlenecks.Storage-intensive workloads: If your workload relies heavily on storage, such as file servers or applications requiring frequent read/write operations, prioritize VM configurations that offer high-speed storage options, such as SSDs or NVMe.I/O-intensive workloads: Applications that require frequent network or disk I/O, such as cloud services and distributed applications, benefit from VMs with high-bandwidth and low-latency network interfaces.﹟2 Consider VM size and scalabilityOnce you understand your workload’s requirements, the next step is to choose the right VM size. VM sizes are typically categorized by the amount of CPU, memory, and storage they offer.Start with a baseline: Select a VM configuration that offers a balanced ratio of CPU, RAM, and storage based on your workload type.Scalability: Choose a VM size that allows you to easily scale up or down as your needs change. Many cloud providers offer auto-scaling capabilities that adjust your VM’s resources based on real-time demand, providing flexibility and cost savings.Overprovisioning vs. underprovisioning: Avoid overprovisioning (allocating excessive resources) unless your workload demands peak capacity at all times, as this can lead to unnecessary costs. Similarly, underprovisioning can affect performance, so finding the right balance is essential.﹟3 Evaluate CPU and memory considerationsThe central processing unit (CPU) and memory (RAM) are the heart of a virtual machine. The configuration of both plays a significant role in performance. Workloads that need high processing power, such as video encoding, machine learning, or simulations, will benefit from VMs with multiple CPU cores. However, be mindful of CPU architecture—look for VMs that offer the latest processors (e.g., Intel Xeon, AMD EPYC) for better performance per core.It’s also important that the VM has enough memory to avoid paging, which occurs when the system uses disk space as virtual memory, significantly slowing down performance. Consider a configuration with more RAM and support for faster memory types like DDR4 for memory-heavy applications.﹟4 Assess storage performance and capacityStorage performance and capacity can significantly impact the performance of your virtual machine, especially for applications requiring large data volumes. Key considerations include:Disk type: For faster read/write operations, opt for solid-state drives (SSDs) over traditional hard disk drives (HDDs). Some cloud providers also offer NVMe storage, which can provide even greater speed for highly demanding workloads.Disk size: Choose the right size based on the amount of data you need to store and process. Over-allocating storage space might seem like a safe bet, but it can also increase costs unnecessarily. You can always resize disks later, so avoid over-allocating them upfront.IOPS and throughput: Some workloads require high input/output operations per second (IOPS). If this is a priority for your workload (e.g., databases), make sure that your VM configuration includes high IOPS storage options.﹟5 Weigh up your network requirementsWhen working with cloud-based VMs, network performance is a critical consideration. High-speed and low-latency networking can make a difference for applications such as online gaming, video conferencing, and real-time analytics.Bandwidth: Check whether the VM configuration offers the necessary bandwidth for your workload. For applications that handle large data transfers, such as cloud backup or file servers, make sure that the network interface provides high throughput.Network latency: Low latency is crucial for applications where real-time performance is key (e.g., trading systems, gaming). Choose VMs with low-latency networking options to minimize delays and improve the user experience.Network isolation and security: Check if your VM configuration provides the necessary network isolation and security features, especially when handling sensitive data or operating in multi-tenant environments.﹟6 Factor in cost considerationsWhile it’s essential that your VM has the right configuration, cost is always an important factor to consider. Cloud providers typically charge based on the resources allocated, so optimizing for cost efficiency can significantly impact your budget.Consider whether a pay-as-you-go or reserved model (which offers discounted rates in exchange for a long-term commitment) fits your usage pattern. The reserved option can provide significant savings if your workload runs continuously. You can also use monitoring tools to track your VM’s performance and resource usage over time. This data will help you make informed decisions about scaling up or down so you’re not paying for unused resources.﹟7 Evaluate security featuresSecurity is a primary concern when selecting a VM configuration, especially for workloads handling sensitive data. Consider the following:Built-in security: Look for VMs that offer integrated security features such as DDoS protection, web application firewall (WAF), and encryption.Compliance: Check that the VM configuration meets industry standards and regulations, such as GDPR, ISO 27001, and PCI DSS.Network security: Evaluate the VM's network isolation capabilities and the availability of cloud firewalls to manage incoming and outgoing traffic.﹟8 Consider geographic locationThe geographic location of your VM can impact latency and compliance. Therefore, it’s a good idea to choose VM locations that are geographically close to your end users to minimize latency and improve performance. In addition, it’s essential to select VM locations that comply with local data sovereignty laws and regulations.﹟9 Assess backup and recovery optionsBackup and recovery are critical for maintaining data integrity and availability. Look for VMs that offer automated backup solutions so that data is regularly saved. You should also evaluate disaster recovery capabilities, including the ability to quickly restore data and applications in case of failure.﹟10 Test and iterateFinally, once you've chosen a VM configuration, testing its performance under real-world conditions is essential. Most cloud providers offer performance monitoring tools that allow you to assess how well your VM is meeting your workload requirements.If you notice any performance bottlenecks, be prepared to adjust the configuration. This could involve increasing CPU cores, adding more memory, or upgrading storage. Regular testing and fine-tuning means that your VM is always optimized.Choosing a virtual machine that suits your requirementsSelecting the best virtual machine configuration is a key step toward optimizing your workloads efficiently, cost-effectively, and without unnecessary performance bottlenecks. By understanding your workload’s needs, considering factors like CPU, memory, storage, and network performance, and continuously monitoring resource usage, you can make informed decisions that lead to better outcomes and savings.Whether you're running a small application or large-scale enterprise software, the right VM configuration can significantly improve performance and cost. Gcore offers a wide range of virtual machine options that can meet your unique requirements. Our virtual machines are designed to meet diverse workload requirements, providing dedicated vCPUs, high-speed storage, and low-latency networking across 30+ global regions. You can scale compute resources on demand, benefit from free egress traffic, and enjoy flexible pricing models by paying only for the resources in use, maximizing the value of your cloud investments.Contact us to discuss your VM needs

How to get the size of a directory in Linux

Understanding how to check directory size in Linux is critical for managing storage space efficiently. Understanding this process is essential whether you’re assessing specific folder space or preventing storage issues.This comprehensive guide covers commands and tools so you can easily calculate and analyze directory sizes in a Linux environment. We will guide you step-by-step through three methods: du, ncdu, and ls -la. They’re all effective and each offers different benefits.What is a Linux directory?A Linux directory is a special type of file that functions as a container for storing files and subdirectories. It plays a key role in organizing the Linux file system by creating a hierarchical structure. This arrangement simplifies file management, making it easier to locate, access, and organize related files. Directories are fundamental components that help ensure smooth system operations by maintaining order and facilitating seamless file access in Linux environments.#1 Get Linux directory size using the du commandUsing the du command, you can easily determine a directory’s size by displaying the disk space used by files and directories. The output can be customized to be presented in human-readable formats like kilobytes (KB), megabytes (MB), or gigabytes (GB).Check the size of a specific directory in LinuxTo get the size of a specific directory, open your terminal and type the following command:du -sh /path/to/directoryIn this command, replace /path/to/directory with the actual path of the directory you want to assess. The -s flag stands for “summary” and will only display the total size of the specified directory. The -h flag makes the output human-readable, showing sizes in a more understandable format.Example: Here, we used the path /home/ubuntu/, where ubuntu is the name of our username directory. We used the du command to retrieve an output of 32K for this directory, indicating a size of 32 KB.Check the size of all directories in LinuxTo get the size of all files and directories within the current directory, use the following command:sudo du -h /path/to/directoryExample: In this instance, we again used the path /home/ubuntu/, with ubuntu representing our username directory. Using the command du -h, we obtained an output listing all files and directories within that particular path.#2 Get Linux directory size using ncduIf you’re looking for a more interactive and feature-rich approach to exploring directory sizes, consider using the ncdu (NCurses Disk Usage) tool. ncdu provides a visual representation of disk usage and allows you to navigate through directories, view size details, and identify large files with ease.For Debian or Ubuntu, use this command:sudo apt-get install ncduOnce installed, run ncdu followed by the path to the directory you want to analyze:ncdu /path/to/directoryThis will launch the ncdu interface, which shows a breakdown of file and subdirectory sizes. Use the arrow keys to navigate and explore various folders, and press q to exit the tool.Example: Here’s a sample output of using the ncdu command to analyze the home directory. Simply enter the ncdu command and press Enter. The displayed output will look something like this:#3 Get Linux directory size using 1s -1aYou can alternatively opt to use the ls command to list the files and directories within a directory. The options -l and -a modify the default behavior of ls as follows:-l (long listing format)Displays the detailed information for each file and directoryShows file permissions, the number of links, owner, group, file size, the timestamp of the last modification, and the file/directory name-a (all files)Instructs ls to include all files, including hidden files and directoriesIncludes hidden files on Linux that typically have names beginning with a . (dot)ls -la lists all files (including hidden ones) in long format, providing detailed information such as permissions, owner, group, size, and last modification time. This command is especially useful when you want to inspect file attributes or see hidden files and directories.Example: When you enter ls -la command and press Enter, you will see an output similar to this:Each line includes:File type and permissions (e.g., drwxr-xr-x):The first character indicates the file type- for a regular filed for a directoryl for a symbolic linkThe next nine characters are permissions in groups of three (rwx):r = readw = writex = executePermissions are shown for three classes of users: owner, group, and others.Number of links (e.g., 2):For regular files, this usually indicates the number of hard linksFor directories, it often reflects subdirectory links (e.g., the . and .. entries)Owner and group (e.g., user group)File size (e.g., 4096 or 1045 bytes)Modification date and time (e.g., Jan 7 09:34)File name (e.g., .bashrc, notes.txt, Documents):Files or directories that begin with a dot (.) are hidden (e.g., .bashrc)ConclusionThat’s it! You can now determine the size of a directory in Linux. Measuring directory sizes is a crucial skill for efficient storage management. Whether you choose the straightforward du command, use the visual advantages of the ncdu tool, or opt for the versatility of ls -la, this expertise enhances your ability to uphold an organized and efficient Linux environment.Looking to deploy Linux in the cloud? With Gcore Edge Cloud, you can choose from a wide range of pre-configured virtual machines suitable for Linux:Affordable shared compute resources starting from €3.2 per monthDeploy across 50+ cloud regions with dedicated servers for low-latency applicationsSecure apps and data with DDoS protection, WAF, and encryption at no additional costGet started today

How to Run Hugging Face Spaces on Gcore Inference at the Edge

Running machine learning models, especially large-scale models like GPT 3 or BERT, requires a lot of computing power and comes with a lot of latency. This makes real-time applications resource-intensive and challenging to deliver. Running ML models at the edge is a lightweight approach offering significant advantages for latency, privacy, and resource optimization. Gcore Inference at the Edge makes it simple to deploy and manage custom models efficiently, giving you the ability to deploy and scale your favorite Hugging Face models globally in just a few clicks. In this guide, we’ll walk you through how easy it is to harness the power of Gcore’s edge AI infrastructure to deploy a Hugging Face Space model. Whether you’re developing NLP solutions or cutting-edge computer vision applications, deploying at the edge has never been simpler—or more powerful. Step 1: Log In to the Gcore Customer PortalGo to gcore.com and log in to the Gcore Customer Portal. If you don’t yet have an account, go ahead and create one—it’s free. Step 2: Go to Inference at the EdgeIn the Gcore Customer Portal, click Inference at the Edge from the left navigation menu. Then click Deploy custom model. Step 3: Choose a Hugging Face ModelOpen huggingface.com and browse the available models. Select the model you want to deploy. Navigate to the corresponding Hugging Face Space for the model. Click on Files in the Space and locate the Docker option. Copy the Docker image link and startup command from Hugging Face Space. Step 4: Deploy the Model on GcoreReturn to the Gcore Customer Portal deployment page and enter the following details: Model image URL: registry.hf.space/ethux-mistral-pixtral-demo:latest Startup command: python app.py Container port: 7860 Configure the pod as follows: GPU-optimized: 1x L40S vCPUs: 16 RAM: 232GiB For optimal performance, choose any available region for routing placement. Name your deployment and click Deploy.Step 5: Interact with Your ModelOnce the model is up and running, you’ll be provided with an endpoint. You can now interact with the model via this endpoint to test and use your deployed model at the edge.Powerful, Simple AI Deployment with GcoreGcore Inference at the Edge is the future of AI deployment, combining the ease of Hugging Face integration with the robust infrastructure needed for real-time, scalable, and global solutions. By leveraging edge computing, you can optimize model performance and simultaneously futureproof your business in a world that increasingly demands fast, secure, and localized AI applications. Deploying models to the edge allows you to capitalize on real-time insights, improve customer experiences, and outpace your competitors. Whether you’re leading a team of developers or spearheading a new AI initiative, Gcore Inference at the Edge offers the tools you need to innovate at the speed of tomorrow. Explore Gcore Inference at the Edge