Manage a GPU Kubernetes cluster

Day-to-day operations covered here include connecting with kubectl, managing node pools and autoscaling, configuring storage and logging, setting up OIDC authentication, upgrading the cluster version, and deleting the cluster.

Connect to the cluster

The kubeconfig file enables kubectl connections to the cluster. Download it from the cluster overview page:

Navigate to the cluster overview page.
Click Kubernetes config.
Save the downloaded k8sConfig.yml file.

Cluster overview page with the Kubernetes config button highlighted

Configure kubectl to use the downloaded configuration:

export KUBECONFIG=/path/to/k8sConfig.yml

Verify the connection:

kubectl cluster-info

Expected output:

Kubernetes control plane is running at https://<cluster-endpoint>:443
CoreDNS is running at https://<cluster-endpoint>:443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

Check node status:

kubectl get nodes

Expected output:

NAME                STATUS   ROLES    AGE   VERSION
ndp1-c3-168-10-78   Ready    <none>   12d   v1.35.3

Nodes show Ready status when fully provisioned. kubectl must be within one minor version of the cluster — for example, kubectl v1.34 or v1.36 works with a v1.35 cluster. A version mismatch beyond that range produces a warning but commands still execute.

Kubeconfig certificates are renewed automatically every two years. A notification appears in the portal and is sent by email two weeks before expiry. Download a fresh kubeconfig after renewal.

Cluster overview

The cluster overview page shows the cluster name, status, Kubernetes version, networking configuration (network, subnetwork, CNI provider), and provides access to management tabs.

Tab	Description
Pools	View and manage node pools, add new pools
Logging	Configure log collection to OpenSearch Dashboards (paid feature)
Advanced DDoS Protection	Enable Advanced DDoS Protection for cluster nodes (requires Public IPv4 address; not available in all regions)
Advanced settings	OIDC authentication and Cluster Autoscaler configuration
User Actions	Audit log of cluster operations

Manage node pools

Node pools group worker nodes with identical configuration. Each pool can have a different instance type, enabling mixed workloads within a single cluster.

View pool details

The expanded pool card shows the pool name, autoscaling status, instance flavor, volume configuration, node list, and current status.

Navigate to the Pools tab.
Click a pool to expand its details.

Pool details showing node configuration, status, and node list

Scale node pools

Node pools scale automatically within the configured min/max limits. Autoscaling is active when Maximum nodes is greater than Minimum nodes. When they are equal, the pool maintains a fixed size.

A single node pool supports up to 100 nodes. To increase this limit, contact support. The number of pools per cluster is not limited. A single region supports up to 100 clusters.

To adjust the limits:

Navigate to the Pools tab.
Expand the target pool.
Click the … menu next to the pool header and select Edit Pool Settings.

Pool actions menu showing Edit Pool Settings and Delete options

Adjust Minimum nodes or Maximum nodes.
Click Save.

Edit pool settings dialog showing minimum and maximum nodes, labels, taints, and autohealing options

When reducing the minimum node count, the cluster removes nodes at random. There is no control over which specific nodes are terminated.

Scaling a pool to zero nodes is not supported. The minimum node count must be at least one. A cluster cannot exist without nodes.

Replace a node

The replace action deletes a node and provisions a new one with the same configuration. Use it to recover a node that is stuck, degraded, or behaving unexpectedly without modifying the pool settings.

Navigate to the Pools tab.
Expand the target pool.
Click the … menu next to the node and select Replace.
Confirm the replacement.

The node is terminated and a new node is provisioned in its place. Running pods are evicted before termination and rescheduled to other available nodes.

Add a GPU node pool

To add GPU nodes to an existing cluster:

Navigate to the Pools tab.
Click Add pool.
Enter a Pool name.
Set Minimum nodes and Maximum nodes.
Select GPU Bare metal instances as the node type.
Choose a GPU flavor from the dropdown. Available flavors depend on the region.
(Optional) Add labels for node selection in pod specs.
(Optional) Add taints to dedicate nodes to GPU workloads.
Click Add pool.

Add GPU pool dialog showing GPU Bare metal instances selection and available flavor options

GPU node pools may take several minutes to provision. Monitor the pool status in the Pools tab.

GPU bare metal nodes are stateless. Kubernetes may delete and recreate them during autoscaling, node failures, or cluster upgrades. Do not store critical data on node-local storage. Use an external storage service — Gcore Object Storage, File Shares, NFS, or a managed database — for any data that must persist across node replacements.

Schedule workloads on GPU nodes

Use node selectors or affinity rules to schedule pods on GPU nodes. The nodeSelector field targets nodes by their pool label:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
  - name: training
    image: nvidia/cuda:12.0-base
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    pool: gpu-pool

GPU bare metal nodes require the NVIDIA device plugin to expose GPU resources to the Kubernetes scheduler. The plugin runs as a DaemonSet on GPU nodes; verify it is running before scheduling GPU workloads:

kubectl get daemonset -n kube-system nvidia-device-plugin-daemonset

Use persistent storage

Gcore clusters include CSI-based storage classes backed by Cinder volumes. List available storage classes:

kubectl get storageclass

Expected output:

NAME                            PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
csi-sc-cinderplugin (default)   cinder.csi.gcorelabs.com   Delete          Immediate           true                   12d
ssd-hiiops                      cinder.csi.gcorelabs.com   Delete          Immediate           true                   12d
ssd-lowlatency                  cinder.csi.gcorelabs.com   Delete          Immediate           true                   12d
standard                        cinder.csi.gcorelabs.com   Delete          Immediate           true                   12d

Create a PersistentVolumeClaim to provision storage for a workload:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: ssd-hiiops

Apply it and verify the volume binds:

kubectl apply -f pvc.yaml
kubectl get pvc model-storage

Expected output:

NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
model-storage   Bound    pvc-56a60388-c19a-47fb-a87a-430246609cb4   100Gi      RWO            ssd-hiiops     5s

The node flavor — including CPU, RAM, and storage — cannot be changed after pool creation. To use a different flavor, create a new pool with the required configuration, migrate workloads, then delete the old pool.

Delete a node pool

Removing a pool terminates all nodes in that pool. Running pods are evicted and rescheduled to other available nodes if resources permit.

Pool deletion is irreversible. All nodes in the pool are terminated. Migrate workloads or confirm they are no longer needed before proceeding. Deleting the last pool in a cluster deletes the entire cluster.

Navigate to the Pools tab.
Click the … menu next to the pool and select Delete.
Confirm the deletion.

Configure cluster autoscaling

The Cluster Autoscaler adjusts cluster size based on workload demand. When pods are pending due to insufficient resources, the autoscaler adds nodes. When nodes are underutilized, it removes them. Autoscaling operates within the min/max limits set per pool.

The Cluster Autoscaler scales based on CPU and memory utilization only. GPU utilization is not used as a scaling signal.

To configure autoscaling behavior:

Navigate to the Advanced settings tab.
Expand Cluster Autoscaler.
Configure the parameters and click Save changes.

Cluster Autoscaler settings showing scan interval, expander, and scale-down parameters

Parameter	Description	Default
Cluster scan interval	How often the autoscaler checks for pending pods	10 seconds
Cluster expander	Strategy for selecting which node group to scale	random
Max node provision time	Maximum time to wait for a node to become ready	15 minutes
New pod scale up delay	Wait time before triggering scale-up for new pods	0 seconds
Post-addition scale-down delay	Wait time after adding a node before considering scale-down	5 minutes
Post-deletion scale-down delay	Wait time after deleting a node before the next deletion	0 seconds
Unneeded node delay	How long a node must be unneeded before removal	5 minutes
Scale-down utilization threshold	Node utilization below which it is considered for removal	99%
Max bulk deletion of empty nodes	Maximum empty nodes deleted at once	10
Max graceful termination time	Time allowed for pods to terminate gracefully	600 seconds

If the autoscaler cannot provision enough capacity, increase the maximum node count for the affected pool.

Configure logging

Cluster logging collects container and system logs, storing them in OpenSearch Dashboards for analysis and debugging. Logging is a paid feature.

Navigate to the Logging tab.
Toggle Enable Logging.
Select an existing log topic or create a new one.
Click Save.

Configure OIDC authentication

OIDC authentication enables Single Sign-On (SSO) for Kubernetes API access through an external identity provider such as Okta, Azure AD, or Keycloak.

Navigate to the Advanced settings tab.
Expand OIDC authentication.
Configure the identity provider settings and click Save changes.

OIDC authentication settings showing issuer URL, client ID, and claim configuration

Field	Description
Issuer URL	The OIDC issuer endpoint (e.g., `https://example.com`)
Client ID	Application client ID from the identity provider
Groups claim	JWT claim containing group membership (optional)
Groups prefix	Prefix added to group names (default: `oidc`)
Set required claims	Require specific claims in tokens
Signing algorithms	JWT signing algorithms to accept (default: RS256, optional)
Username claim	JWT claim used as the username (default: `sub`)
Username prefix	Prefix added to usernames (default: `oidc`)

Upgrade Kubernetes version

Kubernetes upgrades use a rolling update strategy: a new node on the target version is provisioned, workloads migrate to it, and the old node is removed. This repeats until all nodes run the new version, minimizing downtime.

Upgrades are one minor version at a time. For example, upgrading from v1.33 to v1.35 requires two separate upgrades: v1.33 → v1.34, then v1.34 → v1.35. Downgrade is not supported.

Before upgrading

Check that the cluster control plane version and node versions match:

kubectl version
kubectl get nodes

Verify that all nodes show the same version as the control plane before starting the upgrade. Review the changelog for deprecated APIs that may affect running workloads. Rolling upgrades require quota for one additional node of the same flavor during the process. For a 10-node pool, quota for 11 nodes is needed temporarily. To avoid requesting extra quota, scale the pool down by one node before starting the upgrade, then scale it back up after the upgrade completes.

Start the upgrade

When a newer Kubernetes version is available, an upgrade prompt appears in the cluster overview next to the Kubernetes Version field:

In the cluster overview, click the upgrade prompt next to Kubernetes Version.
Select the target version (next minor version only).
Confirm the upgrade.

Kubernetes version upgrade prompt on the cluster overview page

The upgrade cannot be paused or cancelled once started.

After upgrading

After the control plane upgrades, update the local kubectl client if the version skew exceeds one minor version:

kubectl version

Update the Cluster Autoscaler

The Cluster Autoscaler image must match the cluster’s Kubernetes minor version. Find the latest release for the new version on the releases page, then update the image:

kubectl -n kube-system set image deployment.apps/cluster-autoscaler \
  cluster-autoscaler=registry.k8s.io/autoscaling/cluster-autoscaler:v<X.XX.X>

Replace <X.XX.X> with the release that matches the upgraded cluster version. For example, if the cluster is now v1.35, use the latest 1.35.x Cluster Autoscaler release. A version mismatch causes the autoscaler to stop scaling correctly. Verify the rollout completed:

kubectl -n kube-system rollout status deployment/cluster-autoscaler

Update the NVIDIA device plugin (GPU bare metal pools only)

The NVIDIA device plugin DaemonSet must be updated after a cluster upgrade to maintain GPU resource scheduling. Find the latest compatible release on the releases page, then apply it:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v<X.X.X>/deployments/static/nvidia-device-plugin.yml

Replace <X.X.X> with the target plugin version. Verify the DaemonSet is running on all GPU nodes:

kubectl get daemonset -n kube-system nvidia-device-plugin-daemonset

All pods in the DaemonSet should show READY equal to DESIRED before scheduling GPU workloads.

View audit log

The User Actions tab provides an audit log of all cluster operations: creation, pool changes, configuration updates, and upgrades. Each entry shows the timestamp, action type, affected resource, event description, and the account that performed the action.

Navigate to the User Actions tab.
Use filters to narrow results by resource name, action type, resource type, or date range.

User Actions audit log showing date, action, resource, event, and changed by columns

Delete the cluster

Cluster deletion removes all nodes, pools, and associated resources. External resources such as Object Storage buckets and Persistent Volumes created outside the cluster are not affected.

Cluster deletion is irreversible. Export the kubeconfig and back up any required data before proceeding.

Navigate to the cluster overview.
Click Delete cluster.
Type Delete in the confirmation field.
Click Yes, delete.

Delete cluster confirmation dialog requiring 'Delete' text input

​Connect to the cluster

​Cluster overview

​Manage node pools

​View pool details

​Scale node pools

​Replace a node

​Add a GPU node pool

​Schedule workloads on GPU nodes

​Use persistent storage

​Delete a node pool

​Configure cluster autoscaling

​Configure logging

​Configure OIDC authentication

​Upgrade Kubernetes version

​Before upgrading

​Start the upgrade

​After upgrading

​Update the Cluster Autoscaler

​Update the NVIDIA device plugin (GPU bare metal pools only)

​View audit log

​Delete the cluster

Connect to the cluster

Cluster overview

Manage node pools

View pool details

Scale node pools

Replace a node

Add a GPU node pool

Schedule workloads on GPU nodes

Use persistent storage

Delete a node pool

Configure cluster autoscaling

Configure logging

Configure OIDC authentication

Upgrade Kubernetes version

Before upgrading

Start the upgrade

After upgrading

Update the Cluster Autoscaler

Update the NVIDIA device plugin (GPU bare metal pools only)

View audit log

Delete the cluster