> ## Documentation Index
> Fetch the complete documentation index at: https://gcore.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Manage a GPU Kubernetes cluster

Day-to-day operations covered here include connecting with kubectl, managing node pools and autoscaling, configuring storage and logging, setting up OIDC authentication, upgrading the cluster version, and deleting the cluster.

## Connect to the cluster

The kubeconfig file enables kubectl connections to the cluster. Download it from the cluster overview page:

1. Navigate to the cluster overview page.
2. Click **Kubernetes config**.
3. Save the downloaded `k8sConfig.yml` file.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/cluster-kubeconfig.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=85911279235de195b30c66368f4907e7" alt="Cluster overview page with the Kubernetes config button highlighted" width="1920" height="893" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/cluster-kubeconfig.png" />
</Frame>

Configure kubectl to use the downloaded configuration:

```bash theme={null}
export KUBECONFIG=/path/to/k8sConfig.yml
```

Verify the connection:

```bash theme={null}
kubectl cluster-info
```

Expected output:

```
Kubernetes control plane is running at https://<cluster-endpoint>:443
CoreDNS is running at https://<cluster-endpoint>:443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
```

Check node status:

```bash theme={null}
kubectl get nodes
```

Expected output:

```
NAME                STATUS   ROLES    AGE   VERSION
ndp1-c3-168-10-78   Ready    <none>   12d   v1.35.3
```

Nodes show Ready status when fully provisioned. kubectl must be within one minor version of the cluster — for example, kubectl v1.34 or v1.36 works with a v1.35 cluster. A version mismatch beyond that range produces a warning but commands still execute.

<Info>
  Kubeconfig certificates are renewed automatically every two years. A notification appears in the portal and is sent by email two weeks before expiry. Download a fresh kubeconfig after renewal.
</Info>

## Cluster overview

The cluster overview page shows the cluster name, status, Kubernetes version, networking configuration (network, subnetwork, CNI provider), and provides access to management tabs.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/cluster-overview.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=1cd9cbc332b13986d92aa27d9e1a212e" alt="Cluster overview page showing status, networking info, and management tabs" width="1452" height="606" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/cluster-overview.png" />
</Frame>

| Tab                      | Description                                                                                                    |
| ------------------------ | -------------------------------------------------------------------------------------------------------------- |
| Pools                    | View and manage node pools, add new pools                                                                      |
| Logging                  | Configure log collection to OpenSearch Dashboards (paid feature)                                               |
| Advanced DDoS Protection | Enable Advanced DDoS Protection for cluster nodes (requires Public IPv4 address; not available in all regions) |
| Advanced settings        | OIDC authentication and Cluster Autoscaler configuration                                                       |
| User Actions             | Audit log of cluster operations                                                                                |

## Manage node pools

Node pools group worker nodes with identical configuration. Each pool can have a different instance type, enabling mixed workloads within a single cluster.

### View pool details

The expanded pool card shows the pool name, autoscaling status, instance flavor, volume configuration, node list, and current status.

1. Navigate to the **Pools** tab.
2. Click a pool to expand its details.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/pool-details-expanded.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=3a03b35cdfd16d2ff7c11af33699b1c3" alt="Pool details showing node configuration, status, and node list" width="1205" height="612" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/pool-details-expanded.png" />
</Frame>

### Scale node pools

Node pools scale automatically within the configured min/max limits. Autoscaling is active when **Maximum nodes** is greater than **Minimum nodes**. When they are equal, the pool maintains a fixed size.

<Info>
  A single node pool supports up to 100 nodes. To increase this limit, contact support. The number of pools per cluster is not limited. A single region supports up to 100 clusters.
</Info>

To adjust the limits:

1. Navigate to the **Pools** tab.
2. Expand the target pool.
3. Click the **...** menu next to the pool header and select **Edit Pool Settings**.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/pool-actions-menu.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=ac83d2c2f3bc6e31828695972fffd26c" alt="Pool actions menu showing Edit Pool Settings and Delete options" width="1387" height="569" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/pool-actions-menu.png" />
</Frame>

4. Adjust **Minimum nodes** or **Maximum nodes**.
5. Click **Save**.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/edit-pool-settings.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=689238578891fd6b4483a2587a64e98f" alt="Edit pool settings dialog showing minimum and maximum nodes, labels, taints, and autohealing options" width="926" height="813" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/edit-pool-settings.png" />
</Frame>

<Warning>
  When reducing the minimum node count, the cluster removes nodes at random. There is no control over which specific nodes are terminated.
</Warning>

<Warning>
  Scaling a pool to zero nodes is not supported. The minimum node count must be at least one. A cluster cannot exist without nodes.
</Warning>

### Replace a node

The replace action deletes a node and provisions a new one with the same configuration. Use it to recover a node that is stuck, degraded, or behaving unexpectedly without modifying the pool settings.

1. Navigate to the **Pools** tab.
2. Expand the target pool.
3. Click the **...** menu next to the node and select **Replace**.
4. Confirm the replacement.

The node is terminated and a new node is provisioned in its place. Running pods are evicted before termination and rescheduled to other available nodes.

### Add a GPU node pool

To add GPU nodes to an existing cluster:

1. Navigate to the **Pools** tab.
2. Click **Add pool**.
3. Enter a **Pool name**.
4. Set **Minimum nodes** and **Maximum nodes**.
5. Select **GPU Bare metal instances** as the node type.
6. Choose a GPU flavor from the dropdown. Available flavors depend on the region.
7. (Optional) Add **labels** for node selection in pod specs.
8. (Optional) Add **taints** to dedicate nodes to GPU workloads.
9. Click **Add pool**.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/add-gpu-pool.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=bb6a862903048fcdbff199768516d4d2" alt="Add GPU pool dialog showing GPU Bare metal instances selection and available flavor options" width="1920" height="893" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/add-gpu-pool.png" />
</Frame>

<Info>
  GPU node pools may take several minutes to provision. Monitor the pool status in the **Pools** tab.
</Info>

<Warning>
  GPU bare metal nodes are stateless. Kubernetes may delete and recreate them during autoscaling, node failures, or cluster upgrades. Do not store critical data on node-local storage. Use an external storage service — Gcore Object Storage, File Shares, NFS, or a managed database — for any data that must persist across node replacements.
</Warning>

### Schedule workloads on GPU nodes

Use node selectors or affinity rules to schedule pods on GPU nodes. The `nodeSelector` field targets nodes by their pool label:

```yaml theme={null}
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
  - name: training
    image: nvidia/cuda:12.0-base
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    pool: gpu-pool
```

GPU bare metal nodes require the NVIDIA device plugin to expose GPU resources to the Kubernetes scheduler. The plugin runs as a DaemonSet on GPU nodes; verify it is running before scheduling GPU workloads:

```bash theme={null}
kubectl get daemonset -n kube-system nvidia-device-plugin-daemonset
```

### Use persistent storage

Gcore clusters include CSI-based storage classes backed by Cinder volumes. List available storage classes:

```bash theme={null}
kubectl get storageclass
```

Expected output:

```
NAME                            PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
csi-sc-cinderplugin (default)   cinder.csi.gcorelabs.com   Delete          Immediate           true                   12d
ssd-hiiops                      cinder.csi.gcorelabs.com   Delete          Immediate           true                   12d
ssd-lowlatency                  cinder.csi.gcorelabs.com   Delete          Immediate           true                   12d
standard                        cinder.csi.gcorelabs.com   Delete          Immediate           true                   12d
```

Create a PersistentVolumeClaim to provision storage for a workload:

```yaml theme={null}
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: ssd-hiiops
```

Apply it and verify the volume binds:

```bash theme={null}
kubectl apply -f pvc.yaml
kubectl get pvc model-storage
```

Expected output:

```
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
model-storage   Bound    pvc-56a60388-c19a-47fb-a87a-430246609cb4   100Gi      RWO            ssd-hiiops     5s
```

<Warning>
  The node flavor — including CPU, RAM, and storage — cannot be changed after pool creation. To use a different flavor, create a new pool with the required configuration, migrate workloads, then delete the old pool.
</Warning>

### Delete a node pool

Removing a pool terminates all nodes in that pool. Running pods are evicted and rescheduled to other available nodes if resources permit.

<Warning>
  Pool deletion is irreversible. All nodes in the pool are terminated. Migrate workloads or confirm they are no longer needed before proceeding. Deleting the last pool in a cluster deletes the entire cluster.
</Warning>

1. Navigate to the **Pools** tab.
2. Click the **...** menu next to the pool and select **Delete**.
3. Confirm the deletion.

## Configure cluster autoscaling

The Cluster Autoscaler adjusts cluster size based on workload demand. When pods are pending due to insufficient resources, the autoscaler adds nodes. When nodes are underutilized, it removes them. Autoscaling operates within the min/max limits set per pool.

<Info>
  The Cluster Autoscaler scales based on CPU and memory utilization only. GPU utilization is not used as a scaling signal.
</Info>

To configure autoscaling behavior:

1. Navigate to the **Advanced settings** tab.
2. Expand **Cluster Autoscaler**.
3. Configure the parameters and click **Save changes**.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/autoscaler-settings.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=bc90330dfead69d9c2376ab99513933a" alt="Cluster Autoscaler settings showing scan interval, expander, and scale-down parameters" width="1001" height="658" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/autoscaler-settings.png" />
</Frame>

| Parameter                        | Description                                                 | Default     |
| -------------------------------- | ----------------------------------------------------------- | ----------- |
| Cluster scan interval            | How often the autoscaler checks for pending pods            | 10 seconds  |
| Cluster expander                 | Strategy for selecting which node group to scale            | random      |
| Max node provision time          | Maximum time to wait for a node to become ready             | 15 minutes  |
| New pod scale up delay           | Wait time before triggering scale-up for new pods           | 0 seconds   |
| Post-addition scale-down delay   | Wait time after adding a node before considering scale-down | 5 minutes   |
| Post-deletion scale-down delay   | Wait time after deleting a node before the next deletion    | 0 seconds   |
| Unneeded node delay              | How long a node must be unneeded before removal             | 5 minutes   |
| Scale-down utilization threshold | Node utilization below which it is considered for removal   | 99%         |
| Max bulk deletion of empty nodes | Maximum empty nodes deleted at once                         | 10          |
| Max graceful termination time    | Time allowed for pods to terminate gracefully               | 600 seconds |

If the autoscaler cannot provision enough capacity, increase the maximum node count for the affected pool.

## Configure logging

Cluster logging collects container and system logs, storing them in OpenSearch Dashboards for analysis and debugging. Logging is a paid feature.

1. Navigate to the **Logging** tab.
2. Toggle **Enable Logging**.
3. Select an existing log topic or create a new one.
4. Click **Save**.

## Configure OIDC authentication

OIDC authentication enables Single Sign-On (SSO) for Kubernetes API access through an external identity provider such as Okta, Azure AD, or Keycloak.

1. Navigate to the **Advanced settings** tab.
2. Expand **OIDC authentication**.
3. Configure the identity provider settings and click **Save changes**.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/oidc-settings.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=64fe0042c6a9449d90be3585e508bfc4" alt="OIDC authentication settings showing issuer URL, client ID, and claim configuration" width="1400" height="900" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/oidc-settings.png" />
</Frame>

| Field               | Description                                                 |
| ------------------- | ----------------------------------------------------------- |
| Issuer URL          | The OIDC issuer endpoint (e.g., `https://example.com`)      |
| Client ID           | Application client ID from the identity provider            |
| Groups claim        | JWT claim containing group membership (optional)            |
| Groups prefix       | Prefix added to group names (default: `oidc`)               |
| Set required claims | Require specific claims in tokens                           |
| Signing algorithms  | JWT signing algorithms to accept (default: RS256, optional) |
| Username claim      | JWT claim used as the username (default: `sub`)             |
| Username prefix     | Prefix added to usernames (default: `oidc`)                 |

## Upgrade Kubernetes version

Kubernetes upgrades use a rolling update strategy: a new node on the target version is provisioned, workloads migrate to it, and the old node is removed. This repeats until all nodes run the new version, minimizing downtime.

<Warning>
  Upgrades are one minor version at a time. For example, upgrading from v1.33 to v1.35 requires two separate upgrades: v1.33 → v1.34, then v1.34 → v1.35. Downgrade is not supported.
</Warning>

### Before upgrading

Check that the cluster control plane version and node versions match:

```bash theme={null}
kubectl version
kubectl get nodes
```

Verify that all nodes show the same version as the control plane before starting the upgrade. Review the [changelog](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/) for deprecated APIs that may affect running workloads.

Rolling upgrades require quota for one additional node of the same flavor during the process. For a 10-node pool, quota for 11 nodes is needed temporarily. To avoid requesting extra quota, scale the pool down by one node before starting the upgrade, then scale it back up after the upgrade completes.

### Start the upgrade

When a newer Kubernetes version is available, an upgrade prompt appears in the cluster overview next to the **Kubernetes Version** field:

1. In the cluster overview, click the upgrade prompt next to **Kubernetes Version**.
2. Select the target version (next minor version only).
3. Confirm the upgrade.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/kubernetes-version-upgrade.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=ccaf56dafbdf92ef249d2e06b651a500" alt="Kubernetes version upgrade prompt on the cluster overview page" width="1920" height="945" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/kubernetes-version-upgrade.png" />
</Frame>

The upgrade cannot be paused or cancelled once started.

### After upgrading

After the control plane upgrades, update the local kubectl client if the version skew exceeds one minor version:

```bash theme={null}
kubectl version
```

#### Update the Cluster Autoscaler

The Cluster Autoscaler image must match the cluster's Kubernetes minor version. Find the latest release for the new version on the [releases page](https://github.com/kubernetes/autoscaler/releases), then update the image:

```bash theme={null}
kubectl -n kube-system set image deployment.apps/cluster-autoscaler \
  cluster-autoscaler=registry.k8s.io/autoscaling/cluster-autoscaler:v<X.XX.X>
```

Replace `<X.XX.X>` with the release that matches the upgraded cluster version. For example, if the cluster is now v1.35, use the latest `1.35.x` Cluster Autoscaler release. A version mismatch causes the autoscaler to stop scaling correctly.

Verify the rollout completed:

```bash theme={null}
kubectl -n kube-system rollout status deployment/cluster-autoscaler
```

#### Update the NVIDIA device plugin (GPU bare metal pools only)

The NVIDIA device plugin DaemonSet must be updated after a cluster upgrade to maintain GPU resource scheduling. Find the latest compatible release on the [releases page](https://github.com/NVIDIA/k8s-device-plugin/releases), then apply it:

```bash theme={null}
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v<X.X.X>/deployments/static/nvidia-device-plugin.yml
```

Replace `<X.X.X>` with the target plugin version. Verify the DaemonSet is running on all GPU nodes:

```bash theme={null}
kubectl get daemonset -n kube-system nvidia-device-plugin-daemonset
```

All pods in the DaemonSet should show `READY` equal to `DESIRED` before scheduling GPU workloads.

## View audit log

The **User Actions** tab provides an audit log of all cluster operations: creation, pool changes, configuration updates, and upgrades. Each entry shows the timestamp, action type, affected resource, event description, and the account that performed the action.

1. Navigate to the **User Actions** tab.
2. Use filters to narrow results by resource name, action type, resource type, or date range.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/user-actions-tab.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=62443d8694031e0f31cc0204d3347d67" alt="User Actions audit log showing date, action, resource, event, and changed by columns" width="1426" height="585" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/user-actions-tab.png" />
</Frame>

## Delete the cluster

Cluster deletion removes all nodes, pools, and associated resources. External resources such as Object Storage buckets and Persistent Volumes created outside the cluster are not affected.

<Warning>
  Cluster deletion is irreversible. Export the kubeconfig and back up any required data before proceeding.
</Warning>

1. Navigate to the cluster overview.
2. Click **Delete cluster**.
3. Type `Delete` in the confirmation field.
4. Click **Yes, delete**.

<Frame>
  <img src="https://mintcdn.com/gcore/1suj17WkWME2a5by/images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/delete-cluster-dialog.png?fit=max&auto=format&n=1suj17WkWME2a5by&q=85&s=636225f42b2ebffb87c1e81f70dba526" alt="Delete cluster confirmation dialog requiring 'Delete' text input" width="649" height="322" data-path="images/docs/edge-ai/managed-kubernetes/manage-a-gpu-kubernetes-cluster/delete-cluster-dialog.png" />
</Frame>
