CosmicAC Logo

Requirements

The state your Kubernetes cluster must reach before you deploy the CosmicAC GPU worker.

Before you deploy the CosmicAC GPU worker (cosmicac-wrk-server-k8s-nvidia), your Kubernetes cluster must meet the requirements on this page. The worker is a control-plane agent that connects to your cluster's Kubernetes API and creates GPU virtual machines (KubeVirt VMIs) and their supporting resources. It doesn't install the operators, drivers, or node configuration it relies on, so you set those up once and keep them running.

Use the readiness checklist that follows to confirm your cluster is ready, not to build one. Where this page shows a version or value, it comes from the CosmicAC reference cluster, so treat it as a reference point, not a hard requirement, unless the page marks it as one.

Readiness checklist (required outcomes)

Your cluster is ready when all of the following are true.

The NVIDIA GPU Operator isn't a readiness gate. The worker consumes GPUs through KubeVirt PCI passthrough, so the hard dependency is the passthrough path in Section 3, not the GPU Operator's health. On the CosmicAC reference cluster, VMIs run normally even when the GPU Operator's ClusterPolicy reports notReady. The GPU Operator is one convenient way to deliver the NVIDIA driver and Fabric Manager.

1. Cluster baseline

SettingRequirementNotes
Kubernetes versionA version supported by KubeVirt 1.7 and CDI 1.62Reference v1.33.x. Newer minor versions are fine if KubeVirt and CDI still support them.
Container runtimeAny CRI (reference CRI-O, containerd also fine)The worker passes GPUs to VMs through PCI passthrough, so the cluster doesn't need the NVIDIA container runtime for this path.
Host OS (GPU nodes)Linux with IOMMU and VFIO supportReference Ubuntu 24.04 LTS (24.04.3, kernel 6.8.x). See Section 2.
CNIAny standard CNIReference Calico. Workloads using overlay networking layer Multus on top of your CNI (Section 8). It doesn't replace it.
DNS and egressCluster DNS working. VMIs need working outbound egressNeeded so guests can docker login and pull images. node-local-dns is optional (used on the reference cluster for cache parity) and isn't a worker prerequisite.

2. GPU node requirements

You need at least one GPU node. The CosmicAC reference plane uses eight NVIDIA H100 80GB SXM GPUs and eight ConnectX-7 InfiniBand NICs per node, but a single GPU is the functional floor for a single-node workload.

2.1 Node labels

Each GPU node must carry the following labels.

LabelWhy it matters
nvidia.com/gpu.present=trueNode selector for the KubeVirt GPU device plugin daemonset (Section 3). Without it, the node doesn't advertise the passthrough GPU resource.
A KubeVirt VM-host marker (reference node-role.kubernetes.io/kubevirt)Marks the node as a KubeVirt VM host so virt-handler schedules VMIs there.

The reference cluster also carries a node-role.kubernetes.io/gpu label, but the worker's VMI spec doesn't select it. It's cosmetic for this worker and not required.

2.2 Whole-device passthrough (VFIO)

VMIs claim whole GPUs (and IB PFs) through PCI passthrough, so each GPU node must bind those devices to vfio-pci at the host level. Each GPU node must meet the following outcomes.

  • IOMMU enabled and the host GPU driver kept off the passed-through devices.
  • Every GPU PCI ID and every IB PF PCI ID bound to vfio-pci.
  • NVSwitch left on its native driver, with Fabric Manager healthy so NVLink works across the passed-through GPUs.

The NVIDIA Network Operator isn't required for the current VFIO and IB mechanism. IB PFs appear as plain passthrough PCI devices, not through the Network Operator.

2.3 Resource advertisement (verify)

After host preparation and the KubeVirt GPU device plugin (Section 3), each GPU node must advertise the GPU as an allocatable resource named nvidia.com/<gpuType>. For multi-node jobs, it must also advertise the IB PF as mellanox.com/<ibName>.

The reference values per node are as follows.

nvidia.com/GH100_H100_SXM5_80GB = 8
mellanox.com/cx7_ib_pf          = 8
nvidia.com/gpu                  = 0   # expected: the worker requests the passthrough name, not nvidia.com/gpu

nvidia.com/gpu can legitimately be 0. The worker requests nvidia.com/<gpu.type> (and mellanox.com/<ib_name>), not the generic nvidia.com/gpu, so the advertised name must match what the worker is configured to request.

3. GPU and virtualization platform

This section lists the hard requirements for the worker's passthrough path, then the reference implementation that satisfies them.

3.1 Hard requirements

RequirementDetail
KubeVirt installed and healthyRuns VMs as VirtualMachineInstance. Reference version v1.7.0.
KubeVirt HostDevices feature gate enabledRequired for PCI passthrough of GPUs and IB into VMIs.
CDI installed and healthyImports VM root disks from a registry into DataVolumes. Reference version v1.62.0.
KubeVirt GPU device plugin runningAdvertises the passed-through GPU resource to KubeVirt. Reference daemonset nvidia-kubevirt-gpu-dp-daemonset, image docker.io/anurlan/kubevirt-gpu-device-plugin:v1.5.0-fm, args --fm-enabled=true --fm-address=127.0.0.1:6666, nodeSelector nvidia.com/gpu.present=true.
permittedHostDevices configuredThe KubeVirt CR must permit your GPU PCI ID (and IB PCI ID for multi-node). See 3.3.
NVIDIA driver and Fabric Manager healthy on the hostDelivered by any means (GPU Operator, custom node image, or manual). Required for the GPUs to be usable and for NVLink.

3.2 Reference implementation (not a hard gate)

On the CosmicAC reference cluster, the following components deliver these outcomes.

ComponentNamespaceVersionNotes
NVIDIA GPU Operatorgpu-operatorv25.3.1Delivers the NVIDIA driver, container toolkit, DCGM, and Fabric Manager components. Its ClusterPolicy health isn't a worker readiness gate (VMIs run even when it reports notReady).
Node Feature Discoverygpu-operator0.17.3Ships with the GPU-operator chart.

The GPU Operator installs RuntimeClasses (for example, nvidia and nvidia-cdi) on the reference cluster, but they aren't a prerequisite. The worker's VMIs don't set runtimeClassName.

3.3 KubeVirt permittedHostDevices (required)

KubeVirt allows passthrough only for PCI devices you explicitly permit. The KubeVirt CR (kubevirt/kubevirt) must permit your GPU and IB PFs in its permittedHostDevices. The reference value is as follows.

spec:
  configuration:
    permittedHostDevices:
      pciHostDevices:
        - pciVendorSelector: "10DE:2330"          # NVIDIA H100 SXM5 80GB
          resourceName: nvidia.com/GH100_H100_SXM5_80GB
          externalResourceProvider: true          # advertised by the KubeVirt GPU device plugin
        - pciVendorSelector: "15B3:1021"          # Mellanox ConnectX-7 IB PF
          resourceName: mellanox.com/cx7_ib_pf
          externalResourceProvider: false

On different hardware, replace the pciVendorSelector (PCI vendor:device) and resourceName to match your GPUs and NICs. The resourceName must match what the worker requests (Section 2.3).

4. Storage

The worker creates DataVolumes (VM root disks) and optional PersistentVolumeClaims (external data disks) with storageClassName: local-path. So your cluster must meet the following requirements.

  • A working StorageClass named local-path must exist. It doesn't have to be the cluster default, but WaitForFirstConsumer is preferred so the scheduler places the VMI on a GPU node before it provisions the disk locally.
  • CDI must have a scratch StorageClass set for image imports. Reference value scratchSpaceStorageClass: local-path with the HonorWaitForFirstConsumer feature gate enabled.

Reference per node uses the local-path provisioner rancher.io/local-path with WaitForFirstConsumer. Provide enough local NVMe. VM root disks are commonly 150Gi or larger, so size for your model cache.

5. Registry credentials (ghcr-pull-cred)

The worker pulls private images from GHCR at two stages. Each stage reads different keys from the same secret in your workload namespace.

ConsumerWhenKeys it reads
CDI importer podBefore the VM boots, imports the root-disk image into a DataVolumeaccessKeyId, secretKey
In-guest docker-bootstrap.shAfter the VM boots, logs in to the registry and pulls the inference containerDOCKER_REGISTRY, DOCKER_USERNAME, DOCKER_TOKEN

KubeVirt references the secret as the vmImagePullSecret (default name ghcr-pull-cred). The credentials are a GitHub personal access token (PAT) with at least read:packages, SSO-authorized for the tetherto org. The secret holds all five keys.

kubectl -n <workload-namespace> create secret generic ghcr-pull-cred \
  --from-literal=accessKeyId='<github-username>' \
  --from-literal=secretKey='<github-PAT>' \
  --from-literal=DOCKER_REGISTRY='ghcr.io' \
  --from-literal=DOCKER_USERNAME='<github-username>' \
  --from-literal=DOCKER_TOKEN='<github-PAT>'

The resulting secret holds these five keys.

KeyConsumer
accessKeyIdCDI importer (registry username)
secretKeyCDI importer (registry token or PAT)
DOCKER_REGISTRYIn-guest docker-bootstrap.sh (for example, ghcr.io)
DOCKER_USERNAMEIn-guest docker-bootstrap.sh
DOCKER_TOKENIn-guest docker-bootstrap.sh

5.1 Bootstrap ConfigMap (bootstrap-cm)

Workloads that boot the in-guest inference container mount a bootstrap-cm ConfigMap as a CD-ROM to install the in-guest docker bootstrap unit. It must exist in the workload namespace before the worker starts. The ConfigMap holds two keys. docker-bootstrap.sh installs Docker, optionally prepares an external PVC disk, logs into the registry, and pulls and runs the container. docker-bootstrap.service is the systemd unit that runs the script on first boot.

The worker repo ships the canonical manifest at setup/cfg-create.yml. Its namespace must match your workload namespace.

kubectl apply -f setup/cfg-create.yml

The following is the complete ConfigMap.

apiVersion: v1
kind: ConfigMap
metadata:
  name: bootstrap-cm
  namespace: <workload-namespace>   # for example, cosmic-ac

data:
  docker-bootstrap.sh: |
    #!/bin/bash
    set -eu

    # Helper: wait for block device
    wait_for_dev() { for i in {1..60}; do [ -b "$1" ] && return 0; sleep 1; done; echo "ERROR: $1 not found"; lsblk -b; exit 1; }

    apt-get update && apt-get install -y curl ca-certificates docker.io parted e2fsprogs util-linux
    systemctl enable --now docker

    # Setup PVC disk
    ROOT_DISK="$(findmnt -n -o SOURCE / | sed 's/[0-9]*$//' | xargs basename)"
    [ -f /etc/default/docker-bootstrap ] && source /etc/default/docker-bootstrap
    MNT="${MNT:-}"

    if [ -n "${MNT}" ]; then
      PVC_DISK_NAME="$(lsblk -b -dn -o NAME,TYPE,SIZE | awk -v r="${ROOT_DISK}" '$2=="disk" && $1!=r && $1!~/^(sr|loop)/ && $3>=107374182400 {print $1,$3}' | sort -k2,2nr | head -n1 | awk '{print $1}')"
      [ -z "${PVC_DISK_NAME}" ] && { echo "ERROR: Could not find PVC disk (>=100GiB)"; lsblk -b; exit 1; }

      DISK="/dev/${PVC_DISK_NAME}"
      PART="${DISK}1"
      wait_for_dev "${DISK}"

      [ ! -b "${PART}" ] && { parted -s "${DISK}" mklabel gpt mkpart primary ext4 1MiB 100%; udevadm settle || true; }
      wait_for_dev "${PART}"
      blkid "${PART}" | grep -q 'TYPE=' || mkfs.ext4 -F "${PART}"

      mkdir -p "${MNT}"
      UUID="$(blkid -s UUID -o value "${PART}")"
      grep -q "${UUID}" /etc/fstab || echo "UUID=${UUID} ${MNT} ext4 defaults,nofail 0 2" >> /etc/fstab
      mount -a
    else
      echo "MNT not set, skipping external disk setup"
    fi

    # Docker login and pull
    SECRET_MNT="/mnt/registrycred"
    mkdir -p "${SECRET_MNT}"
    wait_for_dev /dev/sr0
    mount -t iso9660 -o ro /dev/sr0 "${SECRET_MNT}"

    REGISTRY="$(cat "${SECRET_MNT}/DOCKER_REGISTRY")"
    USERNAME="$(cat "${SECRET_MNT}/DOCKER_USERNAME")"
    DOCKER_CONFIG="/run/docker-config"
    mkdir -p "${DOCKER_CONFIG}" && chmod 700 "${DOCKER_CONFIG}"

    cat "${SECRET_MNT}/DOCKER_TOKEN" | docker login "${REGISTRY}" -u "${USERNAME}" --password-stdin
    IMAGE_FULL="${IMG:-${USERNAME}/cosmicac-wrk-agent-inference:latest}"
    docker pull "${IMAGE_FULL}"

    docker logout "${REGISTRY}" || true
    rm -rf "${DOCKER_CONFIG}" && umount "${SECRET_MNT}" || true

    # Run container if not already running
    CONTAINER_NAME="${NAME:-cosmicac-container}"
    if docker ps --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
      echo "Container ${CONTAINER_NAME} already running"
    elif docker ps -a --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
      docker start "${CONTAINER_NAME}"
    elif [ -n "${IMG}" ]; then
      BASE_CMD="docker run -d --name ${CONTAINER_NAME} --init --restart unless-stopped --gpus all --ipc=host"
      [ -n "${MNT}" ] && BASE_CMD="${BASE_CMD} -v ${MNT}:${MNT}"

      if [ "${JOB_TYPE:-}" = "GPU_CONTAINER" ]; then
        BASE_CMD="${BASE_CMD} --cap-add=NET_ADMIN --device /dev/net/tun"
      fi

      if [ -n "${ENVS}" ]; then
        IFS=';' read -ra ENV_ARRAY <<< "${ENVS}"
        for env_pair in "${ENV_ARRAY[@]}"; do
          [[ "$env_pair" != *=* ]] && continue
          ENV_KEY="${env_pair%%=*}"
          ENV_VAL="${env_pair#*=}"
          if [[ "$ENV_VAL" =~ ^[\{\[] ]]; then
            # Convert JS-style object to JSON (single sed pass)
            JSON_VAL=$(sed -E '
              s/([{,])([a-zA-Z0-9_]+):/\1"\2":/g;
              s/:([0-9]+)/: \1/g
            ' <<< "$ENV_VAL")
            BASE_CMD+=" -e ${ENV_KEY}='${JSON_VAL}'"
          else
            BASE_CMD+=" -e ${ENV_KEY}=${ENV_VAL}"
          fi
        done
      fi
      eval "${BASE_CMD} ${IMG}"
    fi

    echo "Bootstrap completed successfully."
  docker-bootstrap.service: |
    [Unit]
    Description=Deferred Docker Bootstrap (login + pull + run)
    Wants=network-online.target
    After=network-online.target multi-user.target

    [Service]
    Type=oneshot
    ExecStart=/usr/local/bin/docker-bootstrap.sh
    RemainAfterExit=yes

    [Install]
    WantedBy=multi-user.target

Verify it exists.

kubectl get configmap bootstrap-cm -n <workload-namespace>   # must exist

6. Required container images

Your cluster (and the VM guests) must be able to pull the following.

ImageUsed by
ghcr.io/tetherto/kubevirt-image:latestVMI root disk (default vmImage), pulled by CDI
ghcr.io/tetherto/cosmicac-wrk-agent-inference:latestIn-guest inference container
docker.io/anurlan/kubevirt-gpu-device-plugin:v1.5.0-fmKubeVirt GPU device plugin daemonset
bitnami/kubectl:latestSubcluster manager pod (overlay networking)
debian:bookworm-slim, alpine:latestOVS/VXLAN + WireGuard containers (overlay networking)
ghcr.io/k8snetworkplumbingwg/multus-cni:v4.2.4-thickMultus daemonset (overlay networking)
ghcr.io/k8snetworkplumbingwg/whereabouts:latestWhereabouts IPAM (overlay networking)

The ghcr.io/tetherto/* images are private. Access them with the ghcr-pull-cred PAT from Section 5.

7. Worker access and RBAC

The worker authenticates to your API server with a kubeconfig. The identity behind that kubeconfig needs two sets of permissions.

Namespace-scoped permissions cover the resources the worker creates and manages in the workload namespace, and, for overlay networking, in kube-system.

API groupResourcesVerbs
"" (core)configmaps, secrets, persistentvolumeclaims, pods, servicescreate, get, list, delete
appsdeployments, daemonsetscreate, get, delete
cdi.kubevirt.iodatavolumescreate, get, delete
kubevirt.iovirtualmachineinstancescreate, get, delete
k8s.cni.cncf.ionetwork-attachment-definitionscreate, get, delete (overlay networking only)

Cluster-scoped permissions let the worker report cluster capacity through getClusterResources (used by getInfo and the getClusterResources RPC), which lists all nodes and all pods cluster-wide to compute GPU, CPU, and memory totals and usage.

API groupResourcesVerbsScope
"" (core)nodesget, listcluster-wide
"" (core)podslistall namespaces

The following notes apply.

  • pods needs namespaced list and get (the worker polls VMI launcher pods for status) and cluster-wide list (for getClusterResources).
  • nodes is cluster-scoped, so it requires a ClusterRole.
  • For overlay networking, configmaps also needs update and patch. The worker rewrites the WireGuard peers ConfigMap as clients are added or removed.

The namespaced set binds through a Role and RoleBinding in the workload namespace. The cluster-scoped node and pod reads, and the kube-system objects that overlay networking uses, bind through a ClusterRole and ClusterRoleBinding. The grant follows least privilege, and the worker never needs cluster-admin.

8. Overlay networking add-ons (optional capability)

These add-ons are required only for workloads that use per-instance overlay networking, an isolated OVS/VXLAN overlay reachable through a WireGuard gateway. A subset of workloads currently uses it, though that use is expected to grow over time, so treat it as an optional requirement tier rather than something tied to one workload type. Skip this section if you don't need per-instance network isolation.

The required outcomes are as follows.

OutcomeDetail
Multus CNI installedAttaches the secondary overlay interface to the VMI through NetworkAttachmentDefinition. Reference image ghcr.io/k8snetworkplumbingwg/multus-cni:v4.2.4-thick.
Whereabouts IPAM installedCluster-wide IPAM for the overlay range. Reference ghcr.io/k8snetworkplumbingwg/whereabouts:latest.
OVS CNI available on nodesProvides the ovs CNI type used by the NAD bridge (reference daemonsets ovs-cni-amd64 and ovs-node).
NAD CRD presentnetwork-attachment-definitions.k8s.cni.cncf.io (provided by Multus). The worker creates NADs of CNI type ovs with whereabouts IPAM.
subcluster-manager-sa + RBACServiceAccount in kube-system that can label nodes and watch pods. ClusterRole rules grant nodes get,list,watch,patch,update and pods get,list,watch, bound through a ClusterRoleBinding.
UDP LoadBalancerThe WireGuard gateway is a Service of type: LoadBalancer on UDP 51820 with externalTrafficPolicy: Local. Your LB must support UDP. The reference manifests carry a Gcore-specific annotation loadbalancer.gcore.com/floating-ip-cleanup. On another provider use the equivalent or drop it.

These add-ons layer on top of your primary CNI. They don't replace it.

9. Verification checklist

Run these checks from a workstation with kubectl (and virtctl) pointed at the cluster.

9.1 Base platform

# Nodes Ready + control plane healthy; k8s version supported by KubeVirt/CDI
kubectl get nodes -o wide
kubectl version -o json | jq -r '.serverVersion.gitVersion'

# Per GPU node: GPU (and IB) advertised under the name the worker requests
kubectl get nodes -o json | jq -r '
  .items[] | .metadata.name + "  gpu=" +
  (.status.allocatable["nvidia.com/GH100_H100_SXM5_80GB"] // "0") + "  ib=" +
  (.status.allocatable["mellanox.com/cx7_ib_pf"] // "0")'

# KubeVirt + CDI healthy; HostDevices feature gate on
kubectl -n kubevirt get kubevirt kubevirt -o jsonpath='{.status.observedKubeVirtVersion}{"\n"}'
kubectl -n kubevirt get kubevirt kubevirt -o jsonpath='{.spec.configuration.developerConfiguration.featureGates}{"\n"}'  # includes HostDevices
kubectl -n cdi get cdi cdi -o jsonpath='{.status.observedVersion}{"\n"}'

# KubeVirt GPU device plugin advertising the passthrough resource
kubectl -n kube-system get ds nvidia-kubevirt-gpu-dp-daemonset

# permittedHostDevices includes your GPU/IB
kubectl -n kubevirt get kubevirt kubevirt -o jsonpath='{.spec.configuration.permittedHostDevices}{"\n"}' | jq .

# A working local-path StorageClass exists (need not be the default)
kubectl get sc

# Workload-namespace prerequisites
kubectl -n <workload-namespace> get secret ghcr-pull-cred
kubectl -n <workload-namespace> get configmap bootstrap-cm

# Device-admission smoke test: a VMI that CLAIMS a full GPU is admitted + scheduled.
# NOTE: with no boot disk this only proves device admission + scheduling, NOT a
# working guest. For a real boot test, add a DataVolume root disk.
cat <<'EOF' | kubectl apply -f -
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  name: gpu-smoke
  namespace: default
spec:
  domain:
    devices:
      gpus:
        - deviceName: nvidia.com/GH100_H100_SXM5_80GB
          name: gpu0
    resources:
      requests: { memory: 8Gi }
  volumes: []
EOF
kubectl get vmi gpu-smoke -w     # reaches Scheduled/Running (device admitted), then:
kubectl delete vmi gpu-smoke

The checks pass when nodes are Ready, each GPU node reports the expected gpu= and ib= counts, KubeVirt is healthy with HostDevices on, CDI is healthy, a working local-path StorageClass exists, ghcr-pull-cred (and bootstrap-cm) are present, and the smoke VMI is admitted and scheduled.

9.2 Overlay networking (optional)

# Multus / Whereabouts / OVS daemonsets Ready in kube-system
kubectl -n kube-system get ds | grep -iE 'multus|whereabouts|ovs'

# NAD CRD present
kubectl get crd network-attachment-definitions.k8s.cni.cncf.io

# Subcluster manager SA + RBAC present
kubectl -n kube-system get sa subcluster-manager-sa
kubectl get clusterrole subcluster-manager-role
kubectl get clusterrolebinding subcluster-manager-binding

# LoadBalancer can allocate an external IP for UDP services (provider-specific)

On this page