Running HPL on Kubernetes with Volcano and MPI Operator

SiriusKoan

October 19, 2025 · 5 min read

#Kubernetes #HPC #MPI

文中提到的檔案皆放在 SiriusKoan/hpc-benchmarks-container。

What’s HPL #

HPL (High Performance Linpack) 是一個利用求解多元線性方程組來評估超級電腦效能的程式，主要被用來測試全球的超級電腦的速度 (TOP500)。

超級電腦是由多台小電腦所組成的，程式可以使用 MPI (Message Passing Interface) 來在多台機器之間溝通。HPL 便支援 MPI，所以可以作為評測的工具。

HPL 的設定檔是 HPL.dat，範例如下：

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
4            # of problems sizes (N)
29 30 34 35  Ns
4            # of NBs
1 2 3 4      NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4            Ps
4            Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

What’s Volcano #

Volcano 是一個 Kubernetes-native 的 batch scheduling 系統，專門設計給 HPC 使用。

他補足了原生 Kubernetes scheduler 的不足，支援更多 HPC 常用的排程方式及 feature，如 Gang Scheduling（讓同一個 job 的 workers 一起被佈署，避免只有部份 workers 上線的狀況）、多層的 queue、preemption、異質裝置支援（如 GPU）等等。

Volcano 也支援許多 HPC、AI/ML 相關的系統，例如前面提到的 MPI、Kubeflow（在 Kubernetes 上的 ML 開發及佈署平台）、TensorFlow 等等。

What’s MPI Operator #

MPI Operator 是一個 Kubeflow 的元件，顧名思義是 MPI job 的 controller。

他可以把一些 MPI 必要的資訊送進 host 內，例如 SSH key 以及 MPI host file，讓開發者能更輕鬆地把 MPI job 跑在 Kubernetes 上。

他也支援各種常見的 MPI，包括 Intel MPI、OpenMPI 等等。可以參考官方範例 Pure MPI example。

Containerize HPL #

HPL 目前沒有方便直接用的 Docker image，所以參考了 ExplorerRay/hpc-container-def 的做法，生了一個 Dockerfile 的版本。

FROM debian:trixie-slim AS build

# Environment variables for ignoring some harmless errors (from %environment)
ENV PMIX_MCA_gds="^ds12"
ENV PMIX_MCA_psec="^munge"

# Install dependencies and build HPL
RUN apt-get -y update && \
    DEBIAN_FRONTEND=noninteractive apt-get -y install \
        make g++ gcc gfortran wget openmpi-bin libopenmpi-dev libopenblas-dev && \
    apt-get clean autoclean && \
    apt-get autoremove --yes && \
    rm -rf /var/lib/apt/lists/*

RUN cd /opt && \
    wget https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz && \
    tar -xzf hpl-2.3.tar.gz && \
    cd hpl-2.3/setup && \
    cp Make.Linux_PII_CBLAS ../Make.linux && \
    cd .. && \
    sed -i 's|^TOPdir       = $(HOME)/hpl|TOPdir       = /opt/hpl-2.3|' Make.linux && \
    sed -i 's|^ARCH         = .*$|ARCH         = linux|' Make.linux && \
    sed -i 's|^MPdir        = .*$|MPdir        = /usr/lib/x86_64-linux-gnu/openmpi|' Make.linux && \
    sed -i 's|^MPlib        = .*$|MPlib        = $(MPdir)/lib/libmpi.so|' Make.linux && \
    sed -i 's|^LAdir        = .*$|LAdir        = /usr/lib/x86_64-linux-gnu/openblas-pthread|' Make.linux && \
    sed -i 's|^LAlib        = .*$|LAlib        = $(LAdir)/libopenblas.a|' Make.linux && \
    sed -i 's|^CC           = .*$|CC           = /usr/bin/mpicc|' Make.linux && \
    sed -i 's|^LINKER       = .*$|LINKER       = /usr/bin/mpicc|' Make.linux && \
    echo "Building HPL Benchmarks..." && \
    make arch=linux && \
    rm /opt/hpl-2.3.tar.gz

FROM debian:trixie-slim

ENV PMIX_MCA_gds="^ds12"
ENV PMIX_MCA_psec="^munge"

WORKDIR /opt/hpl-2.3/bin/linux

RUN apt update && \
    apt install -y openmpi-bin libopenmpi-dev libopenblas-dev openssh-server && \
    apt-get clean autoclean && \
    apt-get autoremove --yes && \
    rm -rf /var/lib/apt/lists/*
COPY --from=build /opt/hpl-2.3/bin/linux /opt/hpl-2.3/bin/linux

CMD ["sleep", "infinity"]

此 image 已經上傳至 DockerHub。

Running HPL with Volcano #

為了避免每次修改 HPL 參數都要重新 build container image，可以採用把 HPL.dat 用 ConfigMap 掛進 container 的方法。

$ kubectl create cm hpl --from-file=HPL.dat

接著準備 Volcano Job 的 YAML 檔：

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: hpl-mpi-job
spec:
  minAvailable: 1
  schedulerName: volcano
  plugins:
    mpi: ["--master=mpimaster","--worker=mpiworker","--port=22"]  ## MPI plugin register
  tasks:
    - replicas: 1
      name: mpimaster
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          volumes:
            - name: hpl-config
              configMap:
                name: hpl
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  mpirun --allow-run-as-root --host "$MPI_HOST" ./xhpl;
              image: siriuskoan/hpl:latest
              name: mpimaster
              workingDir: /opt/hpl-2.3/bin/linux
              volumeMounts:
                - name: hpl-config
                  mountPath: /opt/hpl-2.3/bin/linux/HPL.dat
                  subPath: HPL.dat
          restartPolicy: OnFailure
    - replicas: 4
      name: mpiworker
      template:
        spec:
          volumes:
            - name: hpl-config
              configMap:
                name: hpl
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: siriuskoan/hpl:latest
              name: mpiworker
              workingDir: /opt/hpl-2.3/bin/linux
              volumeMounts:
                - name: hpl-config
                  mountPath: /opt/hpl-2.3/bin/linux/HPL.dat
                  subPath: HPL.dat
              resources:
                requests:
                  cpu: 1
                limits:
                  cpu: 1
          restartPolicy: OnFailure

Volcano 的 MPI plugin 會將 MPI_HOST 設定好，讓 MPI master 可以直接知道 topology 裡面有哪些 worker。

另外建議可以設定好 worker 的 resources，Kubernetes 才會給他比較高的 QoS。

最後使用以下指令就可以看到 job 在跑了：

$ kubectl apply -f job
$ kubectl get vcjob

Limitation - Running with Multiple Cores #

雖然 Volcano 會幫忙完成 topology discover 並設定好 MPI_HOST，但它不會知道使用者和 Kubernetes 要求多少資源給 worker。換句話說，如果把要求的 CPU 加到 4，HPL 依然只會跑在一個 CPU 上。

目前沒有找到 Volcano 提供的解法，但可以直接用 Linux 指令來修改一下 MPI_HOST 來讓它帶有 CPU 核心數的資料。將原本 MPI master 的 mpirun 指令改成以下這行：

mpirun --allow-run-as-root --host "$(echo $MPI_HOST | sed 's/,/:4,/g; s/$/:4/')" ./xhpl;

其中的 4 為 worker 請求的 CPU 核心數。

Running HPL with MPI Operator #

和 Volcano 一樣，要先把 HPL.dat 寫成一個 ConfigMap。然後準備好以下的 YAML：

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pi
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 60
  sshAuthMountPath: /root/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          volumes:
            - name: hpl-config
              configMap:
                name: hpl
          containers:
          - image: siriuskoan/hpl:latest
            name: mpi-launcher
            workingDir: /opt/hpl-2.3/bin/linux
            volumeMounts:
              - name: hpl-config
                mountPath: /opt/hpl-2.3/bin/linux/HPL.dat
                subPath: HPL.dat
            command:
              - /bin/sh
              - -c
              - |
                echo "Host *" >> /etc/ssh/ssh_config;
                echo "  StrictHostKeyChecking no" >> /etc/ssh/ssh_config;
                mpirun -np 4 -hostfile /etc/mpi/hostfile --allow-run-as-root ./xhpl;
            resources:
              limits:
                cpu: 1
    Worker:
      replicas: 4
      template:
        spec:
          volumes:
            - name: hpl-config
              configMap:
                name: hpl
          containers:
          - image: siriuskoan/hpl:latest
            name: mpi-worker
            workingDir: /opt/hpl-2.3/bin/linux
            volumeMounts:
              - name: hpl-config
                mountPath: /opt/hpl-2.3/bin/linux/HPL.dat
                subPath: HPL.dat
            command:
            - /bin/sh
            - -c
            - |
              ssh-keygen -A; mkdir -p /var/run/sshd; /usr/sbin/sshd -De -o StrictModes=no;
            resources:
              limits:
                cpu: 1

最後使用以下指令就可以看到 job 在跑了：

$ kubectl apply -f job.yaml
$ kubectl get mpijob

前面提到 Volcano 在使用多核時需要自己去改動 hostfile，MPI operator 很貼心的幫忙處理好，只要在 spec.slotsPerWorker 指定好 CPU 數量、並且在 worker 的 resources 裡面填入相對應的核心數即可。

另外，如果使用者不是 root 的話，要更改 spec.sshAuthMountPath 至家目錄底下的 .ssh 目錄，不然會沒有辦法連線。

Limitation #

在測試 MPI Operator 的時候有遇到以下幾個雷：

MPI Operator 處理 SSH 的方式是先建立一個存有 SSH keys 的 Secret，並把他 mount 進 spec.sshAuthMountPath。但 Kubernetes 在 mount 的時候權限會比較大，而 SSH 預設是不會接受 700 以上的權限，而導致連線失敗。我們必須手動在 sshd 啟動的時候加上 -o StrictModes=no，讓 sshd 接受錯誤的權限。
sshd 啟動的時候會需要 host key，即 server 端的 key-pair，如果 image 裡面沒有 key 的話就需要先用 ssh-keygen -A 來生一把 key。
正常 SSH 到遠端機器的時候會需要驗證 host key，所以必須要手動在 ssh config（不是 sshd，是 client 端的 ssh）把驗證關掉。這邊的做法是直接用 echo 加上 StrictHostKeyChecking no。

Conclusion #

大致用這兩套系統跑過 job 之後覺得都各自有待加強的部分。我覺得主要的缺點是跟 SSH 的整合都相當不足，需要使用者自己把 sshd 跑起來，沒有辦法完全復現在一般 HPC 平台上的方便。

此外也沒有看到 PMI 相關的支援，最接近的只有 MPI Operator 在 2018 年開的 issue Investigate PMIx。

SiriusKoan blog