System Management

Docker 기반 Slurm Cluster 구성하기

kogun82 2024. 5. 21. 13:50

docker-compose를 이용해서 slurm cluster 구성 (연산 노드 자원 설정)

 

1. docker-compose yml 파일 작성

services:
  slurmjupyter:
        image: rancavil/slurm-jupyter:19.05.5-1
        hostname: slurmjupyter
        user: admin
        volumes:
                - shared-vol:/home/admin
                - /Users/kogun82/Documents/docker/cluster/store:/BiO
        ports:
                - 8888:8888
                - 3030:3030
  slurmmaster:
        image: rancavil/slurm-master:19.05.5-1
        deploy:
          resources:
            limits:
              memory: 196G
        hostname: slurmmaster
        user: admin
        volumes:
                - shared-vol:/home/admin
                - /Users/kogun82/Documents/docker/cluster/store:/BiO
                - /Users/kogun82/Documents/docker/cluster/slurm.conf:/etc/slurm-llnl/slurm.conf
        # 연산 노드의 자원 설정 추가
        environment:
                - /Users/kogun82/Documents/docker/cluster/slurm.conf:/etc/slurm-llnl/slurm.conf
        ports:
                - 6817:6817
                - 6818:6818
                - 6819:6819
  slurmnode1:
        image: rancavil/slurm-node:19.05.5-1
        deploy:
          resources:
            limits:
              memory: 196G
        hostname: slurmnode1
        user: admin
        volumes:
                - shared-vol:/home/admin
                - /Users/kogun82/Documents/docker/cluster/store:/BiO
        environment:
                - SLURM_NODENAME=slurmnode1
        links:
                - slurmmaster
  slurmnode2:
        image: rancavil/slurm-node:19.05.5-1
        deploy:
          resources:
            limits:
              memory: 196G
        hostname: slurmnode2
        user: admin
        volumes:
                - shared-vol:/home/admin
                - /Users/kogun82/Documents/docker/cluster/store:/BiO
        environment:
                - SLURM_NODENAME=slurmnode2
        links:
                - slurmmaster
  slurmnode3:
        image: rancavil/slurm-node:19.05.5-1
        deploy:
          resources:
            limits:
              memory: 196G
        hostname: slurmnode3
        user: admin
        volumes:
                - shared-vol:/home/admin
                - /Users/kogun82/Documents/docker/cluster/store:/BiO
        environment:
                - SLURM_NODENAME=slurmnode3
        links:
                - slurmmaster
volumes:
        shared-vol:

 

2. slurm.conf 파일 작성

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=slurmmaster
#
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/affinity
TaskPluginParam=Sched

# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core

# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=error
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=error
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

# COMPUTE NODES
NodeName=slurmnode[1-10] CPUs=12 RealMemory=4096 State=UNKNOWN #cpu, mem 설정
PartitionName=slurmpar Nodes=slurmnode[1-10] Default=YES MaxTime=INFINITE State=UP

 

3. docker-compose 시작

docker-compose up -d

 

4. docker-compose 실행 확인

docker-compose ps

 

5. slrum node 확인

scontrol show node

 

6. sinfo 명령어로 노드 상태 확인 시 idle 상태가 아닌 경우 실행 가능한 상태로 변경

sudo scontrol update nodename=slurmnode1 state=resume

 

7. 테스트 python 코드

#!/usr/bin/env python3
  
import time
import os
import socket
from datetime import datetime as dt
if __name__ == '__main__':
    print('Process started {}'.format(dt.now()))
    print('NODE : {}'.format(socket.gethostname()))
    print('PID  : {}'.format(os.getpid()))
    print('Executing for 15 secs')
    time.sleep(15)
    print('Process finished {}\n'.format(dt.now()))

 

8. 작업 제출 bash shell 코드

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=result.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=1G

sbcast -f test.py /tmp/test.py
srun python3 /tmp/test.py

 

9. 작업 제출

sbatch job.sh

 

10. 작업 상태 확인

smap

 

smap 명령어를 이용해 그래픽하게 작업 상태를 확인할 수 있다.