HPC with Slurm on GCP

A Slurm cluster can be created easily on GCP, following instructions in the git repository Slurm on Google Cloud Platform. Its Terraform script is in beta. Its Deployment Manager script is in production quality with excellent security design.

Accessing with CLI

The newly created cluster has a dedicated login node. In the most secure configuration, no public IPs are assigned to any nodes. The firewall only allows ICMP and TCP port 22. Follow instructions in Accessing GCP node from CLI to access the Slurm cluster via SSH, SCP, rsync, etc.

Enable GCSFuse

GCSFuse presents storage objects as files on shared directories. This allows you to access Petabytes of storage without pre-allocating anything. There is no downloading or uploading needed. You do not need to hard-code any keys or passwords, either.

Edit basic.tfvars if you are using Terraform script or slurm-cluster.yaml if you are using Deployment Manager to create Slurm clusters. Here are the steps in Terraform:

Add "https://www.googleapis.com/auth/devstorage.full_control" to compute_node_scopes, for example:

  compute_node_scopes          = [
    "https://www.googleapis.com/auth/monitoring.write",
    "https://www.googleapis.com/auth/logging.write",
    "https://www.googleapis.com/auth/devstorage.full_control"
  ]

(Optional) If you are using a none default service account, make sure it has "Storage Admin" role. If you are using the default service account (i.e., compute_node_service_account = "default"), nothing needs to be done.

Edit network_storage for worker nodes, for example:

  network_storage = [
    {
      server_ip = "none"
      remote_mount = "dy-test-301718"
      local_mount = "/data"
      fs_type = "gcsfuse"
      mount_options = "file_mode=666,dir_mode=777,allow_other"
    }
  ]

This can be found in our slurm-master repo where we put the latest working code from official repo,

You can fork this repo, add your changes in tfvars and follow the README to setup CI/CD of your slurm cluster infrastructure.

Monitoring

Google Stackdrive should be enabled for monitoring. The dashboard can be accessed from Google Cloud Console, for example https://console.cloud.google.com/monitoring/dashboards/resourceList/gce_instance?project=citc-slurm&timeDomain=1h.