Kubeflow for Machine Learning ============================= Kubeflow is a cloud-native platform for machine learning based on Google's internal ML pipelines. The complete documentation is at `https://www.kubeflow.org/docs/ `_. Kubeflow is under active development. There is one new release roughly every month. Here is a link to the release notes: `https://github.com/kubeflow/kubeflow/releases/ `_. Setting up Kubeflow on GKE -------------------------- Kubeflow can run on any environment with Kubernetes. If it is used for ML, model, quota and performance of GPUs become a major decision factor. Embassy Hosted Kubernetes does not have GPUs. GKE is tried first as it is the most mature environment for Kubernetes, Kubeflow and ML with GPU acceleration. 1. Create a GKE cluster. Choice of a zone is important if GPUs are needed. See `GPU availability at europe-west`_ for GPU accelerators in europe-west. 2. Follow instructions at https://www.kubeflow.org/docs/started/k8s/kfctl-existing-arrikto/ to deploy Kubeflow. It requires a Kubernetes cluster with LoadBalancer support. Create MetalLB if needed. 3. Get the credentials for the newly created cluster `gcloud container clusters get-credentials ${CLUSTER} --zone ${ZONE} --project ${PROJECT}`. 4. Get the IP address and open Kubeflow dashboard. Accessing Kubeflow with the following commands:: IP_KUBEFLOW=$( kubectl get svc -n istio-system istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}' ) open https://${IP_KUBEFLOW} Alternatively, use port forward on local host:: kubectl port-forward -n istio-system svc/istio-ingressgateway 8443:443 open https://localhost:8443 Setting up Jupyter notebook --------------------------- Jupyter notebooks can be created easily on the Kubeflow dashboard. 1. Click `Notebook Servers` to create as many as you want. 2. Click `CONNECT` to start using a notebook server. Note that the number of GPUs in a cluster can be listed as documented in https://www.kubeflow.org/docs/notebooks/setup/, for example:: kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" NAME GPU gke-tsi-gpu-1-default-pool-c5d48ec2-8ckx gke-tsi-gpu-1-default-pool-c5d48ec2-lqmr gke-tsi-gpu-1-gpu-pool-1-695efd18-wlq9 1 If GPUs are to be used, the Docker image for GPU and extra resource needs to be provided:: {"nvidia.com/gpu": 1} The generated sts yaml contains resource limite of Nvidia GPUs, for example:: spec: containers: - env: - name: NB_PREFIX value: /notebook/davidyuan/jupyter-3 image: gcr.io/kubeflow-images-public/tensorflow-1.13.1-notebook-gpu:v0.5.0 imagePullPolicy: IfNotPresent name: jupyter-3 ports: - containerPort: 8888 name: notebook-port protocol: TCP resources: limits: nvidia.com/gpu: "1" requests: cpu: 500m memory: 1Gi Customising images for Notebook servers --------------------------------------- It is a good idea to start customisation from an official image in GCR. Docker needs to be configured to access the images at https://gcr.io/kubeflow-images-public/ :: gcloud components install docker-credential-gcr gcloud auth configure-docker The following must be included in a Dockerfile if starting with a custom image, according to https://www.kubeflow.org/docs/notebooks/custom-notebook/. This is likely a long way:: ENV NB_PREFIX / CMD ["sh","-c", "jupyter notebook --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"] Pipeline SDK ------------ Set up and activate a venv undere ~/.kube, for example. Ensure that it is 3.5 or later. Install SDK in the venv:: cd ~/.kube virtualenv venv source venv/bin/activate python --version pip install https://storage.googleapis.com/ml-pipeline/release/latest/kfp.tar.gz --upgrade which dsl-compile If IntelliJ is used as IDE, ensure project structure is updated with SDK in this venv. When a Python script is run, it compiles itself into Argo manifest in ZIP, which can be uploaded to Kubeflow pipeline:: if __name__ == '__main__': kfp.compiler.Compiler().compile(align_and_vc, __file__ + '.zip') Alternately, compile the Python pipeline in CLI:: dsl-compile --py [path/to/python/file] --output [path/to/output/tar.gz] Authentication -------------- The authentication can be done via external IdP (e.g. Google, LinkedIn, etc.), static users or LDAP. More details can be found at https://www.kubeflow.org/docs/started/k8s/kfctl-existing-arrikto/#accessing-kubeflow. Accessing Git repository ------------------------ Git client is already installed. SSH terminal is available. The `*.ipynb` notebooks can be easily checked in and out of Git repositories on persistent storage. Here is a cheatsheet of the most used Git commands. * https://tsi-ccdoc.readthedocs.io/en/master/Tech-tips/DevOps-toolchain-docker.html?highlight=git%20commit#appendix-command-line-instructions Accessing data via Tensorflow ----------------------------- As documented in https://www.kubeflow.org/docs/pipelines/sdk/component-development/, `tf.gfile` module supports both local and cloud storage paths:: #!/usr/bin/env python3 import argparse import os from pathlib import Path from tensorflow import gfile # Supports both local paths and Cloud Storage (GCS) or S3 # Defining and parsing the command-line arguments parser = argparse.ArgumentParser(description='My program description') parser.add_argument('--input1-path', type=str, help='Path of the local file or GCS blob containing the Input 1 data.') parser.add_argument('--param1', type=int, default=100, help='Parameter 1.') parser.add_argument('--output1-path', type=str, help='Path of the local file or GCS blob where the Output 1 data should be written.') parser.add_argument('--output1-path-file', type=str, help='Path of the local file where the Output 1 URI data should be written.') args = parser.parse_args() gfile.MakeDirs(os.path.dirname(args.output1_path)) # Opening the input/output files and performing the actual work with gfile.Open(args.input1_path, 'r') as input1_file, gfile.Open(args.output1_path, 'w') as output1_file: # do_work(input1_file, output1_file, args.param1) # Writing args.output1_path to a file so that it will be passed to downstream tasks Path(args.output1_path_file).parent.mkdir(parents=True, exist_ok=True) Path(args.output1_path_file).write_text(args.output1_path) This API fits for relatively small files in GB range if from cloud storage. There is no good solution for large files in TB range and large volume in PB range. In addition, there is https://github.com/tensorflow/io and https://github.com/google/nucleus libraries to process specific file types. In particular, Nucleus handles special file types for genomic sequence processing. It branched from Deep Variant https://github.com/google/deepvariant. Accessing data via OneData -------------------------- OneClient can be installed. However, it can not be run without root privilege. It is a serious security risk to run a container with root privilege. Kubeflow will never agree to that. The integration of OneData with Kubeflow is out of the question:: conda install -c onedata oneclient=18.02.2 oneclient --help oneclient: error while loading shared libraries: libprotobuf.so.19: cannot open shared object file: No such file or directory The only sensible option is for OneData to provide a provisioner as other storage vendors https://github.com/kubernetes-incubator/external-storage, where a common library by Kubernetes SIG is available at https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner. GPU quotas ---------- Apply filters to the service of `Compute Engine API`, the metric of `GPUs` and the location of `europe-west1`, for example, to find out GPU quota at https://console.cloud.google.com/iam-admin/quotas?_ga=2.179486372.-1491760115.1547984356&project=extreme-lore-114513&folder&organizationId=817248562955. Separate GPU node pool ---------------------- Always create a separate GPU pools in a cluster. When adding a GPU node pool to an existing cluster that already runs a non-GPU node pool, GKE automatically taints the GPU nodes with the following node taint: * Key: nvidia.com/gpu * Effect: NoSchedule Additionally, GKE automatically applies the corresponding tolerations to Pods requesting GPUs by running the ExtendedResourceToleration admission controller. This causes only Pods requesting GPUs to be scheduled on GPU nodes, which enables more efficient autoscaling: your GPU nodes can quickly scale down if there are not enough Pods requesting GPUs. Run the following commands to add a separate GPU pool to an existing cluster, for example:: gcloud container node-pools create pool-gpu-1 --accelerator type=nvidia-tesla-p100,count=1 --zone ${ZONE} --cluster ${CLUSTER} --num-nodes 1 --min-nodes 0 --max-nodes 2 --enable-autoscaling kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml GPU availability at europe-west ------------------------------- To see a list of all GPU accelerator types supported in each zone, run the following command:: gcloud compute accelerator-types list | grep europe-west nvidia-tesla-k80 europe-west1-d NVIDIA Tesla K80 nvidia-tesla-p100 europe-west1-d NVIDIA Tesla P100 nvidia-tesla-p100-vws europe-west1-d NVIDIA Tesla P100 Virtual Workstation nvidia-tesla-k80 europe-west1-b NVIDIA Tesla K80 nvidia-tesla-p100 europe-west1-b NVIDIA Tesla P100 nvidia-tesla-p100-vws europe-west1-b NVIDIA Tesla P100 Virtual Workstation nvidia-tesla-p100 europe-west4-a NVIDIA Tesla P100 nvidia-tesla-p100-vws europe-west4-a NVIDIA Tesla P100 Virtual Workstation nvidia-tesla-v100 europe-west4-a NVIDIA Tesla V100 nvidia-tesla-p4 europe-west4-c NVIDIA Tesla P4 nvidia-tesla-p4-vws europe-west4-c NVIDIA Tesla P4 Virtual Workstation nvidia-tesla-t4 europe-west4-c NVIDIA Tesla T4 nvidia-tesla-t4-vws europe-west4-c NVIDIA Tesla T4 Virtual Workstation nvidia-tesla-v100 europe-west4-c NVIDIA Tesla V100 nvidia-tesla-p4 europe-west4-b NVIDIA Tesla P4 nvidia-tesla-p4-vws europe-west4-b NVIDIA Tesla P4 Virtual Workstation nvidia-tesla-t4 europe-west4-b NVIDIA Tesla T4 nvidia-tesla-t4-vws europe-west4-b NVIDIA Tesla T4 Virtual Workstation nvidia-tesla-v100 europe-west4-b NVIDIA Tesla V100 In summary, the following models are availabe as permanent or preemptible at present: #. NVIDIA K80 #. NVIDIA V100 #. NVIDIA P100 #. NVIDIA P4 #. NVIDIA T4 Note that P100, P4 and T4 are also available as virtual workstation. According to https://cloud.google.com/kubernetes-engine/docs/how-to/gpus, GKE nodepools can be created with all the GPUs above. #. Kubernetes version > 1.9 for Container-optimised OS or > 1.11.3 for Ubuntu node image #. Manually install GPU driver via a DaemonSet https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers GPUs are used on accelerators to VMs or Kubernetes clusters. The virtual infrastructure passes requests through to GPUs on the same region / zone. GPUs are now available on all regions and zones. However, quota varies. If a workload requires GPU, GPU quota needs to be checked when a region / zone is selected. GPU pricing ----------- GPU pricing seems the same in all regions. More details can be found at `https://cloud.google.com/compute/all-pricing#gpus `_. Regions and zones `https://cloud.google.com/compute/docs/regions-zones/ `_ References ---------- * https://cloud.google.com/gpu/ * https://cloud.google.com/compute/docs/gpus/ * https://www.kubeflow.org/docs/started/k8s/kfctl-existing-arrikto/ * https://www.kubeflow.org/docs/ * https://onedata.org/docs/doc/using_onedata/onedatafs.html * https://github.com/kubeflow/pipelines/