Goal:

This article shares a step-by-step guide on how to install a Kubernetes Cluster with NVIDIA GPU on AWS using DeepOps.

Env:

AWS EC2 (G4dn)

Ubuntu 18.04

Solution:

Most of the steps are the same as previous blog post: How to install a Kubernetes Cluster with NVIDIA GPU on AWS.

In that previous blog, it uses kubeadm to manually install a Kubernetes Cluster by installing below components: Docker, NVIDIA Container Toolkit (nvidia-docker2) and NVIDIA Device Plugin.

In this blog, we will just use DeepOps to do above work by following https://github.com/NVIDIA/deepops/tree/master/docs/k8s-cluster.

So basically we just need to replace section #4 of previous blog with below steps. (So here let me use step 4 as a starting point.)

4.1 Download DeepOps repo

On the EC2 machine:

git clone https://github.com/NVIDIA/deepops.git
cd deepops \
   && git checkout tags/20.10

4.2 Install ansible and other needed software

./scripts/setup.sh

4.3 Edit inventory and add nodes to the "KUBERNETES" section

vi config/inventory

Note: Since this is a single-node cluster, we need to add the same `hostname` to [kube-master], [etcd] and [kube-node] section.

4.4 Verify the configuration

ansible all -m raw -a "hostname"

4.5 Install Kubernetes using Ansible and Kubespray.

ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml

4.6 Test K8s cluster

kubectl get nodes
kubectl run gpu-test --rm -t -i --restart=Never --image=nvcr.io/nvidia/cuda:10.1-base-ubuntu18.04 --limits=nvidia.com/gpu=1 nvidia-smi

Issues:

1. There are 2 CoreDNS PODs with 1 POD pending

# kubectl get pods -A |grep coredns
kube-system              coredns-123                                0/1     Pending   0          2m40s
kube-system              coredns-456                                1/1     Running   0          64m

If we describe this pending POD, we got to know this is due to pod affinity/anti-affinity since we have only 1 node in this K8s cluster.

# kubectl describe pod coredns-123 -n kube-system  |grep affinity
  Warning  FailedScheduling  73s   default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.
  Warning  FailedScheduling  73s   default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.

CoreDNS deployment have 2 desired PODs:

# kubectl describe deployment.apps -n kube-system coredns |grep desired
Replicas:               2 desired | 2 updated | 2 total | 1 available | 1 unavailable

One way to resolve this in my first thought is to manually scale down deployment CoreDNS as below:

kubectl scale deployments.apps -n kube-system coredns --replicas=1

However it did not work.

The reason is by default, deployment dns-autoscaler is also installed, so the final fix is to:

kubectl edit configmap dns-autoscaler --namespace=kube-system

In above configMap, change "min":2 to "min":1.

After that, if you describe CoreDNS again, it will show it got scaled down to 1:

# kubectl describe deployment.apps -n kube-system coredns
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
  Normal  ScalingReplicaSet  21s (x2 over 12m)  deployment-controller  Scaled down replica set coredns-xxx to 1

Eventually you can delete the pending coreDNS pod if it is still there:

kubectl delete pods coredns-123 -n kube-system

2. CoreDNS pod crashed with the reason as "OOMKilled"

If we describe the crashed POD, we can get below reason:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 02 Apr 2021 21:32:12 +0000
      Finished:     Fri, 02 Apr 2021 21:32:21 +0000
    Ready:          False
    Restart Count:  3
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi

This is because by default, CoreDNS POD has 170MB memory limit which may be too small for big cluster. Here are some reported occurrence as well.

The fix is straightforward, just increase the deployment CoreDNS' resource limit:

kubectl set resources deployment.v1.apps/coredns --limits=cpu=1000m,memory=1024Mi

3. Spark on Kubernetes Job in client mode keeps failing

The Spark Driver may keep printing below messages:

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.

The Spark Executor may keeps crashing and restarting, but if we use "kubectl logs" to check the Executor POD, we will get the root cause:

Caused by: java.net.UnknownHostException: ip-xxx-xxx-xxx-xxx.cluster.local

It means the POD can not resolve the hostname of the node.

If we spin-off a "busybox" POD to test DNS to troubleshoot:

a. Create busybox.yaml with below content:

apiVersion: v1
kind: Pod
metadata:
  name: busybox
  namespace: default
spec:
  containers:
  - image: busybox
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
    name: busybox
  restartPolicy: Always

b. Test the DNS resolution in the sample "busybox" POD:

kubectl create -f busybox.yaml
kubectl exec -ti  busybox -- cat /etc/resolv.conf
kubectl exec -ti  busybox -- nslookup ip-xxx-xxx-xxx-xxx.cluster.local

We will get to know that both /etc/resolv.conf has default DNS server as "169.254.25.10" which can not resolve the hostname -f of the machine.

So what is this IP 169.254.25.10?

As we know by default, kubespray enables nodelocal dns cache with default IP as 169.254.25.10.

So it creates a new IP address for this machine if you check "ifconfig":

# ifconfig -a |grep 169.254.25.10
        inet 169.254.25.10  netmask 255.255.255.255  broadcast 169.254.25.10

# ps -ef|grep 169.254.25.10|grep -v grep
root     111 222  0 xx:xx ?        00:00:45 /node-cache -localip 169.254.25.10 -conf /etc/coredns/Corefile -upstreamsvc coredns

# kubectl get pods -A |grep nodelocaldns
kube-system              nodelocaldns-xxxxx                                      1/1     Running   0          161m

Eventually I found out the root cause:

The hostname and hostname -f on the EC2 machine return different results:

hostname returns "ip-xxx-xxx-xxx-xxx.ec2.internal" however hostname -f returns "ip-xxx-xxx-xxx-xxx.cluster.local".

This is because below entry was added by Ansible in /etc/hosts:

# Ansible inventory hosts BEGIN
xxx.xxx.xxx.xxx ip-xxx-xxx-xxx-xxx.cluster.local ip-xxx-xxx-xxx-xxx ip-xxx-xxx-xxx-xxx.ec2.internal.cluster.local ip-xxx-xxx-xxx-xxx.ec2.internal

After removing above entries from /etc/hosts, hostname and hostname -f are matched now -- "ip-xxx-xxx-xxx-xxx.ec2.internal".

Basically we just let DNS server to resolve the hostname.

Now the spark on kubernetes job in client mode works fine.

Friday, April 2, 2021

How to install a Kubernetes Cluster with NVIDIA GPU on AWS using DeepOps