基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台

基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台

假设机器上有 NVIDIA GPU,且已经安装高版本驱动。

安装 docker

安装过程参考[1]

yum -y install yum-utils && \
yum-config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo && \
yum install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.4.3-3.1.el7.x86_64.rpm && \
yum install docker-ce -y && \
systemctl --now enable docker

安装 nvidia-docker2

安装过程参考[2]

centos7:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo && \
yum clean expire-cache && \
yum install -y nvidia-docker2 && \
systemctl restart docker

ubuntu:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

验证安装:

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

为避免每次都指定 --gpus ,更改 docker 配置文件 /etc/docker/daemon.json

{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"exec-opts":["native.cgroupdriver=systemd"],
"default-runtime": "nvidia"
}

安装 Kubernetes

对于 Kubernetes 的安装,除了使用 kubeadm 之外,还有很多种方案,比如安装 microk8skind等,前者的问题是其所有镜像都放在 gcr.io (google container registry)中,没有科学上网是无法拉取镜像的,而且没有提供设置镜像地址的接口,使用起来颇为麻烦;后者是 Kubernets IN Dokcer,是模拟的伪集群,虽然部署简单,但也不在考虑范围之内了。

使用 Rancher,不失为一种更好的方式,它提供易用的 UI,人性化的交互,基于 Docker 容器,即使要删档重来,也不会出现不可预估的意外。

对于 Kubernetes 的版本,Kubeflow 1.3 是在 Kubernetes 1.17 上做的测试[3],为避免不必要的麻烦,我们也选择 Kubernetes 1.17 版本。

在进一步之前,先配置一下系统设置,比如:

# 关闭 SELinux
sudo setenforce 0
# 关闭交换分区
sudo swapoff -a

另外,机器上的时间包括时区,要一致。

名词解释

在这个部分的主要涉及的名词如下:

  • Rancher Server: 是用于管理和配置 Kubernetes 集群。您可以通过 Rancher Server 的 UI 与下游 Kubernetes 集群进行交互。
  • **RKE(Rancher Kubernetes Engine):**是经过认证的 Kubernetes 发行版,它拥有对应的 CLI 工具可用于创建和管理 Kubernetes 集群。在 Rancher UI 中创建集群时,它将调用 RKE 来配置 Rancher 启动的 Kubernetes 集群。
  • kubectl: Kubernetes 命令行工具。

安装 Rancher

Rancher 必须要通过 https 暴露服务,所以最好的解决方案是申请正规(子)域名,来获得权威证书,若是嫌申请流程麻烦,可以生成自签名证书(虽然在后续仍会面临不少麻烦,但是还都是可以克服的)。[4]

生成自签名证书

这一过程参考[5]

一键生成 ssl 自签名证书脚本:

#!/bin/bash -e

help ()
{
echo ' ================================================================ '
echo ' --ssl-domain: 生成ssl证书需要的主域名,如不指定则默认为www.rancher.local,如果是ip访问服务,则可忽略;'
echo ' --ssl-trusted-ip: 一般ssl证书只信任域名的访问请求,有时候需要使用ip去访问server,那么需要给ssl证书添加扩展IP,多个IP用逗号隔开;'
echo ' --ssl-trusted-domain: 如果想多个域名访问,则添加扩展域名(SSL_TRUSTED_DOMAIN),多个扩展域名用逗号隔开;'
echo ' --ssl-size: ssl加密位数,默认2048;'
echo ' --ssl-cn: 国家代码(2个字母的代号),默认CN;'
echo ' 使用示例:'
echo ' ./create_self-signed-cert.sh --ssl-domain=www.test.com --ssl-trusted-domain=www.test2.com \ '
echo ' --ssl-trusted-ip=1.1.1.1,2.2.2.2,3.3.3.3 --ssl-size=2048 --ssl-date=3650'
echo ' ================================================================'
}

case "$1" in
-h|--help) help; exit;;
esac

if [[ $1 == '' ]];then
help;
exit;
fi

CMDOPTS="$*"
for OPTS in $CMDOPTS;
do
key=$(echo ${OPTS} | awk -F"=" '{print $1}' )
value=$(echo ${OPTS} | awk -F"=" '{print $2}' )
case "$key" in
--ssl-domain) SSL_DOMAIN=$value ;;
--ssl-trusted-ip) SSL_TRUSTED_IP=$value ;;
--ssl-trusted-domain) SSL_TRUSTED_DOMAIN=$value ;;
--ssl-size) SSL_SIZE=$value ;;
--ssl-date) SSL_DATE=$value ;;
--ca-date) CA_DATE=$value ;;
--ssl-cn) CN=$value ;;
esac
done

# CA相关配置
CA_DATE=${CA_DATE:-3650}
CA_KEY=${CA_KEY:-cakey.pem}
CA_CERT=${CA_CERT:-cacerts.pem}
CA_DOMAIN=cattle-ca

# ssl相关配置
SSL_CONFIG=${SSL_CONFIG:-$PWD/openssl.cnf}
SSL_DOMAIN=${SSL_DOMAIN:-'www.rancher.local'}
SSL_DATE=${SSL_DATE:-3650}
SSL_SIZE=${SSL_SIZE:-2048}

## 国家代码(2个字母的代号),默认CN;
CN=${CN:-CN}

SSL_KEY=$SSL_DOMAIN.key
SSL_CSR=$SSL_DOMAIN.csr
SSL_CERT=$SSL_DOMAIN.crt

echo -e "\033[32m ---------------------------- \033[0m"
echo -e "\033[32m | 生成 SSL Cert | \033[0m"
echo -e "\033[32m ---------------------------- \033[0m"

if [[ -e ./${CA_KEY} ]]; then
echo -e "\033[32m ====> 1. 发现已存在CA私钥,备份"${CA_KEY}"为"${CA_KEY}"-bak,然后重新创建 \033[0m"
mv ${CA_KEY} "${CA_KEY}"-bak
openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
else
echo -e "\033[32m ====> 1. 生成新的CA私钥 ${CA_KEY} \033[0m"
openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
fi

if [[ -e ./${CA_CERT} ]]; then
echo -e "\033[32m ====> 2. 发现已存在CA证书,先备份"${CA_CERT}"为"${CA_CERT}"-bak,然后重新创建 \033[0m"
mv ${CA_CERT} "${CA_CERT}"-bak
openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
else
echo -e "\033[32m ====> 2. 生成新的CA证书 ${CA_CERT} \033[0m"
openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
fi

echo -e "\033[32m ====> 3. 生成Openssl配置文件 ${SSL_CONFIG} \033[0m"
cat > ${SSL_CONFIG} <<EOM
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
EOM

if [[ -n ${SSL_TRUSTED_IP} || -n ${SSL_TRUSTED_DOMAIN} ]]; then
cat >> ${SSL_CONFIG} <<EOM
subjectAltName = @alt_names
[alt_names]
EOM
IFS=","
dns=(${SSL_TRUSTED_DOMAIN})
dns+=(${SSL_DOMAIN})
for i in "${!dns[@]}"; do
echo DNS.$((i+1)) = ${dns[$i]} >> ${SSL_CONFIG}
done

if [[ -n ${SSL_TRUSTED_IP} ]]; then
ip=(${SSL_TRUSTED_IP})
for i in "${!ip[@]}"; do
echo IP.$((i+1)) = ${ip[$i]} >> ${SSL_CONFIG}
done
fi
fi

echo -e "\033[32m ====> 4. 生成服务SSL KEY ${SSL_KEY} \033[0m"
openssl genrsa -out ${SSL_KEY} ${SSL_SIZE}

echo -e "\033[32m ====> 5. 生成服务SSL CSR ${SSL_CSR} \033[0m"
openssl req -sha256 -new -key ${SSL_KEY} -out ${SSL_CSR} -subj "/C=${CN}/CN=${SSL_DOMAIN}" -config ${SSL_CONFIG}

echo -e "\033[32m ====> 6. 生成服务SSL CERT ${SSL_CERT} \033[0m"
openssl x509 -sha256 -req -in ${SSL_CSR} -CA ${CA_CERT} \
-CAkey ${CA_KEY} -CAcreateserial -out ${SSL_CERT} \
-days ${SSL_DATE} -extensions v3_req \
-extfile ${SSL_CONFIG}

echo -e "\033[32m ====> 7. 证书制作完成 \033[0m"
echo
echo -e "\033[32m ====> 8. 以YAML格式输出结果 \033[0m"
echo "----------------------------------------------------------"
echo "ca_key: |"
cat $CA_KEY | sed 's/^/ /'
echo
echo "ca_cert: |"
cat $CA_CERT | sed 's/^/ /'
echo
echo "ssl_key: |"
cat $SSL_KEY | sed 's/^/ /'
echo
echo "ssl_csr: |"
cat $SSL_CSR | sed 's/^/ /'
echo
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/ /'
echo

echo -e "\033[32m ====> 9. 附加CA证书到Cert文件 \033[0m"
cat ${CA_CERT} >> ${SSL_CERT}
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/ /'
echo

echo -e "\033[32m ====> 10. 重命名服务证书 \033[0m"
echo "cp ${SSL_DOMAIN}.key tls.key"
cp ${SSL_DOMAIN}.key tls.key
echo "cp ${SSL_DOMAIN}.crt tls.crt"
cp ${SSL_DOMAIN}.crt tls.crt

复制以上代码另存为create_self-signed-cert.sh或者其他您喜欢的文件名。

脚本参数:

--ssl-domain: 生成ssl证书需要的主域名,如不指定则默认为www.rancher.local,如果是ip访问服务,则可忽略;
--ssl-trusted-ip: 一般ssl证书只信任域名的访问请求,有时候需要使用ip去访问server,那么需要给ssl证书添加扩展IP,多个IP用逗号隔开;
--ssl-trusted-domain: 如果想多个域名访问,则添加扩展域名(TRUSTED_DOMAIN),多个TRUSTED_DOMAIN用逗号隔开;
--ssl-size: ssl加密位数,默认2048;
--ssl-cn: 国家代码(2个字母的代号),默认CN;
使用示例:
./create_self-signed-cert.sh --ssl-domain=www.test.com --ssl-trusted-domain=www.test2.com \
--ssl-trusted-ip=1.1.1.1,2.2.2.2,3.3.3.3 --ssl-size=2048 --ssl-date=3650

比如:

mkdir sslcert
cd sslcert
chmod +x create_self-signed-cert.sh
./create_self-signed-cert.sh --ssl-domain=ml.rancher.kna.cn

#### 安装 Rancher

这里使用【单节点安装】的方式[6],基于 docker 镜像,搭建 Rancher,然后使用 Rancher 搭建 Kubernetes

而不采用基于现有的 Kubernetes 搭建 Rancher 的方式,也就是【高可用安装】[7]

以下具体的参数说明请参考6[6:1]

docker run -d --privileged --restart=unless-stopped \
-p 80:80 -p 443:443 \
-v /path/to/sslcert/tls.crt:/etc/rancher/ssl/cert.pem \
-v /path/to/sslcert/tls.key:/etc/rancher/ssl/key.pem \
-v /path/to/sslcert/cacerts.pem:/etc/rancher/ssl/cacerts.pem \
-v /path/to/sslcert:/container/certs \
-v /path/to/rancher:/var/lib/rancher \
-e SSL_CERT_DIR="/container/certs" \
-v /data/var/log/rancher/auditlog:/var/log/auditlog \
-e AUDIT_LEVEL=1 \
rancher/rancher:v2.5.8

配置服务

完事儿后,等待一会儿,不出意外的话,即可访问 80 和 443 端口。

在集群中访问,需要做一层反向代理,通过 huge01 把这两个端口暴露出去,同时也需要配置证书,启用 https,在各节点上配置 hosts(编辑 /etc/hosts 文件)。

创建 Kubernetes 集群

通过浏览器进入 Rancher 设定密码界面。

如果密码遗忘,可以通过

$ docker exec <container_id> reset-password

重置密码。

初始密码之后,进入管理界面,添加自定义集群

自定义集群

设定集群名称,选定 Kubernetes 版本之后,网络驱动可以选择 Flannel(其他的可能也行但是没有试过),其余均保持默认即可。

然后就到了集群选项页,按照说明,在其他机器上执行创建 docker 容器的命令,添加子节点。

每台主机可以运行多个角色。每个集群至少需要一个 Etcd 角色、一个 Control 角色、一个 Worker 角色。

最佳实践是将 Etcd 角色 和 Control 角色单独放置于一台空闲机器。

添加节点

等待一段时间,不出意外的话,点击主机,即可看见添加的节点

设置 kubectl

为避免使用 kubectl 命令行时提示:

Unable to connect to the server: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

此错误跟本地的go环境有关。

设置一个环境变量(添加到 ~/.bashrc 或其他地方):

$ export GODEBUG=x509ignoreCN=0

下载 kubectl 二进制文件:

下载地址1: http://mirror.cnrancher.com/

下载地址2:

# 下载
curl -LO https://dl.k8s.io/release/v1.17.17/bin/linux/amd64/kubectl
# 添加执行权限
chmod +x kubectl

其中版本号可根据实际情况修改

将至软链到某 $PATH 目录,比如:

sudo ln -s $(pwd)/kubectl /usr/bin/kubectl

在集群页面,点击 Kubeconfig 文件

在主节点上创建

mkdir ~/.kube
vim ~/.kube/config

文件夹和文件。

编辑内容为浏览器中打开的窗口展示的内容,这样 kubectl 就知道如何找到集群了。

设置 GPU

将以下内容保存成 nvidia-device-plugin.yml

# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.9.0
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins

然后执行

$ kubectl apply -f nvidia-device-plugin.yml

可以通过一下命令追踪 pods 的创建状态:

$ kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-74kv8 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-75845 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-8nlsp 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-rnq8w 1/1 Running 0 2d4h

有几台机器,就会有几个 pods 被创建。

设置存储

最简单的方式,是设置本地存储:

$ kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml

默认会在 /opt/local-path-provisioner 存储数据,要修改的话,根据 https://github.com/rancher/local-path-provisioner 所述,克隆此项目: git clone https://github.com/rancher/local-path-provisioner.git --depth 1,编辑此文件,然后执行

$ kubectl apply -f deploy/local-path-storage.yaml

你可以通过

$ kubectl -n local-path-storage get pod

查看状态。

设置域名 ip 映射

使用自签名证书的坏处就是,在容器内部无法通过设置 hosts 解析自定义域名。

可能会出现

ERROR: https://rancher.my.org/ping is not accessible (Could not resolve host: rancher.my.org)

的问题。

要解决这个问题,可以在环境中搭建一个 dns 服务器,配置正确的域名和 IP 的对应关系,然后将每个节点的nameserver指向这个 dns 服务器。

或者使用 HostAliases,给关键的几个容器(如 cattle-cluster-agentcattle-node-agent)打补丁(patch)[8]

kubectl -n cattle-system patch deployments cattle-cluster-agent --patch '{
"spec": {
"template": {
"spec": {
"hostAliases": [
{
"hostnames":
[
"ml.r***a.cn"
],
"ip": "10.1***3.17"
}
]
}
}
}
}'

kubectl -n cattle-system patch daemonsets cattle-node-agent --patch '{
"spec": {
"template": {
"spec": {
"hostAliases": [
{
"hostnames":
[
"ml.r***a.cn"
],
"ip": "10.1***3.17"
}
]
}
}
}
}'

完事儿后可以使用如下命令追踪状态和进度

$ kubectl get pods -n cattle-system
NAME READY STATUS RESTARTS AGE
cattle-cluster-agent-84f4d9f7cc-xkcrq 1/1 Running 0 3h58m
cattle-node-agent-fdc5z 1/1 Running 0 4h41m
cattle-node-agent-jlpnl 1/1 Running 0 4h40m
kube-api-auth-xww7h 1/1 Running 0 2d

设置 istio

点击 Default 项目

点击 资源 -> istio,保持默认,选择启用

可以使用命令

$ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
authservice-0 1/1 Running 0 4h32m
cluster-local-gateway-66bcf8bc5d-rltpj 1/1 Running 0 4h31m
istio-citadel-66864ff6b8-znrjw 1/1 Running 0 3h12m
istio-galley-5bd9bf8b9c-8b9x6 1/1 Running 0 3h12m
istio-ingressgateway-85b49c758f-4khs7 1/1 Running 0 4h31m
istio-pilot-674bdcbbf9-8dpc8 2/2 Running 1 3h12m
istio-policy-6d9f4577db-mhxnz 2/2 Running 1 3h12m
istio-security-post-install-1.5.9-jfbkk 0/1 Completed 4 3h12m
istio-sidecar-injector-9bcfb645-vm54x 1/1 Running 0 3h12m
istio-telemetry-664b6dfd44-bhr2c 2/2 Running 6 3h12m
istio-tracing-cc6c8c677-7mrnl 1/1 Running 0 3h12m
istiod-5ff6cdbbcd-4vnhf 1/1 Running 0 4h31m
kiali-79c4c46468-bpl7l 1/1 Running 0 3h12m

来追踪状态和进度。

设置 kube-controller 额外参数

Rancher 没有默认的证书签名者,在直接安装 Kubeflow 后,pod: cache-server 会面临

Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-data istio-envoy istio-podinfo kubeflow-pipelines-cache-token-7pwl7 webhook-tls-certs istiod-ca-cert]: timed out waiting for the condition

的错误[9],原因是 cert-manager 没有 Issuer 权限,所以有必要在安装之前添加两个参数,方法如下:

在全局页面,点击升级

选择编辑YAML文件

kube-controller 字段下添加如下 3 行[10]

extra_args: 
cluster-signing-cert-file: "/etc/kubernetes/ssl/kube-ca.pem"
cluster-signing-key-file: "/etc/kubernetes/ssl/kube-ca-key.pem"

安装 Kubeflow

Kubeflow 的安装要求比较苛刻,前面的大部分操作都是为了 Kubeflow 铺路,官方指定的方式适合在墙外操作,由于各种原因[11]墙内操作几乎不太容易实现,所以经摸索,选出了一条合适的方法。

https://github.com/shikanon/kubeflow-manifests

这是一份长期更新的国内镜像版本的 Kubeflow 安装文件,不用管 README.md 是如何描述的,只要上述步骤没问题[12],克隆下来,

git clone https://github.com/shikanon/kubeflow-manifests.git --depth 1

直接 python install.py,即可。

如果安装过程中的输出没有报错出现的话,可以通过以下命令监控后续的各 pods 创建的状态和进度:

$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
auth dex-6686f66f9b-sbbxf 1/1 Running 0 4h55m
cattle-prometheus exporter-kube-state-cluster-monitoring-5dd6d5c9fd-7rwq2 1/1 Running 0 3h35m
cattle-prometheus exporter-node-cluster-monitoring-7wvw2 1/1 Running 0 3h35m
cattle-prometheus exporter-node-cluster-monitoring-vhrk5 1/1 Running 0 3h35m
cattle-prometheus grafana-cluster-monitoring-75c5cd5995-ssr2d 2/2 Running 0 3h35m
cattle-prometheus prometheus-cluster-monitoring-0 5/5 Running 1 3h31m
cattle-prometheus prometheus-operator-monitoring-operator-f9b9567b-p6h4l 1/1 Running 0 3h35m
cattle-system cattle-cluster-agent-84f4d9f7cc-xkcrq 1/1 Running 0 4h14m
cattle-system cattle-node-agent-fdc5z 1/1 Running 0 4h57m
cattle-system cattle-node-agent-jlpnl 1/1 Running 0 4h56m
cattle-system kube-api-auth-xww7h 1/1 Running 0 2d1h
cert-manager cert-manager-9d5774b59-8hp2f 1/1 Running 0 4h55m
cert-manager cert-manager-cainjector-67c8c5c665-896r8 1/1 Running 0 4h55m
cert-manager cert-manager-webhook-75dc9757bd-zcxjp 1/1 Running 0 4h55m
ingress-nginx nginx-ingress-controller-9jzpx 1/1 Running 0 2d1h
ingress-nginx nginx-ingress-controller-bscxv 1/1 Running 0 2d1h
istio-system authservice-0 1/1 Running 0 4h55m
istio-system cluster-local-gateway-66bcf8bc5d-rltpj 1/1 Running 0 4h54m
istio-system istio-citadel-66864ff6b8-znrjw 1/1 Running 0 3h35m
istio-system istio-galley-5bd9bf8b9c-8b9x6 1/1 Running 0 3h35m
istio-system istio-ingressgateway-85b49c758f-4khs7 1/1 Running 0 4h53m
istio-system istio-pilot-674bdcbbf9-8dpc8 2/2 Running 1 3h35m
istio-system istio-policy-6d9f4577db-mhxnz 2/2 Running 1 3h35m
istio-system istio-security-post-install-1.5.9-jfbkk 0/1 Completed 4 3h35m
istio-system istio-sidecar-injector-9bcfb645-vm54x 1/1 Running 0 3h35m
istio-system istio-telemetry-664b6dfd44-bhr2c 2/2 Running 6 3h35m
istio-system istio-tracing-cc6c8c677-7mrnl 1/1 Running 0 3h35m
istio-system istiod-5ff6cdbbcd-4vnhf 1/1 Running 0 4h53m
istio-system kiali-79c4c46468-bpl7l 1/1 Running 0 3h35m
knative-eventing broker-controller-5c84984b97-4shtv 1/1 Running 0 4h55m
knative-eventing eventing-controller-54bfbd5446-4pknn 1/1 Running 0 4h55m
knative-eventing eventing-webhook-58f56d9cf4-2mrdn 1/1 Running 0 4h55m
knative-eventing imc-controller-769896c7db-8gzmc 1/1 Running 0 4h55m
knative-eventing imc-dispatcher-86954fb4cd-l9l98 1/1 Running 0 4h55m
knative-serving activator-75696c8c9-786pn 1/1 Running 0 4h55m
knative-serving autoscaler-6764f9b5c5-q9pd9 1/1 Running 0 4h55m
knative-serving controller-598fd8bfd7-8ng4j 1/1 Running 0 4h55m
knative-serving istio-webhook-785bb58cc6-xlnnk 1/1 Running 0 4h55m
knative-serving networking-istio-77fbcfcf9b-p7wr7 1/1 Running 0 4h55m
knative-serving webhook-865f54cf5f-rn7qq 1/1 Running 0 4h55m
kube-system coredns-6b84d75d99-2f5p4 1/1 Running 0 2d1h
kube-system coredns-6b84d75d99-8j8zs 1/1 Running 0 48m
kube-system coredns-autoscaler-5c4b6999d9-qq87l 1/1 Running 0 48m
kube-system kube-flannel-kjmlb 2/2 Running 0 2d1h
kube-system kube-flannel-vnhm7 2/2 Running 0 2d1h
kube-system metrics-server-7579449c57-2jqld 1/1 Running 0 2d1h
kube-system rke-coredns-addon-deploy-job-drwtd 0/1 Completed 0 49m
kubeflow-user-example-com ml-pipeline-ui-artifact-6d7ffcc4b6-dzfsk 2/2 Running 0 4h52m
kubeflow-user-example-com ml-pipeline-visualizationserver-84d577b989-7bhbm 2/2 Running 0 4h52m
kubeflow admission-webhook-deployment-54cf94d964-8qsh2 1/1 Running 0 4h51m
kubeflow cache-deployer-deployment-65cd55d4d9-d6dzd 2/2 Running 11 4h50m
kubeflow cache-server-f85c69486-rgzq6 2/2 Running 6 4h50m
kubeflow centraldashboard-7b7676d8bd-w5jw6 1/1 Running 0 4h53m
kubeflow jupyter-web-app-deployment-66f74586d9-pnv2r 1/1 Running 0 169m
kubeflow katib-controller-5467f8fdc8-rcc78 1/1 Running 0 4h50m
kubeflow katib-db-manager-646695754f-v2x82 1/1 Running 0 4h53m
kubeflow katib-mysql-5bb5bd9957-7zzct 1/1 Running 0 4h53m
kubeflow katib-ui-55fd4bd6f9-mnqg7 1/1 Running 0 4h53m
kubeflow kfserving-controller-manager-0 2/2 Running 0 4h53m
kubeflow kubeflow-pipelines-profile-controller-5698bf57cf-9t99q 1/1 Running 0 169m
kubeflow kubeflow-pipelines-profile-controller-5698bf57cf-wmbwc 1/1 Running 0 4h53m
kubeflow metacontroller-0 1/1 Running 0 4h53m
kubeflow metadata-envoy-deployment-76d65977f7-5bjkk 1/1 Running 0 4h53m
kubeflow metadata-grpc-deployment-697d9c6c67-5xbq6 2/2 Running 1 4h53m
kubeflow metadata-writer-58cdd57678-ns6kq 2/2 Running 0 4h53m
kubeflow minio-6d6784db95-82wtz 2/2 Running 0 169m
kubeflow ml-pipeline-85fc99f899-mn65n 2/2 Running 3 4h53m
kubeflow ml-pipeline-persistenceagent-65cb9594c7-gt8bw 2/2 Running 0 4h53m
kubeflow ml-pipeline-scheduledworkflow-7f8d8dfc69-spq9b 2/2 Running 0 4h53m
kubeflow ml-pipeline-ui-5c765cc7bd-hks2f 2/2 Running 0 4h53m
kubeflow ml-pipeline-viewer-crd-5b8df7f458-x62wv 2/2 Running 1 4h53m
kubeflow ml-pipeline-visualizationserver-56c5ff68d5-ndltc 2/2 Running 0 4h53m
kubeflow mpi-operator-789f88879-l5d7l 1/1 Running 0 4h53m
kubeflow mxnet-operator-7fff864957-5gcv2 1/1 Running 0 4h53m
kubeflow mysql-56b554ff66-k2qsl 2/2 Running 0 168m
kubeflow notebook-controller-deployment-74d9584477-jvvcc 1/1 Running 0 4h53m
kubeflow profiles-deployment-67b4666796-zjrpw 2/2 Running 0 4h53m
kubeflow pytorch-operator-fd86f7694-tgh8l 2/2 Running 0 4h53m
kubeflow tensorboard-controller-controller-manager-fd6bcffb4-lkz2g 3/3 Running 1 4h53m
kubeflow tensorboards-web-app-deployment-78d7b8b658-qg8kc 1/1 Running 0 4h53m
kubeflow tf-job-operator-7bc5cf4cc7-689d9 1/1 Running 0 4h53m
kubeflow volumes-web-app-deployment-68fcfc9775-mw7gz 1/1 Running 0 4h53m
kubeflow workflow-controller-5449754fb4-pc79x 2/2 Running 1 169m
kubeflow xgboost-operator-deployment-5c7bfd57cc-54hjq 2/2 Running 1 4h53m
local-path-storage local-path-provisioner-5bd6f65fdf-j575f 1/1 Running 0 2d

如果所有的 pods 都进入 RunningCompleted 状态,那么就说明部署成功了,如果有节点迟迟卡在创建状态,可以重新执行一遍 python install.py

如果要删除,可以执行

$ kubectl delete -f manifest1.3

一些可能会遇到的问题

  1. 在 Rancher 页面,可能看不见这些 pods

    是因为他们没有被归到某一个 project 下面,点击集群名称,点击命名空间,即可看见,将之挪到 Default 项目或新建项目下,即可在对应项目下看到这些 pods 的命名空间及 pods 详情了。

  2. 默认用户名密码不正确/如何添加新的用户和命名空间

    可以编辑 patch/auth.yaml 文件,在

    staticPasswords:
    - email: "admin@example.com"
    # hash string is "password"
    hash: "$2y$12$X.oNHMsIfRSq35eRfiTYV.dPIYlWyPDRRc1.JVp0f3c.YqqJNW4uK"
    username: "admin"
    userID: "08a8684b-db88-4b73-90a9-3cd1661f5466"
    - email: myname@abc.cn
    hash: $2b$10$.zSuIlx1bl9PCyigEtebhuWG/PAhZlZoyokPdGObiE7jRUHUcQ0qW
    username: myname
    userID: 08a8684b-db88-4b73-90a9-3cd1661f5466

    可以添加用户,密码使用 hash 生成,可以通过 https://passwordhashing.com/BCrypt 工具来生成密码。

    注意:每一个用户都对应了一个命名空间(namespace),所以如果添加了新用户,需要对应的添加新的命名空间,在此文件的下面几行

    ---
    apiVersion: kubeflow.org/v1beta1
    kind: Profile
    metadata:
    name: kubeflow-user-example-com
    spec:
    owner:
    kind: User
    name: admin@example.com
    ---
    apiVersion: kubeflow.org/v1beta1
    kind: Profile
    metadata:
    name: myname # 命名空间就是这个
    spec:
    owner:
    kind: User
    name: myname@wps.cn # 登录的邮箱

    最后,重新

    # 应用
    kubectl apply -f patch/auth.yaml
    # 重启
    kubectl rollout restart deployment dex -n auth

    或者直接使用此命令编辑容器配置,保存后自动应用。

    kubectl edit configmap dex -n auth

暴露 Kubeflow 服务

当 Kubeflow 服务启动完成后,可以通过

$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

将容器内网关的 80 端口临时暴露到本机的 8080 端口,通过 localhost 域名以 http 的方式访问。

通过修改 patch/auth.yaml 文件,可以更改密码,默认的用户名是admin@example.com,密码是password[13]

生成密码的方式是[14]

python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

或通过 https://passwordhashing.com/BCrypt 在线工具。

如果要通过 NodePort / LoadBalancer / Ingress 暴露服务到非 localhost 网络,那么必须使用 https[15]。否则能打开主页面但是 notebook、ssh等均无法连接。

一个简单可行的方案是使用 NodePort

获知 ssl 端口

使用如下命令[16]

$ kubectl -n istio-system get service istio-ingressgateway
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
istio-ingressgateway NodePort 10.43.121.108 <none> 15021:31066/TCP,80:30000/TCP,443:32692/TCP,31400:32293/TCP,15443:31428/TCP 45h

可以发现,443 端口被映射到物理机的 32692,但是此时 443 端口还没有启用服务,下面几步将生成并启用证书,以启动 443 端口。

生成自签名证书

使用前文中提到过的脚本,生成新的ssl证书,供给 kubeflow 使用。

./create_self-signed-cert.sh --ssl-domain=kub**a.cn

如果有CA签名的证书亦可直接使用。

使用 cert-manager 管理证书

kubectl create --namespace istio-system secret tls kf-tls-cert --key /data/gang/kfcerts/kub***a.cn.key --cert /data/gang/kfcerts/kub***a.cn.crt

配置 Knative cluster 使用自定义域名

kubectl edit cm config-domain --namespace knative-serving

在 data 下面添加一个名为域名的键[17],值若无特殊需求的话可以留空,如:kub***a.cn: ""

配置 Kubeflow 使用此证书

编辑 manifest1.3/016-istio-1-9-0-kubeflow-istio-resources-base.yaml 文件

vim manifest1.3/016-istio-1-9-0-kubeflow-istio-resources-base.yaml

在最后面的 kubeflow-gateway 中,添加[18]

- hosts:
- '*'
port:
name: https
number: 443
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: kf-tls-cert

hosts 可以直接指定为刚刚生成的证书所绑定的域名(仅接受此域名的访问),也可以填写成 * 以接受其他域名的访问。

应用重启

kubectl apply -f manifest1.3/016-istio-1-9-0-kubeflow-istio-resources-base.yaml
kubectl rollout restart deploy istio-ingressgateway -n istio-system

这样就能在其他机器上通过

curl https://10.1***2:32692 -k

安全的访问到 Kubeflow 服务了。由于是自签名证书,所以使用-k 参数可以绕过检查。

效果

删档重玩

谨慎使用脚本,删除残留文件、服务、网络配置、容器等[19]

#!/bin/bash

KUBE_SVC='
kubelet
kube-scheduler
kube-proxy
kube-controller-manager
kube-apiserver
'

for kube_svc in ${KUBE_SVC};
do
# 停止服务
if [[ `systemctl is-active ${kube_svc}` == 'active' ]]; then
systemctl stop ${kube_svc}
fi
# 禁止服务开机启动
if [[ `systemctl is-enabled ${kube_svc}` == 'enabled' ]]; then
systemctl disable ${kube_svc}
fi
done

# 停止所有容器
docker stop $(docker ps -aq)

# 删除所有容器
docker rm -f $(docker ps -qa)

# 删除所有容器卷
docker volume rm $(docker volume ls -q)

# 卸载mount目录
for mount in $(mount | grep tmpfs | grep '/var/lib/kubelet' | awk '{ print $3 }') /var/lib/kubelet /var/lib/rancher;
do
umount $mount;
done

# 备份目录
mv /etc/kubernetes /etc/kubernetes-bak-$(date +"%Y%m%d%H%M")
mv /var/lib/etcd /var/lib/etcd-bak-$(date +"%Y%m%d%H%M")
mv /var/lib/rancher /var/lib/rancher-bak-$(date +"%Y%m%d%H%M")
mv /opt/rke /opt/rke-bak-$(date +"%Y%m%d%H%M")
rm -rf ~/.kube/
rm -rf /etc/kubernetes/
rm -rf /etc/systemd/system/kubelet.service.d
rm -rf /etc/systemd/system/kubelet.service
rm -rf /usr/bin/kube*
rm -rf /etc/cni
rm -rf /opt/cni
rm -rf /var/lib/etcd
rm -rf /var/etcd

# 删除残留路径
rm -rf /etc/ceph \
/etc/cni \
/opt/cni \
/run/secrets/kubernetes.io \
/run/calico \
/run/flannel \
/var/lib/calico \
/var/lib/cni \
/var/lib/kubelet \
/var/log/containers \
/var/log/kube-audit \
/var/log/pods \
/var/run/calico \
/usr/libexec/kubernetes

# 清理网络接口
no_del_net_inter='
lo
docker0
eth
ens
bond
'

network_interface=`ls /sys/class/net`

for net_inter in $network_interface;
do
if ! echo "${no_del_net_inter}" | grep -qE ${net_inter:0:3}; then
ip link delete $net_inter
fi
done

# 清理残留进程
port_list='
80
443
6443
2376
2379
2380
8472
9099
10250
10254
'

for port in $port_list;
do
pid=`netstat -atlnup | grep $port | awk '{print $7}' | awk -F '/' '{print $1}' | grep -v - | sort -rnk2 | uniq`
if [[ -n $pid ]]; then
kill -9 $pid
fi
done

kube_pid=`ps -ef | grep -v grep | grep kube | awk '{print $2}'`

if [[ -n $kube_pid ]]; then
kill -9 $kube_pid
fi

# 清理Iptables表
## 注意:如果节点Iptables有特殊配置,以下命令请谨慎操作
sudo iptables --flush
sudo iptables --flush --table nat
sudo iptables --flush --table filter
sudo iptables --table nat --delete-chain
sudo iptables --table filter --delete-chain
systemctl restart docker

另外,创建 Rancher 的时候指定的几个卷文件夹,也可以酌情删除

rm -rf /data/var/log/rancher/auditlog
rm -rf /path/to/rancher

参考


  1. https://docs.docker.com/engine/install/centos/#install-using-the-repository ↩︎

  2. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#id2 ↩︎

  3. https://github.com/kubeflow/manifests#prerequisites ↩︎

  4. https://docs.rancher.cn/docs/rancher2.5/installation/install-rancher-on-k8s/_index#3-选择您的-ssl-选项 ↩︎

  5. https://docs.rancher.cn/docs/rancher2.5/installation/resources/advanced/self-signed-ssl/_index ↩︎

  6. https://docs.rancher.cn/docs/rancher2.5/installation/other-installation-methods/single-node-docker/advanced/_index ↩︎ ↩︎

  7. https://docs.rancher.cn/docs/rancher2.5/installation/install-rancher-on-k8s/_index ↩︎

  8. https://docs.rancher.cn/docs/rancher2/faq/install/_index/#error-httpsranchermyorgping-is-not-accessible-could-not-resolve-host-ranchermyorg ↩︎

  9. https://github.com/shikanon/kubeflow-manifests/issues/20#issuecomment-843014942 ↩︎

  10. https://github.com/cockroachdb/cockroach/issues/28075#issuecomment-420497277 ↩︎

  11. 原因1:部分镜像放在 gcr.io 无法拉取;原因2:部分镜像是临时版本(版本号为sha256码)无法保存导入;原因3:官方推荐的渠道大部分为云服务提供商深度结合,其他的几个方法都要求科学上网。 ↩︎

  12. 所谓没问题是指在执行 kubectl get pods -A 后,所有的 pods 都在 RunningCompleted 状态。 ↩︎

  13. https://github.com/shikanon/kubeflow-manifests ↩︎

  14. https://github.com/kubeflow/manifests#change-default-user-password ↩︎

  15. https://github.com/kubeflow/manifests#nodeport--loadbalancer--ingress ↩︎

  16. https://istio.io/latest/zh/docs/tasks/traffic-management/ingress/ingress-control/#determining-the-ingress-ip-and-ports ↩︎

  17. https://knative.dev/docs/serving/using-a-tls-cert/#before-you-begin ↩︎

  18. https://knative.dev/docs/serving/using-a-tls-cert/#manually-adding-a-tls-certificate ↩︎

  19. https://docs.rancher.cn/docs/rancher2/cluster-admin/cleaning-cluster-nodes/_index ↩︎


   转载规则


《基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台》 Harbor Zeng 采用 知识共享署名 4.0 国际许可协议 进行许可。
 本篇
基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台 基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台
基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台 假设机器上有 NVIDIA GPU,且已经安装高版本驱动。 安装 docker 安装过程参考[1] yum -y install yum-utils && \yum-config-manager --add-repo=https://download.docker.com/
2021-05-25
下一篇 
(十)BERT 是 Transformer 的 Encoder 而已 (十)BERT 是 Transformer 的 Encoder 而已
BERT (Bidirectional Encoder Representations from Transformers) Bert 是 Transformer 的 Encoder 预训练模型,训练技巧是:预测文本中被遮挡的单词,预测两个句子是否是原文中相邻的句子。 预测文本中被遮挡的单词 eee:被遮挡单词 cat 的 one-hot 向量 ppp:被遮挡的地方输出的概率分布 Loss
2021-01-17
  目录