基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台

基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台

假设机器上有 NVIDIA GPU,且已经安装高版本驱动。

安装 docker

安装过程参考[1]

yum -y install yum-utils && \
yum-config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo && \
yum install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.4.3-3.1.el7.x86_64.rpm && \
yum install docker-ce -y && \
systemctl --now enable docker

安装 nvidia-docker2

安装过程参考[2]

centos7:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo && \
yum clean expire-cache && \
yum install -y nvidia-docker2 && \
systemctl restart docker

ubuntu:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

验证安装:

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

为避免每次都指定 --gpus ,更改 docker 配置文件 /etc/docker/daemon.json

{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"exec-opts":["native.cgroupdriver=systemd"],
"default-runtime": "nvidia"
}

安装 Kubernetes

对于 Kubernetes 的安装,除了使用 kubeadm 之外,还有很多种方案,比如安装 microk8skind等,前者的问题是其所有镜像都放在 gcr.io (google container registry)中,没有科学上网是无法拉取镜像的,而且没有提供设置镜像地址的接口,使用起来颇为麻烦;后者是 Kubernets IN Dokcer,是模拟的伪集群,虽然部署简单,但也不在考虑范围之内了。

使用 Rancher,不失为一种更好的方式,它提供易用的 UI,人性化的交互,基于 Docker 容器,即使要删档重来,也不会出现不可预估的意外。

对于 Kubernetes 的版本,Kubeflow 1.3 是在 Kubernetes 1.17 上做的测试[3],为避免不必要的麻烦,我们也选择 Kubernetes 1.17 版本。

在进一步之前,先配置一下系统设置,比如:

# 关闭 SELinux
sudo setenforce 0
# 关闭交换分区
sudo swapoff -a

另外,机器上的时间包括时区,要一致。

名词解释

在这个部分的主要涉及的名词如下:

  • Rancher Server: 是用于管理和配置 Kubernetes 集群。您可以通过 Rancher Server 的 UI 与下游 Kubernetes 集群进行交互。
  • **RKE(Rancher Kubernetes Engine):**是经过认证的 Kubernetes 发行版,它拥有对应的 CLI 工具可用于创建和管理 Kubernetes 集群。在 Rancher UI 中创建集群时,它将调用 RKE 来配置 Rancher 启动的 Kubernetes 集群。
  • kubectl: Kubernetes 命令行工具。

安装 Rancher

Rancher 必须要通过 https 暴露服务,所以最好的解决方案是申请正规(子)域名,来获得权威证书,若是嫌申请流程麻烦,可以生成自签名证书(虽然在后续仍会面临不少麻烦,但是还都是可以克服的)。[4]

生成自签名证书

这一过程参考[5]

一键生成 ssl 自签名证书脚本:

#!/bin/bash -e

help ()
{
echo ' ================================================================ '
echo ' --ssl-domain: 生成ssl证书需要的主域名,如不指定则默认为www.rancher.local,如果是ip访问服务,则可忽略;'
echo ' --ssl-trusted-ip: 一般ssl证书只信任域名的访问请求,有时候需要使用ip去访问server,那么需要给ssl证书添加扩展IP,多个IP用逗号隔开;'
echo ' --ssl-trusted-domain: 如果想多个域名访问,则添加扩展域名(SSL_TRUSTED_DOMAIN),多个扩展域名用逗号隔开;'
echo ' --ssl-size: ssl加密位数,默认2048;'
echo ' --ssl-cn: 国家代码(2个字母的代号),默认CN;'
echo ' 使用示例:'
echo ' ./create_self-signed-cert.sh --ssl-domain=www.test.com --ssl-trusted-domain=www.test2.com \ '
echo ' --ssl-trusted-ip=1.1.1.1,2.2.2.2,3.3.3.3 --ssl-size=2048 --ssl-date=3650'
echo ' ================================================================'
}

case "$1" in
-h|--help) help; exit;;
esac

if [[ $1 == '' ]];then
help;
exit;
fi

CMDOPTS="$*"
for OPTS in $CMDOPTS;
do
key=$(echo ${OPTS} | awk -F"=" '{print $1}' )
value=$(echo ${OPTS} | awk -F"=" '{print $2}' )
case "$key" in
--ssl-domain) SSL_DOMAIN=$value ;;
--ssl-trusted-ip) SSL_TRUSTED_IP=$value ;;
--ssl-trusted-domain) SSL_TRUSTED_DOMAIN=$value ;;
--ssl-size) SSL_SIZE=$value ;;
--ssl-date) SSL_DATE=$value ;;
--ca-date) CA_DATE=$value ;;
--ssl-cn) CN=$value ;;
esac
done

# CA相关配置
CA_DATE=${CA_DATE:-3650}
CA_KEY=${CA_KEY:-cakey.pem}
CA_CERT=${CA_CERT:-cacerts.pem}
CA_DOMAIN=cattle-ca

# ssl相关配置
SSL_CONFIG=${SSL_CONFIG:-$PWD/openssl.cnf}
SSL_DOMAIN=${SSL_DOMAIN:-'www.rancher.local'}
SSL_DATE=${SSL_DATE:-3650}
SSL_SIZE=${SSL_SIZE:-2048}

## 国家代码(2个字母的代号),默认CN;
CN=${CN:-CN}

SSL_KEY=$SSL_DOMAIN.key
SSL_CSR=$SSL_DOMAIN.csr
SSL_CERT=$SSL_DOMAIN.crt

echo -e "\033[32m ---------------------------- \033[0m"
echo -e "\033[32m | 生成 SSL Cert | \033[0m"
echo -e "\033[32m ---------------------------- \033[0m"

if [[ -e ./${CA_KEY} ]]; then
echo -e "\033[32m ====> 1. 发现已存在CA私钥,备份"${CA_KEY}"为"${CA_KEY}"-bak,然后重新创建 \033[0m"
mv ${CA_KEY} "${CA_KEY}"-bak
openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
else
echo -e "\033[32m ====> 1. 生成新的CA私钥 ${CA_KEY} \033[0m"
openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
fi

if [[ -e ./${CA_CERT} ]]; then
echo -e "\033[32m ====> 2. 发现已存在CA证书,先备份"${CA_CERT}"为"${CA_CERT}"-bak,然后重新创建 \033[0m"
mv ${CA_CERT} "${CA_CERT}"-bak
openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
else
echo -e "\033[32m ====> 2. 生成新的CA证书 ${CA_CERT} \033[0m"
openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
fi

echo -e "\033[32m ====> 3. 生成Openssl配置文件 ${SSL_CONFIG} \033[0m"
cat > ${SSL_CONFIG} <<EOM
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
EOM

if [[ -n ${SSL_TRUSTED_IP} || -n ${SSL_TRUSTED_DOMAIN} ]]; then
cat >> ${SSL_CONFIG} <<EOM
subjectAltName = @alt_names
[alt_names]
EOM
IFS=","
dns=(${SSL_TRUSTED_DOMAIN})
dns+=(${SSL_DOMAIN})
for i in "${!dns[@]}"; do
echo DNS.$((i+1)) = ${dns[$i]} >> ${SSL_CONFIG}
done

if [[ -n ${SSL_TRUSTED_IP} ]]; then
ip=(${SSL_TRUSTED_IP})
for i in "${!ip[@]}"; do
echo IP.$((i+1)) = ${ip[$i]} >> ${SSL_CONFIG}
done
fi
fi

echo -e "\033[32m ====> 4. 生成服务SSL KEY ${SSL_KEY} \033[0m"
openssl genrsa -out ${SSL_KEY} ${SSL_SIZE}

echo -e "\033[32m ====> 5. 生成服务SSL CSR ${SSL_CSR} \033[0m"
openssl req -sha256 -new -key ${SSL_KEY} -out ${SSL_CSR} -subj "/C=${CN}/CN=${SSL_DOMAIN}" -config ${SSL_CONFIG}

echo -e "\033[32m ====> 6. 生成服务SSL CERT ${SSL_CERT} \033[0m"
openssl x509 -sha256 -req -in ${SSL_CSR} -CA ${CA_CERT} \
-CAkey ${CA_KEY} -CAcreateserial -out ${SSL_CERT} \
-days ${SSL_DATE} -extensions v3_req \
-extfile ${SSL_CONFIG}

echo -e "\033[32m ====> 7. 证书制作完成 \033[0m"
echo
echo -e "\033[32m ====> 8. 以YAML格式输出结果 \033[0m"
echo "----------------------------------------------------------"
echo "ca_key: |"
cat $CA_KEY | sed 's/^/ /'
echo
echo "ca_cert: |"
cat $CA_CERT | sed 's/^/ /'
echo
echo "ssl_key: |"
cat $SSL_KEY | sed 's/^/ /'
echo
echo "ssl_csr: |"
cat $SSL_CSR | sed 's/^/ /'
echo
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/ /'
echo

echo -e "\033[32m ====> 9. 附加CA证书到Cert文件 \033[0m"
cat ${CA_CERT} >> ${SSL_CERT}
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/ /'
echo

echo -e "\033[32m ====> 10. 重命名服务证书 \033[0m"
echo "cp ${SSL_DOMAIN}.key tls.key"
cp ${SSL_DOMAIN}.key tls.key
echo "cp ${SSL_DOMAIN}.crt tls.crt"
cp ${SSL_DOMAIN}.crt tls.crt

复制以上代码另存为 create_self-signed-cert.sh 或者其他您喜欢的文件名。

脚本参数:

--ssl-domain: 生成ssl证书需要的主域名,如不指定则默认为www.rancher.local,如果是ip访问服务,则可忽略;
--ssl-trusted-ip: 一般ssl证书只信任域名的访问请求,有时候需要使用ip去访问server,那么需要给ssl证书添加扩展IP,多个IP用逗号隔开;
--ssl-trusted-domain: 如果想多个域名访问,则添加扩展域名(TRUSTED_DOMAIN),多个TRUSTED_DOMAIN用逗号隔开;
--ssl-size: ssl加密位数,默认2048;
--ssl-cn: 国家代码(2个字母的代号),默认CN;
使用示例:
./create_self-signed-cert.sh --ssl-domain=www.test.com --ssl-trusted-domain=www.test2.com \
--ssl-trusted-ip=1.1.1.1,2.2.2.2,3.3.3.3 --ssl-size=2048 --ssl-date=3650

比如:

mkdir sslcert
cd sslcert
chmod +x create_self-signed-cert.sh
./create_self-signed-cert.sh --ssl-domain=rancher.xxx.cn

#### 安装 Rancher

这里使用【单节点安装】的方式[6],基于 docker 镜像,搭建 Rancher,然后使用 Rancher 搭建 Kubernetes

而不采用基于现有的 Kubernetes 搭建 Rancher 的方式,也就是【高可用安装】[7]

以下具体的参数说明请参考6[6:1]

docker run -d --privileged --restart=unless-stopped \
-p 80:80 -p 443:443 \
-v /path/to/sslcert/tls.crt:/etc/rancher/ssl/cert.pem \
-v /path/to/sslcert/tls.key:/etc/rancher/ssl/key.pem \
-v /path/to/sslcert/cacerts.pem:/etc/rancher/ssl/cacerts.pem \
-v /path/to/sslcert:/container/certs \
-v /path/to/rancher:/var/lib/rancher \
-e SSL_CERT_DIR="/container/certs" \
-v /data/var/log/rancher/auditlog:/var/log/auditlog \
-e AUDIT_LEVEL=1 \
rancher/rancher:v2.5.8

配置服务

完事儿后,等待一会儿,不出意外的话,即可访问 80 和 443 端口。

kubernetes 集群

通过浏览器进入 Rancher 设定密码界面。

如果密码遗忘,可以通过

$ docker exec <container_id> reset-password

重置密码。

初始密码之后,进入管理界面,添加自定义集群

自定义集群

设定集群名称,选定 Kubernetes 版本之后,网络驱动可以选择 Flannel(其他的可能也行但是没有试过),其余均保持默认即可。

然后就到了集群选项页,按照说明,在其他机器上执行创建 docker 容器的命令,添加子节点。

每台主机可以运行多个角色。每个集群至少需要一个 Etcd 角色、一个 Control 角色、一个 Worker 角色。

最佳实践是将 Etcd 角色 和 Control 角色单独放置于一台空闲机器。

添加节点

等待一段时间,不出意外的话,点击主机,即可看见添加的节点

设置 kubectl

为避免使用 kubectl 命令行时提示:

Unable to connect to the server: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

此错误跟本地的go环境有关。

设置一个环境变量(添加到 ~/.bashrc 或其他地方):

$ export GODEBUG=x509ignoreCN=0

下载 kubectl 二进制文件:

下载地址1: http://mirror.cnrancher.com/

下载地址2:

# 下载
curl -LO https://dl.k8s.io/release/v1.17.17/bin/linux/amd64/kubectl
# 添加执行权限
chmod +x kubectl

其中版本号可根据实际情况修改

将至软链到某 $PATH 目录,比如:

sudo ln -s $(pwd)/kubectl /usr/bin/kubectl

在集群页面,点击 Kubeconfig 文件

在主节点上创建

mkdir ~/.kube
vim ~/.kube/config

文件夹和文件。

编辑内容为浏览器中打开的窗口展示的内容,这样 kubectl 就知道如何找到集群了。

设置 GPU

将以下内容保存成 nvidia-device-plugin.yml

PS: 如果你使用 nvidia 官方提供的这个 yml 文件,可能会出现主节点没有此 pod 的情况。

# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.9.0
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins

然后执行

$ kubectl apply -f nvidia-device-plugin.yml

可以通过以下命令、或 rancher 面板追踪 pods 的创建状态:

$ kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-74kv8 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-75845 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-8nlsp 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-rnq8w 1/1 Running 0 2d4h

有几台机器,就会有几个 pods 被创建。

设置存储

最简单的方式,是设置本地存储:

$ kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml

默认会在 /opt/local-path-provisioner 存储数据,要修改的话,根据 https://github.com/rancher/local-path-provisioner 所述,克隆此项目:

$ git clone https://github.com/rancher/local-path-provisioner.git --depth 1

编辑此文件,然后执行

$ kubectl apply -f deploy/local-path-storage.yaml

你可以通过命令、或 rancher 面板

$ kubectl -n local-path-storage get pod

查看状态。

启动后,需要在 rancher 面板中,将本地存储设为默认PV。

或者使用 kubectl 命令设置为默认PV:改变默认 StorageClass | Kubernetes

设置域名 ip 映射

使用自签名证书的坏处就是,在容器内部无法通过设置 hosts 解析自定义域名。

可能会出现

ERROR: https://rancher.my.org/ping is not accessible (Could not resolve host: rancher.my.org)

的问题。

要解决这个问题,可以在环境中搭建一个 dns 服务器,配置正确的域名和 IP 的对应关系,然后将每个节点的nameserver指向这个 dns 服务器。

或者使用 HostAliases,给关键的几个容器(如 cattle-cluster-agentcattle-node-agent)打补丁(patch)[8]

kubectl -n cattle-system patch deployments cattle-cluster-agent --patch '{
"spec": {
"template": {
"spec": {
"hostAliases": [
{
"hostnames":
[
"rancher.xxx.cn"
],
"ip": "10.1***3.17"
}
]
}
}
}
}'

kubectl -n cattle-system patch daemonsets cattle-node-agent --patch '{
"spec": {
"template": {
"spec": {
"hostAliases": [
{
"hostnames":
[
"rancher.xxx.cn"
],
"ip": "10.1***3.17"
}
]
}
}
}
}'

完事儿后可以使用如下命令、或 rancher 面板追踪状态和进度

$ kubectl get pods -n cattle-system
NAME READY STATUS RESTARTS AGE
cattle-cluster-agent-84f4d9f7cc-xkcrq 1/1 Running 0 3h58m
cattle-node-agent-fdc5z 1/1 Running 0 4h41m
cattle-node-agent-jlpnl 1/1 Running 0 4h40m
kube-api-auth-xww7h 1/1 Running 0 2d

设置 istio

点击 Default 项目

点击 资源 -> istio,保持默认,选择启用

可以使用命令、或 rancher 面板

$ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
authservice-0 1/1 Running 0 4h32m
cluster-local-gateway-66bcf8bc5d-rltpj 1/1 Running 0 4h31m
istio-citadel-66864ff6b8-znrjw 1/1 Running 0 3h12m
istio-galley-5bd9bf8b9c-8b9x6 1/1 Running 0 3h12m
istio-ingressgateway-85b49c758f-4khs7 1/1 Running 0 4h31m
istio-pilot-674bdcbbf9-8dpc8 2/2 Running 1 3h12m
istio-policy-6d9f4577db-mhxnz 2/2 Running 1 3h12m
istio-security-post-install-1.5.9-jfbkk 0/1 Completed 4 3h12m
istio-sidecar-injector-9bcfb645-vm54x 1/1 Running 0 3h12m
istio-telemetry-664b6dfd44-bhr2c 2/2 Running 6 3h12m
istio-tracing-cc6c8c677-7mrnl 1/1 Running 0 3h12m
istiod-5ff6cdbbcd-4vnhf 1/1 Running 0 4h31m
kiali-79c4c46468-bpl7l 1/1 Running 0 3h12m

来追踪状态和进度。

设置 kube-controller 额外参数

Rancher 没有默认的证书签名者,在直接安装 Kubeflow 后,pod: cache-server 会面临

Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-data istio-envoy istio-podinfo kubeflow-pipelines-cache-token-7pwl7 webhook-tls-certs istiod-ca-cert]: timed out waiting for the condition

的错误[9],原因是 cert-manager 没有 Issuer 权限,所以有必要在安装之前添加两个参数,方法如下:

在全局页面,点击升级

选择编辑YAML文件

kube-controller 字段下添加如下 3 行[10]

extra_args: 
cluster-signing-cert-file: "/etc/kubernetes/ssl/kube-ca.pem"
cluster-signing-key-file: "/etc/kubernetes/ssl/kube-ca-key.pem"

安装 Kubeflow

Kubeflow 的安装要求比较苛刻,前面的大部分操作都是为了 Kubeflow 铺路,官方指定的方式适合在墙外操作,由于各种原因[11]墙内操作几乎不太容易实现,所以经摸索,选出了一条合适的方法。

https://github.com/shikanon/kubeflow-manifests

这是一份长期更新的国内镜像版本的 Kubeflow 安装文件,不用管 README.md 是如何描述的,只要上述步骤没问题[12],克隆下来,

git clone https://github.com/shikanon/kubeflow-manifests.git --depth 1

直接 python install.py,即可。

如果安装过程中的输出没有报错出现的话,可以通过以下命令监控后续的各 pods 创建的状态和进度:

$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
auth dex-6686f66f9b-sbbxf 1/1 Running 0 4h55m
cattle-prometheus exporter-kube-state-cluster-monitoring-5dd6d5c9fd-7rwq2 1/1 Running 0 3h35m
cattle-prometheus exporter-node-cluster-monitoring-7wvw2 1/1 Running 0 3h35m
cattle-prometheus exporter-node-cluster-monitoring-vhrk5 1/1 Running 0 3h35m
cattle-prometheus grafana-cluster-monitoring-75c5cd5995-ssr2d 2/2 Running 0 3h35m
cattle-prometheus prometheus-cluster-monitoring-0 5/5 Running 1 3h31m
cattle-prometheus prometheus-operator-monitoring-operator-f9b9567b-p6h4l 1/1 Running 0 3h35m
cattle-system cattle-cluster-agent-84f4d9f7cc-xkcrq 1/1 Running 0 4h14m
cattle-system cattle-node-agent-fdc5z 1/1 Running 0 4h57m
cattle-system cattle-node-agent-jlpnl 1/1 Running 0 4h56m
cattle-system kube-api-auth-xww7h 1/1 Running 0 2d1h
cert-manager cert-manager-9d5774b59-8hp2f 1/1 Running 0 4h55m
cert-manager cert-manager-cainjector-67c8c5c665-896r8 1/1 Running 0 4h55m
cert-manager cert-manager-webhook-75dc9757bd-zcxjp 1/1 Running 0 4h55m
ingress-nginx nginx-ingress-controller-9jzpx 1/1 Running 0 2d1h
ingress-nginx nginx-ingress-controller-bscxv 1/1 Running 0 2d1h
istio-system authservice-0 1/1 Running 0 4h55m
istio-system cluster-local-gateway-66bcf8bc5d-rltpj 1/1 Running 0 4h54m
istio-system istio-citadel-66864ff6b8-znrjw 1/1 Running 0 3h35m
istio-system istio-galley-5bd9bf8b9c-8b9x6 1/1 Running 0 3h35m
istio-system istio-ingressgateway-85b49c758f-4khs7 1/1 Running 0 4h53m
istio-system istio-pilot-674bdcbbf9-8dpc8 2/2 Running 1 3h35m
istio-system istio-policy-6d9f4577db-mhxnz 2/2 Running 1 3h35m
istio-system istio-security-post-install-1.5.9-jfbkk 0/1 Completed 4 3h35m
istio-system istio-sidecar-injector-9bcfb645-vm54x 1/1 Running 0 3h35m
istio-system istio-telemetry-664b6dfd44-bhr2c 2/2 Running 6 3h35m
istio-system istio-tracing-cc6c8c677-7mrnl 1/1 Running 0 3h35m
istio-system istiod-5ff6cdbbcd-4vnhf 1/1 Running 0 4h53m
istio-system kiali-79c4c46468-bpl7l 1/1 Running 0 3h35m
knative-eventing broker-controller-5c84984b97-4shtv 1/1 Running 0 4h55m
knative-eventing eventing-controller-54bfbd5446-4pknn 1/1 Running 0 4h55m
knative-eventing eventing-webhook-58f56d9cf4-2mrdn 1/1 Running 0 4h55m
knative-eventing imc-controller-769896c7db-8gzmc 1/1 Running 0 4h55m
knative-eventing imc-dispatcher-86954fb4cd-l9l98 1/1 Running 0 4h55m
knative-serving activator-75696c8c9-786pn 1/1 Running 0 4h55m
knative-serving autoscaler-6764f9b5c5-q9pd9 1/1 Running 0 4h55m
knative-serving controller-598fd8bfd7-8ng4j 1/1 Running 0 4h55m
knative-serving istio-webhook-785bb58cc6-xlnnk 1/1 Running 0 4h55m
knative-serving networking-istio-77fbcfcf9b-p7wr7 1/1 Running 0 4h55m
knative-serving webhook-865f54cf5f-rn7qq 1/1 Running 0 4h55m
kube-system coredns-6b84d75d99-2f5p4 1/1 Running 0 2d1h
kube-system coredns-6b84d75d99-8j8zs 1/1 Running 0 48m
kube-system coredns-autoscaler-5c4b6999d9-qq87l 1/1 Running 0 48m
kube-system kube-flannel-kjmlb 2/2 Running 0 2d1h
kube-system kube-flannel-vnhm7 2/2 Running 0 2d1h
kube-system metrics-server-7579449c57-2jqld 1/1 Running 0 2d1h
kube-system rke-coredns-addon-deploy-job-drwtd 0/1 Completed 0 49m
kubeflow-user-example-com ml-pipeline-ui-artifact-6d7ffcc4b6-dzfsk 2/2 Running 0 4h52m
kubeflow-user-example-com ml-pipeline-visualizationserver-84d577b989-7bhbm 2/2 Running 0 4h52m
kubeflow admission-webhook-deployment-54cf94d964-8qsh2 1/1 Running 0 4h51m
kubeflow cache-deployer-deployment-65cd55d4d9-d6dzd 2/2 Running 11 4h50m
kubeflow cache-server-f85c69486-rgzq6 2/2 Running 6 4h50m
kubeflow centraldashboard-7b7676d8bd-w5jw6 1/1 Running 0 4h53m
kubeflow jupyter-web-app-deployment-66f74586d9-pnv2r 1/1 Running 0 169m
kubeflow katib-controller-5467f8fdc8-rcc78 1/1 Running 0 4h50m
kubeflow katib-db-manager-646695754f-v2x82 1/1 Running 0 4h53m
kubeflow katib-mysql-5bb5bd9957-7zzct 1/1 Running 0 4h53m
kubeflow katib-ui-55fd4bd6f9-mnqg7 1/1 Running 0 4h53m
kubeflow kfserving-controller-manager-0 2/2 Running 0 4h53m
kubeflow kubeflow-pipelines-profile-controller-5698bf57cf-9t99q 1/1 Running 0 169m
kubeflow kubeflow-pipelines-profile-controller-5698bf57cf-wmbwc 1/1 Running 0 4h53m
kubeflow metacontroller-0 1/1 Running 0 4h53m
kubeflow metadata-envoy-deployment-76d65977f7-5bjkk 1/1 Running 0 4h53m
kubeflow metadata-grpc-deployment-697d9c6c67-5xbq6 2/2 Running 1 4h53m
kubeflow metadata-writer-58cdd57678-ns6kq 2/2 Running 0 4h53m
kubeflow minio-6d6784db95-82wtz 2/2 Running 0 169m
kubeflow ml-pipeline-85fc99f899-mn65n 2/2 Running 3 4h53m
kubeflow ml-pipeline-persistenceagent-65cb9594c7-gt8bw 2/2 Running 0 4h53m
kubeflow ml-pipeline-scheduledworkflow-7f8d8dfc69-spq9b 2/2 Running 0 4h53m
kubeflow ml-pipeline-ui-5c765cc7bd-hks2f 2/2 Running 0 4h53m
kubeflow ml-pipeline-viewer-crd-5b8df7f458-x62wv 2/2 Running 1 4h53m
kubeflow ml-pipeline-visualizationserver-56c5ff68d5-ndltc 2/2 Running 0 4h53m
kubeflow mpi-operator-789f88879-l5d7l 1/1 Running 0 4h53m
kubeflow mxnet-operator-7fff864957-5gcv2 1/1 Running 0 4h53m
kubeflow mysql-56b554ff66-k2qsl 2/2 Running 0 168m
kubeflow notebook-controller-deployment-74d9584477-jvvcc 1/1 Running 0 4h53m
kubeflow profiles-deployment-67b4666796-zjrpw 2/2 Running 0 4h53m
kubeflow pytorch-operator-fd86f7694-tgh8l 2/2 Running 0 4h53m
kubeflow tensorboard-controller-controller-manager-fd6bcffb4-lkz2g 3/3 Running 1 4h53m
kubeflow tensorboards-web-app-deployment-78d7b8b658-qg8kc 1/1 Running 0 4h53m
kubeflow tf-job-operator-7bc5cf4cc7-689d9 1/1 Running 0 4h53m
kubeflow volumes-web-app-deployment-68fcfc9775-mw7gz 1/1 Running 0 4h53m
kubeflow workflow-controller-5449754fb4-pc79x 2/2 Running 1 169m
kubeflow xgboost-operator-deployment-5c7bfd57cc-54hjq 2/2 Running 1 4h53m
local-path-storage local-path-provisioner-5bd6f65fdf-j575f 1/1 Running 0 2d

如果所有的 pods 都进入 RunningCompleted 状态,那么就说明部署成功了,如果有节点迟迟卡在创建状态,可以重新执行一遍 python install.py

如果要删除,可以执行

$ kubectl delete -f manifest1.3

一些可能会遇到的问题

  1. 在 Rancher 页面,可能看不见这些 pods

    是因为他们没有被归到某一个 project 下面,点击集群名称,点击命名空间,即可看见,将之挪到 Default 项目或新建项目下,即可在对应项目下看到这些 pods 的命名空间及 pods 详情了。

  2. 默认用户名密码不正确/如何添加新的用户和命名空间

    可以编辑 patch/auth.yaml 文件,在

    staticPasswords:
    - email: "admin@example.com"
    # hash string is "password"
    hash: "$2y$12$X.oNHMsIfRSq35eRfiTYV.dPIYlWyPDRRc1.JVp0f3c.YqqJNW4uK"
    username: "admin"
    userID: "08a8684b-db88-4b73-90a9-3cd1661f5466"
    - email: myname@abc.cn
    hash: $2b$10$.zSuIlx1bl9PCyigEtebhuWG/PAhZlZoyokPdGObiE7jRUHUcQ0qW
    username: myname
    userID: 08a8684b-db88-4b73-90a9-3cd1661f5466

    可以添加用户,密码使用 hash 生成,可以通过 https://passwordhashing.com/BCrypt 工具来生成密码。

    注意:每一个用户都对应了一个命名空间(namespace),所以如果添加了新用户,需要对应的添加新的命名空间,在此文件的下面几行

    ---
    apiVersion: kubeflow.org/v1beta1
    kind: Profile
    metadata:
    name: kubeflow-user-example-com
    spec:
    owner:
    kind: User
    name: admin@example.com

    最后,重新

    # 应用
    kubectl apply -f patch/auth.yaml
    # 重启
    kubectl rollout restart deployment dex -n auth

    或者直接使用此命令编辑容器配置,保存后自动应用。

    kubectl edit configmap dex -n auth

暴露 Kubeflow 服务

当 Kubeflow 服务启动完成后,可以通过

$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

将容器内网关的 80 端口临时暴露到本机的 8080 端口,通过 localhost 域名以 http 的方式访问。

通过修改 patch/auth.yaml 文件,可以更改密码,默认的用户名是admin@example.com,密码是password[13]

生成密码的方式是[14]

python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

或通过 https://passwordhashing.com/BCrypt 在线工具。

如果要通过 NodePort / LoadBalancer / Ingress 暴露服务到非 localhost 网络,那么必须使用 https[15]。否则能打开主页面但是 notebook、ssh等均无法连接。

一个简单可行的方案是使用 NodePort

获知 ssl 端口

使用如下命令[16]

$ kubectl -n istio-system get service istio-ingressgateway
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
istio-ingressgateway NodePort 10.43.121.108 <none> 15021:31066/TCP,80:30000/TCP,443:32692/TCP,31400:32293/TCP,15443:31428/TCP 45h

可以发现,443 端口被映射到物理机的 32692,但是此时 443 端口还没有启用服务,下面几步将生成并启用证书,以启动 443 端口。

生成自签名证书

使用前文中提到过的脚本,生成新的ssl证书,供给 kubeflow 使用。

./create_self-signed-cert.sh --ssl-domain=kub**a.cn

如果有CA签名的证书亦可直接使用。

使用 cert-manager 管理证书

kubectl create --namespace istio-system secret tls kf-tls-cert --key /data/gang/kfcerts/kub***a.cn.key --cert /data/gang/kfcerts/kub***a.cn.crt

配置 Knative cluster 使用自定义域名

kubectl edit cm config-domain --namespace knative-serving

在 data 下面添加一个名为域名的键[17],值若无特殊需求的话可以留空,如:kub***a.cn: ""

配置 Kubeflow 使用此证书

编辑 manifest1.3/016-istio-1-9-0-kubeflow-istio-resources-base.yaml 文件

vim manifest1.3/016-istio-1-9-0-kubeflow-istio-resources-base.yaml

在最后面的 kubeflow-gateway 中,添加[18]

- hosts:
- '*'
port:
name: https
number: 443
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: kf-tls-cert

hosts 可以直接指定为刚刚生成的证书所绑定的域名(仅接受此域名的访问),也可以填写成 * 以接受其他域名的访问。

应用重启

kubectl apply -f manifest1.3/016-istio-1-9-0-kubeflow-istio-resources-base.yaml
kubectl rollout restart deploy istio-ingressgateway -n istio-system

这样就能在其他机器上通过

curl https://10.1***2:32692 -k

安全的访问到 Kubeflow 服务了。由于是自签名证书,所以使用-k 参数可以绕过检查。

效果

删档重玩

谨慎使用脚本,删除残留文件、服务、网络配置、容器等[19]

#!/bin/bash

KUBE_SVC='
kubelet
kube-scheduler
kube-proxy
kube-controller-manager
kube-apiserver
'

for kube_svc in ${KUBE_SVC};
do
# 停止服务
if [[ `systemctl is-active ${kube_svc}` == 'active' ]]; then
systemctl stop ${kube_svc}
fi
# 禁止服务开机启动
if [[ `systemctl is-enabled ${kube_svc}` == 'enabled' ]]; then
systemctl disable ${kube_svc}
fi
done

# 停止所有容器
docker stop $(docker ps -aq)

# 删除所有容器
docker rm -f $(docker ps -qa)

# 删除所有容器卷
docker volume rm $(docker volume ls -q)

# 卸载mount目录
for mount in $(mount | grep tmpfs | grep '/var/lib/kubelet' | awk '{ print $3 }') /var/lib/kubelet /var/lib/rancher;
do
umount $mount;
done

# 备份目录
mv /etc/kubernetes /etc/kubernetes-bak-$(date +"%Y%m%d%H%M")
mv /var/lib/etcd /var/lib/etcd-bak-$(date +"%Y%m%d%H%M")
mv /var/lib/rancher /var/lib/rancher-bak-$(date +"%Y%m%d%H%M")
mv /opt/rke /opt/rke-bak-$(date +"%Y%m%d%H%M")
rm -rf ~/.kube/
rm -rf /etc/kubernetes/
rm -rf /etc/systemd/system/kubelet.service.d
rm -rf /etc/systemd/system/kubelet.service
rm -rf /usr/bin/kube*
rm -rf /etc/cni
rm -rf /opt/cni
rm -rf /var/lib/etcd
rm -rf /var/etcd

# 删除残留路径
rm -rf /etc/ceph \
/etc/cni \
/opt/cni \
/run/secrets/kubernetes.io \
/run/calico \
/run/flannel \
/var/lib/calico \
/var/lib/cni \
/var/lib/kubelet \
/var/log/containers \
/var/log/kube-audit \
/var/log/pods \
/var/run/calico \
/usr/libexec/kubernetes

# 清理网络接口
no_del_net_inter='
lo
docker0
eth
ens
bond
'

network_interface=`ls /sys/class/net`

for net_inter in $network_interface;
do
if ! echo "${no_del_net_inter}" | grep -qE ${net_inter:0:3}; then
ip link delete $net_inter
fi
done

# 清理残留进程
port_list='
80
443
6443
2376
2379
2380
8472
9099
10250
10254
'

for port in $port_list;
do
pid=`netstat -atlnup | grep $port | awk '{print $7}' | awk -F '/' '{print $1}' | grep -v - | sort -rnk2 | uniq`
if [[ -n $pid ]]; then
kill -9 $pid
fi
done

kube_pid=`ps -ef | grep -v grep | grep kube | awk '{print $2}'`

if [[ -n $kube_pid ]]; then
kill -9 $kube_pid
fi

# 清理Iptables表
## 注意:如果节点Iptables有特殊配置,以下命令请谨慎操作
sudo iptables --flush
sudo iptables --flush --table nat
sudo iptables --flush --table filter
sudo iptables --table nat --delete-chain
sudo iptables --table filter --delete-chain
systemctl restart docker

另外,创建 Rancher 的时候指定的几个卷文件夹,也可以酌情删除

rm -rf /data/var/log/rancher/auditlog
rm -rf /path/to/rancher

Kubeflow Jupyter Server 自定义镜像

自定义镜像必须要达到几个要求,才能通过 kubeflow 正确访问,参考[20]

Tensorflow

tensorflow 与 cuda 对应关系如下表[21]

版本 Python 版本 cuDNN CUDA
tensorflow-2.6.0 3.6-3.9 8.1 11.2
tensorflow-2.5.0 3.6-3.9 8.1 11.2
tensorflow-2.4.0 3.6-3.8 8.0 11.0
tensorflow-2.3.0 3.5-3.8 7.6 10.1
tensorflow-2.2.0 3.5-3.8 7.6 10.1
tensorflow-2.1.0 2.7、3.5-3.7 7.6 10.1
tensorflow-2.0.0 2.7、3.3-3.7 7.4 10.0
tensorflow_gpu-1.15.0 2.7、3.3-3.7 7.4 10.0
tensorflow_gpu-1.14.0 2.7、3.3-3.7 7.4 10.0
tensorflow_gpu-1.13.1 2.7、3.3-3.7 7.4 10.0
tensorflow_gpu-1.12.0 2.7、3.3-3.6 7 9
tensorflow_gpu-1.11.0 2.7、3.3-3.6 7 9
tensorflow_gpu-1.10.0 2.7、3.3-3.6 7 9
tensorflow_gpu-1.9.0 2.7、3.3-3.6 7 9
tensorflow_gpu-1.8.0 2.7、3.3-3.6 7 9
tensorflow_gpu-1.7.0 2.7、3.3-3.6 7 9
tensorflow_gpu-1.6.0 2.7、3.3-3.6 7 9
tensorflow_gpu-1.5.0 2.7、3.3-3.6 7 9
tensorflow_gpu-1.4.0 2.7、3.3-3.6 6 8
tensorflow_gpu-1.3.0 2.7、3.3-3.6 6 8
tensorflow_gpu-1.2.0 2.7、3.3-3.6 5.1 8
tensorflow_gpu-1.1.0 2.7、3.3-3.6 5.1 8
tensorflow_gpu-1.0.0 2.7、3.3-3.6 5.1 8

创建 Dockerfile 文件:

以下内容可以任意更改,以符合需求,或者等实例化容器后,进入 bash 手动添加更多功能也可。

FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04 as base

# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
unzip


# See http://bugs.python.org/issue19846
ENV LANG C.UTF-8

RUN apt-get update && apt-get install -y \
python3 \
python3-pip

RUN python3 -m pip --no-cache-dir install --upgrade \
"pip<20.3" \
setuptools

# Some TF tools expect a "python" binary
RUN ln -s $(which python3) /usr/local/bin/python

# Options:
# tensorflow
# tensorflow-gpu
# tf-nightly
# tf-nightly-gpu
# Set --build-arg TF_PACKAGE_VERSION=1.11.0rc0 to install a specific version.
# Installs the latest version by default.
ARG TF_PACKAGE=tensorflow-gpu
ARG TF_PACKAGE_VERSION=1.15.5
RUN python3 -m pip install --no-cache-dir ${TF_PACKAGE}${TF_PACKAGE_VERSION:+==${TF_PACKAGE_VERSION}}

RUN python3 -m pip install --no-cache-dir jupyter matplotlib
# Pin ipykernel and nbformat; see https://github.com/ipython/ipykernel/issues/422
RUN python3 -m pip install --no-cache-dir jupyter_http_over_ws ipykernel==5.1.1 nbformat==4.4.0
RUN jupyter serverextension enable --py jupyter_http_over_ws

RUN mkdir -p /home/jovyan && chmod -R a+rwx /home/jovyan
RUN mkdir /.local && chmod a+rwx /.local
RUN apt-get update && apt-get install -y --no-install-recommends wget git
RUN apt-get autoremove -y

WORKDIR /home/jovyan
EXPOSE 8888


RUN python3 -m ipykernel.kernelspec

ENV NB_PREFIX /

CMD ["bash", "-c", "source /etc/bash.bashrc && jupyter notebook --notebook-dir=/home/jovyan --ip 0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]

将其构建成镜像,托管到 docker hub 上,然后在 Kubeflow 上创建 Jupyter server 时,选择 Custom image,填入自己构建的镜像,即可创建成功,打开测试可以看见 gpu 可用。

![](基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台.assets/1622085202812.png)

Pytorch

pytorch 与 cuda 对应的关系如下表[22][23]

CUDAToolkit版本 可用PyTorch版本
7.5 0.4.1 ,0.3.0, 0.2.0,0.1.12-0.1.6
8.0 1.1.0,1.0.0 ,0.4.1
9.0 1.1.0,1.0.1, 1.0.0,0.4.1
9.2 1.7.1,1.7.0,1.6.0,1.5.1,1.5.0,1.4.0,1.2.0,0.4.1
10.0 1.2.0,1.1.0,1.0.1 ,1.0.0
10.1 1.7.1,1.7.0,1.6.0,1.5.1,1.5.0, 1.4.0,1.3.0
10.2 1.7.1,1.7.0,1.6.0,1.5.1,1.5.0
11.0 1.7.1,1.7.0
11.1 1.8.0

创建 Dockerfile 文件:

以下内容可以任意更改,以符合需求,或者等实例化容器后,进入 bash 手动添加更多功能也可。

FROM nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04 as base

RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
unzip


# See http://bugs.python.org/issue19846
ENV LANG C.UTF-8

RUN apt-get update && apt-get install -y \
python3 \
python3-pip

RUN python3 -m pip --no-cache-dir install --upgrade \
"pip<20.3" \
setuptools

RUN ln -s $(which python3) /usr/local/bin/python

RUN python3 -m pip install --no-cache-dir torch==1.8.0

RUN python3 -m pip install --no-cache-dir jupyter matplotlib
# Pin ipykernel and nbformat; see https://github.com/ipython/ipykernel/issues/422
RUN python3 -m pip install --no-cache-dir jupyter_http_over_ws ipykernel==5.1.1 nbformat==4.4.0
RUN jupyter serverextension enable --py jupyter_http_over_ws

RUN mkdir -p /home/jovyan && chmod -R a+rwx /home/jovyan
RUN mkdir /.local && chmod a+rwx /.local
RUN apt-get update && apt-get install -y --no-install-recommends wget git
RUN apt-get autoremove -y

WORKDIR /home/jovyan
EXPOSE 8888


RUN python3 -m ipykernel.kernelspec

ENV NB_PREFIX /

CMD ["bash", "-c", "source /etc/bash.bashrc && jupyter notebook --notebook-dir=/home/jovyan --ip 0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]

将其构建成镜像,托管到 docker hub 上,然后在 Kubeflow 上创建 Jupyter server 时,选择 Custom image,填入自己构建的镜像,即可创建成功,打开测试可以看见 gpu 可用。

将自定义镜像添加到预置

编辑 manifest1.3/022-jupyter-overlays-istio.yaml 文件:

$ vim manifest1.3/022-jupyter-overlays-istio.yaml

data -> spawnerFormDefaults -> image -> options 里面添加自己创建的适用于 kubeflow 的镜像,如:

spawnerFormDefaults:
image:
# The container Image for the user's Jupyter Notebook
value: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-scipy:v1.3.0
# The list of available standard container Images
options:
- public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-scipy:v1.3.0
- public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-pytorch-full:v1.3.0
- public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-pytorch-cuda-full:v1.3.0
- harbordocker/kf-pt:1.2.0-gpu-jupyter-cuda100
- harbordocker/kf-pt:1.4.0-gpu-jupyter-cuda101
- harbordocker/kf-pt:1.5.1-gpu-jupyter-cuda102
- harbordocker/kf-pt:1.6.0-gpu-jupyter-cuda102
- harbordocker/kf-pt:1.7.1-gpu-jupyter-cuda102
- harbordocker/kf-pt:1.8.0-gpu-jupyter-cuda102
- harbordocker/kf-pt:1.8.1-gpu-jupyter-cuda102
- harbordocker/kf-pt:1.8.1-gpu-jupyter-cuda111
- public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow-full:v1.3.0
- public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow-cuda-full:v1.3.0
- harbordocker/kf-tf:1.13.2-gpu-jupyter-cuda100
- harbordocker/kf-tf:1.14.0-gpu-jupyter-cuda100
- harbordocker/kf-tf:1.15.5-gpu-jupyter-cuda100
- harbordocker/kf-tf:2.0.4-gpu-jupyter-cuda100
- harbordocker/kf-tf:2.1.3-gpu-jupyter-cuda101
- harbordocker/kf-tf:2.2.2-gpu-jupyter-cuda101
- harbordocker/kf-tf:2.3.2-gpu-jupyter-cuda101
- harbordocker/kf-tf:2.4.1-gpu-jupyter-cuda110
- harbordocker/kf-tf:2.5.0-gpu-jupyter-cuda112

完事儿再重新应用一下就可以在页面上看见了

$ kubectl apply -f manifest1.3/022-jupyter-overlays-istio.yaml

参考


  1. https://docs.docker.com/engine/install/centos/#install-using-the-repository ↩︎

  2. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#id2 ↩︎

  3. https://github.com/kubeflow/manifests#prerequisites ↩︎

  4. https://docs.rancher.cn/docs/rancher2.5/installation/install-rancher-on-k8s/_index#3-选择您的-ssl-选项 ↩︎

  5. https://docs.rancher.cn/docs/rancher2.5/installation/resources/advanced/self-signed-ssl/_index ↩︎

  6. https://docs.rancher.cn/docs/rancher2.5/installation/other-installation-methods/single-node-docker/advanced/_index ↩︎ ↩︎

  7. https://docs.rancher.cn/docs/rancher2.5/installation/install-rancher-on-k8s/_index ↩︎

  8. https://docs.rancher.cn/docs/rancher2/faq/install/_index/#error-httpsranchermyorgping-is-not-accessible-could-not-resolve-host-ranchermyorg ↩︎

  9. https://github.com/shikanon/kubeflow-manifests/issues/20#issuecomment-843014942 ↩︎

  10. https://github.com/cockroachdb/cockroach/issues/28075#issuecomment-420497277 ↩︎

  11. 原因1:部分镜像放在 gcr.io 无法拉取;原因2:部分镜像是临时版本(版本号为sha256码)无法保存导入;原因3:官方推荐的渠道大部分为云服务提供商深度结合,其他的几个方法都要求科学上网。 ↩︎

  12. 所谓没问题是指在执行 kubectl get pods -A 后,所有的 pods 都在 RunningCompleted 状态。 ↩︎

  13. https://github.com/shikanon/kubeflow-manifests ↩︎

  14. https://github.com/kubeflow/manifests#change-default-user-password ↩︎

  15. https://github.com/kubeflow/manifests#nodeport--loadbalancer--ingress ↩︎

  16. https://istio.io/latest/zh/docs/tasks/traffic-management/ingress/ingress-control/#determining-the-ingress-ip-and-ports ↩︎

  17. https://knative.dev/docs/serving/using-a-tls-cert/#before-you-begin ↩︎

  18. https://knative.dev/docs/serving/using-a-tls-cert/#manually-adding-a-tls-certificate ↩︎

  19. https://docs.rancher.cn/docs/rancher2/cluster-admin/cleaning-cluster-nodes/_index ↩︎

  20. https://www.kubeflow.org/docs/components/notebooks/custom-notebook/ ↩︎

  21. https://www.tensorflow.org/install/source#gpu ↩︎

  22. https://blog.csdn.net/weixin_42069606/article/details/105198845 ↩︎

  23. https://pytorch.org/get-started/previous-versions/ ↩︎


   转载规则


《基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台》 Harbor Zeng 采用 知识共享署名 4.0 国际许可协议 进行许可。
 上一篇
CAP一致性理论 CAP一致性理论
CAP一致性理论 CAP理论告诉我们,一个分布式系统不可能同时满足以下三种 一致性(C: Consistency) 可用性(A: Availability) 分区容错性(P: Partition Tolerance) 这三个基本需求,最多只能同时满足其中的两项,因为P是必须的,因此往往选择就在CP或者AP中。 一致性(C: Consistency) 在分布式环境中,一致性是指数
2021-11-07
下一篇 
(十)BERT 是 Transformer 的 Encoder 而已 (十)BERT 是 Transformer 的 Encoder 而已
BERT (Bidirectional Encoder Representations from Transformers) Bert 是 Transformer 的 Encoder 预训练模型,训练技巧是:预测文本中被遮挡的单词,预测两个句子是否是原文中相邻的句子。 预测文本中被遮挡的单词 eee:被遮挡单词 cat 的 one-hot 向量 ppp:被遮挡的地方输出的概率分布 Loss
2021-01-17
  目录