Kubernetes有状态应用迁移实录-木盒主机

无状态改造是应用微服务改造的前提，K8s成为承载现代应用架构的主流平台，并成为了事实标准，它的基础设施管理可以委托给云供应商，因能按需扩展等特性，从而吸引了高性能计算（HPC）社区的关注。容器上云通常被设计成无状态或短期任务，通常情况下，为了使容器无状态，数据通常被存储于持久战化存储中，如数据库、redis、对象存储等。

文｜zouyee

编辑｜zouyee

接受范围｜重度

当然，也存在一些有状态的应用-如数据库、分析、机器学习（ML）和深度学习（DL）应用中存储或处理数据的应用，对于这类任务来说数据是必不可少的。

HPC工作负载通常是长期运行且有状态的，像模拟或优化问题的工作负载通常将数据保存在内存中，磁盘上的checkpoint或者备份数据通常不是实时的。内存的峰值可能导致内存oom，从而导致pods被杀死。最糟糕的结果是完全丢失几个小时、几天的计算数据。为了避免这种情况发生，一旦出现pod故障，最好能够自动将有状态的pod迁移到另一个节点上。容器checkpoint提供的功能是对运行中的容器进行快照，被检查的容器可以被转移到另一个节点。

Kubernetes采用了抢占的方式，在资源紧张的情况下将现有的Pod从集群中驱逐，为待调度的更高优先级的Pod腾出空间，低优先级的任务经常被抢占，如果这些任务被重新启动，所有这些计算都要重新进行，那么丢失数据的代价还是很大的，一种方式是提升优先级，第二种方法就是提供checkpoint与restore能力。

能够将这些有状态的容器转移到新的机器上，这被称为有状态迁移。将正在运行的容器从一个节点迁移到另一个节点的基本步骤是：在原节点上对容器保存checkpoint，将checkpoint数据转移到目标节点，并在目标节点上恢复容器。这样，容器在迁移时就不会丢失其状态。

社区现状

目前，Kubernetes并不支持pod迁移，今年一月底，Kubernetes社区接受了一项容器checkpoint功能的提案，并有望在未来的版本中提供，当前规划路线图如下所示。

版本	K8s release
alpha	v1.25
beta	v1.26
beta	v1.28

版本

K8s release

alpha

v1.25

beta

v1.26

beta

v1.28

使用场景

以下是checkpoint和restore容器的使用场景

慢启动应用加速

如果某些服务启动时间过长（例如，执行复杂的状态初始化），可以在它完成启动后对其进行checkpoint，并在后续启动中从镜像中恢复它。

重新启动而不丢失状态

机器更新时，需要重新启动，在checkpoint和restore的帮助下，启动耗时过长的容器可以在重启前checkpoint。然后，在重启后，可以从checkpoint恢复容器，而不会丢失任何状态，也不会出现长时间的服务中断。

抢占/驱逐

与第一个用例类似，在一个节点上checkpoint一个容器，并在另一个节点上恢复它以获得更高的资源。

应用快照

保存应用程序的状态，并在以后恢复到其中任何一个状态。

技术背景

CRIU

CRIU（全称“Checkpoint / Restore in Userspace”）是一个为Linux提供检查点/恢复功能的工具，主要是对运行中的应用进行冻结(freeze)再基于其在磁盘上的所有文件建立检查点，并根据checkpoint恢复冻结时状态并继续运行。CRIU对OpenVZ、LXZ/LXD、Docker等提供了良好的支持。

Checkpoint

/proc是一个基于内存的文件系统，包括CPU、内存、分区划分、I/O地址、直接内存访问通道和运行的进程等，Linux通过/proc访问内核内部数据结构及更改内核设置等。在很大程度上，Checkpoint是基于/proc文件系统实现的，主要依赖/proc获取文件描述符信息、管道参数、内存映射等。Checkpoint通过进程转存器(process dumper)进行以下步骤：

收集进程信息并进行冻结
收集任务资源并进行转存（写入转存文件）等

Restore

Restore恢复过程主要分为以下步骤：

解决共享资源：CRIU读取镜像文件找出哪些进程共享哪些资源，共享资源由某个进程恢复后，其他进程继承或以其他方式获取。
fork进程树：通过fork()函数创建待恢复的进程，但此时并没有对进程进行恢复。
恢复基本的任务资源：打开文件，准备namespaces，映射内存区域，创建套接字等。但是以下几类资源的恢复会等到下一个阶段：内存映射的确切位置，计时器，证书，线程。
切换到恢复点的上下文，恢复并继续执行。

Container Runtime

container实际上也是进程，故CRIU本质上是对容器进程进行checkpoint/restore。

Docker热迁移

Docker在实验模式下提供了一个功能（从Docker 1.13开始支持），允许通过检查点冻结一个正在运行的容器，将其状态变成磁盘上的一个文件。容器可以从它被冻结的地方恢复。

源码装CRIU较为麻烦，可以通过官网进行安装。Docker(moby)虽然提供了checkpoint，但切换至experimental下才能用，新建/etc/docker/daemon.json文件。

<code style="margin-left:0">$ echo "{\"experimental\": true}" >> /etc/docker/daemon.json
$ systemctl restart docker</code>

另外，以前使用docker 18及之后的版本时，checkpoint无法正常使用，主要出现以下问题，社区问题参看runc/pull/1840, 目前已经修复：

<code style="margin-left:0">Error response from daemon: open /var/lib/docker/containers/[CONTAINER_ID]/checkpoints/[CHECKPOINT_ID]/config.json: no such file or directory</code>

a. 运行容器

<code style="margin-left:0">docker run -d — name looper busybox /bin/sh -c ‘i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done’</code>

b. 确认输出结果：

<code style="margin-left:0">docker logs looper</code>

c. 创建checkpoint

<code style="margin-left:0">docker checkpoint create looper checkpoint1</code>

--leave-running=false，checkpoint完成之后，容器继续运行还是停止（默认false）
--checkpoint-dir DIR_PATH，使用指定的目录

d. 获取列表

<code style="margin-left:0">docker checkpoint ls CONTAINER</code>

e. 恢复

启动时没有单独的命令，但在container start可以指定checkpoint选项参数：

<code style="margin-left:0">docker start — checkpoint checkpoint1 looper</code>

注意：在创建checkpoint时，可以通过--checkpoint-dir命令行，指定绝对路径。

Podman热迁移

checkpoint&restore

a. 准备工作

<code style="margin-left:0">. /etc/os-release
echo “deb https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/ /” | sudo tee /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list
curl -L “https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/Release.key" | sudo apt-key add -</code>

b. 安装

<code style="margin-left:0">sudo apt update
sudo apt -y install podman</code>

c. 启动容器

<code style="margin-left:0"># podman run -d --name looper busybox /bin/sh -c \
         'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'</code>

d. 确认日志结果：

<code style="margin-left:0"># podman logs -l</code>

或者通过运行podman ps。

如果你多做几次，你会发现整数在增加。现在可以对容器进行checkpoint了。

<code style="margin-left:0"># podman container restore -l</code>

使用podman logs -l或podman ps可以验证容器是否被恢复，以及它是否从checkpoint的时间点继续运行。

容器热迁移

要真正能够将一个容器从一个系统迁移到另一个系统，至少需要>= Podman 1.4.0（2019年6月）版本。在1.4.0版本中，Podman能够导出一个完整的检查点，然后进行迁移。

<code style="margin-left:0"># podman run -d --name looper busybox /bin/sh -c \
         'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
# podman container checkpoint -l --export=/tmp/chkpt.tar.gz
# scp /tmp/chkpt.tar.gz <destination-host>:/tmp</code>

一旦checkpoint被转移到目标系统，容器就可以从checkpoint中恢复。

<code style="margin-left:0"># podman container restore --import=/tmp/chkpt.tar.gz</code>

现在，该容器继续从它原先机器上的checkpoint的同一地点运行。

从checkpoint中，也可以恢复具有不同名称的容器的多个副本。

<code style="margin-left:0"># podman container restore --import=/tmp/chkpt.tar.gz -n looper1
# podman container restore --import=/tmp/chkpt.tar.gz -n looper2
# podman container restore --import=/tmp/chkpt.tar.gz -n looper3</code>

每个恢复的容器将从容器checkpoint的时间点开始运行。

Containerd热迁移

checkpoint&restore

a. 准备工作

获取命令行ctr

<code style="margin-left:0">wget https://github.com/containerd/containerd/releases/download/v1.6.4/containerd-1.6.4-linux-amd64.tar.gz
tar xvf containerd-1.6.4-linux-amd64.tar.gz</code>

b. 拉取镜像

<code style="margin-left:0">ctr image pull docker.io/library/redis:alpine</code>

c. 创建容器

<code style="margin-left:0">ctr run --runtime io.containerd.runc.v1 -d docker.io/library/redis:alpine redis</code>

d. 创建checkpoint

<code style="margin-left:0">ctr c checkpoint --rw --task redis checkpoint/redis:20211011</code>

e. restore容器

<code style="margin-left:0">ctr c restore redis-debug checkpoint/redis:20211011</code>

正如本节所述，当前主流容器运行时均支持容器的checkpoint和restore，但可以发现，当前的kubernetes尚未支持该特性，因此若需要引入该特性，则必须修改cri-api接口，修改kubelet代码，再改造runtime支持的cri接口功能即可。

方案设计

目标

这个KEP的目标是在CRI API中引入checkpoint（未涉及恢复）。这包括扩展kubelet API，以支持对单一容器的checkpoint，考虑用于调试场景，尽管checkpoint和恢复可以用来实现容器迁移，但这个KEP只是为了实现调试场景。恢复功能而是留待未来的实现。

实现概要

对于调试场景来说，希望提供对运行中的Pod checkpoint的功能，而不停止容器，调试用例相应代码可以在kubernetes/kubernetes#104907中找到。

当前的目标是以自下而上的方式引入checkpoint和恢复。第一步，扩展CRI API，以触发容器的checkpoint，并在kubelet中设置接口来触发checkpoint，通过设置feature，来启用/关闭ContainerCheckpoint功能。在上述PR中，在kubelet 设置接口以触发checkpoint。

<code style="margin-left:0">curl -skv -X POST "https://localhost:10250/checkpoint/default/counters/wildfly"</code>

当前实现，不希望在kubelet中支持pod恢复，restore应该发生在Kubernetes之外。

尽管这个KEP只为kubelet增加了checkpoint支持，但在上述PR中，CRI API被扩展为支持checkpoint和恢复。在CRI API中添加恢复功能而不在kubelet中实现的原因是为了使开发更加容易。

注意：实际在API中未涉及恢复接口的定义

实现细节

1. CRI API拓展

新增接口

<code style="margin-left:0">// CheckpointContainer checkpoints a container
    rpc CheckpointContainer(CheckpointContainerRequest) returns (CheckpointContainerResponse) {}</code>

结构体定义：

<code style="margin-left:0">message CheckpointContainerRequest {
    // ID of the container to be checkpointed.
    string container_id = 1;
    // Location of the checkpoint archive used for export
    string location = 2;
}

message CheckpointContainerResponse {}</code>

2. CRI Runtime实现

目前cri-o、podman及containerd运行时,adrianreber均提了相关PR,下述内容以containerd为例进行说明，详情参看containerd/pull/6965。

<code style="margin-left:0">func (c *criService) CheckpointContainer(
  ctx context.Context,
  r *runtime.CheckpointContainerRequest,
) (*runtime.CheckpointContainerResponse, error) {
  start := time.Now()

  // Kubernetes has the possibility to request a file system local
  // checkpoint archive. If the given location starts with a '/' or
  // does not contain any slashes this assumes a local file.
  // Only slashes in the middle assumes a destination in the local image store.
  if strings.HasPrefix(r.GetLocation(), "/") || !strings.Contains(r.GetLocation(), "/") {
    return nil, fmt.Errorf(
      "local checkpoint archives (%s) are not supported",
      r.GetLocation(),
    )
  }

  container, err := c.containerStore.Get(r.GetContainerId())
  if err != nil {
    return nil, fmt.Errorf(
      "an error occurred when try to find container %q: %w",
      r.GetContainerId(),
      err,
    )
  }

  state := container.Status.Get().State()
  if state != runtime.ContainerState_CONTAINER_RUNNING {
    return nil, fmt.Errorf(
      "container %q is in %s state. only %s containers can be checkpointed",
      r.GetContainerId(),
      criContainerStateToString(state),
      criContainerStateToString(runtime.ContainerState_CONTAINER_RUNNING),
    )
  }

  i, err := container.Container.Info(ctx)
  if err != nil {
    return nil, fmt.Errorf("get container info: %w", err)
  }

  task, err := container.Container.Task(ctx, nil)
  if err != nil {
    return nil, fmt.Errorf(
      "failed to get task for container %q: %w",
      r.GetContainerId(),
      err,
    )
  }
  _, err = task.Checkpoint(
    ctx,
    []containerd.CheckpointTaskOpts{withCheckpointOpts(
      i.Runtime.Name,
      r.GetLocation(),
      c.getContainerRootDir(r.GetContainerId()),
    )}...,
  )
  if err != nil {
    return nil, fmt.Errorf(
      "checkpointing container %q failed: %w",
      r.GetContainerId(),
      err,
    )
  }

  containerCheckpointTimer.WithValues(i.Runtime.Name).UpdateSince(start)

  return &runtime.CheckpointContainerResponse{}, nil
}

func withCheckpointOpts(rt, location, rootDir string) containerd.CheckpointTaskOpts {
  return func(r *containerd.CheckpointTaskInfo) error {
    // There is a check in the RPC interface to ensure 'location'
    // contains an image destination in the local image store.
    r.Name = location
    // Kubernetes currently support checkpointing of container
    // as part of the Forensic Container Checkpointing KEP.
    // This implies that the container is never stopped
    leaveRunning := true

    switch rt {
    case plugin.RuntimeRuncV1, plugin.RuntimeRuncV2:
      if r.Options == nil {
        r.Options = &options.CheckpointOptions{}
      }
      opts, _ := r.Options.(*options.CheckpointOptions)

      opts.Exit = !leaveRunning
      opts.WorkPath = rootDir
    case plugin.RuntimeLinuxV1:
      if r.Options == nil {
        r.Options = &runctypes.CheckpointOptions{}
      }
      opts, _ := r.Options.(*runctypes.CheckpointOptions)

      opts.Exit = !leaveRunning
      opts.WorkPath = rootDir
    }
    return nil
  }
}</code>

3. Kublet支持

kubelet server结构体新增POST接口

<code style="margin-left:0">"/checkpoint/{podNamespace}/{podID}/{containerName}"</code>

<code style="margin-left:0">// Only enable checkpoint API if the feature is enabled
  if utilfeature.DefaultFeatureGate.Enabled(features.ContainerCheckpoint) {
    s.addMetricsBucketMatcher("checkpoint")
    ws = &restful.WebService{}
    ws.Path("/checkpoint").Produces(restful.MIME_JSON)
    ws.Route(ws.POST("/{podNamespace}/{podID}/{containerName}").
      To(s.checkpoint).
      Operation("checkpoint"))
    s.restfulCont.Add(ws)
  }</code>

针对pod lifecycle的checkpoint特性，可以追溯到kubernetes/issues/3949，上述PR目前只实现checkpoint基础功能，使用流程如下：

<code style="margin-left:0">1. curl -skv -X POST "https://localhost:10250/checkpoint/default/counters/wildfly"
2. 将/var/lib/kubelet/checkpoints文件转移到其他机器
3. 通过crictl restore --import=<archive></code>

迁移实录

在Jakob Schrettenbrunner之前的PoC基础上，演示如何建立一个具有pod迁移功能的Kubernetes集群，这里构建测试集群使用minikube。

在这里使用minikube启动单节点集群

<code style="margin-left:0">minikube start --container-runtime=containerd --cni=cilium --wait=all</code>

a. 替换containerd-cri插件

<code style="margin-left:0">mkdir -p /root/go/src/github.com/containerd && cd /root/go/src/github.com/containerd
git clone https://github.com/elchead/containerd.git
cd containerd
git checkout checkpoint
make && make install
cp ./bin/containerd /usr/bin/containerd
systemctl start containerd</code>

b. 替换kubelet

<code style="margin-left:0">wget https://github.com/elchead/kubernetes/releases/download/v8.1.0/kubelet
chmod +x ./kubelet
cp ./kubelet /usr/bin</code>

c. 启动pod

<code style="margin-left:0">apiVersion: v1
kind: Pod
metadata:
 name: migration-test
 labels:
   name: migration-test
spec:
 containers:
 - name: redis
   image: redis
   ports:
     - containerPort: 6379
   resources:
     limits:
       memory: "128Mi"
       cpu: "500m"
 nodeSelector:
   kubernetes.io/hostname: node1</code>

再创建的pod的配置如下：

<code style="margin-left:0">apiVersion: v1
kind: Pod
metadata:
 name: migration-test-migrated
 labels:
   name: migration-test
spec:
 clonePod: migration-test
 containers:
 - name: redis
   image: redis
   ports:
     - containerPort: 6379
   resources:
     limits:
       memory: "128Mi"
       cpu: "500m"</code>

相较于原先的pod定义，可以发现多了clonePod的spec, 迁移应该是非常快的。目前，旧的pod在迁移过程中会被破坏。但克隆的pod应该在运行。用curl请求它的端点，应该返回一个大于1的数字。

已知的一些限制：

在恢复阶段不会验证pod配置的镜像，而直接使用。这可能会引发一些问题，特别是像在使用latest标签时。
只有容器的内存信息会被迁移，任何数据都必须存储挂载卷中。
卷需要是ReadWriteMany权限，因为它们会被多个Pod同时挂载使用。
现在未处理任何错误处理
kubelet授权需要被设置为always allow等

未来展望

当前的实现只是提供了一种对pod中某一容器进行checkpoint的能力。在未来的版本中，可能希望支持对整个pod的checkpoint。要checkpoint完整的pod，容器运行时需要实现pod级别的cgroup freeze，以确保所有的容器在同一时间点被checkpoint，并且在pod中的其他容器被checkpoint的时候容器不会继续运行，这些问题的具体讨论可参看issue#3949。

由于笔者时间、视野、认知有限，本文难免出现错误、疏漏等问题，期待各位读者朋友、业界专家指正交流。

参考文献

1.https://github.com/kubernetes/kubernetes/issues/3949

2. https://www.jianshu.com/p/6edcbe67c8e0

3. https://github.com/kubernetes/enhancements/pull/1990

4. https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2008-forensic-container-checkpointing#implementation

5. https://github.com/kubernetes/enhancements/pull/3264

6. https://github.com/kubernetes/kubernetes/pull/104907

7. https://surenraju.medium.com/migrate-running-containers-by-checkpoint-restoring-using-criu-6670dd26a822

8. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/building_running_and_managing_containers/assembly_creating-and-restoring-container-checkpoints_building-running-and-managing-containers

9. https://criu.org/Docker

10. https://astobbe.me/posts/pod-migration/

11. https://github.com/containerd/containerd/pull/6965

未经允许不得转载：木盒主机 » Kubernetes有状态应用迁移实录

Kubernetes有状态应用迁移实录

Restore

容器热迁移

目标

实现概要

实现细节

b. 替换kubelet

相关推荐

热门推荐

DMIT 美国/香港/日本 CN2 GIA

搬瓦工限量版CN2 GIA整理

随便看看

热门标签

分类