Docker管理StableDiffusion安装环境：支持 NVIDIA GPU

在接近系统原生环境的条件下，面对变化，以无招胜有招，这就是 docker 的套路。

学习 AI 相关技术避不开安装环境。 Python 的 pypi 本来是软件包管理工具，而在当今繁杂的版本切换需求下力不从心，衍生出一大堆额外的环境管理工具，比如 conda。但是这些管理工具演变太快，可预见的未来也一定会发展成绊脚石。

第一步：收集信息

为了安装 stable-diffusion，根据硬件和系统需求，确定主要软件包的版本。

由于是将工作负载运行在 NVIDIA RTX 5080 显卡，所以可以确定 CUDA 的最低版本为 12.8。

Nvidia 的 NGC 提供了一个 pytorch 镜像。这个镜像的版本 25.xx 基于 cuda 12.8 之后的库（最新的基于 cuda 13.0）。具体支持信息参看这里。这里选择不太新的镜像版本 25.03，它提供了：

torch-2.7.0
torchvision-0.22.0
cuda-python-12.8.0
Cython-3.0.12
numpy-1.26.4
opencv-4.10.0

第二步：搭建框架

创建 Dockerfile：

FROM nvcr.io/nvidia/pytorch:25.03-py3

WORKDIR /workspace

RUN wget -O- https://github.com/CompVis/stable-diffusion/archive/refs/heads/main.tar.gz | tar zxvf -

WORKDIR /workspace/stable-diffusion-main

构建并运行：

docker build -t my/sd .

docker run --rm --gpus all my/sd /bin/bash

经过漫长的镜像拉取过程，回到电脑前看一下进度。

咦？怎么卡住了！

第三步：与报错信息做斗争

补充依赖包

看看最后一行信息：

=> => # Connecting to codeload.github.com (codeload.github.com)|54.251.140.56|:443...

第一个难题，原因就是高贵的 GFW 阻挡了直接下载源码包。好在可以使用 --build-arg 传递变量信息进入构建的容器。

docker build -t my/sd --build-arg https_proxy=http://172.17.0.1:64540 .

一切顺利，可以运行容器。未雨绸缪，传递一个环境变量。

docker run --rm -it --gpus all -e https_proxy=http://172.17.0.1:64540 my/sd /bin/bash

运行第一个程序：

python -m scripts.txt2img

出现了报错信息。这仅仅是第一个，在完全启动之前还有一系列的报错。不要担心，来一个个干掉它们。

ModuleNotFoundError: No module named 'omegaconf'

显然是确实依赖包。直接执行 pip install omegaconf，看到成功安装的信息。重新执行脚本 python -m scripts.txt2img。另一个模块缺失的信息。

ModuleNotFoundError: No module named 'imwatermark'

还是用同样方法解决。类似的问题，这里不赘述了。最终需要安装一系列模块。参看源码中 environment.yaml 内容，这里给出整理后的列表，将下面内容加入 Dockerfile，并重新构建运行。

RUN pip install \
        omegaconf \
        invisible-watermark \
        pytorch-lightning==1.4.2 \
        torchmetrics==0.6.0 \
        diffusers \
        transformers==4.44.2 "tokenizers>=0.13,<0.20" \
        kornia==0.6 && \
    pip install -e "git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers" && \
    pip install -e "git+https://github.com/openai/CLIP.git@main#egg=clip"

终于不是模块缺失的报错了。

补充大模型文件

重新启动后，运行脚本后会开始下载大模型文件。但是速度非常慢。

pytorch_model.bin:   0%|                                       | 703k/1.22G [01:03<30:25:47, 11.1kB/s]

先中断进程，退出容器。预先将需要的大模型文件下载回来，使用下面的命令。

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --local-dir CompVis/stable-diffusion-safety-checker --resume-download CompVis/stable-diffusion-safety-checker

huggingface-cli download --local-dir openai/clip-vit-large-patch14 --resume-download openai/clip-vit-large-patch14 preprocessor_config.json config.json pytorch_model.bin

这次启动时追加两个 -v 参数，将模型所在目录挂载到容器中对应位置。

docker run --rm -it --gpus all -e https_proxy=http://172.17.0.1:64540 -v ./CompVis:/workspace/stable-diffusion-main/CompVis -v ./openai:/workspace/stable-diffusion-main/openai my/sd /bin/bash

执行 python -m scripts.txt2img，报错信息为：

FileNotFoundError: [Errno 2] No such file or directory: 'models/ldm/stable-diffusion-v1/model.ckpt'

这是生图需要的核心模型。打开 https://hf-mirror.com/CompVis，选择所需版本的 original 模型。比如选择 1.4。则下载命令为：

huggingface-cli download --local-dir CompVis/stable-diffusion-v-1-4-original --resume-download CompVis/stable-diffusion-v-1-4-original sd-v1-4.ckpt

重新运行：

docker run --rm -it --gpus all -e https_proxy=http://172.17.0.1:64540 -v ./CompVis:/workspace/stable-diffusion-main/CompVis -v ./openai:/workspace/stable-diffusion-main/openai -v ./CompVis/stable-diffusion-v-1-4-original/sd-v1-4.ckpt:/workspace/stable-diffusion-main/models/ldm/stable-diffusion-v1/model.ckpt my/sd /bin/bash

`weights_only` 更新

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

这是一个依赖库更新导致的报错。一行命令，临时解决。将下面一行脚本加入 Dockerfile 并重新构建执行。

RUN grep -rl 'torch.load(ckpt, map_location="cpu")' scripts | xargs sed -i 's/torch.load(ckpt, map_location="cpu")/torch.load(ckpt, map_location="cpu", weights_only=False)/g'

再次执行脚本。这次增加一个 -v ./outputs:/workspace/stable-diffusion-main/outputs。

docker run --rm -it --gpus all -e https_proxy=http://172.17.0.1:64540 -v ./outputs:/workspace/stable-diffusion-main/outputs -v ./CompVis:/workspace/stable-diffusion-main/CompVis -v ./openai:/workspace/stable-diffusion-main/openai -v ./CompVis/stable-diffusion-v-1-4-original/sd-v1-4.ckpt:/workspace/stable-diffusion-main/models/ldm/stable-diffusion-v1/model.ckpt my/sd /bin/bash

终于开始有了进度条，最后看到下面的成功提示。

Your samples are ready and waiting for you here: 
outputs/txt2img-samples 
 
Enjoy.

输出目录已经挂载，直接在主机的 output/ 目录下找到生成的文件。

生成的图片

第四步：改进与优化

后续针对速度和功能做优化改进。比如使用 compose 管理运行时各种参数、增加对各类软件镜像的支持。

改进后的代码在：Lax/study-stable-diffusion-original