colossal-ai无法累积梯度

colossal-ai：Making large AI models cheaper, faster and more accessible

官网
 github
论文 Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

运行环境

Docker运行

需要nvidia-container-runtime支持，WSL不支持，没成功构建
- https://github.com/hpcaitech/ColossalAI/tree/v0.2.8#use-docker
- https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime
构建colossalai镜像

FROM hpcaitech/cuda-conda:11.6
# metainformation
LABEL org.opencontainers.image.source = "https://github.com/hpcaitech/ColossalAI"
LABEL org.opencontainers.image.licenses = "Apache License 2.0"
LABEL org.opencontainers.image.base.name = "docker.io/library/hpcaitech/cuda-conda:11.3"
# install torch
RUN conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
# install apex
RUN git clone https://github.com/NVIDIA/apex && \
    cd apex && \
    pip install packaging && \
    pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" ./
# install colossalai
RUN git clone https://github.com/hpcaitech/ColossalAI.git \
    && cd ./ColossalAI \
    && CUDA_EXT=1 pip install -v --no-cache-dir .
# install titans
RUN pip install --no-cache-dir titans
# install tensornvme
RUN conda install cmake && \
    git clone https://github.com/hpcaitech/TensorNVMe.git && \
    cd TensorNVMe && \
    pip install -r requirements.txt && \
    pip install -v --no-cache-dir .

可以直接使用colossalai官方提供的镜像
- 基于colossalai镜像添加软件支持
- https://hub.docker.com/r/hpcaitech/colossalai/tags

FROM hpcaitech/colossalai:0.2.5
RUN apt-get update && \
    apt-get install -y openssh-server && \
    apt-get install -y vim && \
    apt-get install -y wget && \
    apt-get install -y iputils-ping && \
    apt-get install -y net-tools && \
    apt-get install -y curl && \
    apt-get install -y siege && \
    apt install kmod build-essential flex bison dwarves libssl-dev libelf-dev bc rsync dkms
COPY . /workspace/
RUN pip install -r requirements.txt -i http://pypi.douban.com/simple/ --trusted-host=pypi.douban.com/simple
WORKDIR "/workspace/"
RUN echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib" >> /etc/profile
RUN /bin/bash -c "source /etc/profile"
RUN ldconfig /usr/local/cuda/lib64/stubs/
RUN ldconfig
RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa;cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
RUN echo "Host *" > ~/.ssh/config && \
    echo "GSSAPIAuthentication no" >> ~/.ssh/config && \
    echo "StrictHostKeyChecking no" >> ~/.ssh/config && \
    echo "UserKnownHostsFile=/dev/null" >> ~/.ssh/config
RUN echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/loca/lib" >> ~/.bashrc
RUN echo "service ssh restart" >> ~/.bashrc
RUN /bin/bash -c "source ~/.bashrc"

直接安装

仅Linux支持
- 稳定版：pip install colossalai
- PyTorch扩展：CUDA_EXT=1 pip install colossalai
- 最新版：pip install colossalai-nightly

梯度累计

训练模型时由于显存大小有限，batch_size大小会受到限制：可通过多个梯度累积的方法扩大batch_size

多个batch执行前向计算，每次前向计算累积反向传播的梯度，但只执行一次梯度更新
- 相当于将batch_size 分成 MICRO_batch_size 和 MACRO_batch_size
- 每MICRO_batch_size个样本做一次前向计算和反向传播，梯度累积在层参数的grad中，不执行梯度更新
- 每MACRO_batch_size个MICRO_batch_size执行一次梯度更新，与此同时，优化器更新/学习率步进/梯度归零

Gradient-Accumulation

accum_iter = 4
for batch_idx, (input, lables) in enumerate(data_loader):
    # forward pass
    preds = model(input)
    loss = criterion(preds, labels)
    # scale the loss to the mean of the accumulated batch size
    loss = loss / accum_iter
    # backward pass
    loss.backward()
    # weights update
    if ((batch_idx + 1) % accum_iter == 0) or (batch_idx + 1)== len(data_loader):
        optimizer.step()
        optimizer.zero_grad()

能否保存多个batch的loss，然后只执行一次反向传播以及梯度更新？不能
以随机梯度下降 Stochastic Gradient Decent 为例，参数 $V$，学习率 $lr$，梯度 $grad$
参数更新公式：$ Vt = V{t-1} - lr grad $ ；
使用梯度累加时：$ V{t} = V{t-1} - lr \sum{i=0}^{N} grad{i} $ 。

total_loss = 0
loss = criterion(outputs, labels)
total_loss += loss / accumulation_steps
if ...:
    total_loss.backward()

若采用累加多个batch的loss的写法
数学上发生了改变：
$ \partial{loss} / \partial{w} $ 转变成
$ \partial{\sum{t=1}^{N} loss{t}} / \partial{w} $
注意：多个batch的loss累加之后，此时再算梯度的话，由于各网络层的输入x已经发生了改变，算出来的梯度是不准确的
前向传播：x -> f(x) -> y
反向传播：x, y' -> f'(x) -> x'

按正确的pytorch梯度累积代码的逻辑，colossal-ai梯度累积的代码只有一处不同

accum_iter = 4
for batch_idx, (input, lables) in enumerate(data_loader):
    # forward pass
    preds = model(input)
    loss = criterion(preds, labels)
    # scale the loss to the mean of the accumulated batch size
    loss = loss / accum_iter
    # backward pass
    optimizer.backward(loss) #loss.backward() 此处不同
    # weights update
    if ((batch_idx + 1) % accum_iter == 0) or (batch_idx + 1)== len(data_loader):
        optimizer.step()
        optimizer.zero_grad()

然而这种方式训练的网络收敛性有问题

colossal-ai的问题

以上训练过程使用colossalai的GeminiAdamOptimizer

基于Gemini内存管理机制实现的Adam优化器
https://github.com/hpcaitech/ColossalAI/blob/v0.2.8/colossalai/nn/optimizer/gemini_optimizer.py

import torch
from colossalai.nn.optimizer import HybridAdam
from colossalai.nn.optimizer.zero_optimizer import ZeroOptimizer
__all__ = ['GeminiAdamOptimizer']
class GeminiAdamOptimizer(ZeroOptimizer):
    def __init__(self, model: torch.nn.Module, **defaults: Any) -> None:
        optimizer = HybridAdam(model.parameters(), **defaults)
        super().__init__(optimizer, model, **defaults)

其底层封装了ZeroOptimizer和HybridAdam

HybridAdam实现Adam算法，继承自NVMeOptimizer，支持将数据卸载到NVMe设备上
ZeroOptimizer实现Zero策略，继承自ColossalaiOptimizer（继承自torch.optim.Optimizer），以实现deepspeed的zero Redundancy内存优化策略
- https://github.com/hpcaitech/ColossalAI/blob/v0.2.8/colossalai/nn/optimizer/zero_optimizer.py
HybridAdam只实现了step函数，而ZeroOptimizer的step函数做了inf检查之后调用self.optim变量的step函数，而self.optim就是个HybridAdam对象
- https://github.com/hpcaitech/ColossalAI/blob/v0.2.8/colossalai/nn/optimizer/hybrid_adam.py
GeminiAdamOptimizer调用的函数除了step外，都是在ZeroOptimizer中实现

# HybridAdam
def step(self, *args, **kwargs):
    self._maybe_move_fp32_params()
    self._set_grad_ptr()
    found_inf = self._check_overflow()
    if found_inf:
        self.optim_state = OptimState.UNSCALED    # no need to unscale grad
        self.grad_scaler.update(found_inf)    # update gradient scaler
        self._logger.info(f'Found overflow. Skip step')
        self._clear_global_norm()    # clear recorded norm
        self.zero_grad()    # reset all gradients
        self._update_fp16_params()
        return
    # get combined scale. combined scale = loss scale * clipping norm
    # so that gradient = gradient / combined scale
    combined_scale = self._get_combined_scale()
    self.grad_scaler.update(found_inf)
    ret = self.optim.step(div_scale=combined_scale, *args, **kwargs)
    self._register_states()
    self.zero_grad()
    self._update_fp16_params()
    return ret

看ZeroOptimizer的backward函数实现，似乎没什么问题
- 其中self.module是ZeroDDP对象，用于实现模型的并行计算的类

# ZeroOptimizer
def backward(self, loss: torch.Tensor):
    loss = self.loss_scale * loss
    self.optim_state = OptimState.SCALED
    self.module.backward(loss)

接着看模型的部分，模型被封装成ZeroDDP对象
- backward函数实现了和pytorch的写法一样的loss反向传播

# ZeroDDP
def _pre_bacward(self):
    # set a visit label for all parameters
    # the label is used to check whether the parameter is correctly reduced
    for param in self.param2name:
        if not is_ddp_ignored(param):
            setattr(param, "_gemini_reduced", False)
def _post_backward(self):
    if self.chunk_manager.accessed_mem != 0:
        error_params = ["Reduction failed at followed parameters:"]
        for param in self.param2name:
            if not is_ddp_ignored(param) and not getattr(param, "_gemini_reduced"):
                error_params.append(self.param2name[param])
        error_str = "\n\t".join(error_params)
        raise RuntimeError("ZERO DDP error: the synchronization of gradients doesn't exit properly.",
                           "The most possible reason is that the model is not compatible with ZeroDDP.\n",
                           f"{error_str}")
    self._setup_grads_ptr()
    self._logger.debug(
        f'comp cuda demand time: {self.gemini_manager._comp_cuda_demand_time}, layout time: {self.gemini_manager._layout_time}, evict time: {self.gemini_manager._evict_time}, CPU->CUDA vol: {self.gemini_manager._h2d_volume}B, CUDA->CPU vol: {self.gemini_manager._d2h_volume}'
    )
    self.gemini_manager.post_iter()
def backward(self, loss: torch.Tensor):
    self._pre_bacward()
    with self.param_op_hook.switch_to_backward(), ColoParamOpHookManager.use_hooks(self.param_op_hook):
        loss.backward()
    self._post_backward()

- 问题出在forward函数上：self.module.zero_grad(set_to_none=True)清空了梯度
- 此处self.module是个torch.nn.Module对象，也就是模型
- 每次forward时都会清空梯度，因此梯度累积不下来，缩放过的loss可能就训练不收敛

# ZeroDDP
def _post_forward(self):
    """This function is only triggered for inference.
    """
    access_list = list(self.chunk_manager.accessed_chunks)
    # we need to scatter all accessed chunks and move them to their original places
    for chunk in access_list:
        if chunk.keep_gathered:
            self.chunk_manager.fake_release_chunk(chunk)
        else:
            assert chunk.can_release
            self.chunk_manager.release_chunk(chunk)
        first_param = next(iter(chunk.tensors_info))
        self.chunk_manager.move_chunk(chunk, self.grads_device[first_param])
    assert self.chunk_manager.accessed_mem == 0
    # reset all recorded attributes
    self.gemini_manager.reset_attributes()
def forward(self, *args, **kwargs):
    # check whether we are in a inference mode
    grad_flag = torch.is_grad_enabled()
    if not grad_flag:
        assert not self.gemini_manager.need_warmup or not self.gemini_manager.is_warmup(
        ), "You should run a completed iteration as your warmup iter"
    args, kwargs = _cast_float(args, torch.half), _cast_float(kwargs, torch.half)
    self.module.zero_grad(set_to_none=True) #清空梯度
    self.gemini_manager.pre_iter(*args)
    with ColoParamOpHookManager.use_hooks(self.param_op_hook):
        outputs = self.module(*args, **kwargs)
    # scatter chunks in the inference mode
    if not grad_flag:
        self._post_forward()
    if self.force_outputs_fp32:
        return _cast_float(outputs, torch.float)
    return outputs

扩展

pytorch中，nn.Module和optim.Optimizer都有zero_grad函数，他们有什么区别？

nn.Module：https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module
- zero_grad实现：https://pytorch.org/docs/stable/_modules/torch/nn/modules/module.html#Module.zero_grad
- 清空模型所有含参数的层的梯度

def zero_grad(self, set_to_none: bool = True) -> None:
    r"""Sets gradients of all model parameters to zero. See similar function
    under :class:`torch.optim.Optimizer` for more context.
    Args:
        set_to_none (bool): instead of setting to zero, set the grads to None.
            See :meth:`torch.optim.Optimizer.zero_grad` for details.
    """
    if getattr(self, '_is_replica', False):
        warnings.warn(
            "Calling .zero_grad() from a module created with nn.DataParallel() has no effect. "
            "The parameters are copied (in a differentiable manner) from the original module. "
            "This means they are not leaf nodes in autograd and so don't accumulate gradients. "
            "If you need gradients in your forward method, consider using autograd.grad instead.")
    for p in self.parameters():
        if p.grad is not None:
            if set_to_none:
                p.grad = None
            else:
                if p.grad.grad_fn is not None:
                    p.grad.detach_()
                else:
                    p.grad.requires_grad_(False)
                p.grad.zero_()

optim.Optimizer：https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html#torch.optim.Optimizer.zero_grad
- zero_grad实现：https://pytorch.org/docs/stable/_modules/torch/optim/optimizer.html#Optimizer.zero_grad
- 清空所有传入优化器的层的梯度

def zero_grad(self, set_to_none: bool = True):
    r"""Sets the gradients of all optimized :class:`torch.Tensor` s to zero.
    Args:
        set_to_none (bool): instead of setting to zero, set the grads to None.
            This will in general have lower memory footprint, and can modestly improve performance.
            However, it changes certain behaviors. For example:
            1. When the user tries to access a gradient and perform manual ops on it,
            a None attribute or a Tensor full of 0s will behave differently.
            2. If the user requests ``zero_grad(set_to_none=True)`` followed by a backward pass, ``.grad``\ s
            are guaranteed to be None for params that did not receive a gradient.
            3. ``torch.optim`` optimizers have a different behavior if the gradient is 0 or None
            (in one case it does the step with a gradient of 0 and in the other it skips
            the step altogether).
    """
    foreach = self.defaults.get('foreach', False)
    if not hasattr(self, "_zero_grad_profile_name"):
        self._patch_step_function()
    if foreach:
        per_device_and_dtype_grads = defaultdict(lambda: defaultdict(list))
    with torch.autograd.profiler.record_function(self._zero_grad_profile_name):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is not None:
                    if set_to_none:
                        p.grad = None
                    else:
                        if p.grad.grad_fn is not None:
                            p.grad.detach_()
                        else:
                            p.grad.requires_grad_(False)
                        if (not foreach or p.grad.is_sparse):
                            p.grad.zero_()
                        else:
                            per_device_and_dtype_grads[p.grad.device][p.grad.dtype].append(p.grad)
        if foreach:
            for _, per_dtype_grads in per_device_and_dtype_grads.items():
                for grads in per_dtype_grads.values():
                    torch._foreach_zero_(grads)

区别在于有时候会冻结模型部分参数进行训练，因此只传入优化器需要训练的部分网络层的参数

colossal的优化器

torch.optim.Optimizer

Lars：Implements the LARS optimizer
- Adapted from https://github.com/NUS-HPC-AI-Lab/LARS-ImageNet-PyTorch/blob/main/lars.py
- Large batch training of convolutional networks
Lamb：Implements Lamb algorithm
- Adapted from the pytorch-lamb library at https://github.com/cybertronai/pytorch-lamb
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
FusedLAMB：Implements LAMB algorithm.
- modified from https://github.com/NVIDIA/apex/blob/master/apex/optimizers/fused_lamb.py
- Fusion of the LAMB update’s elementwise operations
- A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.
FusedSGD：Implements stochastic gradient descent (optionally with momentum).
- modified from https://github.com/NVIDIA/apex/blob/master/apex/optimizers/fused_sgd.py
- Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning
NVMeOptimizer：A base class for offloading optimizer states.
- HybridAdam：Implements Adam algorithm.
- - Supports parameters updating on both GPU and CPU, the parameters and gradients should on the same device
- - an hybrid of CPUAdam and FusedAdam
- - Adam: A Method for Stochastic Optimization
- - On the Convergence of Adam and Beyond
- CPUAdam：Implements Adam algorithm.
- This version of CPU Adam accelates parameters updating on CPU with SIMD. Support of AVX2 or AVX512 is required.
FusedAdam：Implements Adam algorithm.
- adapted from fused adam in NVIDIA/apex, commit a109f85
- Fusion of the Adam update’s elementwise operations
- A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.
ColossalaiOptimizer
- ZeroOptimizer：A wrapper for optimizer. ZeroDDP and ZeroOptimizer implement Zero Redundancy Optimizer (ZeRO state-3)
- - GeminiAdamOptimizer：combine HybridAdam with ZeroOptimizer

运行环境

梯度累计

colossal-ai的问题

扩展

colossal的优化器

参考