colossal-ai无法累积梯度

colossal-ai:Making large AI models cheaper, faster and more accessible

官网
github
论文 Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

运行环境

Docker运行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
FROM hpcaitech/cuda-conda:11.6
# metainformation
LABEL org.opencontainers.image.source = "https://github.com/hpcaitech/ColossalAI"
LABEL org.opencontainers.image.licenses = "Apache License 2.0"
LABEL org.opencontainers.image.base.name = "docker.io/library/hpcaitech/cuda-conda:11.3"
# install torch
RUN conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
# install apex
RUN git clone https://github.com/NVIDIA/apex && \
cd apex && \
pip install packaging && \
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" ./
# install colossalai
RUN git clone https://github.com/hpcaitech/ColossalAI.git \
&& cd ./ColossalAI \
&& CUDA_EXT=1 pip install -v --no-cache-dir .
# install titans
RUN pip install --no-cache-dir titans
# install tensornvme
RUN conda install cmake && \
git clone https://github.com/hpcaitech/TensorNVMe.git && \
cd TensorNVMe && \
pip install -r requirements.txt && \
pip install -v --no-cache-dir .
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
FROM hpcaitech/colossalai:0.2.5
RUN apt-get update && \
apt-get install -y openssh-server && \
apt-get install -y vim && \
apt-get install -y wget && \
apt-get install -y iputils-ping && \
apt-get install -y net-tools && \
apt-get install -y curl && \
apt-get install -y siege && \
apt install kmod build-essential flex bison dwarves libssl-dev libelf-dev bc rsync dkms
COPY . /workspace/
RUN pip install -r requirements.txt -i http://pypi.douban.com/simple/ --trusted-host=pypi.douban.com/simple
WORKDIR "/workspace/"
RUN echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib" >> /etc/profile
RUN /bin/bash -c "source /etc/profile"
RUN ldconfig /usr/local/cuda/lib64/stubs/
RUN ldconfig
RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa;cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
RUN echo "Host *" > ~/.ssh/config && \
echo "GSSAPIAuthentication no" >> ~/.ssh/config && \
echo "StrictHostKeyChecking no" >> ~/.ssh/config && \
echo "UserKnownHostsFile=/dev/null" >> ~/.ssh/config
RUN echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/loca/lib" >> ~/.bashrc
RUN echo "service ssh restart" >> ~/.bashrc
RUN /bin/bash -c "source ~/.bashrc"

直接安装

  • 仅Linux支持
    • 稳定版:pip install colossalai
    • PyTorch扩展:CUDA_EXT=1 pip install colossalai
    • 最新版:pip install colossalai-nightly

梯度累计

训练模型时由于显存大小有限,batch_size大小会受到限制:可通过多个梯度累积的方法扩大batch_size

  • 多个batch执行前向计算,每次前向计算累积反向传播的梯度,但只执行一次梯度更新
    • 相当于将batch_size 分成 MICRO_batch_size 和 MACRO_batch_size
    • 每MICRO_batch_size个样本做一次前向计算和反向传播,梯度累积在层参数的grad中,不执行梯度更新
    • 每MACRO_batch_size个MICRO_batch_size执行一次梯度更新,与此同时,优化器更新/学习率步进/梯度归零

Gradient-Accumulation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
accum_iter = 4
for batch_idx, (input, lables) in enumerate(data_loader):
# forward pass
preds = model(input)
loss = criterion(preds, labels)
# scale the loss to the mean of the accumulated batch size
loss = loss / accum_iter
# backward pass
loss.backward()
# weights update
if ((batch_idx + 1) % accum_iter == 0) or (batch_idx + 1)== len(data_loader):
optimizer.step()
optimizer.zero_grad()

能否保存多个batch的loss,然后只执行一次反向传播以及梯度更新?不能
以随机梯度下降 Stochastic Gradient Decent 为例,参数 $V$,学习率 $lr$,梯度 $grad$
参数更新公式:$ Vt = V{t-1} - lr grad $ ;
使用梯度累加时:$ V{t} = V{t-1} - lr
\sum{i=0}^{N} grad{i} $ 。

1
2
3
4
5
total_loss = 0
loss = criterion(outputs, labels)
total_loss += loss / accumulation_steps
if ...:
total_loss.backward()

若采用累加多个batch的loss的写法
数学上发生了改变:
$ \partial{loss} / \partial{w} $ 转变成
$ \partial{\sum{t=1}^{N} loss{t}} / \partial{w} $
注意:多个batch的loss累加之后,此时再算梯度的话,由于各网络层的输入x已经发生了改变,算出来的梯度是不准确的
前向传播:x -> f(x) -> y
反向传播:x, y' -> f'(x) -> x'

按正确的pytorch梯度累积代码的逻辑,colossal-ai梯度累积的代码只有一处不同

1
2
3
4
5
6
7
8
9
10
11
12
13
14
accum_iter = 4
for batch_idx, (input, lables) in enumerate(data_loader):
# forward pass
preds = model(input)
loss = criterion(preds, labels)
# scale the loss to the mean of the accumulated batch size
loss = loss / accum_iter
# backward pass
optimizer.backward(loss) #loss.backward() 此处不同
# weights update
if ((batch_idx + 1) % accum_iter == 0) or (batch_idx + 1)== len(data_loader):
optimizer.step()
optimizer.zero_grad()

然而这种方式训练的网络收敛性有问题

colossal-ai的问题

以上训练过程使用colossalai的GeminiAdamOptimizer

1
2
3
4
5
6
7
8
9
import torch
from colossalai.nn.optimizer import HybridAdam
from colossalai.nn.optimizer.zero_optimizer import ZeroOptimizer
__all__ = ['GeminiAdamOptimizer']
class GeminiAdamOptimizer(ZeroOptimizer):
def __init__(self, model: torch.nn.Module, **defaults: Any) -> None:
optimizer = HybridAdam(model.parameters(), **defaults)
super().__init__(optimizer, model, **defaults)

其底层封装了ZeroOptimizer和HybridAdam

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# HybridAdam
def step(self, *args, **kwargs):
self._maybe_move_fp32_params()
self._set_grad_ptr()
found_inf = self._check_overflow()
if found_inf:
self.optim_state = OptimState.UNSCALED # no need to unscale grad
self.grad_scaler.update(found_inf) # update gradient scaler
self._logger.info(f'Found overflow. Skip step')
self._clear_global_norm() # clear recorded norm
self.zero_grad() # reset all gradients
self._update_fp16_params()
return
# get combined scale. combined scale = loss scale * clipping norm
# so that gradient = gradient / combined scale
combined_scale = self._get_combined_scale()
self.grad_scaler.update(found_inf)
ret = self.optim.step(div_scale=combined_scale, *args, **kwargs)
self._register_states()
self.zero_grad()
self._update_fp16_params()
return ret
  • 看ZeroOptimizer的backward函数实现,似乎没什么问题
    • 其中self.module是ZeroDDP对象,用于实现模型的并行计算的类
1
2
3
4
5
# ZeroOptimizer
def backward(self, loss: torch.Tensor):
loss = self.loss_scale * loss
self.optim_state = OptimState.SCALED
self.module.backward(loss)
  • 接着看模型的部分,模型被封装成ZeroDDP对象
    • backward函数实现了和pytorch的写法一样的loss反向传播
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# ZeroDDP
def _pre_bacward(self):
# set a visit label for all parameters
# the label is used to check whether the parameter is correctly reduced
for param in self.param2name:
if not is_ddp_ignored(param):
setattr(param, "_gemini_reduced", False)
def _post_backward(self):
if self.chunk_manager.accessed_mem != 0:
error_params = ["Reduction failed at followed parameters:"]
for param in self.param2name:
if not is_ddp_ignored(param) and not getattr(param, "_gemini_reduced"):
error_params.append(self.param2name[param])
error_str = "\n\t".join(error_params)
raise RuntimeError("ZERO DDP error: the synchronization of gradients doesn't exit properly.",
"The most possible reason is that the model is not compatible with ZeroDDP.\n",
f"{error_str}")
self._setup_grads_ptr()
self._logger.debug(
f'comp cuda demand time: {self.gemini_manager._comp_cuda_demand_time}, layout time: {self.gemini_manager._layout_time}, evict time: {self.gemini_manager._evict_time}, CPU->CUDA vol: {self.gemini_manager._h2d_volume}B, CUDA->CPU vol: {self.gemini_manager._d2h_volume}'
)
self.gemini_manager.post_iter()
def backward(self, loss: torch.Tensor):
self._pre_bacward()
with self.param_op_hook.switch_to_backward(), ColoParamOpHookManager.use_hooks(self.param_op_hook):
loss.backward()
self._post_backward()
    • 问题出在forward函数上:self.module.zero_grad(set_to_none=True)清空了梯度
    • 此处self.module是个torch.nn.Module对象,也就是模型
    • 每次forward时都会清空梯度,因此梯度累积不下来,缩放过的loss可能就训练不收敛
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# ZeroDDP
def _post_forward(self):
"""This function is only triggered for inference.
"""
access_list = list(self.chunk_manager.accessed_chunks)
# we need to scatter all accessed chunks and move them to their original places
for chunk in access_list:
if chunk.keep_gathered:
self.chunk_manager.fake_release_chunk(chunk)
else:
assert chunk.can_release
self.chunk_manager.release_chunk(chunk)
first_param = next(iter(chunk.tensors_info))
self.chunk_manager.move_chunk(chunk, self.grads_device[first_param])
assert self.chunk_manager.accessed_mem == 0
# reset all recorded attributes
self.gemini_manager.reset_attributes()
def forward(self, *args, **kwargs):
# check whether we are in a inference mode
grad_flag = torch.is_grad_enabled()
if not grad_flag:
assert not self.gemini_manager.need_warmup or not self.gemini_manager.is_warmup(
), "You should run a completed iteration as your warmup iter"
args, kwargs = _cast_float(args, torch.half), _cast_float(kwargs, torch.half)
self.module.zero_grad(set_to_none=True) #清空梯度
self.gemini_manager.pre_iter(*args)
with ColoParamOpHookManager.use_hooks(self.param_op_hook):
outputs = self.module(*args, **kwargs)
# scatter chunks in the inference mode
if not grad_flag:
self._post_forward()
if self.force_outputs_fp32:
return _cast_float(outputs, torch.float)
return outputs

扩展

pytorch中,nn.Module和optim.Optimizer都有zero_grad函数,他们有什么区别?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def zero_grad(self, set_to_none: bool = True) -> None:
r"""Sets gradients of all model parameters to zero. See similar function
under :class:`torch.optim.Optimizer` for more context.
Args:
set_to_none (bool): instead of setting to zero, set the grads to None.
See :meth:`torch.optim.Optimizer.zero_grad` for details.
"""
if getattr(self, '_is_replica', False):
warnings.warn(
"Calling .zero_grad() from a module created with nn.DataParallel() has no effect. "
"The parameters are copied (in a differentiable manner) from the original module. "
"This means they are not leaf nodes in autograd and so don't accumulate gradients. "
"If you need gradients in your forward method, consider using autograd.grad instead.")
for p in self.parameters():
if p.grad is not None:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def zero_grad(self, set_to_none: bool = True):
r"""Sets the gradients of all optimized :class:`torch.Tensor` s to zero.
Args:
set_to_none (bool): instead of setting to zero, set the grads to None.
This will in general have lower memory footprint, and can modestly improve performance.
However, it changes certain behaviors. For example:
1. When the user tries to access a gradient and perform manual ops on it,
a None attribute or a Tensor full of 0s will behave differently.
2. If the user requests ``zero_grad(set_to_none=True)`` followed by a backward pass, ``.grad``\ s
are guaranteed to be None for params that did not receive a gradient.
3. ``torch.optim`` optimizers have a different behavior if the gradient is 0 or None
(in one case it does the step with a gradient of 0 and in the other it skips
the step altogether).
"""
foreach = self.defaults.get('foreach', False)
if not hasattr(self, "_zero_grad_profile_name"):
self._patch_step_function()
if foreach:
per_device_and_dtype_grads = defaultdict(lambda: defaultdict(list))
with torch.autograd.profiler.record_function(self._zero_grad_profile_name):
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
if (not foreach or p.grad.is_sparse):
p.grad.zero_()
else:
per_device_and_dtype_grads[p.grad.device][p.grad.dtype].append(p.grad)
if foreach:
for _, per_dtype_grads in per_device_and_dtype_grads.items():
for grads in per_dtype_grads.values():
torch._foreach_zero_(grads)

区别在于有时候会冻结模型部分参数进行训练,因此只传入优化器需要训练的部分网络层的参数

colossal的优化器

torch.optim.Optimizer

参考

  1. 知乎专栏-聊聊梯度累加(Gradient Accumulation)
  2. Colossal-AI源码