Difficulties while working on a PoC
This is a writup to difficulties and errors encountered while working on a SegFormer PoC workbook.
Model
ValueError: You passed along num_labels=1055 with an incompatible id to label map:{}
- Passing
train_ds.features["scene_category"].num_classes
tonum_labels
whenlen(id2label)
expected - Solution: Use
len(id2label)
RuntimeError: Error(s) in loading state_dict for SegformerForSemanticSegmentation: size mismatch for decode_head.classifier.weight: copying a param with shape torch.Size([150, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([151, 256, 1, 1]). size mismatch for decode_head.classifier.bias: copying a param with shape torch.Size([150]) from checkpoint, the shape in current model is torch.Size([151]). You may consider adding ignore_mismatched_sizes=True in the model
from_pretrained method.
- Solution: Use
ignore_mismatched_sizes=True
- New alert:
- decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([151, 256, 1, 1]) in the model instantiated - decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([151]) in the model instantiated
NotImplementedError: Cannot copy out of meta tensor; no data!
- When using
device_map=dev
infrom_pretrained()
. - Solution: Add
accelerate.infer_auto_device_map(model)
tomodel.hf_device_map
after model is loaded
Train
HuggingFace Dataloader RuntimeError: cannot pin 'torch.cuda.FloatTensor' only dense CPU tensors can be pinned
- Dataloader loads data on device of model and tries loading data already loaded to ‘cuda’ into ‘cuda’
- Solution: Not using
.to(cuda)
insidecollator_fn
OutOfMemoryError: CUDA out of memory. Tried to allocate 4.69 GiB (GPU 0; 14.75 GiB total capacity; 11.08 GiB already allocated; 2.48 GiB free; 11.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
- PyTorch CUDA Memory management
- Solution in environment:
environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"
- Solution for training:
per_device_train_batch_size=batch_size
withbatch_size
from32
to8
- Solution for evaluation:
per_device_eval_batch_size=batch_size
withbatch_size
from32
to1
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)
- Solution: Set
environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:2048"
to max1024
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
- Error occurs in cross entropy, maybe wrong number of labels or label indexing,
id2label
orlabel2id
, See CUDA runtime error (59) : device-side assert triggered - Switch to CPU to get more meaningful error messages
- Solution: Switching to CPU leads to
IndexError: Target 150 is out of bounds.
IndexError: Target 150 is out of bounds.
- Occurs in
torch._C._nn.cross_entropy_loss
, See CUDA runtime error (59) : device-side assert triggered. - Maybe because
len(categories)
(150) smaller thantrain_ds.features['scene_category'].num_classes
(1055) -> No. - Testing with
max([(i["labels"].min().item(), i["labels"].max().item()) for i in test_ds.shard(10, 0)])
yields(0, 150)
- Solution: Prepend dummy class
id2label = {**{0:'NONE'}, **{k:v for k,v in enumerate(categories, 1)}}
. Has to be used withignore_mismatched_sizes=True
infrom_pretrained()
.
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
- When trying to debug and trace
CUDA error: device-side assert triggered
with CPU instead of CUDA - Solution: Do not use
device_map
for cpu
ValueError: Unsupported number of image dimensions: 2
- Occuring at random batches with
PIL.mode='RGB'
(['RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB']
)'pixel_values'
:torch.Size([<batch_size=8>, <chn_dim=3>, 512, 512])
'labels'
:torch.Size([<batch_size=8>, 512, 512])
- Maybe false
PIL.mode
likeRGBA
with 4 channels instead ofRGB
, See “Unsupported number of image dimensions” while using image_utils from Transformers - Solution (bad one): Using
image.convert("RGB")
on every image within the on-the-fly transform functiontrain_transforms(example_batch)