SegFormer Part 4, Quantization Difficulties and Errors Part 1

Difficulties while Quantizing

Load versions for QAT and compare SPACE/TIME

ValueError: SegformerForImageClassification does not support device_map='auto'. To implement support, the modelclass needs to implement the _no_split_modules attribute. and ValueError: SegformerForImageClassification does not support device_map='sequential'. To implement support, the modelclass needs to implement the _no_split_modules attribute.

Test model with example input

RuntimeError: Input type (float) and bias type (c10::Half) should be the same

  • If input is not halfed
  • Solution: copy() input dict and half() the pixel_values

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

  • If input is halfed
  • From /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py
  • Solution: inputs[“pixel_values”] to device cuda

UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_type=torch.float32 (default). This will lead to slow inference or training speed.

  • Solution (TODO): Use bnb config BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=<dtpype>)

Training

RuntimeError: Input type (float) and bias type (c10::Half) should be the same

ValueError: The model you want to train is loaded in 8-bit precision. if you want to fine-tune an 8-bit model, please make sure that you have installed bitsandbytes>=0.37.2.

  • Error while calling Trainer() despite having bitsandbytes>=0.37.0 installed and imported, e.g. %pip list | grep bitsandbytes yields bitsandbytes 0.41.1
  • Solution: TODO

UserWarning: You are calling save_pretrained to a 8-bit converted model you may likely encounter unexepected behaviors. If you want to save 8-bit models, make sure to have bitsandbytes>0.37.2 installed.

  • Warning and Error saving 8bit quantized model

NotImplementedError: You are calling save_pretrained on a 4-bit converted model. This is currently not supported

  • Error saving 4bit qantized model

RuntimeError: Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported. Please call module.cuda() before module.load_state_dict()

  • TODO

Designing a device map

Read More

SegFormer Part 3, Quantization Description

Description of Quantization of pre-trained Image Transformers

Load versions for QAT and compare SPACE/TIME

8-bit quantization with bitsandbytes

From LLM.int8() Paper, Source GH. 8-bit HF inference example

  • optimizer
    • bnb.optim.Adam8bit(....)
    • bnb.nn.Embedding(..)
  • inference
    • linear = bnb.nn.Linear8bitLt(...)
    • Modes: mixed-precision, int8
    • or full LLM.int8() method

BitsAndBytesConfig also offers configuration support.

from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=bf16)
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)
Read More

SegFormer Part 2, PoC Difficulties and Errors

Difficulties while working on a PoC

This is a writup to difficulties and errors encountered while working on a SegFormer PoC workbook.

Model

ValueError: You passed along num_labels=1055 with an incompatible id to label map:{}

  • Passing train_ds.features["scene_category"].num_classesto num_labels when len(id2label) expected
  • Solution: Use len(id2label)

RuntimeError: Error(s) in loading state_dict for SegformerForSemanticSegmentation: size mismatch for decode_head.classifier.weight: copying a param with shape torch.Size([150, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([151, 256, 1, 1]). size mismatch for decode_head.classifier.bias: copying a param with shape torch.Size([150]) from checkpoint, the shape in current model is torch.Size([151]). You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.

  • Solution: Use ignore_mismatched_sizes=True
  • New alert: - decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([151, 256, 1, 1]) in the model instantiated - decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([151]) in the model instantiated

NotImplementedError: Cannot copy out of meta tensor; no data!

  • When using device_map=dev in from_pretrained().
  • Solution: Add accelerate.infer_auto_device_map(model) to model.hf_device_map after model is loaded

Train

HuggingFace Dataloader RuntimeError: cannot pin 'torch.cuda.FloatTensor' only dense CPU tensors can be pinned

  • Dataloader loads data on device of model and tries loading data already loaded to ‘cuda’ into ‘cuda’
  • Solution: Not using .to(cuda) inside collator_fn

OutOfMemoryError: CUDA out of memory. Tried to allocate 4.69 GiB (GPU 0; 14.75 GiB total capacity; 11.08 GiB already allocated; 2.48 GiB free; 11.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

  • PyTorch CUDA Memory management
  • Solution in environment: environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"
  • Solution for training: per_device_train_batch_size=batch_size with batch_size from 32 to 8
  • Solution for evaluation: per_device_eval_batch_size=batch_size with batch_size from 32 to 1

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

  • Solution: Set environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:2048" to max 1024

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

  • Error occurs in cross entropy, maybe wrong number of labels or label indexing, id2label or label2id, See CUDA runtime error (59) : device-side assert triggered
  • Switch to CPU to get more meaningful error messages
  • Solution: Switching to CPU leads to IndexError: Target 150 is out of bounds.

IndexError: Target 150 is out of bounds.

  • Occurs in torch._C._nn.cross_entropy_loss, See CUDA runtime error (59) : device-side assert triggered.
  • Maybe because len(categories) (150) smaller than train_ds.features['scene_category'].num_classes (1055) -> No.
  • Testing with max([(i["labels"].min().item(), i["labels"].max().item()) for i in test_ds.shard(10, 0)]) yields (0, 150)
  • Solution: Prepend dummy class id2label = {**{0:'NONE'}, **{k:v for k,v in enumerate(categories, 1)}}. Has to be used with ignore_mismatched_sizes=True in from_pretrained().

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

  • When trying to debug and trace CUDA error: device-side assert triggered with CPU instead of CUDA
  • Solution: Do not use device_map for cpu

ValueError: Unsupported number of image dimensions: 2

  • Occuring at random batches with
    • PIL.mode='RGB' (['RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB'])
    • 'pixel_values':torch.Size([<batch_size=8>, <chn_dim=3>, 512, 512])
    • 'labels':torch.Size([<batch_size=8>, 512, 512])
  • Maybe false PIL.mode like RGBA with 4 channels instead of RGB, See “Unsupported number of image dimensions” while using image_utils from Transformers
  • Solution (bad one): Using image.convert("RGB") on every image within the on-the-fly transform function train_transforms(example_batch)
Read More

SegFormer Part 1, Description

Description

Model

Using Nvidia SegFormer (b0-sized) encoder pre-trained-only

SegFormer Model Architecture

Task

Using scene-parsing with Dataset scene_parse_150, a subset of semantic segmentation dataset MIT ADE20k

  • “segment the whole image densely into semantic classes (image regions), where each pixel is assigned a class label”
  • “mean of the pixel-wise accuracy and class-wise IoU as the final score”
  • structure
{
  'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=683x512 at 0x1FF32A3EDA0>,
  'annotation': <PIL.PngImagePlugin.PngImageFile image mode=L size=683x512 at 0x1FF32E5B978>,
  'scene_category': 0
}

Execution order for model Trainer()

  1. Transform on-the-fly * Data gets batch-wise prepared and augmented (<dataset>.set_transform(<transform_fn>))
  2. Tokenize tansformed data (image_processor) * Inputs image, annotation (segmentation mask) and scene_category (label) * Outputs pixel_values and labels tensors
  3. Collate tokenized batch data (data_collator=collate_fn) * Returns stacked tensor of tokenized data batches
  4. Fine-tune model with prepared data * Also inputs id2label and label2id * Returns tensor of pixel-wise logits
  5. Evaluate model output (compute_metrics) * Compare output logits to input segmentation mask

Pseudo downstream forward run

from torch import no_grad
from transformers import (
  AutoModelForImageClassification,
  AutoImageProcessor
)
image_processor = AutoImageProcessor.from_pretrained(checkpoint)
model = AutoModelForImageClassification.from_pretrained(checkpoint)
# preprocess and tokenize, return PyTorch tensors
inputs = image_processor(image.convert("RGB"), return_tensors="pt")
# forward only
with no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
pred_cls_idx = logits.argmax(-1).item()
print(f"{pred_cls_idx=}, {model.config.id2label[pred_cls_idx]=}")

Some weights of SegformerForSemanticSegmentation were not initialized

The following layers were not initialized because they should be fine-tuned to down-stream task.

  • ‘decode_head.classifier.weight’
  • ‘decode_head.batch_norm.bias’
  • ‘decode_head.linear_c.3.proj.bias’
  • ‘decode_head.batch_norm.running_mean’
  • ‘decode_head.batch_norm.weight’
  • ‘decode_head.batch_norm.running_var’
  • ‘decode_head.linear_c.0.proj.weight’
  • ‘decode_head.linear_c.1.proj.weight’
  • ‘decode_head.classifier.bias’
  • ‘decode_head.linear_c.1.proj.bias’
  • ‘decode_head.linear_c.3.proj.weight’
  • ‘decode_head.linear_c.2.proj.bias’
  • ‘decode_head.linear_c.2.proj.weight’
  • ‘decode_head.linear_fuse.weight’ac
  • ‘decode_head.batch_norm.num_batches_tracked’
  • ‘decode_head.linear_c.0.proj.bias’

In regards to the following warning:

Some weights of SegformerForSemanticSegmentation were not initialized from the model checkpoint at [...] are newly initialized because the shapes did not match:
- decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([151, 256, 1, 1]) in the model instantiated
- decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([151]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Read More