Collection of Tools for ML

May 27, 2024

E2E Automated ML Tools (AMLT)

SegFormer Part 4, Quantization Difficulties and Errors Part 1

May 5, 2024

Difficulties while Quantizing

Load versions for QAT and compare SPACE/TIME

ValueError: SegformerForImageClassification does not support device_map='auto'. To implement support, the modelclass needs to implement the _no_split_modules attribute. and ValueError: SegformerForImageClassification does not support device_map='sequential'. To implement support, the modelclass needs to implement the _no_split_modules attribute.

from_pretrained() supports device_map='auto', but not SegformerForImageClassification
Solution: device_map=0 (cuda:0) as default param into SegformerForImageClassification.from_pretrained()

Test model with example input

RuntimeError: Input type (float) and bias type (c10::Half) should be the same

If input is not halfed
Solution: copy() input dict and half() the pixel_values

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

If input is halfed
From /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py
Solution: inputs[“pixel_values”] to device cuda

UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_type=torch.float32 (default). This will lead to slow inference or training speed.

Solution (TODO): Use bnb config BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=<dtpype>)

Training

RuntimeError: Input type (float) and bias type (c10::Half) should be the same

Error when training int8 and int4 w/o adopting input
PyTorch cpp c10::Half, Introducing the Half type!, Numpy Data types
Solution: collate_fn with tensor.half()

ValueError: The model you want to train is loaded in 8-bit precision. if you want to fine-tune an 8-bit model, please make sure that you have installed bitsandbytes>=0.37.2.

Error while calling Trainer() despite having bitsandbytes>=0.37.0 installed and imported, e.g. %pip list | grep bitsandbytes yields bitsandbytes 0.41.1
Solution: TODO

UserWarning: You are calling save_pretrained to a 8-bit converted model you may likely encounter unexepected behaviors. If you want to save 8-bit models, make sure to have bitsandbytes>0.37.2 installed.

Warning and Error saving 8bit quantized model

NotImplementedError: You are calling save_pretrained on a 4-bit converted model. This is currently not supported

Error saving 4bit qantized model

RuntimeError: Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported. Please call module.cuda() before module.load_state_dict()

TODO

Designing a device map

Designing a device map with HF Accelerate, supported defaults: "auto", "balanced", "balanced_low_0", "sequential"
accelerate.infer_auto_device_map
Solution: for now use device_map=0 or device_map={'':torch.cuda.current_device()}

SegFormer Part 3, Quantization Description

May 5, 2024

Description of Quantization of pre-trained Image Transformers

Load versions for QAT and compare SPACE/TIME

8-bit quantization with bitsandbytes

From LLM.int8() Paper, Source GH. 8-bit HF inference example

optimizer
- bnb.optim.Adam8bit(....)
- bnb.nn.Embedding(..)
inference
- linear = bnb.nn.Linear8bitLt(...)
- Modes: mixed-precision, int8
- or full LLM.int8() method

BitsAndBytesConfig also offers configuration support.

from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=bf16)
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

Links

HF SegFormer, SegFormer Semantic Segmentation
HF Quantize Transformer Models
HF PEFT Parameter-Efficient Fine-Tuning
HF PEFT LoRA int8 Finetune-opt-bnb-peft.ipynb
HF Accelerate MP multi-GPUs/TPU/fp16
- wraps torch.distributed.run
Nvidia amp: Automatic Mixed Precision
Microsoft DeepSpeed CPU offloading
- HF DeepSpeed integration
HF Utilities for Image Processors
PyTorch performance tuning

SegFormer Part 2, PoC Difficulties and Errors

May 5, 2024

Difficulties while working on a PoC

This is a writup to difficulties and errors encountered while working on a SegFormer PoC workbook.

Model

ValueError: You passed along num_labels=1055 with an incompatible id to label map:{}

Passing train_ds.features["scene_category"].num_classesto num_labels when len(id2label) expected
Solution: Use len(id2label)

RuntimeError: Error(s) in loading state_dict for SegformerForSemanticSegmentation: size mismatch for decode_head.classifier.weight: copying a param with shape torch.Size([150, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([151, 256, 1, 1]). size mismatch for decode_head.classifier.bias: copying a param with shape torch.Size([150]) from checkpoint, the shape in current model is torch.Size([151]). You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.

Solution: Use ignore_mismatched_sizes=True
New alert: - decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([151, 256, 1, 1]) in the model instantiated - decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([151]) in the model instantiated

NotImplementedError: Cannot copy out of meta tensor; no data!

When using device_map=dev in from_pretrained().
Solution: Add accelerate.infer_auto_device_map(model) to model.hf_device_map after model is loaded

Train

HuggingFace Dataloader RuntimeError: cannot pin 'torch.cuda.FloatTensor' only dense CPU tensors can be pinned

Dataloader loads data on device of model and tries loading data already loaded to ‘cuda’ into ‘cuda’
Solution: Not using .to(cuda) inside collator_fn

OutOfMemoryError: CUDA out of memory. Tried to allocate 4.69 GiB (GPU 0; 14.75 GiB total capacity; 11.08 GiB already allocated; 2.48 GiB free; 11.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

PyTorch CUDA Memory management
Solution in environment: environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"
Solution for training: per_device_train_batch_size=batch_size with batch_size from 32 to 8
Solution for evaluation: per_device_eval_batch_size=batch_size with batch_size from 32 to 1

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

Solution: Set environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:2048" to max 1024

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Error occurs in cross entropy, maybe wrong number of labels or label indexing, id2label or label2id, See CUDA runtime error (59) : device-side assert triggered
Switch to CPU to get more meaningful error messages
Solution: Switching to CPU leads to IndexError: Target 150 is out of bounds.

IndexError: Target 150 is out of bounds.

Occurs in torch._C._nn.cross_entropy_loss, See CUDA runtime error (59) : device-side assert triggered.
Maybe because len(categories) (150) smaller than train_ds.features['scene_category'].num_classes (1055) -> No.
Testing with max([(i["labels"].min().item(), i["labels"].max().item()) for i in test_ds.shard(10, 0)]) yields (0, 150)
Solution: Prepend dummy class id2label = {**{0:'NONE'}, **{k:v for k,v in enumerate(categories, 1)}}. Has to be used with ignore_mismatched_sizes=True in from_pretrained().

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

When trying to debug and trace CUDA error: device-side assert triggered with CPU instead of CUDA
Solution: Do not use device_map for cpu

ValueError: Unsupported number of image dimensions: 2

Occuring at random batches with
- PIL.mode='RGB' (['RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB'])
- 'pixel_values':torch.Size([<batch_size=8>, <chn_dim=3>, 512, 512])
- 'labels':torch.Size([<batch_size=8>, 512, 512])
Maybe false PIL.mode like RGBA with 4 channels instead of RGB, See “Unsupported number of image dimensions” while using image_utils from Transformers
Solution (bad one): Using image.convert("RGB") on every image within the on-the-fly transform function train_transforms(example_batch)

SegFormer Part 1, Description

May 5, 2024

Description

Model

Using Nvidia SegFormer (b0-sized) encoder pre-trained-only

“hierarchical Transformer encoder”, “lightweight all-MLP decode head” (for segmentation)
“pre-trained on ImageNet-1k, after which a decode head is added and fine-tuned altogether on a downstream dataset”
“SegformerForSemanticSegmentation adds the all-MLP decoder head on top”
Paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Paper Github
SegFormer Model Architecture

SegFormer Model Architecture

Task

Using scene-parsing with Dataset scene_parse_150, a subset of semantic segmentation dataset MIT ADE20k

“segment the whole image densely into semantic classes (image regions), where each pixel is assigned a class label”
“mean of the pixel-wise accuracy and class-wise IoU as the final score”
structure

{
  'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=683x512 at 0x1FF32A3EDA0>,
  'annotation': <PIL.PngImagePlugin.PngImageFile image mode=L size=683x512 at 0x1FF32E5B978>,
  'scene_category': 0
}

Execution order for model `Trainer()`

Transform on-the-fly * Data gets batch-wise prepared and augmented (<dataset>.set_transform(<transform_fn>))
Tokenize tansformed data (image_processor) * Inputs image, annotation (segmentation mask) and scene_category (label) * Outputs pixel_values and labels tensors
Collate tokenized batch data (data_collator=collate_fn) * Returns stacked tensor of tokenized data batches
Fine-tune model with prepared data * Also inputs id2label and label2id * Returns tensor of pixel-wise logits
Evaluate model output (compute_metrics) * Compare output logits to input segmentation mask

Pseudo downstream forward run

from torch import no_grad
from transformers import (
  AutoModelForImageClassification,
  AutoImageProcessor
)
image_processor = AutoImageProcessor.from_pretrained(checkpoint)
model = AutoModelForImageClassification.from_pretrained(checkpoint)
# preprocess and tokenize, return PyTorch tensors
inputs = image_processor(image.convert("RGB"), return_tensors="pt")
# forward only
with no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
pred_cls_idx = logits.argmax(-1).item()
print(f"{pred_cls_idx=}, {model.config.id2label[pred_cls_idx]=}")

Some weights of SegformerForSemanticSegmentation were not initialized

The following layers were not initialized because they should be fine-tuned to down-stream task.

‘decode_head.classifier.weight’
‘decode_head.batch_norm.bias’
‘decode_head.linear_c.3.proj.bias’
‘decode_head.batch_norm.running_mean’
‘decode_head.batch_norm.weight’
‘decode_head.batch_norm.running_var’
‘decode_head.linear_c.0.proj.weight’
‘decode_head.linear_c.1.proj.weight’
‘decode_head.classifier.bias’
‘decode_head.linear_c.1.proj.bias’
‘decode_head.linear_c.3.proj.weight’
‘decode_head.linear_c.2.proj.bias’
‘decode_head.linear_c.2.proj.weight’
‘decode_head.linear_fuse.weight’ac
‘decode_head.batch_norm.num_batches_tracked’
‘decode_head.linear_c.0.proj.bias’

In regards to the following warning:

Some weights of SegformerForSemanticSegmentation were not initialized from the model checkpoint at [...] are newly initialized because the shapes did not match:
- decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([151, 256, 1, 1]) in the model instantiated
- decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([151]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

SegFormer Quantization Part 1 Short Intro and Reason

August 6, 2023

SegFormer Quantization, Part 1 Introduction and Reason

Purpose

Quantization to reduce SPACE and it’s effect on TIME and model quality
Research different quantization schemes on pre-trained models
Using HuggingFace built-in or custom functions
If HuggingFace is insufficient use PyTorch or TensorFlow Hubs
If all fall back to using low level PyTorch

What

Overcoming and recording difficulties along the way
From PoC to MVP
As generic as possible using jupytext and papermill

How

Using SegFormer (HF) for Image Classification and Semantic Segmentation
PyTorch .half()
HuggingFace and bitsandbytes load_in_8bit and load_in_4bit
Custom functions like binarization

To come

Using PyTorch quantization capabilities like quant/dequant-Layers, dtype.qint32, quantize_fx, QConfigMapping
Task specific distribution of w/b, act and grad
Use learned task specific distribution as initialisation

Collection of Tools for ML

E2E Automated ML Tools (AMLT)

SegFormer Part 4, Quantization Difficulties and Errors Part 1

Difficulties while Quantizing

Load versions for QAT and compare SPACE/TIME

Test model with example input

Training

SegFormer Part 3, Quantization Description

Description of Quantization of pre-trained Image Transformers

Load versions for QAT and compare SPACE/TIME

Links

SegFormer Part 2, PoC Difficulties and Errors

Difficulties while working on a PoC

Model

Train

SegFormer Part 1, Description

Description

Model

Task

Execution order for model Trainer()

Pseudo downstream forward run

Some weights of SegformerForSemanticSegmentation were not initialized

SegFormer Quantization Part 1 Short Intro and Reason

SegFormer Quantization, Part 1 Introduction and Reason

Purpose

What

How

To come

Execution order for model `Trainer()`