Description of Quantization of pre-trained Image Transformers
Load versions for QAT and compare SPACE/TIME
8-bit quantization with bitsandbytes
From LLM.int8() Paper, Source GH. 8-bit HF inference example
- optimizer
bnb.optim.Adam8bit(....)
bnb.nn.Embedding(..)
- inference
linear = bnb.nn.Linear8bitLt(...)
- Modes: mixed-precision, int8
- or full
LLM.int8()
method
BitsAndBytesConfig
also offers configuration support.
from transformers import BitsAndBytesConfig
# quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=bf16)
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
)
Links
- HF SegFormer, SegFormer Semantic Segmentation
- HF Quantize Transformer Models
- HF PEFT Parameter-Efficient Fine-Tuning
- HF PEFT LoRA int8 Finetune-opt-bnb-peft.ipynb
- HF Accelerate MP multi-GPUs/TPU/fp16
- wraps
torch.distributed.run
- wraps
- Nvidia amp: Automatic Mixed Precision
- Microsoft DeepSpeed CPU offloading
- HF Utilities for Image Processors
- PyTorch performance tuning