Model Compression: The demands of running complex models on limited systems have made model compression techniques essential in machine learning. This survey explores various model compression techniques and highlights key strategies to reduce model size and computational costs without much loss in accuracy. Overall, we group the methods into four categories: parameter pruning, quantization, knowledge distillation, and low-rank factorization.
Pruning reduces unnecessary or secondary connections and weights from a neural network. Structured pruning eliminates entire filters or channels, leading to more regular and hardware-friendly designs, whereas unstructured pruning removes individual weights, typically requiring the use of special hardware for efficient computation.
Pruning works best when you base it on importance, using simple to advanced model compression techniques that focus on the loss function. Quantization reduces weight and activation resolution, often from 32-bit to 8-bit or binary. This lowers memory use and speeds up computation but requires careful tuning to maintain accuracy. Model compression techniques like post-training quantization and quantization-aware training help reduce performance loss from lower precision.
Knowledge distillation redistributes the knowledge from a large, complex teacher model compression to a student model that is smaller and more lightweight. The student model is trained to mimic the output of the teacher, such as soft targets that capture the uncertainty of the teacher and decision boundaries. The student model compression methods will equate the capability of the teacher model with significantly fewer parameters using this approach.
Low-rank factorization breaks weight matrices into simpler forms with fewer parameters and less computation needed. Model Compression methods like Singular Value Decomposition (SVD) are used to find and utilize redundancy in weight matrices. The selected factorization method and approximation rank affect the balance between model compression and accuracy.
Recent work explores hybrid solutions that combine multiple model compression techniques, like pruning and quantization, to achieve higher compression ratios. Automated machine learning (AutoML) techniques optimize model compression methods for specific models and platforms. Ongoing innovations in compression algorithms and hardware accelerators enhance machine learning deployment efficiency, enabling broader AI adoption in various applications.
Click here to get the complete project: