An Empirical study on model pruning and quantization
In machine learning, model compression is vital for resource-constrained Internet of Things (IoT) devices, such as unmanned aerial vehicles (UAVs) and wearable devices. Currently, there are some state-of-the-art (SOTA) compression methods, but little study is conducted to evaluate these techniques across different models and datasets.
In this paper, we present an in-depth study on two SOTA model compression methods, pruning and quantization. We apply these methods on AlexNet, ResNet18, VGG16BN and VGG19BN, with three well-known datasets, Fashion-MNIST, CIFAR-10, and UCI-HAR. Through our study, we draw the conclusion that, applying pruning and retraining could keep the performance (average less than 0.5% degrade) while reducing the model size (at 10⇥ compression rate) on spatial domain datasets (e.g. pictures); the performance on temporal domain datasets (e.g. motion sensors data) degrades more (average about 5.0% degrade); the performance of quantization is related with the pruning rate, network architecture, and clustering methods. We also conduct comparative experiments on knowledge distillation. The result indicates that more prerequisites need to be satisfied when using the knowledge distillation to achieve average performance.
Finally, we provide some interesting directions for future research.