Extremely Low-Bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures

Jan 1, 2020·

Qingchang Han

Yongmin Hu

Fengwei Yu

Hailong Yang

Bing Liu

Peng Hu

Ruihao Gong

Yanfei Wang

Rui Wang

Zhongzhi Luan

Depei Qian

· 0 min read

Cite Video DOI URL

Abstract

With the continuous demand for higher accuracy of deep neural networks, the model size has increased significantly. Quantization is one of the most widely used model compression methods, which can effectively reduce the model size without severe accuracy loss. Modern processors such as ARM CPU and NVIDIA GPU have already provided the support of low-bit arithmetic instructions. However, there lack efficient and practical optimizations for convolution computation towards extremely low-bit on ARM CPU (e.g., 2 ∼ 8-bit) and NVIDIA GPU (e.g., 4-bit and 8-bit). This paper explores the performance optimization methods of extremely low-bit convolution on diverse architectures. On ARM CPU, we propose two instruction schemes for 2 ∼ 3-bit and 4 ∼ 8-bit convolution with corresponding register allocation methods. In addition, we re-design the GEMM computation with data padding and packing optimizations. We also implement winograd algorithm for convolution with some specific bit width (e.g., 4 ∼ 6-bit) to achieve higher performance. On NVIDIA GPU, we propose a data partition mechanism and multi-level memory access optimizations, to better adapt the computation to GPU thread and memory hierarchy. We also propose quantization fusion to eliminate unnecessary data access. The experiment results demonstrate our implementations achieve better performance of extremely low-bit convolution compared to the state-of-the-art frameworks and libraries such as ncnn and cuDNN. To the best of our knowledge, this is the first work that provides efficient implementations of extremely low-bit convolutions covering 2 ∼ 8-bit on ARM CPU and 4-bit/8-bit on NVIDIA GPU.

Type

Conference paper

Publication

49th International Conference on Parallel Processing - ICPP

Last updated on Mar 27, 2022

NVIDIA GPU Quantized Neural Network Extremely Low-Bit Convolution Computation Optimization ARM CPU

Authors

Ruihao Gong

← Efficient Bitwidth Search for Practical Mixed Precision Neural Network Jan 1, 2020

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks Oct 1, 2019 →