A Systematic, Unified DNN Model Compression Framework using ADMM

A fundamental contribution of Yanzhi’s group is a systematic, unified DNN model compression framework (ECCV18, ASPLOS19, ICCV19, AAAI20-1, AAAI20-2, HPCA19, etc.) based on the powerful optimization tool ADMM (Alternating Direction methods of Multipliers). It applies to weight pruning (both non-structured and various types of structured pruning) and weight quantization techniques. We further incorporate multiple enhancements such as progressive framework, multi-rho technique, purification and unused path removal, and automatic hyperparameter determination.

The proposed framework achieves unprecedented model compression rates on representative DNNs and datasets. For non-structured weight pruning alone, we achieve 700X overall weight reduction in LeNet-5, 44X in AlexNet (ImageNet dataset), 34X in VGGNet, and 12X for ResNet-50, with no accuracy loss – 3X to 56X improvement over competing methods (Arxiv). Besides, we achieve a 6X and 4.3X lossless structured pruning on CONV layers of AlexNet and ResNet-50 on ImageNet dataset, respectively. For the first time, we show that fully binarized weight quantization (for all layers) can be lossless in accuracy for MNIST and CIFAR-10 datasets, and full binarization of ResNet on ImageNet dataset is possible (ICML workshop’19). When (non-structured) weight pruning and quantization are combined, we achieve up to 6,635X reduction in weight storage without accuracy loss (Arxiv here), outperforming state-of-art by two orders of magnitude.

Our most recent results (Arxiv) conduct a comprehensive comparison between non-structured and structured weight pruning, with weight quantization in place. We have a strong conclusion that non-structured weight pruning is not desirable at any hardware platform.

When Compression Meets Compiler — Towards Real-Time Execution of All DNNs on a Mobile Device

Recently, another major contribution has been made (ASPLOS20, AAAI20, ICML19, etc.) based on the ADMM solution framework. The compiler has been identified as the bridge between algorithm-level compression of DNNs and hardware-level acceleration, maintaining the highest possible parallelism degree without accuracy compromise. Using mobile device (embedded CPU/GPU) as an example, we have developed a combination of pattern and connectivity pruning techniques, possessing both flexibility (and high accuracy) and regularity (and then hardware parallelism and acceleration). Accuracy and hardware performance are not a tradeoff anymore. Rather, it is possible for DNN model compression to achieve the best of both non-structured and structured pruning schemes while hiding weaknesses, to be desirable at all of theory, algorithm, compiler, and hardware levels.

Compiler-based code optimization techniques are proposed, such as filter/channel re-order, compact storage format, register-level redundant load elimination, fine-tuning, etc., to fully reap the benefits enabled by pattern and connectivity pruning (ASPLOS20), thereby facilitating real-time acceleration of DNNs. The optimized compiler automatically generates C++ codes for embedded CPU and OpenCL codes for embedded GPU.

For mobile devices, we achieve undoubtfully the fastest in DNN acceleration. For example, we achieve 18.9ms inference time for VGG-16, 26ms for ResNet-50, and 4ms for MobileNet-V2 on a commercial smartphone without accuracy loss (ASPLOS20). The execution time and especially energy efficiency outperform a large number of FPGA and ASIC acceleration works. All DNNs can be potentially be real-time in general-purpose, off-the-shelf mobile devices through our algorithm-compiler-hardware co-design framework.

Video on YOLO-V3 here?

Block-Circulant Matrix-based DNNs – A Model Compression Technique Beneficial to both Inference and Training

Besides the weight pruning and quantization techniques discussed before, we have proposed a holistic framework of incorporating structured matrices (block-circulant matrices) into DNNs (MICRO 2017, ICCAD 2017, AAAI 2018, AAAI 2019), and could achieve (i) simultaneous reduction on weight storage and computational complexities, (ii) simultaneous speedup of training and inference, and (iii) generality and fundamentality that can be adopted to both software and hardware implementations, different platforms and different neural network types, sizes, and scalability. The above characteristic (ii) is a unique one since the other weight pruning/quantization techniques will inevitably increase the training time. Besides algorithm-level achievements, our framework has a solid theoretical foundation (ICML 2017) to prove that our approach will converge to the same “effectiveness” as deep learning without compression. Unlike many prior work on weight pruning, our proposed framework is hardware friendly, especially for FPGAs (AAAI 2018, FPGA 2018, FPGA 2019, HPCA 2019) and ASICs (ISSCC 2019), because it is based on FFT (fast Fourier transform) and inverse FFT computations.

High-Performance and Energy Efficiency Implementation of DNNs on Hardware Platforms – Overview

Based on the systematic model compression framework, we develop model compression and acceleration techniques for various hardware platforms, accounting for the unique hardware characteristics. As shown in the figure, our research span in three dimensions and their effective coordination. The first is NN type: CNN, MLP, RNN (LSTM, GRU), Bert model (transformer model), DRL (deep reinforcement learning), etc. The second is application: image classification, speech recognition, object detection (multiple objects for UAV deployment), natural language processing, supervisory control (based on DRL), medical imaging (based on transfer learning), auto-driving, etc. The third is hardware platform, including GPU, FPGA, ASIC, mobile devices (as discussed before), and emerging hardware devices. As an additional important angle, we also aim to improve the generalibility of DNN model compression in transfer learning scenario and robustness to adversarial attacks (and other forms of attacks) (ICCV 2019). Our ultimate goal is to develop DNN model compression and acceleration that is desirable at all of theory, algorithm, compiler, and hardware levels.

FPGA Acceleration

Our work on DNN acceleration on FPGAs (AAAI 2018, FPGA 2018, FPGA 2019, HPCA 2019) mainly focuses on block-circulant based framework thanks to its hardware friendliness, in collaboration with Peking University. We are the first to incorporate block-circulant framework for FPGA acceleration of DNNs, use FFT-IFFT decoupling technique and reconfigurability ability for enhancing hardware performance (AAAI 2018), and developed the corresponding weight quantization scheme (in frequency domain). We further proposed to quantize each weight as the sum of two power-of-2 numbers (FPGA 2019), for further computation reduction without accuracy loss. Currently, we achieve (one of) the highest performance and energy efficiency in FPGA implementations of RNN-based speech recognition and YOLO-based UAV object detection (FPGA 2019, HPCA 2019) .

A table shall be here.

ASIC Acceleration

In collaboration with Tsinghua University, we have developed the first solid-state (65nm TSMC process) tapeout of block-circulant based DNN acceleration framework (ISSCC 2019). It achieves 0.39-to-140.3TOPS/W energy efficiency with 1-to-12b flexible weight/activation quantization (in frequency domain), which is one of the most energy efficient ASIC implementations. We have also developed an ASIC DNN accelerator supporting weight pruning-induced sparsity and 1-to-8b flexible quantization. In our lab, we have developed the first circuit tapeout on stochastic computing-based DNNs, which achieves ultra-small hardware footprint compared with conventional binary computing (HEIF paper).

A composite figure shall be here.

Emerging Devices including Superconducting Devices

The Adiabatic Quantum-Flux-Parametron (AQFP) superconducting technology has been recently developed, which achieves the highest energy efficiency among superconducting logic families (Nature SP 2019). We have identified that stochastic computing is especially suitable for AQFP. With help from Yokohama National University, Japan, in circuit fabrication and cryogenic testing, we have proposed and implemented the first of DNN acceleration using superconducting technology (based on stochastic computing technique), and have achieved the highest energy efficiency among all hardware devices – 104X to 105X higher than CMOS-based implementations (ISCA 2019).

A composite figure shall be here.

For other emerging hardware devices such as memristors (RRAM), our structured weight pruning and weight quantization framework for DNNs, which achieves currently the highest model compression rate, is also very helpful. This is because we minimize the size of RRAM crossbar under the same accuracy (ISLPED 2019, ASP-DAC 2020).