GdR ISIS Théorie du deep learning - June 28, 2021

Despite excellent prediction performance, state-of-the-art neural network architectures are very large, up to several millions of weights. In particular, running them on systems with limited compu- tational capacity (embedded systems) becomes a difficult task. For this reason, several works focused on the compression of NNs.

Most popular tensor approaches [1] for compression mainly aim at compressing the layers of convolutional networks, which can be viewed as tensors. By a canonical polyadic decomposition (CPD), they replace multidimensional convolutions by one-dimensional ones. Another direction of research is focused on relating tensor decompositions to neural networks with product units (instead of summing units) [2]; this type of representations, however, is not so much used in practice.

In this talk, we consider an entirely different approach. While keeping the traditional neural network structure (linear weights + nonlinear activation functions), we aim at adding flexibility [3] to activation functions (AFs), as opposed to fixed AFs used conventionally. In particular, the activation functions are allowed to be different in different nodes (as opposed to fixed functions, e.g. ReLu, in conventional architectures). Such architecture is particularly interesting thanks to identifiability (uniqueness) theory available in the polynomial case [4]. Identifiability properties may provide insight into the functioning of these NNs and help to enforce stability of the representation.

Unlike existing methods for flexible AFs that are using conventional training techniques [3], we employ an original framework developed in nonlinear system identification community. The work of [5] showed that an architecture with one hidden flexible layer can be identified as a CPD of a Jacobian tensor. However, it is not directly applicable in the learning setup; in particular, there is no simple way to estimate the activation functions. In this work we propose a new method for compression of pretrained neural networks based on coupled matrix-tensor factorization. The proposed learning algorithm is based on a constrained alternating least squares (ALS) approach. Our method allows for a good compression of large NN layers, with a slight degradation of the classification accuracy.

References

[1] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, Speeding-up convolutional neural networks using fine-tuned CP-decomposition, ICLR 2015, arXiv:1412.6553 (2015).

[2] N. Cohen, O. Sharir, and A. Shashua, On the Expressive Power of Deep Learning: A Tensor Analysis, in 29th Annual Conference on Learning Theory, New York, USA, 2016, pp. 698?728.

[3] A. Apicella, F. Donnarumma, F. Isgro`, and R. Prevete, A survey on modern trainable activation functions, Neural Networks, 138 (2021), pp. 14?32.

[4] P. Comon, Y. Qi, and K. Usevich, Identifiability of an x-rank decomposition of polynomial maps, SIAM Journal on Applied Algebra and Geometry, 1 (2017), pp. 388?414.

[5] P. Dreesen, M. Ishteva, and J. Schoukens, Decoupling multivariate polynomials using first-order information and tensor decompositions, SIAM Journal on Matrix Analysis and Applications, 36 (2015), pp. 864?879.

[6] Y. Zniyed, K. Usevich, S. Miron, and D. Brie, Learning nonlinearities in the decoupling problem with structured CPD, in 16th IFAC Symposium on System Identification, Padova, Italy, 2021.

This is a cowork by Yassine Zniyed, Konstantin Usevich, Sebastian Miron, David Brie.