Deep learning comprises
a parameterized “neural network” model;
a methodology for learning the model parameters from given data.
Neural networks model composite functions that alternate (affine) linear and nonlinear functions. The nonlinear “activation” functions are often sigmoid, hyperbolic tangent, or rectified linear units. The main parameters of a neural network are the matrices (weights) and vectors (biases) of the affine linear functions. Learning them from data is traditionally viewed as a nonlinear regression problem. It is principally solved by stochastic gradient descent-type methods.
Recently, deep learning was reinterpreted as a discretization of an optimal control problem (OCP) subject to constraints defined by a system of ordinary differential equations (ODEs). In this interpretation, the input data (at time t = 0) is viewed as being transformed by an unknown ODE to the corresponding output decision (at time t = 1). The goal is then to determine the optimal controls that correspond to model parameters defining the ODE.
This exciting viewpoint has not yet been fully exploited in the learning phase of deep learning. Current approaches can be interpreted from this vantage point as forward Euler methods, the most basic scheme for OCPs with ODE constraints. More advanced, stable, and computationally efficient numerical schemes were developed in numerical analysis for solving such problems.
The goal of this project is to reduce the number of floating-point operations (flops) required for the learning phase of deep learning. This will result in a proportional reduction of the energy consumption. To this end, we will leverage state-of-the-art technologies for ODE-constrained OCPs from applied mathematics and adapt them to deep learning. We anticipate halving the number of flops based on known order-of-magnitude efficiency gains when replacing forward Euler with more advanced schemes.
We will realize the goal through the following three advanced technologies from applied mathematics.
Formulation as a sparse, block linear system: We adopt the optimize-then-discretize paradigm of OCPs and combine it with an all-at-once discretization of the ODE. This results in a large-scale saddle point problem: an algebraic linear system with a particular block structure. We will solve it efficiently by structure-exploiting iterative Krylov methods combined with effective preconditioning. A special-purpose preconditioner will be designed for this system.
Sparsity-promoting regularization: Dropout is a classic regularization for deep learning to prevent overfitting. It randomly induces sparsity in the weights. In the OCP formulation, we can apply L1-regularization techniques to prioritize sparse weights. This will have a similar effect as dropout, with one advantage: The amount of sparsity and where it is applied is automatically controlled by the regularization term.
Tensor trains structure: Dropout can be viewed as a bias to obtain sparse models. Other data-sparse models can be imposed. For example, the weight matrices can be interpreted as flattenings of higher-order tensors, so a low-rank tensor trains structure can be imposed. We will impose this structure on the solution of the linear system and adapt the Krylov method to use only low-rank iterates.