Statistical theory for deep neural networks

Mathematically speaking, a neural network is a function mapping an input vector $\mathbf{x}\in\mathbb{R}^d$ to an output variable $y.$ Network functions are build by alternating matrix-vector multiplications with the action of a non-linear activation function $\sigma.$ Fitting a multilayer neural network means to find network parameters such that the network explains the input-output relation on training data as good as possible.

A neural network can be represented as directed graph, cf. Figure 1. The nodes in the graph (also called units ) are arranged in layers. The input layer is the first layer and the output layer the last layer. The layers that lie in between are called hidden layers . Each node/unit in the graph stands for a scalar product of the incoming signal with a weight vector which is then shifted and applied to the activation function. The number of hidden layers is called the depth of the network. A multilayer network (also called deep network) is a network with more than one hidden layer.
Large databases and increasing computational power have recently resulted in astonishing performances of multilayer neural networks or deep nets for a broad range of learning tasks, including image and text classification, speech recognition and game playing. Figure 2 shows the performance of a deep network for object recognition. For that a so called convolutional neural network (CNN) is used. The input of the CNN are the pixels of the image in (A). Some of the learned features in the first layer are displayed in (B). The CNN outputs the probabilities for the classes (C). It correctly classifies the image in this case.
Although deep networks are a central topic in machine learning they received little attention from mathematicians yet. While the optimal estimation rates in high dimensions are slow due to the unavoidable curse of dimensionality, multilayer neural networks still perform well in high dimensions. It is thus natural to conjecture that multilayer neural networks form a flexible class of estimators which can avoid the curse of dimensionality by adapting to various low-dimensional structural constraints on the regression function and the design.

Recently, I finished a first preprint studying large multilayer neural networks. It is shown that estimators based on sparsely connected multilayer neural networks with ReLU activation function and properly chosen network architecture achieve the minimax estimation rates (up to log n-factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. The mathematical analysis extends the recent progress in approximation theory and combines it with explicit control on the network architecture and the network parameters. Interestingly, the number of layers of the neural network architectures plays an important role and the theory suggests to scale the network depth with the logarithm of the sample size.

Nonparametric Bayes

Statistics has long been divided into frequentist and Bayesian statistics and this division lasts until today. Bayesian statistics is the historically older principle, dating back to the work by Thomas Bayes in the 18th century. Following the Bayesian paradigm, one specifies a prior distribution $\pi$ on the parameter space that models the prior belief of the underlying unknown parameter of interest, say $\theta.$ The posterior distribution $$\pi(\theta | x) = \frac{p(x|\theta) \pi(\theta)}{\int p(x|\theta) \pi(\theta) d\theta}$$ can be used for point estimation and uncertainty quantification.
For parametric models, the Bernstein-von Mises theorem states that under weak assumptions on the prior and the model, the frequentist and Bayesian inference match in the sense that confidence sets and credible sets are close for large sample sizes. For most nonparametric and high-dimensional models the posterior distribution is much more difficult to analyze and depends even asymptotically on the prior. For a subjective Bayesian the prior is given and models the prior belief. For models with complex parameter structure precise knowledge of the full high-dimensional prior distribution can typically not be assumed. Instead, practitioners pick priors within a collection of standard priors. It is then natural to interpret Bayes as a frequentist method and to study properties of the posterior assuming that there exists a true parameter generating the data.
In previous work, we wrote an article on posterior contraction in the high-dimensional linear regression model. It can be shown that for priors that put a lot of mass on sparse models, the posterior concentrates around the true regression model. The Laplace prior that leads to the famous LASSO estimator does, however, not lead to contraction of the full posterior around the truth. In another project we studied so called global-local shrinkage priors and derived sharp conditions on this class of priors under which the posterior adapts to the underlying sparsity.
Posterior contraction is typically derived with respect to an intrinsic norm that is induced by the underlying statistical model. We investigated the problem of deriving posterior concentration rates under different loss functions in nonparametric Bayes

Confidence statements for qualitative constraints

If we reconstruct a function from data, there is always the question whether shape features such as maxima are artifacts of the reconstruction method or whether they are also present in the true underlying function. To answer these questions one wants to assign confidence statements to qualitative constraints. As we do not know in advance where interesting features of the shape occurs, we have to search over the whole domain.

One possibility is to derive a so called multiscale statistic that combines local tests in a sophisticated way to account for the dependence among the individual tests. The mathematical challenge is then to prove convergence to a distribution-free limit from which quantiles can be obtained.

In previous work, we studied multiscale statistics for deconvolution. In more recent work, we investigated the random coefficients model.

Asymptotic equivalence

Nonparametric statistics deals with a large zoo of different statistical models. Although the way the data are recorded might be quite different accross the models, there is typically a lot of similarity once it comes to reconstruction/estimation of the hidden quantities. In many cases for instance we obtain the same convergence rates. The notion of asymptotic equivalence makes the similarity of statistical models more precise. Two statistical models are said to be asymptotic equivalent if they lead to the same asymptotic estimation theory (in a certain sense).
Asymptotic equivalence can be quite useful for statistical theory. In several cases, difficult statistical models can be proven to be asymptotically equivalent to a simpler model and this allows us to work directly in the simpler model, avoiding for instance nasty discretization effects.
To establish asymptotic equivalence is, however, quite hard and each result needs a new proving strategy. Therefore, only few results have been established so far. In previous work, we derived conditions for which nonparametric regression with dependent errors by a continuous model. In a second project, we worked on asymptotic equivalence between the Gaussian white noise model and nonparametric density estimation.

Spot volatility estimation

The spot volatility describes the local variability of a financial asset or a portefolio. It is an important quantity for risk management and analyzing historic data. Unfortunately, the spot volatility cannot be observed directly and has to be inferred from the price process. If the price is recorded on high frequencies such as milliseconds, there are various market frictions that pertub the price process. If this so called microstructure noise is ignored the reconstruction of the spot volatility will be far to big. Models with microstructure noise are hard to analyse as the microstructure noise dominates the signal on most frequencies.
I studied spot volatility estimation with additive microstructure noise in my dissertation. We proved that microstructure noise reduces the optimal convergence rates by a factor 1/2. We also constructed reconstruction methods that achieve these convergence rates and implemented them in a software package for Matlab.