Privacy and Security in Bayesian Inference
There has been substantial progress in development of statistical methods which are amenable to computation with modern cryptographic techniques, such as homomorphic encryption. This has enabled fitting and/or prediction of models in areas from classification and regression through to genome wide association studies. However, these are techniques devised to address specific models in specific settings, with the broader challenge of an approach to inference for arbitrary models and arbitrary data sets receiving less attention. This talk will discuss work on an approach which enables theoretically arbitrary low dimensional Bayesian models to be fitted fully encrypted, keeping the model and prior secret from data owners and vice-versa. There are several illustrative examples, together with a discussion of some initial theoretical results on the behaviour of the new methodology.
Cross-validation, possibly V -folded (in this case denoted for short VFCV in the following), is a versatile tool for hyper-parameter tuning in statistical inference. In particular, it is very popular in the machine learning community. Reasons for this success combine a relatively low computational cost with good efficiency and wide applicability. The rationale behind cross-validation indeed barely only relies on the assumption that the sample is made of (nearly) independent and identically distributed random variables.
Cross-validation of the risks of a collection of M-estimators for model selection can be seen through the prism of penalization. It is then quite transparent that, at least for a fixed value of the number of folds V , VFCV is asymptotically sub-optimal. It is also legitimate to think that it should be improvable in the non-asymptotic regime.
More precisely, the main drawback of VFCV is that it provides a biased estimate of the ideal penalty. But, very interestingly, debiasing this estimate does not give substantially better performances in practice (actually, it tends to deteriorate the results). This is due to a genuine second-order effect that gives benefit to a slight over-estimation of the ideal penalty. This phenomenon is sometimes called the over-penalization problem in a model selection literature, lacking so far of theoretical understanding.
In this talk, we will first give a precise mathematical description of the over- penalization problem, through a formalism involving multiple (pseudo-)testing. Then we will propose a possible modification of VFCV and compare its theoretical guarantees with those of the classical VFCV on a non-parametric regression problem, with random design and heteroscedastic noise. At the heart of our analysis and algorithms, we derive and use some concentration inequalities for the excess risks of M-estimators. Such results require to go at a finer scale than (minimax) rates of convergence and are tackled through the use of representation formulas for the excess risks in terms maximzers of local suprema of the underlying empirical process. We will conclude by discussing encouraging experimental results and stating some open problems.
This talk is based on joint works with Amandine Dubois and Fabien Navarro.