**Project Heads**

Pavel Dvurechenskii, Klaus-Robert Müller, Shinichi Nakajima, Vladimir Spokoiny

**Project Members**

Vaios Laschos, Aleksei Kroshnin

**Project Duration**

01.01.2021 − 31.12.2022

**Located at**

WIAS

We are motivated by the goal of analyzing neuroimaging data (EEG/MEG/fMRI) to find correlates of brain activity for a better understanding of e.g. ageing processes. We develop a Bayesian optimal transport framework that can be used to detect and statistically validate clusters and differences of brain activities taking into account the spatio-temporal structure of neuroimaging data.

**Background.**

Detecting spatio-temporal differences or changes of brain states is a fundamental question in computational and clinical neuroscience. For instance, one line of research aims to characterize healthy cognitive ageing by studying potential factors that may be of importance to maintain cognitive functionality across long lifespan (e.g. [28]). Brain Computer Interfacing (BCI), the research field aiming to decode brain states (e.g. [25]) in real-time and to translate them into control signals aim to find robust discriminators of various cognitive brain states or transients. To study these scientific goals, various neuroimaging data set have been collected [20,13,18], recently a trend has been towards analyzing multimodal brain signals [16]. Multimodality is important in brain signal analysis, because there is no single non-invasive acquisition method that has both sufficient temporal and spatial resolution. For example, fMRI and MRI have high spatial resolution but low temporal resolution to capture reactions of brains to stimuli, while EEG and MEG have sufficient temporal resolution but suffer from low spatial resolution and low signal-to-noise ratio.

One of the established analytic procedures is source localization, where the EEG/MEG signal generation process is defined by a linear model with the leadfield matrix, and the signal source in the brain is inferred from data samples by solving the inverse problem (e.g.[26]). Novel approaches may benefit from the complementary nature of modalities. For example, one can build a joint model of EEG and MEG signals to infer a common source signal, and use the MRI measurements, which identify the cortical structure of brains, to define subject-specific leadfield matrices such that the source space is aligned to all subjects [3].

Conventionally, differences in brain activities are detected and validated by statistical test in the source space with the Euclidean metric. However, recent research revealed that optimal transport (OT), which takes the anatomical distance between voxels into account, has emerged as a candidate for a more appropriate metric, avoiding undesired blurring in reconstructed source signals [3]. This implies that the statistical test should also be conducted with the OT metric.

On the neuroscientific side, some of the questions on healthy cognitive ageing—maintenance view and compensation mechanism mentioned above have been partially supported by EEG/MEG signal analyses [4,14]. However, more accurate and robust analysis techniques may be required for furthering our understanding of healthy cognitive ageing. On the BCI side, previous machine learning techniques for EEG/MEG have shown broadly that cognitive states and their transients can be detected, including also fatigues, drowsiness, and attention (e.g.[27,11]). However, high intra and inter-subject variability leave still ample challenges (e.g. [24]). Clearly, novel robust statistical tools may help to overcome the discussed limitations and will thus facilitate further progress in multimodal analysis of neuroimaging data and cognitive neuroscience.

**Project Scope.**

Our main goal for this project is to establish tools to detect and statistically validate differences in neuroimaging data with an appropriate metric taken into account. More specifically, we aim to establish a statistical method, called Bayesian optimal transport (BOT), with which one can analyse and judge whether groups of signal samples form separate clusters or a single joint cluster based on OT as the metric. Statistical test by BOT could potentially contribute to answer to neuroscientific questions (see [28], e.g., at what age the brain activity changes and exhibits flexibility for healthy cognitive ageing? Is there gender dependence? How much the change-point depends on the individual?)

Mathematically, we consider activation pattern in the source space as probability distribution on the space of voxels or on the brain surface with probability mass at each point being the relative activation strength at this point. An enhanced leadfield operator will map the signal space to the space of probability measures. Then the space of probability measures will be equipped with an OT distance, in particular, Wasserstein distance. Observed measures will be considered as samples from some unknown probability distribution on the space of these measures. These second-level distributions correspond to subjects, age, stimuli, trials, etc.

Based on this model and extending the framework of [9], different statistical questions are to be answered via Bayesian OT barycenter of probability measures. The key aspect is to use prior in the space of log-density for the barycenter and relax the constraints in the definition of an OT distance by proposing a new relaxation method, which we call calming, and adding priors in the space of dual potentials. This will allow to obtain a variant of Bernstein-von Mises theorem for barycenters. As an output, the framework allows not only to produce the barycenter, but also to equip it with confidence or credible set. The main benefit of the BOT in comparison with standard OT is that the new approach comes with statistical guarantees for the obtained results and that these results are in the finite sample setting, which is crucial as the experiments are costly and the provided data has small sample size.

**References.**

[1] J. Ebert, V. Spokoiny, and A. Suvorikova. Construction of non-asymptotic confidence sets in 2-Wasserstein space. arXiv:1703.03658.

[2] K. Bykov et al. How much can I trust you?—Quantifying uncertainties in explaining neural networks. arXiv:2006.09000, 2020.

[3] H. Janati, T. Bazeille, B. Thirion, M. Cuturi, and A. Gramfort. Multi-subject MEG/EEG source imaging with sparse multi-task regression. Neuroim., 2020.

[4] Rose Bruffaerts, Lorraine K. Tyler, Meredith Shafto, Kamen A. Tsvetanov, and Alex Clarke. Perceptual and conceptual processing of visual objects across the adult lifespan. Scientific Reports, 9(1):13771, 2019.

[5] W. Samek et al. (Eds.). Explainable AI: Interpreting, Explaining, Visualizing Deep Learning Springer, 2019.

[6] Jacob Kauffmann, Malte Esders, Gregoire Montavon, Wojciech Samek, and Klaus-Robert Müller. From clustering to cluster explanations via neural networks. arXiv:1906.07633, 2019.

[7] A. Kroshnin, V. Spokoiny, and A. Suvorikova. Statistical inference for Bures-Wasserstein barycenters. arXiv:1901.00226, 2019.

[8] Alexey Kroshnin, Nazarii Tupitsa, Darina Dvinskikh, Pavel Dvurechensky, Alexander Gasnikov, and Cesar Uribe. On the complexity of approximating Wasserstein barycenters. ICML 2019.

[9] V. Spokoiny and M. Panov. Accuracy of Gaussian approximation in nonparametric Bernstein – von Misestheorem. arXiv:1910.06028, 2019.

[10] Pavel Dvurechensky, Darina Dvinskikh, Alexander Gasnikov, Cesar A. Uribe, and Angelia Nedic. Decentralize and randomize: Faster algorithm for Wasserstein barycenters. In Advances in Neural Information Processing Systems 31, NeurIPS 2018, pages 10783–10793, 2018.

[11] Seunghyeok Hong, Hyunbin Kwon, Sangho Choi, and Kwang Suk Park. Intelligent system for drowsiness recognition based on ear canal electroencephalography with photoplethysmography and electrocardiography. Inf. Sci., 453:302–322, 2018.

[12] Stephan Kaltenstadler, Shinichi Nakajima, Klaus-Robert Müller, and Wojciech Samek. Wasserstein stationary subspace analysis. J. Sel. Topics Signal Proc., 2018.

[13] Jason R. Taylor, Nitin Williams, Rhodri Cusack, Tibor Auer, Meredith A. Shafto, Marie Dixon, Lorraine K.Tyler, Cam-CAN Group, and Richard N. Henson. The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) data repository: Structural and functional MRI, MEG, and cognitive data from a cross-sectional adult lifespan sample. NeuroIm., 2017.

[14] J. R. Gilbert and R. J. Moran. Inputs to prefrontal cortex support visual recognition in the aging brain. Scientific Reports, 6(1):31943, 2016.

[15] Gregoire Montavon, Klaus-Robert Müller, and Marco Cuturi. Wasserstein training of restricted Boltzmann machines. NIPS 2016.

[16] Sven Dähne, Felix Biessmann, Wojciech Samek, Stefan Haufe, Dominique Goltz, Christopher Gundlach, Arno Villringer, Siamac Fazli, and Klaus-Robert Müller. Multivariate machine learning methods for fusing multimodal functional neuroimaging data. Proceedings of the IEEE, 103(9):1507–1530, 2015.

[17] M. Panov and V. Spokoiny. Finite sample Bernstein – von Mises theorem for semiparametric problems. Bayesian Anal., 2015.

[18] D. Wakeman and R. Henson. A multi-subject, multi-modal human neuroimaging dataset. Sci. Data, 2015.

[19] Stefan Haufe, Frank Meinecke, Kai Görgen, Sven Dähne, John-Dylan Haynes, Benjamin Blankertz, and Felix Bießmann. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage, 87:96–110, 2014.

[20] Michael Tangermann, Klaus-Robert Müller, Ad Aertsen, Niels Birbaumer, Christoph Braun, Clemens Brun-ner, Robert Leeb, Carsten Mehring, Kai J Miller, Gernot Mueller-Putz, et al. Review of the BCI competition IV. Frontiers in neuroscience, 6:55, 2012.

[21] Benjamin Blankertz, Steven Lemm, Matthias Treder, Stefan Haufe, and Klaus-Robert Müller. Single-trial analysis and classification of ERP components—a tutorial. Neuroim., 2011.

[22] Siamac Fazli, Márton Danóczy, Jürg Schelldorfer, and Klaus-Robert Müller. l1-penalized linear mixed-effects models for high dimensional data with applications to BCI. Neuroim., 2011.

[23] Steven Lemm, Benjamin Blankertz, Thorsten Dickhaus, and Klaus-Robert Müller. Introduction to machine learning for brain imaging. Neuroimage, 56(2):387–399, 2011.

[24] Siamac Fazli, Florin Popescu, Márton Danóczy, Benjamin Blankertz, Klaus-Robert Müller, and Cristian Grozea. Subject-independent mental state classification in single trials. Neural networks, 2009.

[25] Benjamin Blankertz, Ryota Tomioka, Steven Lemm, Motoaki Kawanabe, and Klaus robert Muller. Optimizng spatial filters for robust EEG single-trial analysis. IEEE Sign. Proc., 2008.

[26] Stefan Haufe, Vadim V Nikulin, Andreas Ziehe, Klaus-Robert Müller, and Guido Nolte. Combining sparsity and rotational invariance in EEG/MEG source reconstruction Neuroim., 2008.

[27] Klaus-Robert Müller, Michael Tangermann, Guido Dornhege, Matthias Krauledat, Gabriel Curio, and Benjamin Blankertz. Machine learning for real-time single-trial EEG-analysis: from brain–computer interfacing to mental state monitoring. Journal of neuroscience methods, 167(1):82–90, 2008.

[28] Naftali Raz, Ulman Lindenberger, Karen M Rodrigue, Kristen M Kennedy, Denise Head, Adrienne Williamson, Cheryl Dahle, Denis Gerstorf, and James D Acker. Regional brain changes in aging healthy adults: general trends, individual differences and modifiers. Cerebral cortex, 15(11):1676–1689, 2005.

**Results.**

Motivated by the efficiency of optimal transport in the analysis of geometric data such as images, we propose a Bayesian Optimal Transport framework allowing one to construct statistical tests. Departing from the Wasserstein barycenter optimization problem, we construct a quasi log-likelihood function and study the corresponding posterior distribution. The latter is approximated by a Gaussian distribution based on the Laplace approximation, which allows the construction of credible sets for the barycenter and to use them in two-sample tests.

We also study a number of other optimal-transport-related problems including computational aspects and generalized optimal transport.

Wasserstein barycenter is a non-linear average in the space of measures, friendly to the underlying geometry, which is particularly important in the analysis of image data. We consider entropic regularization which allows for more efficient computational methods and ensures higher regularity of barycenters. In the case of stochastic data (i.e. random measures), there naturally appear questions on the consistency of barycenters, confidence sets, etc. We use the Bayesian approach with the objective in the dual regularized Wasserstein barycenter problem playing the role of a quasi log-likelihood. Its value can be computed much faster, than the value of the objective function in the primal problem, which allows for efficient sampling from posterior distribution. Also, in this case we work with dual potentials in a vector space, and due to the Laplace approximation, the posterior is close to a Gaussian distribution. This is demonstrated by experiments, and under suitable assumptions one can rigorously prove a Gaussian approximation result. The obtained results are used to construct a two-sample test which is empirically efficient when analyzing images from the MNIST dataset. Laplace approximation allows us to construct an iterative Bayesian procedure to iteratively update prior distribution based on the sampling from the posterior of the previous iteration. Such procedure is expected to converge to the barycenter simultaneously allowing to obtain credible sets for different inference tasks.

In [5], the quality of Laplace approximation is studied, providing general theoretical framework that is used in the above construction for Wasserstein barycenters. To that end, we attempt to revisit the classical results on Laplace approximation in a modern non-asymptotic and dimension free form. Such an extension is motivated by applications to high dimensional statistical and optimization problems. The established results provide explicit non-asymptotic bounds on the quality of a Gaussian approximation of the posterior distribution in total variation distance in terms of the so called *effective dimension*. This value is defined as interplay between information contained in the data and in the prior distribution. In the contrary to prominent Bernstein – von Mises results, the impact of the prior is not negligible and it allows to keep the effective dimension small or moderate even if the true parameter dimension is huge or infinite. We also address the issue of using a Gaussian approximation with inexact parameters with the focus on replacing the Maximum a Posteriori (MAP) value by the posterior mean and design the algorithm of Bayesian optimization based on Laplace iterations. The results are specified to the case of nonlinear inverse problem.

The aim of a related note [6] is to state a couple of general results about the properties of the penalized maximum likelihood estimators (pMLE) and of the posterior distribution for parametric models in a non-asymptotic setup and for possibly large or even infinite parameter dimension. We consider a special class of stochastically linear smooth (SLS) models satisfying two major conditions: the stochastic component of the log-likelihood is linear in the model parameter, while the expected log-likelihood is a smooth function. The main results simplify a lot if the expected log-likelihood is concave. For the pMLE, we establish a number of finite sample bounds about its concentration and large deviations as well as the Fisher and Wilks expansion. The later results extend the classical asymptotic Fisher and Wilks Theorems about the MLE to the non-asymptotic setup with large parameter dimension which can depend on the sample size. For the posterior distribution, our main result states a Gaussian approximation of the posterior which can be viewed as a finite sample analog of the prominent Bernstein–von Mises Theorem. In all bounds, the remainder is given explicitly and can be evaluated in terms of the effective sample size and effective parameter dimension. The results are dimension and coordinate free. In spite of generality, all the presented bounds are nearly sharp and the classical asymptotic results can be obtained as simple corollaries. Interesting cases of logit regression and of estimation of a log-density with smooth or truncation priors are used to specify the results and to explain the main notions. Our construction of quasi log-likelihood for the Wasserstein barycenter problem in the dual formulation satisfy the assumptions and these general results apply to our main problem in this project as well.

Approximating Wasserstein barycenter is a computationally intensive optimization problem, which motivates the development of efficient numerical algorithms for its solution. In [4], we build on previous research that proposed an equivalent saddle-point reformulation of the Wasserstein barycenter problem. To scale up the computations, we propose a decentralized distributed version of the Mirror Prox algorithm. Specifically, we focus on smooth convex-concave saddle point problems in a decentralized distributed setting, where a finite-sum objective is distributed among the nodes of a computational network, and local objectives depend on groups of local and global variables. Our proposed algorithm achieves an accuracy of ε in terms of the duality gap and consensus between nodes, with complexities of O(1/ε) in terms of communication and oracle calls. This complexity is better than the previous best-known result of O(1/ε^2), and we prove lower bounds for the communication and oracle call complexities, demonstrating that our algorithm is optimal. Furthermore, unlike existing decentralized algorithms, our algorithm allows for a non-euclidean proximal setup, including entropic setups. We demonstrate the effectiveness of our proposed algorithm by applying it to the problem of computing Wasserstein barycenters, where a non-euclidean proximal setup arises naturally in a bilinear saddle point reformulation of the Wasserstein barycenter problem.

We provide [1] a new algorithm for solving Risk Sensitive Partially Observable Markov Decisions Processes, when the risk is modeled by a utility function, and both the state space and the space of observations is finite. This algorithm is based on an observation that the change of measure and the subsequent introduction of the information space that is used for exponential utility functions, can be actually extended for sums of exponentials if one introduces an extra vector parameter that tracks the “expected accumulated cost” that corresponds to each exponential. Since every increasing function can be approximated by sums of exponentials in finite intervals, the method can be essentially applied for any utility function, with its complexity depending on the number of exponentials.

The Sinkhorn algorithm (arXiv:1306.0895) is the state-of-the-art to compute approximations of optimal transport distances between discrete probability distributions, making use of an entropically regularized formulation of the problem. The algorithm is guaranteed to converge, no matter its initialization. This leads to little attention being paid to initializing it, and simple starting vectors like the *n*-dimensional one-vector are common choices. We [2] train a neural network to compute initializations for the algorithm, which significantly outperform standard initializations. The network predicts a potential of the optimal transport dual problem, where training is conducted in an adversarial fashion using a second, generating network. The network is universal in the sense that it is able to generalize to any pair of distributions of fixed dimension after training, and we prove that the generating network is universal in the sense that it is capable of producing any pair of distributions during training. Furthermore, we show that for certain applications the network can be used independently.

We study [3] the minimizing movement scheme for families of geodesically semiconvex functionals defined on either the Hellinger-Kantorovich or the Spherical Hellinger-Kantorovich space. By exploiting some of the finer geometric properties of those spaces, we prove that the sequence of curves, which are produced by geodesically interpolating the points generated by the minimizing movement scheme, converges to curves that satisfy the Evolutionary Variational Inequality (EVI) when the time step goes to 0.

**External Website**

**Related Publications
**

[2] J. Geuter and V. Laschos. Generative Adversarial Learning of Sinkhorn Algorithm Initializations. Dec. 2022. (submitted). URL: http://arxiv.org/abs/2212.00133.

[3] V. Laschos and A. Mielke. Evolutionary Variational Inequalities on the Hellinger-Kantorovich and Spherical Hellinger-Kantorovich spaces. Nov. 2022. (submitted). URL: http://arxiv.org/abs/2207.09815.

[4] A. Rogozin, A. Beznosikov, D. Dvinskikh, D. Kovalev, P. Dvurechensky, and A. Gasnikov. Decentralized distributed optimization for saddle point problems. Optimization Methods and Software, 2021. (submitted). URL: https://arxiv.org/abs/2102.07758.

[5] V. Spokoiny. Dimension free non-asymptotic bounds on the accuracy of high dimensional laplace approximation. 2022. (submitted). URL: https://arxiv.org/abs/2204.11038.

[6] V. Spokoiny. Finite samples inference and critical dimension for stochastically linear models. 2022. (submitted). URL: https://arxiv.org/abs/2201.06327.

**Related Pictures
**