2nd Workshop

Mathematics of Small Data Analysis

January 17 – January 19, 2024

The semester is organized within the framework of the Berlin Mathematics Research Center MATH+ and supported by the Einstein Foundation Berlin. We are committed to fostering an atmosphere of respect, collegiality, and sensitivity. Please read our MATH+ Collegiality Statement.

A critical workshop on mathematical methods applied to small data sets

The second workshop will reveal different mathematical approaches to the analysis of small data sets in order to make the role of mathematics in this overall process more transparent.

At the first workshop we had already discussed the question: When using mathematical methods, how does expert knowledge factor into the results of non-mathematical research? We examined this with regard to small data sets.

With small data sets, it is not the unmanageability of the data that forces us to use “complex” methods to analyze it. The decision to use mathematics has other reasons. But if given data is only transformed with the help of mathematics using a given algorithm, then the actual gain in knowledge does not lie in the mathematical step. The selection of research objects and their coordination, the selection of the method and the interpretation of the mathematical results are the actual steps of the overall process in which gaining knowledge plays a role. But because these steps are carried out by experts, questions arise about the connection between the mathematical method and our knowledge.

Not so much “how” but more “what do you use mathematical methods for?” How do they lead to hypotheses? Are they only used to check the consistency of given hypotheses? How are possible uncertainties negotiated? Is there an open or hidden “physical” connection in the research object that can be modeled mathematically? Is the (perhaps misunderstood) rigor of mathematics used to support one’s own arguments?

Program Committee

Nataša Conrad
Wolfgang König
Vijay Natarajan
Christoph von Tycowicz
Dennis Mischke
Marcus Weber
Sarah Wolf
Fabian Telschow
Vikram Sunkara.

Contact & Slides of the Talks

Use our mailing list TES_small_data@zib.de for communication. The participants of the workshop received a link to a cloud folder to can download the slides.

Program

Wed. 17.01.2024

09:00 – 10:00 Registration – Coffee & Hanging Posters

10:00 – 10:30 Short Introduction

10:30 – 12:00 Session “Dynamical Models”

10:30 Developing Interacting Particle Models with Small Data -- Benjamin Goddard (University of Edinburgh)

Abstract: Interacting particle models can be used to describe a wide range of systems, in particular in the physical and social sciences. One of the main attractions of such models (at least to mathematicians) is their generality; the governing equations for many systems are essentially the same, up to the choice of interactions between the particles. Therein lies one of the biggest challenges in modelling interacting particle systems: how does one determine these interactions? In this talk, I will introduce some common models, motivated by examples from opinion dynamics and the physics of droplets, as well as the types of small data sets that are available to parameterize them. I will then highlight some significant differences between the approaches available in the social and physical sciences and discuss some recent successes and open challenges.

11:15 Parameter Optimization for a Neurotransmission Recovery Model -- Stefanie Winkelmann (Zuse Institute Berlin)

Abstract: Neurotransmission at chemical synapses relies on calcium-induced fusion of synaptic vesicles with the presynaptic membrane. After a fusion event, both the vesicle and the release site undergo a recovery phase, after which they are recycled and become available for fusion again. We have developed a stochastic model for these cyclical neurotransmission dynamics. Using a small data set obtained from the synapses of five Drosophila fruit flies, the parameter values of the recovery model will be estimated. The overall goal is to understand the process of neurotransmission using mathematical methods.

12:00 – 13:30 LUNCH Break

13:30 – 15:00 Session “Large Deviation Theory”

13:30 Introduction to the theory of probabilities of large deviations -- Wolfgang König (Weierstraß Institute)

Abtract: We explain in what situations the exponential decay rate of probabilities can be described in terms of a variational problem and what additional information can be deduced from that. The benefits comprise that extreme events (large deviations) can be analyzed using cheap formulas instead of costly simulations, and the least unlikely behavior in an extremely unlikely situation can be characterized.

14:00 Random medium access strategies in communication systems -- Helia Shafigh (Weierstrass Institute)

Abstract: In a situation where many devices would like to transmit their message(s), the ubiquitous problem of interference may render the success difficult. Rather one will distribute all the messages randomly over time in the hope that not too many messages cancel each other. We introduce a Markovian model and examine the probability that an extraordinarily small number of transmissions is successful.

14:30 Large deviations for a simple chemical reaction model -- Robert Patterson (Weierstraß Institute)

Abstract: Large deviation theory can be applied to models with random dynamics. The random objects of study are then the paths (history realisations) of the system. I will introduce this aspect of large deviation theory by sketching its application to a simple stochastic model of chemical reactions in a well-mixed vessel. We will see that this is a powerful way to deduce a simple model for the bulk behaviour of a system with many small constituent parts.

15:00 – 15:30 COFFEE Break

15:30 – 17:00 Session “Topological Data Analysis”

15:30 Opening: An Introduction to TDA -- Vijay Natarajan (Indian Institute of Science Bangalore and Zuse Institute Berlin)

15:40 Shaping up your data - how tools of algebraic topology help to understand complex phenomena? -- Paweł Dłotko (Dioscuri Centre in Topological Data Analysis, Polish Academy of Sciences)

Abstract: Algebraic topology, a classical tool in mathematics, serves as a rich source of inspirational techniques for data analysis. In this talk, I will discuss a range of strategies it offers to the data science community. We will start with classical persistent homology and mapper, explore tools such as Euler characteristic curves and profiles, and conclude with Ball mapper and related concepts. I will present the main mathematical ideas underpinning these methods in a simple and illustrative manner. Additionally, I will provide motivational examples of various applications that necessitate the adaptation of the discussed techniques.

16:15 Measure Theoretic Reeb Graphs and Reeb Spaces -- Bei Wang (Phillips, University of Utah, USA)

Abstract: A Reeb graph is a graphical representation of a scalar function defined on a topological space that encodes the topology of the level sets. A Reeb space is a generalization of the Reeb graph to a multivariate function. We propose novel constructions of Reeb graphs and Reeb spaces that incorporate the use of a measure. Specifically, we introduce measure theoretic Reeb graphs and Reeb spaces when the domain or the range is modeled as a metric measure space (i.e., a metric space equipped with a measure). Our main goal is to enhance the robustness of the Reeb graph and Reeb space in representing the topological features of a scalar field while accounting for the distribution of the measure. We first prove the stability of our measure theoretic constructions with respect to the interleaving distance. We then prove their stability with respect to the measure, defined using the distance to a measure or the kernel distance to a measure, respectively. This is a joint work with Qingsong Wang, Guanqun Ma, and Raghavendra Sridharamurthy.

16:50 Closing and open discussions

17:00 – 18:30 Workshop Reception

Thu. 18.01.2024

COFFEE available

09:00 – 10:30 Session “Geometric Data Analysis”

9:00 The diffusion mean of geometric data -- Stefan Sommer (University of Copenhagen)

Abstract: Statistics of geometric data such as shapes appearing in medical imaging and morphology is traditionally formulated using geodesic distances and least-squares. In the talk, I will outline an approach to geometric statistics where geodesic distances are replaced with (-log)likelihoods of diffusion processes on the geometric spaces. This leads to new statistics and new estimation algorithms. One example is the diffusion mean, an alternative to the classical Frechet mean. I will discuss the motivation behind the diffusion mean, its construction and statistical properties, focusing on differences between the diffusion mean and the classical Frechet mean.

9:30 Learning torus PCA-based classification for multiscale RNA correction with application to SARS-CoV-2 -- Henrik Wiechers (University of Göttingen)

Absract: Three-dimensional RNA structures frequently contain atomic clashes. Usually, corrections approximate the biophysical chemistry, which is computationally intensive and often does not correct all clashes. We propose fast, data-driven reconstructions from clash free benchmark data with two-scale shape analysis: microscopic (suites) dihedral backbone angles, mesoscopic sugar ring center landmarks. Our analysis relates concentrated mesoscopic scale neighborhoods to microscopic scale clusters, correcting within-suite-backbone-to-backbone clashes exploiting angular shape and size-and-shape Fréchet means. We illustrate the power of our method using cutting-edge SARS-CoV-2 RNA.

10:00 Barycentric subspace analysis of a set of graphs -- Elodie Maignant (Zuse Institute Berlin)

Abstract: Unlabeled graphs are naturally modeled as equivalence classes under the action of permutations on nodes or, equivalently, under the action by conjugation of permutation matrices on adjacency matrices. Such a model is, however, computationally limited, especially because it relies on the combinatorial problem of graph matching. As a relaxation of such action, we introduce in this talk a new framework where graphs are modeled in the quotient space resulting from the action of rotations on the set of adjacency matrices. Now, beyond the idea of relaxation, our approach takes on a natural interpretation from the point of view of spectral graph theory, which is the study of graphs through their spectrum, a descriptor that has proved to encode numerous properties. Indeed, the action of rotations by conjugation preserves the eigenvalues of a given graph – represented by its adjacency matrix – and the resulting equivalence classes are precisely the sets of cospectral graphs, that is, graphs sharing the same set of eigenvalues. In this framework, we explore non-Euclidean dimensionality reduction and, in particular, a recent method named Barycentric Subspace Analysis (BSA). We demonstrate that our framework carries a very nice geometric structure, greatly simplifying the method. Through several examples, we are then able to illustrate that BSA is a powerful dimensionality reduction tool with great interpretability and is particularly suited to the analysis of sets of graphs.

10:30 – 10:45 Short Break

10:45 – 12:15 Session “Data Temporality”

10:45 Topic space trajectories -- Tom Hanika (University of Hildesheim, HU Berlin)

Abstract: The annual number of publications at scientific venues, for example, conferences and journals, is growing quickly. As an attempt to support human data analysts, we present topic space trajectories. This temporal method is based on simple factorization based topic models and allows for the comprehensible tracking of single or a small number of entities in topic space. We demonstrate how these trajectories can be interpreted and further analyzed using lattice orders and ordinal motifs. This demonstration is based on several text corpora esteeming from research publications and patent data. Moreover, we present how the results can be aggregated to identify topic flow between single authors or whole scientific disciplines.

11:30 Why do I search a test for the toeplitz structure of a data matrix -- Georg Roth (Freie Universität Berlin)

Abstract: In archaeology the occurrence and/or the frequency of certain traits e.g. certain types of artifact shapes or decoration patterns in stratigraphic entities (strata, pits, ditches etc.) have long been recognized to convey important information especially – but not only – about the relative chronology of these entities. Clearly the formalized representation of such information is a matrix of nonnegativeinteger numbers, either binary data or larger integer numbers with the entities as rows and the traits as columns. Such a matrix is also called two-way two-mode data in the field of statistical seriation. If a causal phenomenon strongly influences the alternating appearance and disappearance of the traits by its gradually changing values, it is called a gradient in multivariate analysis. Such a gradient driven data matrix can be expressed in an optimized diagonal form, a form known to mathematicians as toeplitz matrix. Alas, archaeologists do NOT know if their data is gradient driven i.e. a toeplitz matrix at all, they just hope for it to be so! So a formal statistical test to rule out noise and confirm, that a toeplitz structure is present in the data, represents a big progress for archaeology.

12:15 – 13:30 LUNCH Break

13:30 – 15:00 Session “Complexity”

Discussion with Evelyn Gius (TU Darmstadt), Nan Z. Da (John Hopkins University), and Mihai Prunescu (Institute of Mathematics, Academy of Romania)

Speakers:

Evelyn Gius, Digital Philology, TU Darmstadt
Nan Z. Da, Department of English, John Hopkins University
Mihai Prunescu, Institute of Mathematics, Academy of Romania

Abstract:

In this session, we will start with “small data analysis” as it is done in Computational Literary Studies. Especially the possibilities and impossibilities of computational methods applied to literature will be discussed with two speakers from this field. Is it a matter of complexity of thoughts or of missing contextual knowledge in the data that disables mathematical approaches? Do we have a scheme that compares literature complexity with the complexity of computational methods?

We will see that some tasks performed by experts in comparative studies can be expressed by operations on infinite(?) boolean rings. Some research questions can eventually be modeled mathematically by satisfiability problems in this way.

This leads to the question of the complexity of these problems. With the last speaker of this session, we will discuss whether, in this mathematical field, P≠NP holds. Thus, is there a concrete mathematical problem in literature studies that is too complex to be solved deterministically in polynomial time?

15:00 – 15:30 COFFEE Break

15:30 – 17:00 Session “Statistical Inference and Modeling”

15:30 Small Data Analysis in Sample Surveys -- Katarzyna Reluga (University of Bristol)

Abstract: Sample surveys are widely recognised as high-quality and cost-effective sources of information to obtain estimates of target parameters at the population and subpopulation levels. If the sample size of the subpopulation is small (or even zero in some areas), then the researcher encounters the small area estimation (SAE) dilemma. SAE techniques have been developed to provide official statistics by leveraging survey samples and other available information, allowing estimators to borrow strength. In the first part of this talk, we present the main rationale behind introducing SAE machinery and discuss the necessary assumptions. The second part delves into the tools addressing simultaneous inference within the SAE framework, showcasing their practical application in measuring poverty rates in the Spanish region of Galicia. We conclude with general remarks on the potential applicability of machine learning within the SAE framework.

16:15 How Small Sample Studies have Impacted fMRI Neuroscience -- Tom Maullin-Sapey (University of Oxford)

Abstract: Functional Magnetic Resonance Imaging (fMRI) Neuroscience is a relatively new discipline, with the first fMRI experiment having been performed as recently as 1991. Due to the expensive and time consuming nature of the method, small sample studies are historically ubiquitous in the fMRI literature. For instance, in 2018 the median sample size of high-impact neuroimaging papers was only n=24, with an average sample size across all literature of n=60 ([1],[2]). Such sample sizes are problematic, as they result in lower statistical power, increased risk of false positives, and reduced generalizability of findings; problems which are greatly exacerbated in the large data imaging setting.

This talk shall provide a broad overview of how fMRI statisticians handle various problems associated with small-sample studies. Following a brief overview of fMRI as a subject, we shall touch on (1) how fMRI scientists use permutation and bootstrapping techniques to address distributional assumptions violated in the small sample setting, (2) current meta-analysis and data sharing techniques employed to synthesize small study analysis results and (3) the recent move towards big data cohorts and the pros and cons such a shift brings.

[1] – Yeung AWK (2018) An updated survey on statistical thresholding and sample size of fmri studies. Frontiers in Human Neuroscience 12, DOI 10.3389/fnhum.2018.00016

[2] – Szucs D, Ioannidis JP (2020) Sample size evolution in neuroimaging research: An evaluation of highly-cited studies (1990–2012) and of latest practices (2017–2018)

in high-impact journals. NeuroImage 221:117164, DOI https://doi.org/10.1016/j.neuroimage.2020.117164

17:00 – 18:30 Refreshments / Poster Session (Post-Its / Discussions)

Fri. 19.01.2024

COFFEE available

09:00 – 10:30 Session “Uncertainty”

Discussion with Hans-Liudger Dienel (TU Berlin), Andrea Heilrath (TU Berlin) and Claude Garcia (Bern University of Applied Sciences)

Speakers:

Hans-Liudger Dienel (TU Berlin)
Andrea Heilrath (TU Berlin) on “Beyond Numbers: The Art of Uncertainty Visualization”
Claude Garcia (Bern University of Applied Sciences) on “Choices we make in times of crisis: better representing agency in our models”

Abstract:

Mathematics has developed various concepts and methods to represent, analyse, and quantify uncertainty. Mathematical models of the systems under study and their (often simulated, stochastic) outcomes can support decision making under uncertainty. This session wants to look into interesting open challenges, in particular, in the (overlapping) contexts of small data and of complex social systems, such as: How can different kinds of uncertainty be accounted for? How can uncertainty be visualised and communicated, in particular, beyond a mathematical audience? How can models deal with uncertainties arising from human interaction?

10:30 – 10:45 Short Break

10:45 – 12:00 “Join-In” Discussion

12:00 – 13:00 LUNCH Break / Poster Session (Collecting Post-Its)

13:00 – 14:30 Session “Machine Learning”

13:00 A “Deep” Dive into how Neural Networks see Data -- Vikram Sunkara (Zuse Institute Berlin)

Abstract: Some argue that Deep Neural Nets (DNNs) are only appropriate for huge datasets. However, mathematically, this assertion is missing much context. In this session, we’ll take a ‘Deep’ dive into how neural networks perceive/process data and what mathematical patterns emerge inside DNNs. With our new understanding and intuition, we will try to determine whether DNNs are also appropriate for small data settings.

13:30 Active Learning for Expensive Black-Box Optimization -- Philipp Schneider (JCMWave, Zuse Institute Berlin)

Abstract: In many scientific and industrial fields, it is desired to maximize a figure of merit of a black-box system by varying some tuning parameters. Optimizing the system with heuristic approaches (e.g. evolutionary optimization) can be too costly if each system evaluation is expensive. In this case, active learning schemes based on probabilistic machine learning methods allow to steer the optimization process efficiently to the global optimum.

14:00 Exploiting Data Structure and Geometry with a Diffusion-based Graph Neural Network -- Martin Hanik (Zuse institute Berlin)

Abstract: Exploiting the underlying structure of the problem can help to improve the performance of deep learning models. This is in particular true for applications that deal with graphs, where graph neural networks have often become the state-of-the-art. In this talk, I will introduce a novel graph neural network based on diffusion in manifolds and explain how it can exploit several common characteristics of learning problems on graphs for better results. It can handle node features from general Riemannian manifolds, which opens up a broad spectrum of possible applications. To finish the talk, we will discuss the performance on small data.

14:30 – 15:00 Closing Remarks

Thank you!

Thank you very much to all participants (71 in presence at ZIB over the three days and more than 10 virtually online). It was fun to discuss mathematical and interdisciplinary aspects of small data analysis. We have one weekend for digesting the highlights of the workshop. See you at our Hackathon!