The semester is organized within the framework of the Berlin Mathematics Research Center MATH+ and supported by the Einstein Foundation Berlin. We are committed to fostering an atmosphere of respect, collegiality, and sensitivity. Please read our MATH+ Collegiality Statement.
Use our mailing list TES_small_data@zib.de for communication. The participants of the workshop received a link to a cloud folder to can download the slides.
Wed. 17.01.2024
09:00 – 10:00 Registration – Coffee & Hanging Posters
10:00 – 10:30 Short Introduction
10:30 – 12:00 Session “Dynamical Models”
Abstract: Interacting particle models can be used to describe a wide range of systems, in particular in the physical and social sciences. One of the main attractions of such models (at least to mathematicians) is their generality; the governing equations for many systems are essentially the same, up to the choice of interactions between the particles. Therein lies one of the biggest challenges in modelling interacting particle systems: how does one determine these interactions? In this talk, I will introduce some common models, motivated by examples from opinion dynamics and the physics of droplets, as well as the types of small data sets that are available to parameterize them. I will then highlight some significant differences between the approaches available in the social and physical sciences and discuss some recent successes and open challenges.
Abstract: Neurotransmission at chemical synapses relies on calcium-induced fusion of synaptic vesicles with the presynaptic membrane. After a fusion event, both the vesicle and the release site undergo a recovery phase, after which they are recycled and become available for fusion again. We have developed a stochastic model for these cyclical neurotransmission dynamics. Using a small data set obtained from the synapses of five Drosophila fruit flies, the parameter values of the recovery model will be estimated. The overall goal is to understand the process of neurotransmission using mathematical methods.
12:00 – 13:30 LUNCH Break
13:30 – 15:00 Session “Large Deviation Theory”
Abtract: We explain in what situations the exponential decay rate of probabilities can be described in terms of a variational problem and what additional information can be deduced from that. The benefits comprise that extreme events (large deviations) can be analyzed using cheap formulas instead of costly simulations, and the least unlikely behavior in an extremely unlikely situation can be characterized.
Abstract: In a situation where many devices would like to transmit their message(s), the ubiquitous problem of interference may render the success difficult. Rather one will distribute all the messages randomly over time in the hope that not too many messages cancel each other. We introduce a Markovian model and examine the probability that an extraordinarily small number of transmissions is successful.
Abstract: Large deviation theory can be applied to models with random dynamics. The random objects of study are then the paths (history realisations) of the system. I will introduce this aspect of large deviation theory by sketching its application to a simple stochastic model of chemical reactions in a well-mixed vessel. We will see that this is a powerful way to deduce a simple model for the bulk behaviour of a system with many small constituent parts.
15:00 – 15:30 COFFEE Break
15:30 – 17:00 Session “Topological Data Analysis”
Abstract: Algebraic topology, a classical tool in mathematics, serves as a rich source of inspirational techniques for data analysis. In this talk, I will discuss a range of strategies it offers to the data science community. We will start with classical persistent homology and mapper, explore tools such as Euler characteristic curves and profiles, and conclude with Ball mapper and related concepts. I will present the main mathematical ideas underpinning these methods in a simple and illustrative manner. Additionally, I will provide motivational examples of various applications that necessitate the adaptation of the discussed techniques.
Abstract: A Reeb graph is a graphical representation of a scalar function defined on a topological space that encodes the topology of the level sets. A Reeb space is a generalization of the Reeb graph to a multivariate function. We propose novel constructions of Reeb graphs and Reeb spaces that incorporate the use of a measure. Specifically, we introduce measure theoretic Reeb graphs and Reeb spaces when the domain or the range is modeled as a metric measure space (i.e., a metric space equipped with a measure). Our main goal is to enhance the robustness of the Reeb graph and Reeb space in representing the topological features of a scalar field while accounting for the distribution of the measure. We first prove the stability of our measure theoretic constructions with respect to the interleaving distance. We then prove their stability with respect to the measure, defined using the distance to a measure or the kernel distance to a measure, respectively. This is a joint work with Qingsong Wang, Guanqun Ma, and Raghavendra Sridharamurthy.
17:00 – 18:30 Workshop Reception
Thu. 18.01.2024
COFFEE available
09:00 – 10:30 Session “Geometric Data Analysis”
Abstract: Statistics of geometric data such as shapes appearing in medical imaging and morphology is traditionally formulated using geodesic distances and least-squares. In the talk, I will outline an approach to geometric statistics where geodesic distances are replaced with (-log)likelihoods of diffusion processes on the geometric spaces. This leads to new statistics and new estimation algorithms. One example is the diffusion mean, an alternative to the classical Frechet mean. I will discuss the motivation behind the diffusion mean, its construction and statistical properties, focusing on differences between the diffusion mean and the classical Frechet mean.
Absract: Three-dimensional RNA structures frequently contain atomic clashes. Usually, corrections approximate the biophysical chemistry, which is computationally intensive and often does not correct all clashes. We propose fast, data-driven reconstructions from clash free benchmark data with two-scale shape analysis: microscopic (suites) dihedral backbone angles, mesoscopic sugar ring center landmarks. Our analysis relates concentrated mesoscopic scale neighborhoods to microscopic scale clusters, correcting within-suite-backbone-to-backbone clashes exploiting angular shape and size-and-shape Fréchet means. We illustrate the power of our method using cutting-edge SARS-CoV-2 RNA.
Abstract: Unlabeled graphs are naturally modeled as equivalence classes under the action of permutations on nodes or, equivalently, under the action by conjugation of permutation matrices on adjacency matrices. Such a model is, however, computationally limited, especially because it relies on the combinatorial problem of graph matching. As a relaxation of such action, we introduce in this talk a new framework where graphs are modeled in the quotient space resulting from the action of rotations on the set of adjacency matrices. Now, beyond the idea of relaxation, our approach takes on a natural interpretation from the point of view of spectral graph theory, which is the study of graphs through their spectrum, a descriptor that has proved to encode numerous properties. Indeed, the action of rotations by conjugation preserves the eigenvalues of a given graph – represented by its adjacency matrix – and the resulting equivalence classes are precisely the sets of cospectral graphs, that is, graphs sharing the same set of eigenvalues. In this framework, we explore non-Euclidean dimensionality reduction and, in particular, a recent method named Barycentric Subspace Analysis (BSA). We demonstrate that our framework carries a very nice geometric structure, greatly simplifying the method. Through several examples, we are then able to illustrate that BSA is a powerful dimensionality reduction tool with great interpretability and is particularly suited to the analysis of sets of graphs.
10:30 – 10:45 Short Break
10:45 – 12:15 Session “Data Temporality”
Abstract: The annual number of publications at scientific venues, for example, conferences and journals, is growing quickly. As an attempt to support human data analysts, we present topic space trajectories. This temporal method is based on simple factorization based topic models and allows for the comprehensible tracking of single or a small number of entities in topic space. We demonstrate how these trajectories can be interpreted and further analyzed using lattice orders and ordinal motifs. This demonstration is based on several text corpora esteeming from research publications and patent data. Moreover, we present how the results can be aggregated to identify topic flow between single authors or whole scientific disciplines.
Abstract: In archaeology the occurrence and/or the frequency of certain traits e.g. certain types of artifact shapes or decoration patterns in stratigraphic entities (strata, pits, ditches etc.) have long been recognized to convey important information especially – but not only – about the relative chronology of these entities. Clearly the formalized representation of such information is a matrix of nonnegativeinteger numbers, either binary data or larger integer numbers with the entities as rows and the traits as columns. Such a matrix is also called two-way two-mode data in the field of statistical seriation. If a causal phenomenon strongly influences the alternating appearance and disappearance of the traits by its gradually changing values, it is called a gradient in multivariate analysis. Such a gradient driven data matrix can be expressed in an optimized diagonal form, a form known to mathematicians as toeplitz matrix. Alas, archaeologists do NOT know if their data is gradient driven i.e. a toeplitz matrix at all, they just hope for it to be so! So a formal statistical test to rule out noise and confirm, that a toeplitz structure is present in the data, represents a big progress for archaeology.
12:15 – 13:30 LUNCH Break
13:30 – 15:00 Session “Complexity”
Speakers:
Abstract:
In this session, we will start with “small data analysis” as it is done in Computational Literary Studies. Especially the possibilities and impossibilities of computational methods applied to literature will be discussed with two speakers from this field. Is it a matter of complexity of thoughts or of missing contextual knowledge in the data that disables mathematical approaches? Do we have a scheme that compares literature complexity with the complexity of computational methods?
We will see that some tasks performed by experts in comparative studies can be expressed by operations on infinite(?) boolean rings. Some research questions can eventually be modeled mathematically by satisfiability problems in this way.
This leads to the question of the complexity of these problems. With the last speaker of this session, we will discuss whether, in this mathematical field, P≠NP holds. Thus, is there a concrete mathematical problem in literature studies that is too complex to be solved deterministically in polynomial time?
15:00 – 15:30 COFFEE Break
15:30 – 17:00 Session “Statistical Inference and Modeling”
Abstract: Sample surveys are widely recognised as high-quality and cost-effective sources of information to obtain estimates of target parameters at the population and subpopulation levels. If the sample size of the subpopulation is small (or even zero in some areas), then the researcher encounters the small area estimation (SAE) dilemma. SAE techniques have been developed to provide official statistics by leveraging survey samples and other available information, allowing estimators to borrow strength. In the first part of this talk, we present the main rationale behind introducing SAE machinery and discuss the necessary assumptions. The second part delves into the tools addressing simultaneous inference within the SAE framework, showcasing their practical application in measuring poverty rates in the Spanish region of Galicia. We conclude with general remarks on the potential applicability of machine learning within the SAE framework.
Abstract: Functional Magnetic Resonance Imaging (fMRI) Neuroscience is a relatively new discipline, with the first fMRI experiment having been performed as recently as 1991. Due to the expensive and time consuming nature of the method, small sample studies are historically ubiquitous in the fMRI literature. For instance, in 2018 the median sample size of high-impact neuroimaging papers was only n=24, with an average sample size across all literature of n=60 ([1],[2]). Such sample sizes are problematic, as they result in lower statistical power, increased risk of false positives, and reduced generalizability of findings; problems which are greatly exacerbated in the large data imaging setting.
This talk shall provide a broad overview of how fMRI statisticians handle various problems associated with small-sample studies. Following a brief overview of fMRI as a subject, we shall touch on (1) how fMRI scientists use permutation and bootstrapping techniques to address distributional assumptions violated in the small sample setting, (2) current meta-analysis and data sharing techniques employed to synthesize small study analysis results and (3) the recent move towards big data cohorts and the pros and cons such a shift brings.
[1] – Yeung AWK (2018) An updated survey on statistical thresholding and sample size of fmri studies. Frontiers in Human Neuroscience 12, DOI 10.3389/fnhum.2018.00016
[2] – Szucs D, Ioannidis JP (2020) Sample size evolution in neuroimaging research: An evaluation of highly-cited studies (1990–2012) and of latest practices (2017–2018)
in high-impact journals. NeuroImage 221:117164, DOI https://doi.org/10.1016/j.neuroimage.2020.117164
17:00 – 18:30 Refreshments / Poster Session (Post-Its / Discussions)
Fri. 19.01.2024
COFFEE available
09:00 – 10:30 Session “Uncertainty”
Speakers:
Abstract:
Mathematics has developed various concepts and methods to represent, analyse, and quantify uncertainty. Mathematical models of the systems under study and their (often simulated, stochastic) outcomes can support decision making under uncertainty. This session wants to look into interesting open challenges, in particular, in the (overlapping) contexts of small data and of complex social systems, such as: How can different kinds of uncertainty be accounted for? How can uncertainty be visualised and communicated, in particular, beyond a mathematical audience? How can models deal with uncertainties arising from human interaction?
10:30 – 10:45 Short Break
10:45 – 12:00 “Join-In” Discussion
12:00 – 13:00 LUNCH Break / Poster Session (Collecting Post-Its)
13:00 – 14:30 Session “Machine Learning”
Abstract: Some argue that Deep Neural Nets (DNNs) are only appropriate for huge datasets. However, mathematically, this assertion is missing much context. In this session, we’ll take a ‘Deep’ dive into how neural networks perceive/process data and what mathematical patterns emerge inside DNNs. With our new understanding and intuition, we will try to determine whether DNNs are also appropriate for small data settings.
Abstract: In many scientific and industrial fields, it is desired to maximize a figure of merit of a black-box system by varying some tuning parameters. Optimizing the system with heuristic approaches (e.g. evolutionary optimization) can be too costly if each system evaluation is expensive. In this case, active learning schemes based on probabilistic machine learning methods allow to steer the optimization process efficiently to the global optimum.
Abstract: Exploiting the underlying structure of the problem can help to improve the performance of deep learning models. This is in particular true for applications that deal with graphs, where graph neural networks have often become the state-of-the-art. In this talk, I will introduce a novel graph neural network based on diffusion in manifolds and explain how it can exploit several common characteristics of learning problems on graphs for better results. It can handle node features from general Riemannian manifolds, which opens up a broad spectrum of possible applications. To finish the talk, we will discuss the performance on small data.
14:30 – 15:00 Closing Remarks
Thank you very much to all participants (71 in presence at ZIB over the three days and more than 10 virtually online). It was fun to discuss mathematical and interdisciplinary aspects of small data analysis. We have one weekend for digesting the highlights of the workshop. See you at our Hackathon!