Thematic Einstein Semester on

Small Data Analysis

Winter Semester 2023/24

Organizers

Christoph von Tycowicz (ZIB)
Marcus Weber (ZIB)
Sarah Wolf (FUB)
Carlos Enrique Améndola Cerón (TUB)
Nataša Djurdjevac Conrad (ZIB)
Karsten Tabelow (WIAS)
Dennis Mischke (FUB)
Benjamin Ducke (DAI)
Kai Kappert (Charité)
Vikram Sunkara (ZIB)

The semester is organized within the framework of the Berlin Mathematics Research Center MATH+ and supported by the Einstein Foundation Berlin. We are committed to fostering an atmosphere of respect, collegiality, and sensitivity. Please read our MATH+ Collegiality Statement.

About

“Why there are only 40 golden statues (among thousands) in the Cachette of Karnack?” — This Thematic Einstein Semester will deal with the analysis of small complex data sets, a small number of research objects, or rare objects, for which standard statistical methods are not applicable and the data itself does not sufficiently represent expert knowledge. We propose a counter movement to the need for large training sets, but rather to revive classical mathematics (algebra, topology, logic, geometry,…) for the application of data analysis.

Small data sets are common when the events of interest are hard to observe such as certain diseases, meteorological phenomena, or (pre-)historic findings. The complexity of the examined objects allows for a large variety of analysis answers, the relevance of which cannot be justified by the sparse data itself. Even if the studied systems can be simulated, the relevant events might occur rarely leading to prohibitive sampling costs or uncertainty.

The Thematic Einstein Semester aims to intensify the collaboration between mathematicians and researchers from other disciplines who develop research methods for small data sets.

Accompanying Lecture Series

Depending on capacities also open for guests from all disciplines:

Seminar on “Geometric Deep Learning”, FU Berlin, Wednesdays 14:15-15:45, A6/SR 025/026 (Arnimallee 6)
Lecture on “Logische Datenanalyse” (within Mathematics II), October 24th 2023, 08:15-9:45, Lecture Hall B (B.004), Arnimallee 22

1st Workshop

An interdisciplinary workshop on knowledge-driven analytics

The first workshop of the Thematic Einstein Semester will be held from November 1 to 3 at the Zuse Institute Berlin. It is an opportunity to find future research topics and collaborations. It is the prelude to a Thematic Einstein Semester, which deals with the interaction of mathematics and non-mathematical disciplines. In our first workshop we deal with research questions and data from different disciplines and see how mathematics comes into play in interdisciplinary projects. Success stories but also critical views on this collaboration have their say. The following questions are examined in particular: Transition from the research question to the mathematical problem – is that possible? Can expert knowledge (beyond the concrete data) be represented mathematically? What is the impact of complexity reduction? Can mathematical structures be found in interdisciplinary research objects?

More details can be found on the workshop website and on the workshop flyer.

2nd Workshop

Mathematics of Small Data Analysis

The second workshop of the Thematic Einstein Semester will be held from January 17 to 19, 2024, at Zuse Institute Berlin. This workshop mainly addresses mathematical questions concerning the use of data analysis methods for small data sets. It will present a broad range of mathematical fields and, thus, make methods more transparent to the users.

More information and the program can be found here.

Hackathon

This three-day event will bring together an interdisciplinary group of young scientists so that they can learn about knowledge-driven approaches for small data analysis, gain first-hand practical experience, and catch a glimpse into exciting new research questions. It will take place January 22-24, 2024, in connection with the 2nd TES workshop.

The Hackathon webpage can be found here.

Final Conference

The Final Conference taking place March 11-12, 2024, aims at the theme of “Interpretation of mathematical results”. For example: How does contextual knowledge come into play? The results will be presented and summarized.

See the conference website for more information.

Podcasts

Throughout the Thematic Einstein Semseter, we will create a thematic Podcast that casts a spotlight on particular interfaces between real-world applications and mathematics, introducing them to a broader research community as well as the general public.

Our current ZIBcasts can be found here: https://www.zib.de/training-outreach/press/zibcast

Workshop on Ideas

In order to stay connected we have planned a workshop for collecting and implementing new interdisciplinary project ideas.The workshop will take place on Dec 12th and 13th 2024 in Villa Engler.

Here is the preliminary program (we intend to do the workshop in German language):

December 12th, 2024: Discuss project ideas

10:00 a.m. – 10:30 a.m.: Welcome + introduction
10:30 a.m. – 11:30 a.m.: Project speed dating
11:30 a.m. – 12:00 p.m.: Collect results
12:00 p.m. – 1:00 p.m.: Lunch together
1:00 p.m. – 2:30 p.m.: Identify state-of-the-art
2:30 p.m. – 4:00 p.m.: Cherry-picking, condensing, refining

December 13th, 2024: Deepening financing topics

9:00 a.m. – 10:30 a.m.: Identify funding programs
10:30 a.m. – 11:00 a.m.: Coffee break
11:00 a.m. – 12:30 p.m.: Resource planning, site planning, lead
12:30 p.m. – 1:30 p.m.: Lunch break
1:30 p.m. – 3:00 p.m.: Plan application and concrete next steps

If you want to take part: Write an email to weber@zib.de

Example of small data

“Data is small, if it can only train experts and not machines” — this is not really a definition of small data, but it shows what is the central task of analyzing small data. In order to demonstrate principles of and needs for small data analysis we present an interesting contribution to our Thematic Einstein Semester here.

This is the work of Bradley Lewis Scott, a PhD researcher at the Queen Mary University of London and the Natural History Museum. It was sent to us in September 2023.

Thinking about the Sloane Herbarium and its data

At first sight a collection of 130,000 dried plants dating from the seventeenth and eighteenth centuries is an extraordinary witness to the worlds of knowledge, practices and understanding that co-existed across many parts of the globe in that period. Physically amassed in their current form by several hundred people of (largely) European heritage, the collection certainly offers the possibility of constructing histories of botany, medicinal knowledge and travel in that cultural context. That alone is reason for studying the collection, but it also hints at other histories: of colonialism, trans-cultural encounters, knowledge exchange (willing or otherwise), trade and privateering, and of enslavement in the Atlantic world, the Cape and the East Indies. Other, unnamed hands certainly did collect some of the plants, and their knowledge worlds and ways of seeing are all but invisible in the collection as we see it today. But not totally. There are occasional indigenous names and uses, along with the spatial information associated with many of the plants. If these can be combined and read with other sources it is sometimes possible to construct an even richer account of the histories of these plants and the people with whom they have interacted.

The challenge is how to do that. The current project intersects with many other Sloane-related activities over several decades, along with the data they have generated, and continue to generate. Figure 1 illustrates the range of different data types with connections to the Sloane Herbarium that now exist. This note explores the opportunities such data affords, illustrates some work in progress, and summarises the likely remaining data work on my project that will be necessary to deliver some of the objectives.

Figure 1 – Inter-connecting data types and sources relating to the Sloane Herbarium.

With a well-constructed data set, it will be possible to analyse many aspects of the historical and geographical contexts of the herbarium, as Figure 2 suggests. Some of these are relatively straightforward, such as a simple modelling of the networks of named people that have an association with the collection, or the places from which specimens derive. Beyond that, to analyse the collection by previous owner, or chronological development, will require some limited data enrichment, so that what is currently effectively unfindable within the history of the herbarium can be represented. Likewise, devising and applying a nuanced categorisation of the known people and places associated with the specimens should make visible some of the connections with slavery, colonialism and (potentially) identify locations in which trans-cultural encounters were likely significant in the generation of the collection. These in turn could lead to insights arising from relationships with other data sets, potentially suggesting the people and knowledge worlds about which the physical herbarium is silent. Here, that which is unrepresentable from the data may become discoverable.

Figure 2 – Potential analytical avenues and objectives using the Sloane Herbarium data.

The data drawn from James Dandy’s book The Sloane Herbarium (1958) gives a good representation of the herbarium as it exists today. However, it does omit a limited number of people and places, which can now be added. With the integration of data from Sloane’s own catalogue of his horti sicci, and a more fine-grained modelling of the constituent parts of the collection, it should be possible to combine that with a chronology of the making of the herbarium, and enable a timeline showing the mergers, additions and deletions to the collection over the course of Sloane’s lifetime. An initial subset of the collection is so represented in Figure 3. It should be possible to generate automatically a similar diagram for the entire collection, once appropriate data support has been added.

Figure 3 – Sankey diagram of a small subset of the Sloane Herbarium showing mergers, additions and deletions. Horizontal axis is (roughly) chronological. The numbers at the end of each line of text is the count of folios. The values for some horti sicci which no longer exist in their original form have been estimated. The colours have no relevance at present; in the future, it may be useful to colour them according to the main collections (Merret, Plukenet, Courten, Petiver etc).

There are inevitably several ways of visualising processes, and Figure 4 shows an alternative approach to showing the management by Sloane of part of his collection. The modelling of these relationships goes hand in hand with the experimentation with the visualisation tools, and may draw attention to specific, little-noticed features of the collection as part of that process.

simple-hs-flows-mermaid-diagram-2023-08-31-171514

Figure 4 – Flow diagram of small subset of the Herbarium showing sub-collections, re-cataloguing and mergers.

Assigning dates to individual horti sicci, or their components, is not straightforward, though there is often some evidence that can be used to infer a chronology. As long as these are adequately documented, it should be possible to construct a series of maps that suggest the geographical sources of plant materials according to when they were accessioned by Sloane; and, more interestingly, it may be plausible to map the collection dates as well. Figure 5 shows a limited subset of some of the early horti sicci, grouped within five-year periods as Sloane acquired them. Such an approach should be relatively easy to extend across the collection, with some caveats.

Figure 5 – Geographical distribution of specimens at the date they were acquired by Sloane. This is a small subset comprising most of HS1-13, 16-17, 29-30.

From these macro views of the collection, we can also turn to an analysis of some of its parts. These could include the collecting networks of particular individuals such as Uvedale, Plukenet or Petiver, or a more close-grained analysis of individual horti sicci to better understand their structure, and thereby generate insights into potential relationships of their parts and their histories which are not apparent directly from their pages. Realistically, this is probably not going to be within the scope of the current project, but Figure 6 is an initial attempt to represent the different structures and thereby suggest differing management processes in different parts of the volume. It charts the folio number sequentially along the horizontal axis, and plots counts of various features, thereby making especially visible the addition of the Courten-derived materials on many pages in the latter third of the volume.

Figure 6 – Visualisation of the internal structure of a single hortus siccus volume (HS8) in folio sequence, showing occurrences of (a) cross reference types; (b) named collectors; (c) named locations; and (d) number of specimens per page.

This volume (HS8), mostly containing the plants from George Handisyd, does indeed have a more complex history than many horti sicci, but it is far from unique. It is very likely that some sort of network analysis of horti sicci generally may reveal different ‘fingerprints’, which may themselves suggest different histories and interventions which are not apparent from looking at a volume in isolation. Future work (not within scope) may indicate a possible typology of horti sicci, and suggest how they evolved over the course of the seventeenth and eighteenth centuries, depending on use and intent of their creators. Figure 7 is a different representation of the same hortus siccus, focussing on the core network of specimens, modern determination, places, cross references etc, and could be understood as just such a possible ‘fingerprint’. Many other such visualisations are conceivable, and could include pagination sequence, watermarks and other physical features, depending on the material and the research questions.

These network tools are also valuable as a means to unpick some of the complex data relationships within horti sicci volumes. The Handisyd collection includes three separate lists in Sloane’s hand that clearly relate to the specimens, though often use different names than those on the specimen labels. These lists include some additional information, such as habitat and location, so it would be valuable to be able to associate the items together, even speculatively. To some extent this has been possible by visualising the otherwise unconnected small network clusters (as in Figure 8), which greatly facilitates the teasing apart of the dispersed and complicated labels, lists and specimens, and thereby enables potential relationships with other unconnected items to be proposed. A visualisation of incomplete networks supports the making of inferences which would otherwise be hard to discern.

Figure 7 – High-level multipartite network diagram of one hortus siccus (HS8), plotting polynomials, modern determinations, places, internal cross references, cited works, and internal lists. This section shows the core network, and omits the unconnected smaller networks.

Figure 8 – Using simple networks to identify associations between specimens, labels, and list items (in yellow).

The currently-imagined data objective is to create a single Sloane Herbarium object that adequately models the component parts and the important relationships between them. The intent is that this will be a TEI corpus, which will contain suitable granularity to support the main types of visualisations and analyses described above. From this it will be straightforward to generate a version of the data for the NHM Data Portal. The corpus will combine many components from a range of data sets: the Dandy IDs and folio-based people and place information; the Sloane horti sicci catalogue data, supplemented by Dandy data for any missing items; the new folio-based inventory of the entire collection; NHM people and places; and new temporal data for each hortus siccus, where possible. Some of these data sets will also need to be modified or added to (eg places). Figure 9 shows the likely workflow for this work. Some components may well be deferred, if necessary; the priority will be to generate a core model that can deliver a folio-based data set that combines Dandy and the new inventory by the end of the year.

Figure 9 – Likely workflow for remaining data work.

There are, of course, still many unknowns about the data work ahead, and how best to analyse and make use of it. Much of the focus in the data design will necessarily be on what we know is in there and which can be modelled. However, it will be important not to lose sight of the wider questions about how we can find what is unrepresentable by data alone, those gaps and silences in this complex archive that may give us hints of and insights into other people and their knowledge worlds which may suggest those richer histories that are now crystallised within a 300-year-old collection of dried plants.