16 March – Processing Language, Images and Other Data Modalities (Special Talk)

A fundamental problem in artificial intelligence is how to simultaneously deploy data from different sources — such as audio, images, text, and video — collectively known as multimodal data. In this talk, Andrew Stuart will present a mathematical framework for studying this question, focusing primarily on text and images.

 

Stuart will begin by describing how large language models (LLMs) operate, addressing the challenging issue of using real-number algorithms to process language. In particular, he will explain next-token prediction — the core of current LLM methodology. He will then focus on the canonical problem of measuring alignment between image and text data (contrastive learning). Finally, he will describe how images can be generated from text prompts (conditional generative modeling).

 

From a mathematical perspective, a unifying theme underlying this work is the minimization of divergences defined on spaces of probability measures. A second key mathematical idea is the attention mechanism — a form of nonlinear correlation between vector-valued sequences. Andrew Stuart aims to explain these concepts — and their relevance to modern machine learning algorithms — in an accessible fashion, suitable for a broad audience from the mathematical and computational sciences.

 

Andrew Stuart is the Bren Professor of Computing and Mathematical Sciences at Caltech since 2016. Before he spent 17 years as Professor of Mathematics at the University of Warwick (1999–2016). Prior to that he was on the faculty in The Departments of Computer Science and Mechanical Engineering at Stanford University (1992–1999), and in the Mathematics Department at Bath University (1989–1992). He obtained his PhD from the Oxford University Computing Laboratory in 1986, and held postdoctoral positions in Mathematics at Oxford University and at MIT in the period 1986–1989.

 

Download the poster here