PhD Thesis

Below is a summary of my PhD thesis:

Originally formulated by French mathematician Gaspard Monge for minimizing the human effort of moving soil from one place to another, optimal transport (OT) defines a tool to compare and transfer probability distributions. Quantifying similarity between different distributions is at the core of data analysis in application such as classification, clustering, imaging analysis and signal processing. In the past decade, OT has been shown valuable to address limitations of traditional statistics, yet understanding it as a statistical tool is still in its infancy. There are two major open questions: What kind of statistical problems can be solved by OT, and can they be solved efficiently on large-scale datasets and high-dimensional probability distributions?

The ultimate goal of my dissertation is to develop a unified theoretical and computational framework to incorporate OT as a versatile component of the data analysis toolbox. A variety of problems in statistics and data analysis are formulated and investigated through OT under different settings. Particular focus is given to scenarios that involve explaining the relationship between variables of interest and external factors. On the practical side, data-driven numerical methodologies are proposed to find their solutions accurately and efficiently. We also demonstrate a wide range of real world applications enabled under this new framework.

Given two probability distributions labelled 'source' and 'target', Monge's OT problem seeks a mapping which transforms one to the other while minimizing the total cost. The cost is a problem-specific distance between two points, and the minimal value of the total cost is a meaningful quantification of the 'distance' between source and target.

Many tasks in statistics, image processing and data analysis can be formulated as Monge's problems. For example, in color transfer, a source image is rendered according to the color pattern of a target image through mapping its RGB values pixel by pixel. In Bayesian inference, sampling from a posterior distribution is enabled by a map pushing forward the prior to the posterior. In my thesis, the applicability of OT in data analysis is investigated through three representative problems, all of which involve analyzing the relationship between variables of interest and external factors, the 'covariates'.