Abstract
This thesis is a review of algorithms and statistical complexity results for the manifold intrinsic dimension (ID) estimation problem. The task is as follows: given an independent and identically distributed sample of points from a low-dimensional submanifold embedded in high-dimensional Euclidean space, determine the dimension of the submanifold. This problem is of key interest in data science, as many algorithms can be made to depend on the intrinsic dimension of data, rather than the dimension of its ambient space. We pay close attention to the linear case of this problem, which reduces to principle component analysis (PCA). In the general manifold case, the kinds of approaches become much more diverse. We distinguish two very different kinds of methods: (1) those which isolate a local statistic (e.g., number of neighbors within a certain radius) and analyze its scaling behavior in varying neighborhood sizes, and (2) those which analyze a global statistic and its scaling behavior independent of local information (e.g., the Wasserstein distance between two independently-formed empirical distributions, and how it scales with the size of their samples). We then compare lower bounds on the sample complexity of ID estimation, in a model with noise and a model without.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Copyright (c) 2025 Noah Bergman
