Advantages and Limitations of Deep Networks as Local Interpolators, Not Global Approximators
This blog addresses a common misconception in the mathematics community: the belief that deep networks can serve as global approximators of a target function across the entire input domain. I write this post to emphasize the importance of understanding the limitations of deep networks' global approximation capabilities, rather than blindly accepting such claims, and to highlight how their strengths as local interpolators can be effectively leveraged. To clarify, deep networks are fundamentally limited in their ability to learn most globally defined mathematical transforms, such as the Fourier transform, Radon transform, and Laplace transform, particularly in high-dimensional settings. (I am aware of papers claiming that deep networks can learn the Fourier transform, but these are limited to low-dimensional cases with small pixel counts.)
The misconception often stems from the influence of the Barron space framework, which provides a theoretical basis for function approximation. While Barron spaces suggest that neural networks are effective approximators for certain types of functions, this does not imply that deep networks trained on finite datasets can achieve global approximation across high-dimensional input domains. Instead, as I have emphasized in previous posts, deep networks function as local interpolators, approximating the target function only in regions close to the data distribution. This limitation arises because their training process minimizes a loss function on finite training data, inherently restricting their ability to generalize beyond regions where data is available.
The local interpolator property of deep networks, driven by their data-dependent nature, enables them to adapt to specific datasets, making them highly effective for domain-specific tasks where the data distribution is well-sampled. While their reliance on local interpolation may result in poor performance on out-of-distribution inputs, it proves particularly advantageous in high-dimensional spaces, where the vast input domain cannot be fully explored. By concentrating on regions where data is densely distributed, deep networks partially mitigate the curse of dimensionality.
Why deep networks are NOT global approximators: a practical example in cancer classification
To explain this concept more clearly, let us consider an example of a cancer classification problem—specifically, determining whether a chest X-ray image contains cancer. For simplicity, assume each chest X-ray image is a 300×300-pixel image with 256 grayscale levels. The goal is to learn a network function $f$ that maps an X-ray image to either 0 (no cancer) or 1 (cancer). The total number of possible 300×300-pixel images with 256 grayscale levels is astronomically large: $256^{300\times 300}$, a number far exceeding the estimated number of atoms in the observable universe. However, in reality, all possible chest X-ray images occupy only a tiny fraction of this enormous space. A well-trained neural network typically operates effectively only within the immediate vicinity of the data distribution, which constitutes an extremely small fraction of the entire input space—often far less than 0.00001%, especially in high-dimensional settings.
Now, suppose we have access to an exceptional dataset of 100 billion labeled training examples that satisfy the i.i.d. (independent and identically distributed) condition—although such a dataset is practically impossible to collect. Additionally, imagine that 10 million experts around the world independently develop high-performing cancer detection networks using the same fixed CNN architecture. Assume that all these networks achieve perfect classification performance on the training data.
From a practical perspective, we might consider all these networks to be equivalent because they perform identically on the training data and, presumably, generalize well to unseen examples drawn from the same distribution. However, when viewed as functions over the entire input domain of $256^{300\times 300}$ possible images, the situation changes dramatically. The likelihood that any two networks behave identically for all possible inputs is virtually zero. It is because each network is trained on a finite subset of the input space, leaving the rest unconstrained. Training involves stochastic processes such as random initialization, batch sampling, and optimizer dynamics, leading to different parameter configurations. Deep networks are often overparameterized, allowing for many different parameter configurations that yield the same performance on training data. Consequently, their global behavior across the input domain differs significantly. This diversity arises because deep networks are local solutions, shaped by the specific data they are exposed to during training, and are not designed to approximate functions globally.
Recognizing Limitations: Maximizing the Effectiveness of Deep Learning in High-Dimensional Domains
Understanding these limitations is critical for effectively applying deep learning to real-world problems, particularly in high-dimensional domains like medical imaging. While deep networks excel at fitting data and interpolating within the observed distribution, they cannot guarantee meaningful behavior outside this region. .
Comments
Post a Comment