Advantages and Limitations of Deep Networks as Local Interpolators, Not Global Approximators

This blog addresses a common misconception in the mathematics community: the belief that deep networks can serve as global approximators of a target function across the entire input domain. I write this post to emphasize the importance of understanding the limitations of deep networks' global approximation capabilities, rather than blindly accepting such claims, and to highlight how their strengths as local interpolators can be effectively leveraged. To clarify, deep networks are fundamentally limited in their ability to learn most globally defined mathematical transforms, such as the Fourier transform, Radon transform, and Laplace transform, particularly in high-dimensional settings. (I am aware of papers claiming that deep networks can learn the Fourier transform, but these are limited to low-dimensional cases with small pixel counts.)

The misconception often stems from the influence of the Barron space framework, which provides a theoretical basis for function approximation. While Barron spaces suggest that neural networks are effective approximators for certain types of functions, this does not imply that deep networks trained on finite datasets can achieve global approximation across high-dimensional input domains. Instead, as I have emphasized in previous posts, deep networks function as local interpolators, approximating the target function only in regions close to the data distribution. This limitation arises because their training process minimizes a loss function on finite training data, inherently restricting their ability to generalize beyond regions where data is available.

The local interpolator property of deep networks, driven by their data-dependent nature, enables them to adapt to specific datasets, making them highly effective for domain-specific tasks where the data distribution is well-sampled. While their reliance on local interpolation may result in poor performance on out-of-distribution inputs, it proves particularly advantageous in high-dimensional spaces, where the vast input domain cannot be fully explored. By concentrating on regions where data is densely distributed, deep networks partially mitigate the curse of dimensionality.

Why deep networks are NOT global approximators: a practical example in cancer classification

To explain this concept more clearly, let us consider an example of a cancer classification problem—specifically, determining whether a chest X-ray image contains cancer. For simplicity, assume each chest X-ray image is a 300×300-pixel image with 256 grayscale levels. The goal is to learn a network function $f$ that maps an X-ray image to either 0 (no cancer) or 1 (cancer). The total number of possible 300×300-pixel images with 256 grayscale levels is astronomically large: $256^{300\times 300}$, a number far exceeding the estimated number of atoms in the observable universe. However, in reality, all possible chest X-ray images occupy only a tiny fraction of this enormous space. A well-trained neural network typically operates effectively only within the immediate vicinity of the data distribution, which constitutes an extremely small fraction of the entire input space—often far less than 0.00001%, especially in high-dimensional settings.

Now, suppose we have access to an exceptional dataset of 100 billion labeled training examples that satisfy the i.i.d. (independent and identically distributed) condition—although such a dataset is practically impossible to collect. Additionally, imagine that 10 million experts around the world independently develop high-performing cancer detection networks using the same fixed CNN architecture. Assume that all these networks achieve perfect classification performance on the training data.

From a practical perspective, we might consider all these networks to be equivalent because they perform identically on the training data and, presumably, generalize well to unseen examples drawn from the same distribution. However, when viewed as functions over the entire input domain of $256^{300\times 300}$  possible images, the situation changes dramatically. The likelihood that any two networks behave identically for all possible inputs is virtually zero. It is because each network is trained on a finite subset of the input space, leaving the rest unconstrained. Training involves stochastic processes such as random initialization, batch sampling, and optimizer dynamics, leading to different parameter configurations. Deep networks are often overparameterized, allowing for many different parameter configurations that yield the same performance on training data. Consequently, their global behavior across the input domain differs significantly. This diversity arises because deep networks are local solutions, shaped by the specific data they are exposed to during training, and are not designed to approximate functions globally.

Recognizing Limitations: Maximizing the Effectiveness of Deep Learning in High-Dimensional Domains

Understanding these limitations is critical for effectively applying deep learning to real-world problems, particularly in high-dimensional domains like medical imaging. While deep networks excel at fitting data and interpolating within the observed distribution, they cannot guarantee meaningful behavior outside this region. .

An example of losing the merits of deep networks as local interpolators: Kolmogorov–Arnold Networks (KANs)

KANs are based on the Kolmogorov–Arnold representation theorem. Similar to the Barron space framework, KANs aim to provide globally defined representations across the entire input space. However, this global focus makes KANs less effective at approximating functions locally, particularly in regions where the data is concentrated.
In the context of medical image analysis, KANs are less efficient compared to standard deep learning models due to their lack of hierarchical architectures for learning localized features and their inefficient parameter usage. Consequently, they fail to leverage the key advantage of data-driven adaptability, which is a defining strength of standard deep learning. These limitations significantly reduce their practicality for tasks that rely on local interpolation and adaptability to the data distribution.

Attempting to address the weaknesses of deep learning against adversarial attacks can result in a loss of its inherent advantages. Deep learning's susceptibility to adversarial attacks stems from its fundamental nature as a local interpolator, with a strong focus on fitting the data distribution. While robustness to adversarial attacks can be improved to some degree, achieving complete robustness is ultimately impossible. Consequently, there is an inherent trade-off between the local interpolation capability of deep learning and its robustness to adversarial attacks.

Deep learning's memorization bias is inherent to its role as a local interpolator, making it effective for tasks involving uniform datasets with simple correlations and associations. While ChatGPT has achieved notable success in natural language tasks, Watson Health's failure to deliver personalized cancer treatments underscores the shortcomings of current deep learning systems. These models rely heavily on statistical patterns in data, often overlooking subtle, context-dependent cues that human experts, such as doctors, can detect. 

Generative models such as GANs and Diffusion models, widely celebrated for their ability to synthesize and replicate content in entertainment industries, encounter significant challenges in medical applications. In medical imaging, these models frequently misinterpret anomalous features as noise or perturbations, due to their bias toward reproducing common patterns from training data. This tendency risks suppressing rare but critical features, potentially leading to misdiagnoses. Furthermore, when applied to out-of-distribution data, generative models may produce hallucinations, generating unrealistic features that compromise their reliability.

Supervised learning, while not free of limitations, is often more effective at reducing memorization bias compared to unsupervised or generative approaches. (Note: Although IBM's Watson for Oncology utilized supervised learning techniques to assist in delivering personalized cancer treatments, it ultimately failed due to the overwhelming complexity of the task.) In other words, supervised learning is better suited for applications where rare or anomalous patterns must be preserved and accurately interpreted, such as detecting cancers or other critical medical conditions. This advantage stems from the fact that supervised models are guided by explicit labels during training, which provide a strong signal to associate specific inputs with desired outputs. These labels help anchor the model's focus on meaningful patterns, reducing the likelihood of overfitting to noise or irrelevant correlations. However, supervised learning has its drawbacks. Collecting labeled data is often costly, labor-intensive, and, in some cases, not feasible. 

The key message I want to convey in this blog is that we must first acknowledge the limitations of AI technology and ensure that we do not compromise its strengths while addressing its weaknesses. Deep learning's ability to closely fit training data can be seen as both an advantage and a drawback. Therefore, we should aim for a balanced approach that leverages its strengths while thoughtfully addressing its shortcomings.

Comments

Popular posts from this blog

Exploring the Fundamentals of Diffusion Models

University Education: Reflecting Life's Complexities and Challenges

AI as an Aid, Not a Replacement: Enhancing Medical Practice without Supplanting Doctors