Overfitting and Regularization ============================== In statistics and machine learning, overfitting occurs when a statistical model describes random errors or noise instead of the underlying relationships. Overfitting generally occurs when a model is **excessively complex**, such as having **too many parameters relative to the number of observations**. A model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. A learning algorithm is trained using some set of training samples. If the learning algorithm has the capacity to overfit the training samples the performance on the **training sample set** will improve while the performance on unseen **test sample set** will decline. The overfitting phenomenon has three main explanations: - excessively complex models, - multicollinearity, and - high dimensionality. Causes of Overfitting --------------------- Multicollinearity ~~~~~~~~~~~~~~~~~ Predictors are highly correlated, meaning that one can be linearly predicted from the others. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least not within the sample data set; it only affects computations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. In case of perfect multicollinearity the predictor matrix is singular and therefore cannot be inverted. Under these circumstances, for a general linear model :math:`\mathbf{y} = \mathbf{X} \mathbf{w} + \boldsymbol{\varepsilon}`, the ordinary least-squares estimator, :math:`\mathbf{w}_{OLS} = (\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}`, does not exist. .. code:: ipython3 import numpy as np # Plot import matplotlib.pyplot as plt import seaborn as sns # Plot parameters plt.style.use('seaborn-v0_8-whitegrid') fig_w, fig_h = plt.rcParams.get('figure.figsize') plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5) An example where correlated predictor may produce an unstable model follows: We want to predict the business potential (pb) of some companies given their business volume (bv) and the taxes (tx) they are paying. Here pb ~ 10% of bv. However, taxes = 20% of bv (tax and bv are highly collinear), therefore there is an infinite number of linear combinations of tax and bv that lead to the same prediction. Solutions with very large coefficients will produce excessively large predictions. Multicollinearity between the predictors: business volumes and tax produces unstable models with arbitrary large coefficients. |Multicollinearity between the predictors| Dealing with multicollinearity: - Regularization by e.g. :math:`\ell_2` shrinkage: Introduce a bias in the solution by making :math:`(X^T X)^{-1}` non-singular. See :math:`\ell_2` shrinkage. - Feature selection: select a small number of features. See: Isabelle Guyon and André Elisseeff *An introduction to variable and feature selection* The Journal of Machine Learning Research, 2003. - Feature selection: select a small number of features using :math:`\ell_1` shrinkage. - Extract few independent (uncorrelated) features using e.g. principal components analysis (PCA), partial least squares regression (PLS-R) or regression methods that cut the number of predictors to a smaller set of uncorrelated components. .. |Multicollinearity between the predictors| image:: images/ols_multicollinearity.png :width: 10cm .. code:: ipython3 bv = np.array([10, 20, 30, 40, 50]) # business volume tax = .2 * bv # Tax bp = .1 * bv + np.array([-.1, .2, .1, -.2, .1]) # business potential X = np.column_stack([bv, tax]) beta_star = np.array([.1, 0]) # true solution ''' Since tax and bv are correlated, there is an infinite number of linear combinations leading to the same prediction. ''' # 10 times the bv then subtract it 9 times using the tax variable: beta_medium = np.array([.1 * 10, -.1 * 9 * (1/.2)]) # 100 times the bv then subtract it 99 times using the tax variable: beta_large = np.array([.1 * 100, -.1 * 99 * (1/.2)]) print("L2 norm of coefficients: small:%.2f, medium:%.2f, large:%.2f." % (np.sum(beta_star ** 2), np.sum(beta_medium ** 2), np.sum(beta_large ** 2))) print("However all models provide the exact same predictions.") assert np.all(np.dot(X, beta_star) == np.dot(X, beta_medium)) assert np.all(np.dot(X, beta_star) == np.dot(X, beta_large)) .. parsed-literal:: L2 norm of coefficients: small:0.01, medium:21.25, large:2550.25. However all models provide the exact same predictions. Model complexity ~~~~~~~~~~~~~~~~ Model complexity impacts both bias and variance: - Low-complexity models (underfitting): - High bias (priors or assumptions about data are too strong, leading to oversimplified models). - Low variance (consistent predictions across different datasets but with poor accuracy). - Example: A linear regression model trying to fit a highly non-linear dataset or an under regularized model. - Poor performance on both training and test data. - High-complexity models (overfitting): - Low bias (can fit training data very well). - High variance (small fluctuations in training data lead to large changes in predictions). - Example: A deep neural network with too many parameters trained on limited data or an over regularized model. - Good training performance but poor test performance. The **bias-variance tradeoff** states that as complexity increases, bias decreases, but variance increases. The goal is to find the optimal model complexity that balances both. Complex learners with too many parameters relative to the number of observations may overfit the training dataset. .. figure:: images/model_complexity_bias_variance.png :alt: Model complexity :width: 10cm Model complexity The Challenges of High Dimensionality ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ High-dimensional data refers to datasets with a large number of input features (:math:`P`). In linear models, each feature corresponds to a parameter, so when the number of features :math:`P` is large compared to the number of training samples :math:`N` (the “**large P, small N**” problem), the model tends to overfit the training data. This phenomenon is part of the **curse of dimensionality**, which describes the difficulties that arise when working in high-dimensional spaces. One of the most critical factors in selecting a machine learning algorithm is the relationship between :math:`P` and :math:`N`, as it significantly impacts model performance. Below are three key problems associated with high-dimensionality: **Infinite Solutions and Ill-Conditioned Matrices** In linear models, the covariance matrix :math:`\mathbf{X}^T \mathbf{X}` is of size :math:`P \times P` and has rank :math:`\min(N, P)`. When :math:`P > N`, the system of equations becomes **overparameterized**, meaning there are infinitely many possible solutions that fit the training data. This leads to poor generalization, as the learned solutions may be highly specific to the dataset. In such cases, the covariance matrix is singular or ill-conditioned, making it unstable for inversion in methods like ordinary least squares regression. **Exponential Growth of Sample Requirements** The density of data points in a high-dimensional space decreases exponentially with increasing :math:`P`. Specifically, the effective sampling density of :math:`N` points in a :math:`P`-dimensional space is proportional to :math:`N^{1/P}`. As a result, the data becomes increasingly sparse as :math:`P` grows, making it difficult to estimate distributions or learn meaningful patterns. To maintain a constant density, the number of required samples grows exponentially. For example: - In **1D**, 50 samples provide reasonable coverage. - In **2D**, approximately **2,500** samples are needed for equivalent density. - In **3D**, around **125,000** samples are required. This illustrates why high-dimensional problems often suffer from a lack of sufficient training data. **Most Data Points Lie on the Edge of the Space** In high-dimensional spaces, most data points are **closer to the boundary of the sample space** than to any other data point. Consider :math:`N` points uniformly distributed in a :math:`P`-dimensional unit ball. The median distance from the origin to the nearest neighbor is given by: .. math:: d(P, N) = \left(1 - \frac{1}{2}^{1/N}\right)^{1/P} For example, with :math:`N = 500` and :math:`P = 10`, this distance is approximately **0.52**, meaning that most data points are more than halfway to the boundary. This has severe consequences for prediction: - In lower dimensions, models **interpolate** between data points. - In high dimensions, models must **extrapolate**, which is significantly harder and less reliable. This explains why many machine learning algorithms perform poorly in high-dimensional settings and why dimensionality reduction techniques (e.g., PCA, feature selection) are essential. **Conclusion** The curse of dimensionality creates fundamental challenges for machine learning, including overparameterization, data sparsity, and unreliable predictions. Addressing these issues requires strategies such as **dimensionality reduction**, **regularization**, and **feature selection** to ensure that models generalize well and remain computationally efficient. *(Source: T. Hastie, R. Tibshirani, J. Friedman.* The Elements of Statistical Learning: Data Mining, Inference, and Prediction.\* Second Edition, 2009.)\* Measure of overfitting risk: Vapnik–Chervonenkis (VC) Dimension ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Vapnik–Chervonenkis (VC) dimension is a fundamental concept in statistical learning theory that measures the capacity of a hypothesis class (i.e., the set of functions a model can learn). It provides a way to quantify a model’s ability to fit data and generalize to unseen examples. VC dimension (for Vapnik–Chervonenkis dimension) is a measure of the **capacity** (complexity, expressive power, richness, or flexibility) of a statistical classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter. Theorem: Linear classifier in :math:`R^P` have VC dimension of :math:`P+1`. Hence in dimension two (:math:`P=2`) any random partition of 3 points can be learned. .. figure:: images/vc_dimension_linear_2d.png :alt: In 2D we can shatter any three non-collinear points :width: 15cm In 2D we can shatter any three non-collinear points Regularization Approaches to Mitigate Overfitting ------------------------------------------------- Regularization techniques help prevent overfitting by constraining the complexity of machine learning models. Below is a categorized enumeration of common regularization methods, along with their summaries. Norm-Based Regularization (Penalty Methods) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These techniques add constraints to the model’s parameters to prevent excessive complexity. - **L2 Regularization (Ridge Regression / Weight Decay / Shrinkage)** - Adds a squared penalty: :math:`\lambda \sum w_i^2`. - Shrinks weights but does not eliminate them, reducing model sensitivity to noise. - Common in linear regression, logistic regression, and deep learning. - **L1 Regularization (Lasso Regression)** - Adds an absolute penalty: :math:`\lambda \sum |w_i|`. - Promote sparsity by setting some weights to zero, effectively selecting features. - Used in high-dimensional datasets to perform feature selection. - **Elastic Net Regularization** - Combines L1 and L2 penalties: :math:`\lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2` - Used when dealing with correlated features. Ensemble Learning approaches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Bagging & Boosting** - **Bagging** (e.g., Random Forests) reduces overfitting by averaging multiple models trained on different data subsets. - **Boosting** (e.g., XGBoost) adds weak learners sequentially with a learning rate to control overfitting. - **Stacking** reduces overfitting by averaging multiple models trained on same data subsets. Data-Filtering/or preprocessing Regularization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Feature Selection** - Reduces model complexity by removing redundant or irrelevant features. - Methods include univariate filter (SelecKBest) or recursive feature elimination (RFE) and mutual information filtering. - **Unsupervised Dimension Reduction as preprocessing step** - Reduces model complexity by reducing the dimension of the imput data. - Methods include Linear Dimension reduction or Manifold Learning. - Unsupervised approaches are generally not efficient, as they tend to overfill the data before the supervised stage. Regularization for Probabilistic Models ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Bayesian Regularization** - Introduces priors over model parameters, effectively acting as L2 regularization. - Used in Bayesian Neural Networks, Gaussian Processes, and Bayesian Ridge Regression. Regularization in Kernel Methods (SVM, Gaussian Processes) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Margin-Based Regularization (SVM)** - The **soft margin** parameter $ C $ controls the trade-off between a large margin and misclassification. - A smaller $ C $ encourages more regularization, preventing overfitting. - **Kernel Regularization** - Kernel methods (e.g., Gaussian RBF, polynomial kernels) use hyperparameters like kernel bandwidth to control model complexity. - A wider kernel bandwidth smooths the decision boundary, reducing variance. Regularization in Deep Learning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Dropout** - Randomly disables a fraction of neurons during training to reduce reliance on specific features. - Helps improve generalization in fully connected and convolutional networks. - **Batch Normalization** - Normalizes activations across mini-batches, reducing internal covariate shift. - Acts as an implicit regularizer by smoothing the optimization landscape. - **Early Stopping** - Monitors validation loss and stops training when it stops decreasing. - Prevents the model from overfitting to the training data. - **Weight Decay (L2 Regularization in Neural Networks)** - Reduces the magnitude of neural network weights to prevent overfitting. - Equivalent to L2 regularization in traditional machine learning models. Data-Centric Regularization ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Data Augmentation** - Artificially increases the dataset size using transformations (e.g., rotations, scaling, flipping). - Particularly useful in image and text processing tasks. - **Adding Noise to Inputs or Weights** - Introduces small random noise to training data or network weights to improve robustness. - Common in deep learning and reinforcement learning. Regularization for Tree-Based Models ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Pruning (Decision Trees, Random Forests, Gradient Boosting)** - Removes branches that have low importance to reduce complexity. - Prevents trees from memorizing noise. Summary ~~~~~~~ +-----------------------+-------------------+--------------------------+ | **Category** | **Method** | **Description** | +=======================+===================+==========================+ | **Norm-Based | **L2 | Adds a squared penalty | | Regularization** | Regularization | on coefficients; shrinks | | | (Ridge, Weight | weights but does not | | | Decay, | eliminate them, reducing | | | Shrinkage)** | noise sensitivity. | +-----------------------+-------------------+--------------------------+ | | **L1 | Adds an absolute penalty | | | Regularization | on coefficients; | | | (Lasso)** | promotes sparsity by | | | | setting some weights to | | | | zero (feature | | | | selection). | +-----------------------+-------------------+--------------------------+ | | **Elastic Net** | Combines L1 and L2 | | | | penalties; useful for | | | | correlated features. | +-----------------------+-------------------+--------------------------+ | **Ensemble Learning | **Bagging** | Reduces overfitting by | | Regularization** | | averaging multiple | | | | models trained on | | | | different subsets (e.g., | | | | Random Forests). | +-----------------------+-------------------+--------------------------+ | | **Boosting** | Sequentially adds weak | | | | learners with a learning | | | | rate to control | | | | overfitting (e.g., | | | | XGBoost). | +-----------------------+-------------------+--------------------------+ | | **Stacking** | Combines multiple models | | | | trained on the same data | | | | and uses a meta-learner | | | | for final predictions. | +-----------------------+-------------------+--------------------------+ | **Data-Filtering / | **Feature | Reduces model complexity | | Preprocessing | Selection** | by removing redundant or | | Regularization** | | irrelevant features | | | | (e.g., SelectKBest, | | | | RFE). | +-----------------------+-------------------+--------------------------+ | | **Unsupervised | Reduces data | | | Dimension | dimensionality before | | | Reduction** | modeling (e.g., PCA, | | | | Manifold Learning); less | | | | efficient for supervised | | | | learning. | +-----------------------+-------------------+--------------------------+ | **Regularization for | **Bayesian | Introduces priors over | | Probabilistic | Regularization** | parameters, acting as L2 | | Models** | | regularization (e.g., | | | | Bayesian Ridge | | | | Regression, Gaussian | | | | Processes). | +-----------------------+-------------------+--------------------------+ | **Regularization in | **Margin-Based | Soft margin parameter | | Kernel Methods** | Regularization | (C) controls the | | | (SVM)** | trade-off between margin | | | | size and | | | | misclassification. | +-----------------------+-------------------+--------------------------+ | | **Kernel | Kernel hyperparameters | | | Regularization** | (e.g., RBF bandwidth) | | | | control decision | | | | boundary complexity. | +-----------------------+-------------------+--------------------------+ | **Regularization in | **Dropout** | Randomly disables | | Deep Learning** | | neurons during training | | | | to reduce reliance on | | | | specific features. | +-----------------------+-------------------+--------------------------+ | | **Batch | Normalizes activations | | | Normalization** | across mini-batches, | | | | reducing internal | | | | covariate shift and | | | | smoothing optimization. | +-----------------------+-------------------+--------------------------+ | | **Early | Stops training when | | | Stopping** | validation loss stops | | | | improving to prevent | | | | overfitting. | +-----------------------+-------------------+--------------------------+ | | **Weight Decay | Applies L2 | | | (L2 in Deep | regularization to neural | | | Learning)** | network weights. | +-----------------------+-------------------+--------------------------+ | **Data-Centric | **Data | Increases dataset size | | Regularization** | Augmentation** | using transformations | | | | (e.g., rotations, | | | | scaling, flipping) to | | | | improve generalization. | +-----------------------+-------------------+--------------------------+ | | **Adding Noise** | Introduces noise to | | | | inputs or weights to | | | | improve model | | | | robustness. | +-----------------------+-------------------+--------------------------+ | **Regularization for | **Pruning** | Removes low-importance | | Tree-Based Models** | | branches in decision | | | | trees and ensemble | | | | models to reduce | | | | complexity. | +-----------------------+-------------------+--------------------------+