Gradient descent ================ Gradient descent is an **optimization algorithm** used to **minimize some function** by **iteratively moving in the direction of steepest descent** as defined by the **negative of the gradient**. In machine learning, we use gradient descent to **update the parameters of our model**. **Parameters** refer to **coefficients** in **Linear Regression** and **weights** in **neural networks**. This section aims to provide you an explanation of gradient descent and **intuitions** towards the **behaviour of different algorithms for optimizing it**. These explanations will help you put them to use. We are first going to introduce the gradient descent, solve it for a regression problem and look at its different variants. Then, we will then briefly summarize challenges during training. Finally, we will introduce the **most common optimization algorithms** by showing their motivation to resolve these challenges and list some advices for facilitate the algorithm choice. Introduction ------------ Consider the 3-dimensional graph below in **the context of a cost function**. Our **goal** is to **move from the mountain in the top right corner (high cost) to the dark blue sea in the bottom left (low cost)**. The **arrows** represent the **direction** of steepest descent (negative gradient) from any given point–the direction that **decreases the cost function** as quickly as possible .. raw:: html
.. figure:: ./images/gradient_descent_intuition.png :alt: adalta.it adalta.it .. raw:: html
.. raw:: html
Gradient descent intuition. .. raw:: html
Starting at the top of the mountain, we take our first step **downhill** in the **direction specified by the negative gradient**. Next we **recalculate the negative gradient** (passing in the coordinates of our new point) and take another step in the direction it specifies. We continue this **process iteratively until** we get to the **bottom of our graph**, or to a **point where we can no longer move downhill–a local minimum**. .. raw:: html
.. raw:: html
Learning rate ~~~~~~~~~~~~~ The **size of these steps** is called the **learning rate**. With a **high learning rate** we can cover more ground each step, but we **risk overshooting the lowest point** since the slope of the hill is constantly changing. With a **very low learning rate**, we can **confidently move in the direction of the negative gradient** since we are **recalculating it so frequently**. A **low learning rate is more precise**, but calculating the gradient is **time-consuming**, so it will take us a very long time to get to the bottom. .. raw:: html
.. figure:: ./images/learning_rate_choice.png :alt: jeremyjordan jeremyjordan .. raw:: html
.. raw:: html
impacts of learning rate choice. .. raw:: html
Cost function ~~~~~~~~~~~~~ A **Loss Function (Error function)** tells us **“how good”** our **model** is at making predictions for a **given set of parameters**. The cost function has its own curve and its own gradients. The **slope of this curve** tells us how to **update our parameters** to make the model more **accurate**. Numerical solution for gradient descent --------------------------------------- Let’s run gradient descent using a **linear regression cost function**. There are **two parameters** in our cost function we can control: - $ .. raw:: latex \beta`\_0$ : **(the bias)** - $:raw-latex:`\beta`\_1 $ : **(weight or coefficient)** Since we need to consider the **impact each one** has on the final prediction, we need to use **partial derivatives**. We calculate the **partial derivatives of the cost function with respect to each parameter and store the results in a gradient**. Solution ~~~~~~~~ **Given the cost function** .. math:: f(\beta_0,\beta_1) = \frac{1}{2}\frac{\partial MSE}{\partial\beta} = \frac{1}{2N} \sum_{i=1}^{n} (y_i - (\beta_1 x_i + \beta_0))^2 = \frac{1}{2N} \sum_{i=1}^{n} ((\beta_1 x_i + \beta_0) - y_i)^2 **The gradient can be calculated as** .. math:: \begin{split}f'(\beta_0,\beta_1) = \begin{bmatrix} \frac{\partial f}{\partial \beta_0}\\ \frac{\partial f}{\partial \beta_1}\\ \end{bmatrix} = \begin{bmatrix} \frac{1}{2N} \sum -2((\beta_1x_i + \beta_0) - y_i ) \\ \frac{1}{2N} \sum -2x_i((\beta_1x_i + \beta_0) - y_i) \\ \end{bmatrix} = \begin{bmatrix} \frac{-1}{N} \sum ((\beta_1x_i + \beta_0) - y_i ) \\ \frac{-1}{N} \sum x_i((\beta_1x_i + \beta_0) - y_i) \\ \end{bmatrix} \end{split} To solve for the gradient, we **iterate** through our **data points** using our **:math:\beta_1 and :math:\beta_0 values** and compute the **partial derivatives**. This **new gradient** tells us the **slope of our cost function at our current position** (current parameter values) and the **direction we should move to update our parameters**. The **size of our update** is **controlled by the learning rate**. **Pseudocode of this algorithm** .. code:: python Function gradient_descent(X, Y, learning_rate, number_iterations): m : 1 b : 1 m_deriv : 0 b_deriv : 0 data_length : length(X) loop i : 1 --> number_iterations: loop i : 1 -> data_length : m_deriv : m_deriv -X[i] * ((m*X[i] + b) - Y[i]) b_deriv : b_deriv - ((m*X[i] + b) - Y[i]) m : m - (m_deriv / data_length) * learning_rate b : b - (b_deriv / data_length) * learning_rate return m, b Gradient descent variants ------------------------- There are **three variants of gradient descent**, which **differ in how much data we use to compute the gradient** of the objective function. Depending on the **amount of data**, we make a **trade-off between the accuracy of the parameter update and the time it takes to perform an update**. Batch gradient descent ~~~~~~~~~~~~~~~~~~~~~~ Batch gradient descent, known also as Vanilla gradient descent, computes the gradient of the cost function with respect to the parameters :math:`\theta` for the **entire training dataset** : .. math:: \theta = \theta - \eta \cdot \nabla_\theta J( \theta) As we need to calculate the gradients for the **whole dataset** to **perform just one update**, **batch** gradient descent can be **very slow** and **is intractable for datasets that don’t fit in memory**. Batch gradient descent also **doesn’t allow us to update our model online**. Stochastic gradient descent ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example :math:`x^{(i)}` and label :math:`y^{(i)}` - Choose an initial vector of parameters :math:`w` and learning rate :math:`\eta`. - Repeat until an approximate minimum is obtained: - Randomly shuffle examples in the training set. - For :math:`i \in 1, \dots, n` - :math:`\theta = \theta - \eta \cdot \nabla_\theta J( \theta; x^{(i)}; y^{(i)})` **Batch gradient descent** performs **redundant computations for large datasets**, as it **recomputes** gradients for **similar examples** before each **parameter update**. SGD does away with **this redundancy by performing one update at a time**. It is therefore usually **much faster and can also be used to learn online**. **SGD** performs **frequent updates with a high variance that cause the objective function to fluctuate heavily as in the image below**. .. raw:: html
.. figure:: ./images/SGD_fluctuation.PNG :alt: Wikipedia Wikipedia .. raw:: html
.. raw:: html
SGD fluctuation. .. raw:: html
While batch gradient descent **converges to the minimum of the basin** the parameters are placed in, SGD’s fluctuation, on the one hand, **enables it to jump to new and potentially better local minima**. On the other hand, **this ultimately complicates convergence to the exact minimum**, as **SGD will keep overshooting**. However, it **has been shown** that when we **slowly decrease the learning rate**, SGD shows the same **convergence behaviour as batch gradient descent**, almost certainly **converging to a local or the global minimum for non-convex and convex optimization respectively**. Mini-batch gradient descent ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples: .. math:: \theta = \theta - \eta \cdot \nabla_\theta J( \theta; x^{(i:i+n)}; y^{(i:i+n)}) This way, it : - **reduces the variance of the parameter updates**, which can lead to **more stable convergence**. - can make use of **highly optimized matrix optimizations** common to state-of-the-art deep learning libraries that make computing the gradient very efficient. **Common mini-batch sizes range between 50 and 256**, but can vary for different applications. **Mini-batch** gradient descent is **typically the algorithm of choice when training a neural network**. Gradient Descent challenges --------------------------- Vanilla mini-batch gradient descent, however, does not guarantee good convergence, but offers a few challenges that need to be addressed: - Choosing a proper **learning rate** can be difficult. A learning rate that is **too small** leads to **painfully slow convergence**, while a learning rate that is **too large** can hinder convergence and cause the loss function to **fluctuate** around the minimum or even to **diverge**. - **Learning rate schedules** try to **adjust the learning rate during training** by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics. - Additionally, the same learning rate applies to all parameter updates. **If our data is sparse and our features have very different frequencies**, we might not want to update all of them to the same extent, but perform **a larger update for rarely occurring features**. - Another key challenge of **minimizing highly non-convex error** functions common for neural networks is **avoiding** getting **trapped in their numerous suboptimal local minima**. These **saddle points (local minimas)** are usually surrounded by a plateau of the same error, which makes it **notoriously hard for SGD to escape**, as the gradient is close to zero in all dimensions. Gradient descent optimization algorithms ---------------------------------------- In the following, we will outline some **algorithms** that are **widely** used by the **deep learning community** to deal with the aforementioned **challenges**. Momentum ~~~~~~~~ **SGD** has trouble **navigating ravines (areas where the surface curves much more steeply in one dimension than in another)**, which are common **around local optima**. In these scenarios, **SGD oscillates across the slopes of the ravine while only making hesitant progress** along the bottom towards the local optimum as in the image below. .. raw:: html
.. figure:: ./images/sgd_momentum.png :alt: Wikipedia Wikipedia .. raw:: html
.. raw:: html
SGD and momentum. .. raw:: html
.. raw:: html
`Source `__ .. figure:: ./images/grad_descent_no_momentum.png :alt: No momentum: oscillations toward local largest gradient No momentum: oscillations toward local largest gradient .. raw:: html
.. raw:: html
No momentum: moving toward local largest gradient create oscillations. .. raw:: html
.. raw:: html
.. figure:: ./images/grad_descent_momentum.png :alt: With momentum: accumulate velocity to avoid oscillations With momentum: accumulate velocity to avoid oscillations .. raw:: html
.. raw:: html
With momentum: accumulate velocity to avoid oscillations. .. raw:: html
**Momentum** is a method that helps **accelerate SGD** in the **relevant direction and dampens oscillations** as can be seen in image above. It does this by **adding a fraction :math:`\gamma` of the update vector** of the **past time step to the current update vector** .. math:: \begin{align} \begin{split} v_t &= \rho v_{t-1} + \nabla_\theta J( \theta) \\ \theta &= \theta - v_t \end{split} \end{align} .. code:: python vx = 0 while True: dx = gradient(J, x) vx = rho * vx + dx x -= learning_rate * vx **Note**: The momentum term **:math:`\rho`** is usually set to 0.9 or a similar value. Essentially, when using momentum, we push **a ball down a hill**. The **ball accumulates momentum as it rolls downhill**, becoming faster and faster on the way **(until it reaches its terminal velocity if there is air resistance, i.e.  :math:`\rho` <1 )**. The same thing happens to our parameter updates: **The momentum term increases** for **dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions**. As a result, we **gain faster convergence** and **reduced oscillation**. AdaGrad: adaptive learning rates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Added element-wise scaling of the gradient based on the historical sum of squares in each dimension. - “Per-parameter learning rates” or “adaptive learning rates” .. code:: python grad_squared = 0 while True: dx = gradient(J, x) grad_squared += dx * dx x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7) - Progress along “steep” directions is damped. - Progress along “flat” directions is accelerated. - Problem: step size over long time => Decays to zero. RMSProp: “Leaky AdaGrad” ~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python grad_squared = 0 while True: dx = gradient(J, x) grad_squared += decay_rate * grad_squared + (1 - decay_rate) * dx * dx x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7) - ``decay_rate = 1``: gradient descent - ``decay_rate = 0``: AdaGrad Nesterov accelerated gradient ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ However, a ball that rolls down a hill, blindly following the slope, is highly **unsatisfactory**. We’d like to have a smarter ball, a ball that has **a notion of where it is going** so that it **knows to slow down before the hill slopes up again**. Nesterov accelerated gradient (NAG) is a way to give **our momentum term this kind of prescience**. We know that we will use our momentum term :math:`\gamma v_{t-1}` to move the parameters :math:`\theta`. Computing :math:`\theta - \gamma v_{t-1}` thus gives us **an approximation of the next position of the parameters** (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters :math:`\theta` but w.r.t. the approximate future position of our parameters: .. raw:: latex \begin{align} \begin{split} v_t &= \gamma v_{t-1} + \eta \nabla_\theta J( \theta - \gamma v_{t-1} ) \\ \theta &= \theta - v_t \end{split} \end{align} Again, we set the momentum term :math:`\gamma` to a value of around 0.9. While **Momentum first computes the current gradient and then takes a big jump in the direction of the updated accumulated gradient** , **NAG** first **makes a big jump in the direction of the previous accumulated gradient, measures the gradient and then makes a correction, which results in the complete NAG update**. This anticipatory update **prevents** us from **going too fast** and results in **increased responsiveness**, which has significantly **increased the performance of RNNs** on a number of tasks Adam ~~~~ **Adaptive Moment Estimation (Adam)** is a method that computes **adaptive learning rates** for each parameter. In addition to storing an **exponentially decaying average of past squared gradients :math:`v_t`**, Adam also keeps an **exponentially decaying average of past gradients :math:`m_t`, similar to momentum**. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a **heavy ball with friction**, which thus prefers **flat minima in the error surface**. We compute the decaying averages of past and past squared gradients :math:`m_t` and :math:`v_t` respectively as follows: .. raw:: latex \begin{align} \begin{split} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J( \theta) \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) \nabla_\theta J( \theta)^2 \end{split} \end{align} :math:`m_{t}` and :math:`v_{t}` are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. Adam (almost) .. code:: python first_moment = 0 second_moment = 0 while True: dx = gradient(J, x) # Momentum: first_moment = beta1 * first_moment + (1 - beta1) * dx # AdaGrad/RMSProp second_moment = beta2 * second_moment + (1 - beta2) * dx * dx x -= learning_rate * first_moment / (np.sqrt(second_moment) + 1e-7) As :math:`m_{t}` and :math:`v_{t}` are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. :math:`\beta_{1}` and :math:`\beta_{2}` are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates: .. raw:: latex \begin{align} \hat{m}_t &= \dfrac{m_t}{1 - \beta^t_1} \\ \hat{v}_t &= \dfrac{v_t}{1 - \beta^t_2} \end{align} They then use these to update the parameters (Adam update rule): .. math:: \theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - :math:`\hat{m}_t` Accumulate gradient: velocity. - :math:`\hat{v}_t` Element-wise scaling of the gradient based on the historical sum of squares in each dimension. - Choose Adam as default optimizer - Default values of 0.9 for :math:`\beta_1`, 0.999 for :math:`\beta_2`, and :math:`10^{-7}` for :math:`\epsilon`. - learning rate in a range between :math:`1e-3` and :math:`5e-4`