Negative Log-Likelihood (NLL) for Binary Classification with Sigmoid Activation
-------------------------------------------------------------------------------

.. _`ref:demonstration-nll`:

Demonstration of Negative Log-Likelihood (NLL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Setup**

- Inputs: :math:`\{(x_i, y_i)\}_{i=1}^n`, with :math:`y_i \in \{0, 1\}`
- Model:

  .. math::


     \hat{p}_i = \sigma(\mathbf{w}^\top \mathbf{x}_i) = \frac{1}{1 + e^{-\mathbf{w}^\top \mathbf{x}_i}}
- Objective:

  .. math::


     \mathcal{L}_{\text{NLL}} = - \sum_{i=1}^n \log P(y_i \mid \mathbf{x}_i; \mathbf{w})

Since :math:`y_i \in \{0, 1\}`, we model the likelihood as:

.. math::


   P(y_i \mid \mathbf{x}_i; \mathbf{w}) = \hat{p}_i^{y_i} (1 - \hat{p}_i)^{1 - y_i}

**Step-by-step Expansion**

.. math::


   \mathcal{L}_{\text{NLL}} = - \sum_{i=1}^n \log \left( \hat{p}_i^{y_i} (1 - \hat{p}_i)^{1 - y_i} \right)

Apply log properties:

.. math::


   = - \sum_{i=1}^n \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

Now substitute :math:`\hat{p}_i = \sigma(z_i) = \frac{1}{1 + e^{-z_i}}`
where :math:`z_i = \mathbf{w}^\top \mathbf{x}_i`:

- Use:

  .. math::


     \log(\sigma(z)) = -\log(1 + e^{-z}), \quad \log(1 - \sigma(z)) = -z - \log(1 + e^{-z})

So the per-example loss becomes:

.. math::


   \ell_i(\mathbf{w}) = - \left[ y_i \log \sigma(z_i) + (1 - y_i) \log (1 - \sigma(z_i)) \right]

.. math::


   = - \left[ y_i (-\log(1 + e^{-z_i})) + (1 - y_i)(-z_i - \log(1 + e^{-z_i})) \right]

Simplify:

.. math::


   \ell_i(\mathbf{w}) = \log(1 + e^{-z_i}) + (1 - y_i) z_i

Therefore, the total loss over :math:`n` examples is:

**Final Simplified Expression**

.. math::


   \mathcal{L}_{\text{NLL}}(\mathbf{w}) = \sum_{i=1}^n \left[ \log(1 + e^{-z_i}) + (1 - y_i) z_i \right] \quad \text{with } y_i \in \{0, 1\},

simplifies to

.. math::


   \mathcal{L}_{\text{NLL}}(\mathbf{w}) = \sum_{i=1}^n \log\left(1 + e^{-y_i \cdot z_i} \right) \quad \text{with } y_i \in \{-1, +1\}.

This final form is particularly elegant and often used in optimization
routines.

Gradient of Negative Log-Likelihood (NLL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Recap: The Model and Loss**

We have:

- Input–label pairs: :math:`\{(x_i, y_i)\}_{i=1}^n`, where
  :math:`y_i \in \{0, 1\}`
- Linear logit: :math:`z_i = \mathbf{w}^\top \mathbf{x}_i`
- Sigmoid output:

  .. math::


     \hat{p}_i = \sigma(z_i) = \frac{1}{1 + e^{-z_i}}
- NLL loss:

  .. math::


     \mathcal{L}(\mathbf{w}) = -\sum_{i=1}^n \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

We aim to compute the gradient
:math:`\nabla_{\mathbf{w}} \mathcal{L}(\mathbf{w})`

**Step 1: Loss per Sample**

Define per-sample loss:

.. math::


   \ell_i(\mathbf{w}) = -\left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

Take derivative w.r.t. :math:`\mathbf{w}`. Using the chain rule:

.. math::


   \nabla_{\mathbf{w}} \ell_i = \frac{d\ell_i}{d\hat{p}_i} \cdot \frac{d\hat{p}_i}{dz_i} \cdot \frac{dz_i}{d\mathbf{w}}

**Step 2: Compute Gradients**

- Derivative of the loss w.r.t. :math:`\hat{p}_i`:

  .. math::


     \frac{d\ell_i}{d\hat{p}_i} = -\left( \frac{y_i}{\hat{p}_i} - \frac{1 - y_i}{1 - \hat{p}_i} \right)

- Derivative of sigmoid:

  .. math::


     \frac{d\hat{p}_i}{dz_i} = \hat{p}_i(1 - \hat{p}_i)

- Derivative of :math:`z_i = \mathbf{w}^\top \mathbf{x}_i`:

  .. math::


     \frac{dz_i}{d\mathbf{w}} = \mathbf{x}_i

Putting it all together:

.. math::


   \nabla_{\mathbf{w}} \ell_i = \left[ \hat{p}_i - y_i \right] \mathbf{x}_i

**Step 3: Final Gradient over Dataset**

Sum over all samples:

.. math::


   \nabla_{\mathbf{w}} \mathcal{L}(\mathbf{w}) = \sum_{i=1}^n (\hat{p}_i - y_i) \mathbf{x}_i

Or in matrix form, if :math:`\mathbf{X} \in \mathbb{R}^{n \times p}`,
:math:`\hat{\mathbf{p}} \in \mathbb{R}^n`, and
:math:`\mathbf{y} \in \mathbb{R}^n`:

.. math::


   \nabla_{\mathbf{w}} \mathcal{L}(\mathbf{w}) = \mathbf{X}^\top (\hat{\mathbf{p}} - \mathbf{y})

**Summary**

- Gradient of binary NLL with sigmoid:

  .. math::


     \boxed{\nabla_{\mathbf{w}} \mathcal{L} = \sum_{i=1}^n (\sigma(\mathbf{w}^\top \mathbf{x}_i) - y_i)\mathbf{x}_i}
- In matrix form:

  .. math::


     \boxed{\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{X}^\top (\hat{\mathbf{p}} - \mathbf{y})}

This form is used in logistic regression and binary classifiers trained
via gradient descent.

Hessian matrix (i.e., the matrix of second derivatives) for Negative Log-Likelihood (NLL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Recap: Setup**

| We are given: - Dataset: :math:`\{(x_i, y_i)\}_{i=1}^n`, with
  :math:`y_i \in \{0, 1\}` - Model:
| 

  .. math::


       \hat{p}_i = \sigma(z_i) = \frac{1}{1 + e^{-z_i}}, \quad \text{where } z_i = \mathbf{w}^\top \mathbf{x}_i
       

  - Loss function:

  .. math::


       \mathcal{L}(\mathbf{w}) = -\sum_{i=1}^n \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]
       

We already know the gradient is:

.. math::


   \nabla_{\mathbf{w}} \mathcal{L} = \sum_{i=1}^n (\hat{p}_i - y_i) \mathbf{x}_i

**Goal: Hessian :math:`\nabla^2_{\mathbf{w}} \mathcal{L}`**

We now compute the second derivative of :math:`\mathcal{L}`, i.e., the
**Hessian matrix** :math:`\mathbf{H} \in \mathbb{R}^{p \times p}`, where
each entry is:

.. math::


   \mathbf{H}_{jk} = \frac{\partial^2 \mathcal{L}}{\partial w_j \partial w_k}

**Step-by-Step Derivation**

Recall:

.. math::


   \nabla_{\mathbf{w}} \mathcal{L} = \sum_{i=1}^n (\hat{p}_i - y_i) \mathbf{x}_i

But note that
:math:`\hat{p}_i = \sigma(z_i) = \sigma(\mathbf{w}^\top \mathbf{x}_i)`,
so :math:`\hat{p}_i` depends on :math:`\mathbf{w}` too.

We differentiate the gradient:

.. math::


   \nabla^2_{\mathbf{w}} \mathcal{L} = \sum_{i=1}^n \nabla_{\mathbf{w}} \left[ (\hat{p}_i - y_i) \mathbf{x}_i \right]

The only term depending on :math:`\mathbf{w}` is :math:`\hat{p}_i`. We
apply the chain rule:

.. math::


   \nabla_{\mathbf{w}} \hat{p}_i = \sigma'(z_i) \cdot \nabla_{\mathbf{w}} z_i = \hat{p}_i (1 - \hat{p}_i) \mathbf{x}_i

So the outer derivative becomes:

.. math::


   \nabla_{\mathbf{w}} \left[ (\hat{p}_i - y_i) \mathbf{x}_i \right] = \hat{p}_i (1 - \hat{p}_i) \mathbf{x}_i \mathbf{x}_i^\top

Hence:

.. math::


   \boxed{
   \nabla^2_{\mathbf{w}} \mathcal{L} = \sum_{i=1}^n \hat{p}_i (1 - \hat{p}_i) \mathbf{x}_i \mathbf{x}_i^\top
   }

This is a **weighted sum of outer products** of input vectors.

**Matrix Form**

Let: - :math:`\mathbf{X} \in \mathbb{R}^{n \times p}`: input matrix
(rows = :math:`x_i^\top`) - :math:`\hat{\mathbf{p}} \in \mathbb{R}^n`:
predicted probabilities - Define
:math:`\mathbf{S} = \text{diag}(\hat{p}_i (1 - \hat{p}_i)) \in \mathbb{R}^{n \times n}`

Then the Hessian is:

.. math::


   \boxed{
   \nabla^2_{\mathbf{w}} \mathcal{L} = \mathbf{X}^\top \mathbf{S} \mathbf{X}
   }

**Summary**

- The Hessian of the NLL loss with sigmoid output is:

  .. math::


     \nabla^2_{\mathbf{w}} \mathcal{L} = \sum_{i=1}^n \hat{p}_i (1 - \hat{p}_i) \mathbf{x}_i \mathbf{x}_i^\top
- In matrix form:

  .. math::


     \nabla^2_{\mathbf{w}} \mathcal{L} = \mathbf{X}^\top \mathbf{S} \mathbf{X}
- This is **positive semi-definite**, hence the NLL is convex for
  logistic regression.