Derivation of the Binary Cross Entropy Loss Gradient
The binary cross entropy loss function is the preferred loss function in binary classification tasks, and is utilized to estimate the value of the model's parameters through gradient descent. In order to apply gradient descent we must calculate the derivative (gradient) of the loss function w.r.t. the model's parameters. Deriving the gradient is usually the most tedious part of training a machine learning model.
In this article we will derive the derivative of the binary cross entropy loss function w.r.t. W, step by step.
The binary cross entropy loss is given by
y is the observed class, y_hat the prediction, W the model's parameters. Predictions are given by:
z is equal to:
To calculate the gradient of L(W) w.r.t. W we will use the chain rule:
Let's derive the first term:
The second term is a little more complicated:
Done with the second term. The derivative of the third term is straight forward:
Now let's put everything together:
And there you have it: the derivative of the binary cross entropy loss function w.r.t. the model's parameters.