본문 바로가기

Coursera/Mathematics for ML and Data Science

Calculus for Machine Learning and Data Science (9)

728x90

Optimization in Neural Networks and Newton’s Method

Optimization in Neural Networks

Regression with a perceptron

0123

Perceptron can be seen as a linear regression, where inputs are multiplied with weights. We output a prediction using the formula wx + b and optimize the weights (w) and bias (b).

We can think of a perceptron as a single node that does the computation/calculation.

Regression with a perceptron - Loss function

01

A reason we dived the squared error by 1/2 is when we take the derivative of the error we have that lingering 2. So by multiplying the squared error by 1/2, we can cancel out the 2 in computing the derivative.

We minimize the error/loss using the gradient descent method.

Regression with a perceptron - Gradient Descent

01234
Performing gradient descent with a perception

To get the partial derivatives, we use the chain rule to compute the derivatives and update the variables.

Classification with Perceptron

For classification, we add an activation function on top of the linear regression formula.

01234
Steps of forward pass (input to output)

For classification, we add an activation function on top of the linear regression formula.

To update the continuous number (output of the linear regression) we use an activation function (here, we use a sigmoid function) so we can get the numbers crunched into 0 and 1 (make it non-linear).

Classification with Perceptron - The sigmoid function

The horizontal line is the input and the vertical line is the output.

012

On the last slide, we can add and subtract 1 to split the formula, cancel the common numbers, and factor it for simplification.

Classification with Perceptron - Gradient Descent

We can use the squared loss used previously with regression problems, but for classification, we use the log loss function.

012

We use log loss because the math works nicely and also there’s a probabilistic nature to it and the output of the classification problem is a probability.

Classification with Perceptron - Calculating the derivatives

01234
01234

Recall that the derivative of the sigmoid is sigmoid times 1 minus sigmoid, hence replaced with $\hat y$ multiplied by the derivative of the inner part.

Classification with a Neural Network

01
Forward pass of the 2-layer neural network

A neural network consists of multiple perceptrons.

Classification with a Neural Network - Minimizing log-loss

01234
Performing partial derivatives of weights and biases

We update the weights and biases to minimize the loss function (log loss) which is done by performing gradient descent.

Gradient Descent and Backpropagation

Superscript numbers with square brackets denote the layer number

Backpropagation: Process of updating weights and biases based on the computed loss using gradient descent and partial derivatives

 

All the information provided is based on the Calculus for Machine Learning and Data Science |  Coursera from DeepLearning.AI

728x90