DL Menu

Gradient-Based Learning

As with other machine learning models, to apply gradient-based learning we must choose a cost function, and we must choose how to represent the output of the model. Largest difference between simple ML Models and neural networks are nonlinearity of a neural network causes most interesting loss functions to become non-convex. This means that neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value, rather than exact linear equation solvers used to train linear regression models or the convex optimization algorithms used for logistic regression or SVMs.

Cost Functions

A cost function is an important parameter that determines how well a machine learning model performs for a given dataset. It calculates the difference between the expected value and predicted value and represents it as a single real number.

Types of Cost Function

Regression Cost Function
- Means Error
- Mean Squared Error
- Mean Absolute Error
Binary Classification cost Functions
Multi-class Classification Cost Function.

In most cases, our parametric model defines a distribution p(y | x;θ ) and we simply use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function.

Sometimes, we rather than predicting a complete probability distribution over y, we merely predict some statistic of y conditioned on x. Specialized loss functions allow us to train a predictor of these estimates.

The total cost function used to train a neural network will often combine one of the primary cost functions described here with a regularization term.

Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This meansthat the cost function is simply the negative log-likelihood, equivalently describedas the cross-entropy between the training data and the model distribution. This cost function is given by:

z = {- 𝔼}_{x,y ~ \hat{p} data} \log p_{model} (y | x)

The specific form of the cost function changes from model to model, depending on the specific form of log p_model.

An advantage of this approach of deriving the cost function from maximum likelihood is that it removes the burden of designing cost functions for each model. Specifying a model p(y | x) automatically determines a cost function log p(y | x).

Next Topic :Output Units