As with other machine learning models, to apply gradient-based learning we must choose a cost function, and we must choose how to represent the output of the model. Largest difference between simple ML Models and neural networks are nonlinearity of a neural network causes most interesting loss functions to become non-convex. This means that neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value, rather than exact linear equation solvers used to train linear regression models or the convex optimization algorithms used for logistic regression or SVMs.
In most cases, our parametric model defines a distribution p(y | x;θ ) and we simply use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function.
Sometimes, we rather than predicting a complete probability distribution over y, we merely predict some statistic of y conditioned on x. Specialized loss functions allow us to train a predictor of these estimates.
The total cost function used to train a neural network will often combine one of the primary cost functions described here with a regularization term.
Most modern neural networks are trained using maximum likelihood. This meansthat the cost function is simply the negative log-likelihood, equivalently describedas the cross-entropy between the training data and the model distribution. This cost function is given by:
.The specific form of the cost function changes from model to model, depending on the specific form of log pmodel.
An advantage of this approach of deriving the cost function from maximum likelihood is that it removes the burden of designing cost functions for each model. Specifying a model p(y | x) automatically determines a cost function log p(y | x).