How to choose the type of hidden unit to use in the hidden layers of the model. The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles. Rectified linear units are an excellent default choice of hidden unit.
We discuss motivations behind choice of hidden unit. It is usually impossible to predict in advance which will work best. The design process consists of trial and error, intuiting that a kind of hidden unit may work well, and evaluating its performance on a validation set
Some hidden units are not differentiable at all input points. For example, the rectified linear function is not differentiable at z = 0. This may seem like it invalidates g for use with a gradientbased learning algorithm. In practice, gradient descent still performs well enough for these models to be used for machine learning tasks.
Most hidden units can be described as accepting a vector of inputs x, computing an affine transformation , and then applying an element-wise nonlinear function . Most hidden units are distinguished from each other only by the choice of the form of the activation function
Rectified linear units use the activation function .
Rectified linear units are easy to optimize due to similarity with linear units.
Rectified linear units are typically used on top of an affine transformation: .
Good practice to set all elements of b to a small value such as 0.1. This makes it likely that ReLU will be initially active for most training samples and allow derivatives to pass through
One drawback to rectified linear units is that they cannot learn via gradientbased methods on examples for which their activation is zero.
Three generalizations of rectified linear units are based on using a non-zero slope αi when zi < 0: .
Most neural networks used the logistic sigmoid activation function prior to rectified linear units.
or the hyperbolic tangent activation function
These activation functions are closely related because
We have already seen sigmoid units as output units, used to predict the probability that a binary variable is 1.
Sigmoidals saturate across most of domain
Hyperbolic tangent typically performs better than logistic sigmoid. It resembles the identity function more closely. Because tanh is similar to the identity function near 0, training a deep neural network resembles training a linear model so long as the activations of the network can be kept small.