DL Menu

Hidden Units

How to choose the type of hidden unit to use in the hidden layers of the model. The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles. Rectified linear units are an excellent default choice of hidden unit.

We discuss motivations behind choice of hidden unit. It is usually impossible to predict in advance which will work best. The design process consists of trial and error, intuiting that a kind of hidden unit may work well, and evaluating its performance on a validation set

Some hidden units are not differentiable at all input points. For example, the rectified linear function $g (z) = max {0, z}$ is not differentiable at z = 0. This may seem like it invalidates g for use with a gradientbased learning algorithm. In practice, gradient descent still performs well enough for these models to be used for machine learning tasks.

Most hidden units can be described as accepting a vector of inputs x, computing an affine transformation $z = w^{T} h + b$ , and then applying an element-wise nonlinear function $g (z)$ . Most hidden units are distinguished from each other only by the choice of the form of the activation function $g (z)$

Rectified Linear Units and Their Generalizations

Rectified linear units use the activation function $g (z) = max {0, z}$ .

Rectified linear units are easy to optimize due to similarity with linear units.

Only difference with linear units that they output 0 across half its domain
Derivative is 1 everywhere that the unit is active
Thus gradient direction is far more useful than with activation functions with second-order effects

Rectified linear units are typically used on top of an affine transformation: $h = g (W^{T} x + b)$ .

Good practice to set all elements of b to a small value such as 0.1. This makes it likely that ReLU will be initially active for most training samples and allow derivatives to pass through

ReLU vs other activations:

Sigmoid and tanh activation functions cannot be with many layers due to the vanishing gradient problem.
ReLU overcomes the vanishing gradient problem, allowing models to learn faster and perform better.
ReLU is the default activation function with MLP and CNN

One drawback to rectified linear units is that they cannot learn via gradientbased methods on examples for which their activation is zero.

Three generalizations of rectified linear units are based on using a non-zero slope α_i when z_i < 0: $h_{i} = g {(z, α)}_{i} = \max (0, z_{i}) + α_{i} \min (0, z_{i})$ .

Absolute value rectification fixes α_i = −1 to obtain g(z) = |z|. It is used for object recognition from images
A leaky ReLU fixes α_i to a small value like 0.01
parametric ReLU treats α_i as a learnable parameter

Logistic Sigmoid and Hyperbolic Tangent

Most neural networks used the logistic sigmoid activation function prior to rectified linear units.

g (z) = σ (z)

or the hyperbolic tangent activation function

g (z) = \tanh (z)

These activation functions are closely related because

\tanh (z) = 2 σ (2z) - 1

We have already seen sigmoid units as output units, used to predict the probability that a binary variable is 1.

Sigmoidals saturate across most of domain

Saturate to 1 when z is very positive and 0 when z is very negative
Strongly sensitive to input when z is near 0
Saturation makes gradient-learning difficult

Hyperbolic tangent typically performs better than logistic sigmoid. It resembles the identity function more closely. Because tanh is similar to the identity function near 0, training a deep neural network $\hat{y} = w^{T} \tanh (U^{T} \tanh (V^{T} x))$ resembles training a linear model $\hat{y} = w^{T} U^{T} V^{T} x$ so long as the activations of the network can be kept small.

Next Topic :Architecture Design