Output units in neural networks

The last layer unit of a neural network. And they determine the form of the cross entropy (loss) function.

Linear units for Gaussian distribution

Affine transformation with no nonlinearity

ˆy=wTh+b
  • h is the output of an previous hidden unit

Can be used to produce the mean of a conditional Gaussian

p(y|x)=N(y;ˆy,I)

Sigmoid Unit for Bernoulli output

Used for binary classification

p(y|x)=Bernoulli(y;ˆy)ˆy=σ(wTh+b)

Softmax unit for multinomial

If the output distribution consists of n possible values we can use the softmax function

softmax(z)i=exp(zi)jexp(zj)logsoftmax(z)i=zilogjexp(zj)

Thus we pass a linear function z=WTh+b trough a softmax function. The softmax function directly penalizes the most active incorrect prediction. If the answer at the largest of the softmax is correct than zi and logjexp(zj) roughly cancel. Tus it will contribute only little to the overall training loss, the cost is dominated by examples are not correctly classified.

Mixture density networks

The output of a NN is a mixture of Gaussians.