Greedy layer wise unsupervised pre-training¶
rely on a single layer representation learning algorithm as RBM or single layer autoencoder, sparse coding or some latent variable representation.
we greedily pre-train each layer using unsupervised pre-training, when we ad a new layer than the weights of the previous layer are not adapted anymore. Each new layer should produce a new representation that should be simpler than the previous
it is not popular anymore, but it is still used in NLP.
after pre-training we train the network as a whole, thus we fine tune all the layers together
we can use it as initialization for unsupervised learning algorithms as deep autoencoders or probabilistic models with many layers of latent variables.
pretraining provided more consistent convergence
When this will work?¶
on many tasks it is harmful, and it was used to provide a good initialization to avoid local minima, but since we have a lot of data we do not worry about local minima anymore
popular for word embedding, in general abandoned
well suited if we have only a few labeled examples
Downsides¶
separate training stages make it hard to tune hyperparameters