Batch and mini-batch algorithms¶
The objective function in most machine learning algorithms can be expressed as a sum over the training examples. In most cases we evaluate the expected value of the cost function only on a subset of terms of the full cost function.
The gradient of the loss function is:
\(E_{x,y \sim \tilde{p}_{\text{data}}}\) is the expectation over the data which can be expensive if we have a lot of it
In practice we evaluate the expectation only on a random subsample (mini-batch) of the data. This can be justified by looking at the standard error for the sample mean
There is less than linear return for adding more samples.
Mini-batch size¶
Larger more accurate
smaller batch can have regularization effect
gradient based algorithms are robust and can handle small batch size (~100)
second order methods require +10_000