- While we think of Gradient Descent as acting on a single (very high dimensional) landscape, each batch yields a different direction for the gradient.
- Is this part of the reason why NNs avoid getting trapped in local minima now? Training on enough data leads to a shifting landscape that "massages out" minima.
- In contrast, GAs tend to have a fixed landscape that they are evaluated on.
- If GAs are given large neutral networks, that could act similarly to batches
- Is there any advantage in evaluating GAs using a changing landscape? Such as computing fitness on a subset of the training data?