Batch Gradient Descent (BGD) and Stochastic Gradient Descent (SGD) are optimization algorithms used in machine learning to minimize a loss function and update the model's parameters during training. Below are the key differences between them:
1. Dataset Size for Updates
2. Speed and Computational Efficiency
BGD:
- Slower updates as it computes the gradient using the entire dataset.
- Not ideal for large datasets due to the high computational cost.
SGD:
- Faster updates as it computes gradients using individual samples.
- More efficient for large datasets.
3. Convergence and Stability
BGD:
- Smooth and stable convergence towards the minimum but can be slow.
- Can get stuck in local minima or plateaus if the landscape is complex.
SGD:
- Convergence may be noisy due to random fluctuations from single-sample updates.
- The randomness can help escape local minima but may oscillate near the minimum instead of settling.
4. Memory Requirements
BGD:
- Requires memory proportional to the entire dataset for gradient computation.
SGD:
- Requires memory only for a single training example at a time.
5. Use Cases
- BGD:
- Preferred for small datasets where computational resources are not a concern.
- SGD:
- Better suited for large-scale datasets and online learning scenarios.
Variants
To balance the trade-offs between BGD and SGD:
- Mini-batch Gradient Descent: Computes gradients using small batches of training data (combines efficiency and stability).
- Momentum and Adam Optimizer: Variants that improve convergence speed and reduce oscillations.