# From SGD to Shampoo

In [In Search of the Right Weights](/intro-to-deep-learning/training-neural-networks/in-search-of-the-right-weights.md#optimizing-the-weights) we decided that since our neural networks and losses are differentiable, we will use gradient-based methods to optimize our networks. You've seen some of these optimizers if you have gone through [Streamlining the Training Pipeline](/intro-to-deep-learning/deep-learning-engineering/getting-started-with-pytorch/streamlining-the-training-pipeline.md). We'll go over a few key optimizers and the intuition behind them here.&#x20;

### Stochastic Gradient Descent (SGD)

In the previous set of notes, I was pretty imprecise with my terminology. Gradient descent is but one algorithm in the zoo of first-order methods available. Albeit it is the most famous one. We've seen it before, but let's see it with a compact location and dissect it a bit:

$$
W^{t+1}\_{i} = W^{t}*i - \xi\frac{\partial{\mathcal{L^{t}}}}{\partial{W^{t}*{i}}}
$$

First, let's talk about the superscript $$t$$. That is to denote the time step or iteration at which these calculations are taking place. These methods are iterative methods (think for loop). In numerical optimization lingo $$W^{t+1}\_i$$ and $$W^t\_i$$ would be called *iterates.* We keep updating the weights until some convergence criterion is met.&#x20;

I'm using $$W\_i$$ as the updated quantity to make the connection to the **W**eights of a neural network a bit more obvious. At each iteration $$t$$, we are updating each weights $$W\_i^{t}$$ to get the updated weight $$W\_i^{t+1}$$.   Now you may notice exactly how the weights are changing timestep $$t$$ to the next timestep $$t+1$$. It is, of course, the term, $$-\xi\frac{\partial{\mathcal{L}^{t}}}{\partial{W\_{i}^{t}}}$$. So let's dissect this term a bit more carefully.&#x20;

We have the gradient term, $$\frac{\partial{\mathcal{L}^{t}}}{\partial{W^{t}\_{i}}}$$. $$\mathcal{L}^{t}$$ is the loss at time t, calculated based on the neural networks having weights $$W^{t}$$. We just shortened the presentation for brevity. So the gradient is the partial derivative of the loss with respect to the weights. The direction opposite the direction of the gradient minimizes the first-order approximation, hence the negative sign. We've now got a direction to go towards, next is how big a step to take. Well, the gradient is going to scale with the loss, but we also set a step size with $$\xi$$. That is a parameter we decide on based on intuition, and it dictates how big a step in the negative direction we take. This is often called the *step size* or, more commonly in ML literature, the ***learning rate.***

{% hint style="danger" %}
I am frantically looking behind my back to check if a mathematician friend of mine is chasing me down to tell me that $$\xi$$ is only the step size or step length if the gradient is normalized (divided by the magnitude). But I like to live dangerously and use the term loosely.&#x20;
{% endhint %}

We have seen it before in [/pages/ry28mUeUUetzoDRatWk5#enough-theory.-lets-do-calculus-homework](https://szaman.gitbook.io/intro-to-deep-learning/training-neural-networks/pages/ry28mUeUUetzoDRatWk5#enough-theory.-lets-do-calculus-homework "mention"), but it might be instructive to see this in code form:

```
def SGD(W_cur, delta_W_cur, step_size) -> Tensor:
    W_next = W_cur - step_size * delta_W_cur
    return W_next
```

#### GD vs SGD vs Mini-batch GD

So far, we've talked about doing updates on a single sample. We don't usually do this in practice. While we use SGD as an overloaded term, especially in practice, where SGD is typically used in place of mini-batch SGD. This is because no one really uses SGD, and almost everyone uses mini-batching.&#x20;

In deep learning, we tend to have large datasets. Let's say we have a dataset with N samples. N is sufficiently large, so we don't want to update the model after each sample. Updating the model is costly when you have a large model. You have to calculate a lot of gradients, and you have to wait until all the updates are completed to process the next sample.&#x20;

Also, as we work with large datasets and models, SGD provides noisy gradients. We are operating in a very high-dimensional space and only using one point sample to gather gradient information. That results in bad estimates of gradients for our weights and makes the training really sensitive to outliers.&#x20;

The other end of the extreme is gradient descent. Here we go through the entire dataset, sum or average the loss to get a single loss, and calculate the gradients once per dataset. You can imagine that's not great either. For a dataset with a few trillion tokens, we're talking about taking a couple of days to do a single update. Furthermore, a little stochasticity helps. It allows the model to explore the loss landscape and potentially escape poor local minima.&#x20;

So, the most often used method is mini-batch gradients. The dataset is broken down into M chunks, and the loss and gradients are calculated per chunk (or mini-batch) of data. The size of the mini-batch is a hyperparameter that you determine when you are training. This gives us the best of both worlds as we can trade off performance and accuracy. In general, you want as large a mini-batch as possible without sacrificing accuracy.&#x20;

#### Are we guaranteed to converge?

No, we are not. Usually, convergence guarantees in this line of work start with an assumption of convexity. We unfortunately cannot guarantee that our neural network objective is convex. Even smoothness is a difficult thing to guarantee in large and deep neural networks. Our loss landscape is usually riddled with local minima and saddle points.&#x20;

I have used the term **loss landscape** twice now. Going back to our definition of a function with the weights as the arguments, we can then think about the loss with any given input and set of weights. Let's think about this in linear algebra terms. We have an N-dimensional space, where each point is a realization of a network with those weights. In other words, each point is a state of our model with N weights.&#x20;

{% hint style="warning" %}
For a linear network f(x) = Wx + b, with W and B as weights. The point (0, 0) would correspond to W = 0, B = 0. The point (100, -99) would correspond to W = 100, B=-99, so on and so forth.&#x20;
{% endhint %}

Now, at each point, we can calculate the total loss over a set of samples with our network. Because given a set of weights, we can produce a prediction for each input, and see how close that prediction is to the real values. We sum all the losses for each sample, and we get a single value, a scalar, in other words. Hopefully, you can readily see that each point in the weight space is associated with a single scalar in the loss space. Those of you from a physics background may have even caught on that I am describing a [scalar field](https://en.wikipedia.org/wiki/Scalar_field). Well, this is our Loss Landscape. \
\
Mathematically, we can write the loss at the point $$W$$ as:

$$
\mathcal{L}*W = \sum*{x\_i \in \mathcal{D}}\mathcal{L}(f\_W(x\_i),y\_i)
$$

<figure><img src="/files/iFpbbcjlSeAkINLwo3DA" alt=""><figcaption><p><a href="https://arxiv.org/abs/1712.09913">There's some cool work on visualizing loss landscapes by Li et al.</a></p></figcaption></figure>

We traverse the weight space, iteratively updating the coordinates such that the corresponding point in the loss landscape is a minimum.  Unfortunately, even for convex loss landscapes (so they have a global minimum), SGD doesn't guarantee convergence to the minimum.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://szaman.gitbook.io/intro-to-deep-learning/training-neural-networks/from-sgd-to-shampoo.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.