# In Search of the Right Weights

&#x20;

$$
W^{t+1}*i = W^{t}*{i}-\xi \frac{\partial L (W^{t}\_1, W^{t}\_2, ..., W^{t}\_i, ..., W^{t}\_n)}{\partial W^{t}\_i}
$$

This one-liner underpins the training of all the predictive AI, LLMs, and other neural network models you see today. It describes a standard gradient descent (hopefully you see the gradient part readily) that folks augment to train neural network models.&#x20;

### A Perspective Shift

One way of thinking about neural networks is as a function that takes some input $$x$$ and outputs $$y\_{pred}$$. So we write it as $$f\_W(x)=y\_{pred}$$ , where $$f\_W$$ is the function determined by the weights $$W$$. But this sometimes buries the lede a bit. Because this doesn't really show that the set of weights $$W$$ can change, and also the output $$y\_{pred}$$ is based on both $$x$$ and the weights $$W$$. In other words, if you had two different sets of weights $$W$$ and $$W^{\`}$$,  it's highly likely that $$f\_{W}(x) \neq f\_{W^{'}}(x)$$.&#x20;

So let's try to do something else and abuse a bit of notation to make the weight dependence more explicit. So, we can write the function with both inputs $$x$$ and weights $$W$$ as arguments. We get something like:

$$
f(x, W) = y\_{pred}
$$

So now the $$f$$ only determines the structure of the relation between $$x$$ and $$W$$. The predicted output changes when you change $$x$$ or $$W$$. Now things can get interesting.&#x20;

Suppose for the input $$x$$, we have a true label $$y\_{true}$$. So obviously we want to make sure our predicted output $$y\_{pred}$$ to be as close as possible to $$y\_{true}$$. We can't really change $$x$$ as that wouldn't really make any sense. So, we can start changing $$W$$ so that we find the best set of weights $$W\_{best}$$ where the predicted outputs and real outputs are as close as possible. And well, now we're training the weights!&#x20;

### Optimizing the Weights

The next question is, how do we find the best set of weights? We could randomly select weights and test, and keep the best one. But that's not very efficient, considering we can have millions of weights and have infinitely possible combinations. We are in the territory of unconstrained optimization.&#x20;

So what if we restrict the functions learned to be differentiable? In other words, the $$f(x,W)$$ can be differentiated with respect to the weights $$W$$. This is, of course, not needed, and there's a whole line of work where derivatives are not needed. But we can go a long way with differentiable functions. Differentiable function lets us use the gradient descent we've talked about above.&#x20;

Remember, our goal is to minimize the difference between our predicted output and the real output (for supervised training). We need to be able to actually measure this difference between $$y\_{pred}$$ and $$y\_{true}$$ and do it in a differentiable way (this restriction can be loosened). So let's say we have a function that measures the "loss" of accuracy and denote it $$\mathcal{L}(y\_{pred}, y\_{real})$$. When $$\mathcal{L}=0$$, we have perfect predictions. It may be nice to write out the loss explicitly as:

$$
\mathcal{L} = \mathcal{L}(f(x, W), y\_{true})
$$

Given our loss or objective function, we know we want to minimize the loss to 0 or at least as small as possible. In other words, we are tasked with finding the set of arguments $$W\_{best}$$ that minimizes the loss $$\mathcal{L}$$.&#x20;

That is what gradient descent tries to achieve. In the gradient descent equation, we are taking a step of size $$\xi$$ towards the negative direction of the gradient with respect to the weights. Intuitively, the derivative (or gradient) shows the direction where the function is travelling, and we want to go in the direction where it is decreasing. If the function is increasing, the negative sign sends it the opposite way. If it is decreasing, the negatives cancel, and we follow the direction of decrease.&#x20;

When the gradient goes to 0, we have reached a local minimum. This is often called the *First-Order Necessary Conditions* for local minima. The proof for this is not too difficult and could be instructive. Under some conditions, gradient descent does guarantee convergence to local minima, but we usually operate pretty far from that regime when training networks&#x20;

So this was gradient descent, but we've mostly talked theory and intuition. In practice, it looks quite different when we have multiple samples, poorly conditioned weights, and a challenging loss landscape. So, practical considerations take baseline gradient descent and iterate (no pun intended) to improve it for training neural networks.&#x20;

{% hint style="info" %}
**Note:** These notes have used supervised training and continuous regression outputs. The same arguments extend to unsupervised (like generative) and semi-supervised learning tasks and categorical outputs.&#x20;
{% endhint %}

### In Summary

The problem of training a network is a problem of optimization. We are either trying to minimize or maximize some quantity, usually the "loss" function. The idea is that if the loss is minimized, our neural network is doing what we want it to do. There are quite a few challenges along the way, unfortunately.&#x20;

First of all, our network is, of course, a black box. We couldn't write it out as a mathematical function even if we wanted to. It's incredibly high-dimensional. Each learnable weight or parameter in the network essentially is a dimension in the optimization. So, training a network with 100 million parameters is equivalent to optimizing a 100 million-dimensional function. We don't really have access to the true "loss" or objective function. At best, we can approximate it by using finite data[^1].

But if we can make our network parameters and loss function differentiable, we are in luck! We can rely on decades of numerical optimization theory and start training our network. So that's what we do! You've read about backpropagation in neural networks, and you've read about differentiable loss functions; they are all in service to allow us to train the model with techniques like gradient descent.&#x20;

The theory of numerical optimization goes well beyond gradient descent and is very cool. Worth checking out if you are interested. &#x20;

[^1]:


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://szaman.gitbook.io/intro-to-deep-learning/training-neural-networks/in-search-of-the-right-weights.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
