> For the complete documentation index, see [llms.txt](https://szaman.gitbook.io/intro-to-deep-learning/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://szaman.gitbook.io/intro-to-deep-learning/review/math-review-and-notation.md).

# Math Review and Notation

## Linear Algebra

Familiarity with linear algebra is a key prerequisite to understanding the rest of the course. Linear operations with non-linear activations are the backbone of modern multi-layer networks that we use today. However, a day 0 review of linear algebra is probably needed to get you started, so this review is mostly to get you up and running quickly.&#x20;

#### Scalars, Vectors, Matrices, and Tensors

Some of the definitions used here differ slightly from their homonyms in traditional mathematics and physics.&#x20;

* **Scalars:**
  * A single number
  * Can be either continuous  ($$\mathbb{R}$$ or $$\mathbb{C}$$) or discrete ($$\mathbb{N}, \mathbb{Z},$$ or $$\mathbb{Z}\_2$$)
  * A 0-d tensor, usually denoted with a lowercase Roman  letter  (in my notes)
* **Vectors:**
  * Mechanistically, a set of scalars in array
  * Mathematically, a $$d$$-dimensional real vector represents a point in $$\mathbb{R}^d$$
  * A 1-d tensor, usually denoted as a **bold** lowercase Roman  letter  (in my notes)&#x20;
* **Matrices:**
  * Serves a dual purpose for us a data storage and also a linear map
  * What is an $$N \times M$$ matrix?
    * It could be a greyscale image $$N \times M$$ pixels
    * &#x20;It could be a set of $$M$$-dimensional data points arranged in $$N$$ rows
    * It could represent a linear map/transformation/function that takes an $$N$$-dimensional vector and outputs a $$M$$-dimensional vector
  * Usually denoted with an uppercase Roman letter (in my notes).
* **Tensors:**
  * An arbitrary dimensional set of scalars.
  * Used for batches of data or high-dimensional data cubes
  * All data (and weights) in deep learning applications are usually called tensors
    * A set of N  color images is represented as 4D tensors of dimension (or shape) of either {N, 3, H, W} or {N, H, W, 3}
    * A batch of E-dimensional B sequences of S tokens is represented as a {B, S, E} dimensional tensor

#### Functions

A function takes mathematical objects and transforms them according to some rule.\
&#x20;   \- Synonyms: map, transformations, operator

We were quite general with our definition, because functions can be quite general. They can take a single scalar, vector, or arbitrary tensor as input and output scalars, vectors, or arbitrary tensors.\
\
**Notation:** The function mapping an m-dimensional vector to an n-dimensional vector is usually represented as:

$$
f: \mathbb{R}^m \rightarrow \mathbb{R}^n
$$

**Scalar Functions**

**Vector Functions**

**Task**\
1\) Two matrices, AB = BA. What can you say about the properties of A and B? \
2\) What happens if you train your neural network on data that is not a function?\
3\) <br>

## Calculus

#### Partial Derivatives

As you've probably surmised by now, we are almost always going to be dealing with differentiable functions of multiple variables when dealing with neural networks. So naturally, we are going to take lots and lots of partial derivatives. Now, while you will rarely do the derivatives by hand, it is good to review the basics. We sometimes have to revert to looking at these values when debugging, particularly nasty scenarios where we have no idea why our model is not learning.&#x20;

The partial derivative of a function $$f(x, y)$$ with respect to $$x$$ is denoted as: $$\frac{\partial f }{\partial x}$$.&#x20;

#### Gradient vs Jacobian vs Hessian

**Gradient:** \
&#x20;   **-** *Vector* of partial derivatives of a scalar function\
**Jacobian:**\
&#x20;   **-** *Matrix* of partial derivatives of a vector function.\
&#x20;   \- A generalization of gradients where the output dimension space is greater than 1 \
**Hessian:**\
&#x20;   \- *Matrix* of second-order partial derivatives of a scalar function\
&#x20;   *-* Another way to view this would be the Jacobian of the gradient of a scalar function\
&#x20;   \- The Hessian of a function $$f$$, often denoted $$H\_f$$, measures the curvature of the function\
&#x20;   \- Symmetric matrix that is the transpose of the Jacobian matrix of the gradient of the function<br>

**Task:**

1\) For a scalar function $$f: \mathbb{R}^{16} \rightarrow \mathbb{R}$$, what is the shape of it's Hessian matrix $$H\_f$$\
2\) We have a vector function $$g: \mathbb{R}^4 \rightarrow \mathbb{R}^2$$, such that $$g(x) = x W$$. $$W$$ is a projection $$4 \times 2$$ matrix given by $$W = \begin{bmatrix} W\_{11} & W\_{12} \  W\_{21} & W\_{22} \ W\_{31} & W\_{32} \ W\_{41} & W\_{42} \\\end{bmatrix}$$. Can you write the Jacobian, $$J\_g$$? \
3\) For the above function, how can you verify your Jacobian is correct? Can you write a program that numerically calculates the Jacobian for random input $$x$$ and projection matrix $$W$$.&#x20;

## Probability

**Task:**

1\) **True or False:** Any distribution can be approximated by an empirical distribution given a sufficiently large number of independent and identically distributed samples? If True, why? If false, provide a counterexample. \
2\) Can the value of the probability density function be greater than 1 at any point? \
3\) You know you have data from a multi-modal distribution, should you just report the central tendencies of your data? What are some alternatives you would try? (Hint: There's no consensus, so feel free to be creative, but validate your intuition with experiments)