# Maximizing Your Supercomputer

You can only go so far on a single GPU, even though the current crop of GPUs has an eye-watering number of compute units and memory. When I started my Ph.D, my advisor got us a workstation with a Titan V, and at that time, I thought I was the king of the world. I could train models in minutes (once I figured out how to build Tensorflow 1 with CUDA support). By the end, a system with 4 A100s doesn't seem enough.&#x20;

What happened? Scaling laws, large language models, attention, internet-scale datasets, and more. High-performance computing has become intrinsically connected with deep learning (not that it wasn't before). Models and datasets are larger, interconnects are faster, so high-performance deep learning (HPDL) not only makes sense, but is inescapable.

### Why distribute?

When we looked into FlashAttention, an interesting (to me) fact was that models were memory-bound. We were bottlenecked by how quickly we could load bits from GPU memory to registers and write back to memory. Unfortunately, we can't really make the memory bus any faster or wider. But we CAN have more of them if we use multiple GPUs.&#x20;

The story of distributing neural networks is the story of circumventing the memory wall (quantity or bandwitdh), by introducing communication. Of course, communication will always be slower than any data movement on the chip. So we have to be clever about it, the communication and hide it behind compute or minimize the cost of communication.&#x20;

### How distribute?

Distributing the compute leads to us considering how and where we should slice up the data. We have multiple dimensions along which we can distribute our workload. We are going to look into the following:

* Data parallelism&#x20;
* Model / Tensor parallelism&#x20;
* Pipeline parallelism
* Domain / Sequence / Context parallelism

{% hint style="warning" %}
Pedagogically, the high-performance, distributed training lessons are difficult because students can't get practical experience. Unfortunately, we don't have a large enough education cluster to have students run and benchmark their distributed training runs and explore different modes of parallelism. So these notes serve as a survey and tips and tricks I picked up on working on HPDL.&#x20;
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://szaman.gitbook.io/intro-to-deep-learning/maximizing-your-supercomputer.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
