Amp-architecture deep-dive: GA100 without RT cores and details about the new NVSwitch

In a so-called deep dive, Jonah Alben, senior vice president of GPU engineering at NVIDIA, answered a few more questions and details about the amp architecture. We have already looked at all the important details about amp architecture. In addition, we have already engaged in an analysishow much GeForce is in the GA100 GPU.

Anyone expecting new insights into a possible next generation of GeForce cards will be disappointed, because NVIDIA currently only talks about the A100 Tensor Core GPU or the GA100 GPU in the version used for the A100.

First of all, it should be noted that the Ampere and Volta architecture are more similar than it might appear at first glance. The GA100 GPU does not have the RT cores, so it cannot offer hardware acceleration for ray tracing calculations. In addition, albums confirmed that a total of 48 MB of L2 cache are available, but only 40 MB are addressed – analogous to the use of only 108 of the 128 SMs intended for full expansion. In the topology, NVIDIA provides eight so-called slices, each with 512 kB per memory controller. 8 x 512 kB x 12 memory controller corresponds to 49,152 kB and therefore 48 MB of L2 cache in total.

This leads us to memory. NVIDIA uses 40 GB of HBM2 – 5x 8 GB for the A100 Tensor Core GPU. The actually 6,144 bit wide memory interface is only 5,120 bit wide due to the use of only five of the six memory stacks. The sixth memory stack is not a dummy, but a functional memory chip that is simply not used.

Taken together, NVIDIA leaves most of the available resources unused. However, this will not be done without reason, because the yield in production will simply not be good enough to be able to guarantee a higher expansion level than 108 SMs.

In order to maintain the thermal design power at 400 W, NVIDIA limits the GPU clock to 1410 MHz for the given computing power. However, this cannot serve as an indicator for a GeForce offshoot of the ampere architecture either, since we will see completely different prerequisites in the chip size.

Until further notice, there will be no PCI Express variant of the A100 Tensor Core GPU. NVIDIA only supplies the HGX A100 consisting of four SMX4 modules that are directly connected via NVLink, as well as the DGX variants with eight or 16 SMX4 modules, which then use the third generation NVSwitch.

NVSwitch 2.0: 6 billion transistors manufactured in 7FF

NVLink and the new NVSwitches are of great importance in the infrastructure of the A100 Tensor Core GPUs. Connected directly to one another, the A100 accelerators achieve a NVLink data rate of 100 GB / s to each other. The new NVSwitches are used for more than four GPUs.

These are manufactured by TSMC in 7FF and have 6 billion transistors. The first NVSwitches were still manufactured in 12 nm and have two billion transistors. The complexity has increased by a factor of three. Each of the NVSwitches has 36 NVLink ports with a bidirectional data rate of 25 GB / s per port.

Each of the NVSwitches in the A100 systems offers a GPU-GPU bandwidth of 600 GB / s. This means a doubling of the bandwidth compared to the Tesla V100. The 600 GB / s per GPU are realized via 12 NVLink ports.



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.