
We built a brand new Tensor Core that has the FP8 capability, but the Transformer Engine has special functions for collecting statistics and adjusting the range and bias of the computation on a layer per layer basis during the training run. So what Hopper does that is unique is that it actually implements what we call the Transformer Engine, which is a combination of hardware and software. But making it work well and making be able to train a model like GPT-3 or Megatron 530B is where the art is. But it is a small representation, and we can make it work because AI is fundamentally a statistical problem – you are working on probabilities at the layer level and that kind of stuff. You might have four or five bits of exponent. The trick, the art, the skill of doing an FP8 operation, to make it work and be successful, is doing so by operating with two or three bits of mantissa. To be honest, building an ALU that can do a multiply-add is relatively straightforward, and even though I don’t want to offend anybody I probably will by saying that. Ian Buck: Well, in the end, they get faster because what people care about is how much work they get done. TPM: Well, they are not really getting much faster, they are just getting fatter with skinnier datasets, they are getting more capacious, really. Obviously, with the reduced precision, you can build faster and faster GPUs. The Hopper GH100 GPU has 2 petaflops of performance in the new fourth generation Tensor Core, and 4 petaflops with sparse data. Why FP8 quarter-precision, and how significant that the same format can be used for machine learning training and inference? But up until now, the low-end of floating point has been stuck at FP16 half precision, and it is among the mix of floating point precisions used for machine learning training, which also includes FP32 single precision and a smattering of FP64 double precision. Timothy Prickett Morgan: I understand why increasingly lower precision is sometimes useful, particularly for machine learning inference, as we have seen happen with integer formats down to INT8 and INT4 even. Under normal circumstances – meaning before the coronavirus pandemic – we would have had lunch with Buck somewhere off the beaten path to have a chat, but for this GTC 2022 spring conference, we had to settle for a Zoom call to talk about the ever-decreasing size of computation and the ever increasing throughput coming out of GPU compute engines. He spoke to us about Brook and CUDA back when The Next Platform was founded in 2015, and it is still an interesting read to see how far we have come in GPU computing in more than two decades.

Among other things, Buck is the creator of the Brook stream processing programming language, which was breaking ground on general purpose GPU compute back in 2004 on both ATI and Nvidia GPUs, which is when Buck joined Nvidia to become a systems engineer working on what would become the CUDA environment.
#SALLY FACE GAME ANDROID HOW TO#
One of the key researchers who has been there from the very beginning of GPU compute and the machine learning revolution is Ian Buck, general manager of the accelerated computing business at Nvidia.īuck got his bachelor’s in computer science at Princeton University in 1999, and moved to Stanford University to get his PhD, where he was part of team that figured out how to make GPUs do unnatural acts like do massive amounts of math calculations in parallel when they would rather be turning pixels on and off in a video game.

And this is so true today that people just say AI for all three because the distinction is academic. It has taken untold thousands of people to make machine learning, and specifically the deep learning variety, the most viable form of artificial intelligence.
