View unanswered posts | View active topics It is currently Thu Jul 18, 2019 11:51 pm



Reply to topic  [ 7 posts ] 
 Google's Tensor Processing Unit 
Author Message

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
Google's special hardware for neural net calculations:

https://cloud.google.com/blog/big-data/ ... g-unit-tpu


Tue May 16, 2017 5:12 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
.
An interesting device, to be sure. It's for evaluating neural networks - handles requests in milliseconds, presumably from everyone doing OK Google voice queries, as well as placing ads, recommending videos and filtering gmail spam - not for training them.
Quote:
The TPU includes the following computational resources:
    Matrix Multiplier Unit (MXU): 65,536 8-bit multiply-and-add units for matrix operations
    Unified Buffer (UB): 24MB of SRAM that work as registers
    Activation Unit (AU): Hardwired activation functions


The main horsepower is in the low-precision multiplies (92 Teraops per second at 700MHz).
Image

Image

More remarkable, Google will only be presenting this because it's yesterday's technology and they must now have the next generation deployed.


Tue May 16, 2017 7:36 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 904
Location: Canada
Exciting to read about something that’s not so Von Neuman. Thanks for the post. I’m reminded of the Cray for some reason where they wanted good scalar performance as well as vector operations. I’ll bet the software making use of the TPU is quite a piece of work as well.
Some inexpensive FPGA’s have dozens or even hundreds of DSP multiplier blocks. Opens the possibility to make an accelerator (matrix multiplier) for neural networks albeit on a smaller scale.

_________________
Robert Finch http://www.finitron.ca


Tue May 16, 2017 9:50 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
Looks like Google have released some details of their second generation: they put 4 chips on a board, call that a Cloud TPU, rated at 180 Teraflops, and then connect 64 boards in an 8x8 torus, which all fills up a pair of double-width racks. That might be called a TPU pod. Possibly two more adjacent double-width racks are also a necessary part of the pod. Then they make it all available as a service - sign up now if you have an appropriate research project. Most important, perhaps, is that this new TPU is supposedly good for both inference and for training - which probably means it does at least 16 bit arithmetic.
https://blog.google/topics/google-cloud ... -learning/
More photos
https://www.tensorflow.org/tfrc/

All that said, no detail about the internal architecture.

"Using these TPU pods, we've already seen dramatic improvements in training times. One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod."


Thu May 18, 2017 7:24 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
Just a note: Google say "The TensorFlow Research Cloud is a cluster of 1,000 Cloud TPUs that provides the machine learning research community with a total of 180 petaflops of raw compute power — at no charge — to support the next wave of breakthroughs." Compare that with Sunway TaihuLight, the top supercomputer of 2016, which offers 93 petaflops (net, LINPACK) - in other words, apart from a factor of four in precision and presumably a big difference in memory bandwidth, comparable.


Fri May 19, 2017 7:26 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 904
Location: Canada
They are getting a lot of the "petaflops" from the matrix multiply operations. It seems like the petaflops are calculated. Has it been measured ? Unless one has an app that does a lot of matrix multiplies performance may not be so fantastic. I wonder how the TPU compares petaflops wise across a broad range of applications. It may be fantastic performance on neural network software but how is it otherwise ?

I want one for my PC.

_________________
Robert Finch http://www.finitron.ca


Sat May 20, 2017 11:33 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
As the new TPU is, I think, only 16 bit precision, it would be hard-pressed to earn a LINPACK rating, which demands 64 bit floats. So we can talk about Peta operations, but not quite about petaflops. On todays supercomputers, it seems they achieve, in LINPACK results, better than half the theoretical peak flop rating. But of course you're right, performance for any given application will vary. The supercomputer can run many applications, the TPU maybe a different kind of animal - best measured as inferences per second, or something like that. This table is I think from the first generation TPU:
Attachment:
Inferences-per-second.png
Inferences-per-second.png [ 138.74 KiB | Viewed 4560 times ]


via https://www.nextplatform.com/2017/04/05 ... hitecture/


Sat May 20, 2017 11:51 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 7 posts ] 

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software