Hot Chips 31 Live Blogs: NVIDIA Multi-Chip AI Accelerator at 128 TOPS

Thursday, August 22nd, 2019 - Artificial Intelligence, Teknologi

Hot Chips 31 Live Blogs: NVIDIA Multi-Chip AI Accelerator at 128 TOPS

Advertisement

02:04PM EDT – NVIDIA announced at a VLSI conference last year that it had designed a test multi-chip solution for DNN computations. The company is explaining the technology today at Hot Chips, with the idea that what they’ve created could be a stepping stone for future monetizable products.

02:04PM EDT – This is a test chip

02:04PM EDT – NVIDIA research does many test chips every year

02:05PM EDT – This work is about multi-chip DL inference

02:05PM EDT – CNN was a target for this test chip

02:06PM EDT – System is configurable for scale

02:09PM EDT – 36 small chips

02:09PM EDT – large scale inference accelerators

02:09PM EDT – three key objectives

02:09PM EDT – high inteference scaling and perfomrance

02:09PM EDT – each chip could be a DL edge inference accelerator

02:09PM EDT – many chips enabled data-center scale throughput

02:09PM EDT – network on package architecture

02:10PM EDT – Ground Reference Signalling as an MCM interconnect

02:10PM EDT – Chiplet Enables reuse and lower cost

02:10PM EDT – Assemble existing chips together

02:10PM EDT – NOC uses RISC-V

02:10PM EDT – 20ns per hop

02:10PM EDT – Network on chip and network on package

02:11PM EDT – Ground Reference Signalling – low voltage signalling, up to 1.75 pJ/bit, up to 25 Gbit per pin

02:11PM EDT – Single ended links

02:12PM EDT – Tiled Architecture with Distributed Memory

02:12PM EDT – RISC-V controller is a chip controller

02:13PM EDT – 8 Vector MACs per PE

02:13PM EDT – Processing Engine

02:13PM EDT – Each chip is 12 PEs, Each package is 6×6 chips

02:14PM EDT – PE – 8 MACs, chip is 96 MACs, package is 3456 MACs

02:15PM EDT – Designed for CNNs

02:15PM EDT – Can do different tiling strategies

02:17PM EDT – Multicast support

02:17PM EDT – Extracting model parallelism using the NoP and NoC

02:18PM EDT – TSMC 16mm, 2.5mm x 2.4mm each

02:18PM EDT – 100 Gbps per link

02:18PM EDT – 9.5 TOPS/W, 128 TOPs

02:18PM EDT – 6 months from spec to tapeout

02:19PM EDT – Designed in high level synthesis

02:19PM EDT – Agile VLSI Design

02:19PM EDT – Continuous integration with automated tool flows

02:20PM EDT – C++ to Gates design in 12 hours

02:20PM EDT – MatchLib is opensource

02:24PM EDT – Experimental results

02:24PM EDT – Custom PCB with FPGA DRAM

02:25PM EDT – 27x improvement with 32 chips

02:25PM EDT – GRS uses most energy at high chip counts

02:25PM EDT – (oh that energy is per image)

02:26PM EDT – At high batch, GRS links are all active all the time, consuming power

02:26PM EDT – No sleep modes enabled with GrS

02:27PM EDT – Again, going to 32 chips, GRS becomes a big energy consumption

02:27PM EDT – 0.11 pJ/Op2.5K images/sec with 0.4ms latency on ResNet-50 batch = 1

02:28PM EDT – Q&A time

02:30PM EDT – Q: Results show scale 1-32 chips. Batch went up to 32 – is only one image per chip, or one image across over all chips? A: Tiling strategy depends on layer in CNN. As batch size is scaled, it gives more computations to scale to achieve better scalability. But it’s not a catch-all solution.

02:32PM EDT – Q: 10 ns at 1 GHz? A: About 1.1 GHz at 0.7 volts. It includes partition interface latencies and the latency of the router itself

02:33PM EDT – Q: Physical Package? A : Organic substrate. Can be used in 2.5D

02:34PM EDT – That’s a wrap. Next up is Xilinx

Source link : Hot Chips 31 Live Blogs: NVIDIA Multi-Chip AI Accelerator at 128 TOPS

Advertisement

Pictures gallery of Hot Chips 31 Live Blogs: NVIDIA Multi-Chip AI Accelerator at 128 TOPS

Hot Chips 31 Live Blogs: NVIDIA Multi-Chip AI Accelerator at 128 TOPS | admin | 4.5