Why can't we train neural networks in a peer-to-peer manner?

Question

I have recently been exposed to the concept of decentralized applications,
I know that neural networks require a lot of parallel computing infra for training.
What are the technical difficulties one may face for training neural networks in a p2p manner?

Brian O'Donnell · Answer

Data management and bandwidth are key issues for interconnecting multiple GPUs.  These are such big issues that it is hard to think about other challenges like neural network architecture, metrics, etc.  The key to success for interconnecting multiple GPUs on a single computer is NVIDIA's NVLink:
NVLink is a wire-based communications protocol for near-range semiconductor communications developed by Nvidia that can be used for data and control code transfers in processor systems between CPUs and GPUs and solely between GPUs. NVLink specifies a point-to-point connection with data rates of 20 and 25 Gbit/s (v1.0/v2.0) per differential pair.
Compare 25 Gbit/s to a typical peer to peer connection over the web of 100Mbps.  NVLINK provides a 250x advantage assuming everything else is equal which it is not.  This means that, considering bandwidth only, a neural network which takes one day to train on a computer with two GPUs connected with NVLINK could take 250 days over the internet using two computers with the same GPU!

Why can't we train neural networks in a peer-to-peer manner?

One Answer

Add your own answers!

Ask a Question