Fast training for very large NN

Prof. Assaf Schuster | Computer Science


Information and Computer Science

The Technology

Modern deep neural networks are comprised of millions of parameters which require massive amounts of data and time to train. Steady growth along the years has led these networks to a point where it takes too long to train a network from scratch on a single GPU. Distributed training of these networks can drastically reduce training times, however stochastic gradient decent (SGD), which is typically used to train these networks, is an inherently sequential algorithm. As a result, training deep neural networks on multiple workers (computational devices) is difficult, especially when trying to maintain high efficiency, scalability and final accuracy. Existing methods suffer from slow convergence and low final accuracy when scaling to large clusters, and often require substantial re-tuning of learning parameters.

Distributed Accelerated Nesterov ASGD (DANA) is a novel approach that scales to large clusters while maintaining state-of-the-art accuracy and convergences speed without having to re-tune parameters compared to training on a single worker.


  • DANA achieves state-of-the-art accuracy on existing architectures without any hyper parameter tuning, while scaling well beyond existing ASGD approaches
  • DANA mitigates gradient staleness by computing each worker’s gradients on parameters that more closely resemble the master’s current parameters

Applications and Opportunities

  • Neural networks
  • Machine learning models
arrow Business Development Contacts
Ofer Shneyour
Director of Business Development, ICT