Fast training for very large NN

Researcher:

Categories:

Information and Computer Science

The Technology

Modern deep neural networks are comprised of millions of parameters which require massive amounts of data and time to train. Steady growth along the years has led these networks to a point where it takes too long to train a network from scratch on a single GPU. Distributed training of these networks can drastically reduce training times, however stochastic gradient decent (SGD), which is typically used to train these networks, is an inherently sequential algorithm. As a result, training deep neural networks on multiple workers (computational devices) is difficult, especially when trying to maintain high efficiency, scalability and final accuracy. Existing methods suffer from slow convergence and low final accuracy when scaling to large clusters, and often require substantial re-tuning of learning parameters.

Distributed Accelerated Nesterov ASGD (DANA) is a novel approach that scales to large clusters while maintaining state-of-the-art accuracy and convergences speed without having to re-tune parameters compared to training on a single worker.

Advantages

DANA achieves state-of-the-art accuracy on existing architectures without any hyper parameter tuning, while scaling well beyond existing ASGD approaches
DANA mitigates gradient staleness by computing each worker’s gradients on parameters that more closely resemble the master’s current parameters

Applications and Opportunities

Neural networks
Machine learning models

Business Development Contacts

Dr. Arkadiy Morgenshtein

Director of Business Development, ICT

morgenshtein@technion.ac.il

Fast training for very large NN

Categories:

The Technology

Advantages

Applications and Opportunities

BECOME A MEMBER