Evaluating The Scalability of Distributed Neural Networks in High-Performance Computing
SNVASRK Prasad, Aythepally Lakshmi Narayana, Prasanthi Potnuru
This study investigates the scalability of distributed neural networks (DNNs) in high-performance
computing (HPC) environments, focusing on the comparative analysis of horizontal and vertical scaling
methods. By distributing neural network training across multiple nodes and upgrading individual nodes,
we assess key metrics such as training time, speedup, efficiency, and resource utilization. Our experimental
results demonstrate that horizontal scaling significantly reduces training time but introduces challenges
in efficiency due to communication overhead and synchronization costs. Conversely, vertical scaling
offers improved resource utilization and maintains high efficiency, though its scalability is constrained
by hardware limitations. A hybrid approach, combining both scaling strategies, is shown to optimize
performance by balancing resource utilization and computational efficiency. These findings provide
valuable insights into optimizing distributed neural network training, highlighting the trade-offs and
potential of different scaling methods in HPC settings.