Search for a command to run...
With the continuous growth of large language model (LLM) sizes, individual datacenters are increasingly insufficient to support LLM training. Cross-datacenter training has become a feasible solution, where optical networks play a crucial role in data transmission. However, current optical networks suffer from severe over-provisioning issues, primarily due to the conflict between the bursty nature of LLM traffic and the fixed bandwidth of optical network connections. Therefore, this paper presents CrossOptic, a novel, to our knowledge, framework for optimizing cross-datacenter LLM training over optical networks. The proposed solution addresses the critical challenges of bandwidth under-utilization in optical networks and excessive communication overhead in distributed LLM training. CrossOptic integrates two synergistic components: adaptive GPU resource orchestration (AGRO) for communication-minimizing task placement and traffic-interleaved connectivity provisioning (TICP) for bandwidth-efficient flow aggregation. Through comprehensive simulations using realistic GPT model workloads, CrossOptic outperforms conventional dedicated connection approaches by 38.1% in terms of optical resource utilization. For a GPT model with 529.6B parameters, compared with the training methods using data parallelism and pipeline parallelism between datacenters, CrossOptic reduces the iteration time by 55% and 14%, respectively. The framework achieves significant cost efficiency in the optical infrastructure while supporting the computational demands of LLM training.
Published in: Journal of Optical Communications and Networking
Volume 18, Issue 2, pp. 137-137
DOI: 10.1364/jocn.579324