"Multi-Agent Reinforcement Learning-Based Routing for Dragonfly Networks"

Xin Yuan
Department of Computer Science
Florida State University (FSU)

Wednesday, Mar 25, 2026

Colloquium -  499 DSL Seminar Room
03:30 to 04:30 PM Eastern Time (US and Canada)

In-person attendance is requested.
499 DSL Seminar Room
Zoom access is intended for external (non-departmental) participants only.

Click Here to Join via Zoom

Meeting # 942 7359 5552

Zoom Meeting # 942 7359 5552


Abstract:

Reinforcement Learning (MARL)-based routing has emerged as a promising approach for high-performance interconnect networks such as Dragonfly, offering a viable alternative to the widely used Universal Globally Adaptive Load-balanced (UGAL) routing. Practical routing on modern interconnects must satisfy various requirements, such as being deadlock-free and having a limited path length. These requirements impose routing constraints, which in turn pose challenges for MARL-based routing. In particular, two important issues must be addressed for a MARL-based scheme to be effective. First, in the presence of routing constraints, sufficient path diversity to accommodate different traffic conditions is essential. Second, since routing constraints influence how Q-values are propagated in a MARL-based scheme, it is vital that the value propagation mechanism accounts for the routing constraints. Existing MARL-based routing schemes for Dragonfly fall short in addressing both issues. As a result, while they achieve high performance for some traffic conditions, they may exhibit poor performance or even pathological behaviors in other scenarios. In this work, we discuss the limitations of existing MARL-based routing schemes for Dragonfly, present methods to address the two key issues, and develop UGAL-Q, a novel MARL-based scheme that resolves these issues and overcomes the problems in existing approaches. We perform extensive evaluations using both synthetic traffic and HPC application benchmarks. The results demonstrate that our scheme is more effective than existing ones and is a robust routing solution for Dragonfly.

Xin Yuan

Attachments:
FileDescriptionFile size
Download this file (Xin_Yuan.jpg)Xin_Yuan.jpgXin Yuan Headshot502 kB
Dept. of Scientific Computing
Florida State University
400 Dirac Science Library
Tallahassee, FL 32306-4120
admin@sc.fsu.edu
© Scientific Computing, Florida State University
Scientific Computing