Search for a command to run...
Resource allocation in new R&D institutions faces challenges such as diverging interests among multiple agents and dynamic resource supply and demand. Traditional methods struggle to adapt to these complex scenarios. This paper proposes a dynamic role-aware Multi-Agent Reinforcement Learning (MARL) collaborative decision-making algorithm for resource optimization. The algorithm constructs a resource system model encompassing human, material, financial, and information resources, and designs three innovative modules: dynamic role mapping, multi-objective hierarchical rewards, and real-time conflict resolution. Specifically, the MARL model adopts an improved Proximal Policy Optimization (PPO) framework integrated with an attention mechanism to prioritize key resource-task pairs (e.g., matching high-priority tasks with scarce GPU servers) and leverages a federated learning communication framework to reduce data transmission by 30% while ensuring information security. Dynamic role mapping adjusts agent roles (resource management, task execution, benefit coordination) in real time based on resource supply-demand deviations (e.g., switching resource management agents to auxiliary task execution during GPU surges) and task priorities. Multi-objective hierarchical rewards optimize benefits at the local (single-agent task completion), collaborative (multi-agent coordination), and global (system-wide utilization/cost) levels. Real-time conflict resolution rapidly resolves resource competition through game equilibrium (Nash equilibrium), completing reallocation within 10 seconds to avoid task delays. An experimental platform is built using Python 3.9 and PyTorch 2.0. Using operational data (2022–2024) from a provincial R&D institution (50 tasks, 20 resource types, 15 agents) and a synthetic dataset (generated via statistical distributions for generalizability), the algorithm is compared with Linear Programming (LP), Deep Q-Network (DQN), Multi-Agent Deep Deterministic Policy Gradient (MADDPG), QMIX, and FedMARL. Results show that the proposed algorithm achieves resource utilization of 94.2% ± 0.5% (95% confidence interval: 93.7–94.7%) in a single scenario, a 15.7 percentage point improvement over LP. The average task completion time is 28.5 days ± 1.2 days, a 36.9% reduction compared to LP. In dynamic scenarios, when resource fluctuations exceed ±10%, the average performance fluctuation is only 3.2%, a 74.4% reduction compared to LP. Ablation experiments show that removing dynamic role mapping reduces resource utilization by 6.2%, validating the module's effectiveness. This algorithm provides technical support for improving resource allocation efficiency in new R&D institutions.