Search for a command to run...
Enhancing software reliability mitigates failures and reduces maintenance costs. In the LLVM compiler, different option sequences applied to the intermediate representation (IR) produce binaries with varying reliability levels, making the search for an optimal sequence in reliability-oriented compilation a key challenge. Although reinforcement learning (RL) has been employed to automate this process, existing methods suffer from two major limitations: they rely on structure-based static IR embeddings that overlook the dynamic behaviors induced by compilation transformations, leading to low-fidelity state representations, and construct action spaces that fail to preserve critical inter-option dependencies while maintaining spatial compactness, thereby impairing training efficiency and constraining optimization gains. This paper presents OSRC-RL (Optimizing Software Reliability at Compile-time using Reinforcement Learning), which integrates two novel components: a Compilation Behavior-Aware State Representation (CBA-SR) and an Action Space Construction via Option Dependency Awareness and space minimization (ASC-ODA). CBA-SR jointly encodes program semantics and compilation dynamics, using activated options as multi-label supervisory signals to guide an expressive Graph Isomorphism Network in learning IR subgraph patterns correlated with option activations, yielding high-fidelity states that enable more reliable policy learning and faster convergence. ASC-ODA reconciles dependency preservation with spatial compactness by constructing Option Dependency Graph, performing dependency-aware hierarchical clustering and recursive intra-cluster path analysis, and applying a preference-aware selection to generate a compact yet dependency-preserving action space that enhances sample efficiency and policy stability. Evaluations across 20 benchmarks show that OSRC-RL achieves the highest average reliability gain ( I rg = 0.6694) and competitive convergence speed compared with five baselines. Ablations attribute these gains to its core components: CBA-SR (up to 11.34% improvement via IR-graph learning) and ASC-ODA (up to 15.22% improvement via dependency-faithful actions). On industrial instances, it attains a 0.1818 average gain, surpassing the next-best method by 33.77%. Bounded overheads are mitigated by Lightweight Post-Processing (LPP), ensuring practical feasibility.