Search for a command to run...
Accurate surgical activity recognition from surgical video can support intraoperative decision-making, post-operative analysis, and surgical education. However, complex and variable workflows across different procedures challenge the generalization of current models. Consistently learning temporal dependencies at the frame-level, activity-level, and case-level across different procedure types remains an open problem. Here, we apply the existing Frame-Action Cross-Attention for Temporal modeling (FACT) architecture to the surgical domain, since it was designed for this purpose. We evaluate its performance across three datasets: Cholec80 (cholecystectomy, 7-phase), AutoLaparo (hysterectomy, 7-phase), and MultiBypass140 (gastric bypass, 12-phase, and 46-step). To distinguish between the relative importance of spatial and temporal representation learning, we investigate both, domain-general and surgically fine-tuned image feature extractors with FACT which jointly reasons over frame-level and action-level dependencies across the entire case via bidirectional cross-attention. To better understand workflow variability in our datasets, we model the variance in surgical workflows across these datasets using metrics. We compare FACT, which processes an entire case as a single sample, against multiple other architectures, including traditional windowed methods to evaluate how different approaches perform across these datasets. Across Cholec80, AutoLaparo, and MultiBypass140, FACT delivers competitive performance under matched protocols including 94.1% accuracy on Cholec80 phases, 77.3% on MultiBypass140 steps, and 89.5% accuracy on MultiBypass140 phases. Importantly, FACT performs comparably when using either domain-general or surgically-finetuned spatial representations, in some cases, even being relatively robust to older generations of domain-general image feature extractors, like RotNet. We emphasize method clarity, cross-dataset consistency, and workflow-variability analysis as our key contributions. While our results suggest that FACT is a flexible and robust temporal architecture for surgical activity modeling, challenges remain in generalizing to fine-grained or highly variable workflows. We also include a discussion of how the performance of different architectures may be affected by the measured variance in surgical workflows and propose future work that may help close this gap.