Search for a command to run...
Voice command recognition has become increasingly important for enabling natural human–computer interaction in gaming and embedded systems. However, achieving accurate and noise-robust recognition on small-scale datasets remains challenging due to the limited availability of data and computational resources. To address this, this paper presents a systematic study on architectural choices for small-scale speech command recognition. We compare three neural architectures, BiLSTM, Transformer, and a sequential hybrid, in a controlled framework using identical MFCC front-ends and standardized noise augmentation. Experiments include various configurations, varying sampling rates, and noise types, rigorously evaluated using repeated cross-validation to ensure reliability. The results show that the hybrid architecture achieves superior accuracy, clearly outperforming the standalone BiLSTM and standalone Transformer baseline architectures. The hybrid model exhibits lower variance across cross-validation folds and initialization processes, and demonstrates significantly higher training throughput compared to the BiLSTM model. The model demonstrates excellent robustness to acoustic noise; variants trained with pink and white noise augmentation show comparable accuracy, confirming robust feature learning under diverse data augmentation conditions. These findings support the hypothesis that BiLSTM's local temporal modeling complements Transformer's global self-attention, enabling more effective capture of multi-scale temporal patterns in short utterances. Real-time deployment feasibility was confirmed through integration with a 3D game engine, achieving an average total response time of 5.23 ms, demonstrating the model's suitability for low-latency interactive applications.