Voice Command Recognition for 3D Endless Games Using Hybrid Transformer LSTM

20260 citationsJournal Articlegold Open Access

Authors

Oddy Virgantara Putra · University of Darussalam Gontor

Adrik Fikhtiyaril Amro · University of Darussalam Gontor

Adimas Arya Alief Riarta · University of Darussalam Gontor

Alvin Arya Pangestu · University of Darussalam Gontor

Abstract

Voice command recognition has become increasingly important for enabling natural human–computer interaction in gaming and embedded systems. However, achieving accurate and noise-robust recognition on small-scale datasets remains challenging due to the limited availability of data and computational resources. To address this, this paper presents a systematic study on architectural choices for small-scale speech command recognition. We compare three neural architectures, BiLSTM, Transformer, and a sequential hybrid, in a controlled framework using identical MFCC front-ends and standardized noise augmentation. Experiments include various configurations, varying sampling rates, and noise types, rigorously evaluated using repeated cross-validation to ensure reliability. The results show that the hybrid architecture achieves superior accuracy, clearly outperforming the standalone BiLSTM and standalone Transformer baseline architectures. The hybrid model exhibits lower variance across cross-validation folds and initialization processes, and demonstrates significantly higher training throughput compared to the BiLSTM model. The model demonstrates excellent robustness to acoustic noise; variants trained with pink and white noise augmentation show comparable accuracy, confirming robust feature learning under diverse data augmentation conditions. These findings support the hypothesis that BiLSTM's local temporal modeling complements Transformer's global self-attention, enabling more effective capture of multi-scale temporal patterns in short utterances. Real-time deployment feasibility was confirmed through integration with a 3D game engine, achieving an average total response time of 5.23 ms, demonstrating the model's suitability for low-latency interactive applications.

Topics & Keywords

Speech Recognition and Synthesis Emotion and Mood Recognition Music and Audio Processing

Publication Details

Published in: Teknika

Volume 15, Issue 1, pp. 1-10

DOI: 10.34148/teknika.v15i1.1414

Field-Weighted Citation Impact: 0.00

Command Palette

Voice Command Recognition for 3D Endless Games Using Hybrid Transformer LSTM

Authors

Abstract

Topics & Keywords

Publication Details