Activity Recognition via Multimodal Large Language Models and Riemannian Optimization

20250 citationsPreprintgreen Open Access

Authors

Farahnaz Soleimani · Laboratoire des signaux et systèmes

Ghazaleh Khodabandelou · Laboratoire des signaux et systèmes

Abdelghani Chibani · Laboratoire des signaux et systèmes

Yacine Amirat · Laboratoire des signaux et systèmes

Abstract

Human Activity Recognition (HAR) has become a critical task in applications such as healthcare, smart environments, and human-computer interaction. This study investigates the potential of using a publicly available GPT-2 variant for HAR with multimodal data, combining its natural language processing capabilities with advanced temporal modeling for sequential data. Furthermore, the study introduces a novel Riemannian manifold optimization strategy that enhances model generalization by leveraging the geometric structure of the parameter space. The model's performance is evaluated on both unimodal and multimodal datasets to assess its adaptability and effectiveness. Specifically, the GPT-2 model is fine-tuned on the UCI-HAR and Opportunity datasets (unimodal) and a multimodal dataset comprising synchronized RGB video, depth, skeleton, and inertial data. The fine-tuned GPT-2 achieves state-of-the-art performance, with 84% accuracy on the UCI-HAR dataset-a 2% improvement over previous benchmarks-and 84.62% accuracy on the Opportunity dataset. For the multimodal dataset, the model achieves an outstanding accuracy of 99.42%, demonstrating its capability to effectively integrate and process diverse sensor modalities. These results establish the viability of GPT-based architectures for HAR, particularly with multimodal data, and pave the way for further advancements in transformerbased models for sensor-driven activity recognition. Impact Statement-This paper presents a novel Riemannian manifold optimization strategy designed for transformers, enhancing their generalization and robustness in HAR tasks. The proposed approach is validated on both a standard transformer and a GPT variant across unimodal and multimodal datasets. This extensive evaluation highlights the strategy's effectiveness in managing diverse data modalities while leveraging the complementary strengths of multimodal inputs. By achieving state-of-the-art performance on HAR benchmarks, this work sets a new paradigm for transformer optimization, opening avenues for further advancements in multimodal data processing and activity recognition.

Topics & Keywords

Human Pose and Action Recognition

Publication Details

Published in: SPIRE - Sciences Po Institutional REpository

Command Palette

Activity Recognition via Multimodal Large Language Models and Riemannian Optimization

Authors

Abstract

Topics & Keywords

Publication Details