Search for a command to run...
Motion Transfer is an Artificial Intelligence (AI) technique that synthesizes videos by transferring motion dynamics from a driving video to a source image. In this work we propose a deep learning-based framework to enable real-time video motion transfer which is critical for enabling bandwidth-efficient applications such as video conferencing, remote health monitoring, virtual reality interaction, and vision-based anomaly detection. This is done using keypoints which serve as semantically meaningful, compact representations of motion across time, and are extracted from every video frame via a self-supervised detector. To enable bandwidth savings during video transmission we perform forecasting of keypoints using two generative time series models–Variational Recurrent Neural Networks (VRNN) and Gated Recurrent Units with Normalizing Flows (GRU-NF)–enabling both single and diverse future prediction modes. The predicted keypoints are transformed into realistic video frames using an optical flow-based module paired with a generator network, thereby facilitating accurate video forecasting and enabling efficient, low-frame-rate video transmission. Based on the application this allows the framework to either generate a deterministic future sequence or sample a diverse set of plausible futures. Experimental results across three benchmark video datasets using state-of-the-art quality and diversity metrics for video animation and reconstruction tasks demonstrate that VRNN achieves the best point-forecast fidelity (lowest MAE) in the majority of evaluated settings in applications requiring stable and accurate multi-step forecasting (e.g., video conferencing, remote patient monitoring) and is particularly competitive in higher-uncertainty, multi-modal settings. This is achieved by utilizing the superior reconstruction property of the Variational Autoencoder and by introducing recurrently conditioned stochastic latent variables that carry past contexts to capture uncertainty and temporal variation. On the other hand the GRU-NF model enables richer diversity of generated videos while maintaining high visual quality to better support tasks like AI-driven anomaly detection. This is realized by learning an invertible, exact-likelihood mapping between the keypoints and their latent representations which supports rich and controllable sampling of diverse yet coherent keypoint sequences. Our work lays the foundation for next-generation AI systems that require real-time, bandwidth-efficient, and semantically controllable video generation, with broad implications for communication, health, and manufacturing applications. The code is available at: https://github.com/Tasmiah1408028/RealtimeVideoMotionTransfer