A DATA-DRIVEN APPROACH: YOUTUBE AS A RESOURCE FOR AI, ML AND ROBOTIC DEVELOPMENT 2

20260 citationsJournal Articlegreen Open Access

Authors

C. B. Onuike · Federal University of Technology Owerri

Michael Oluwaseyi Aderibigbe · Federal University of Technology

Monday Dickson Osadolor · National Open University

GINIKA PATRICIA OGUCHI · Lagos State University of Science and Technology

CHINONSO MICHAEL EZEKWEM · Federal University of Technology Owerri

Abstract

This paper builds upon the foundations established in A Data-Driven Approach: YouTube as a Resource for AI, ML and Robotic Development, extending the earlier framework into a more robust, multimodal, and robotics-focused system. While the previous study demonstrated the feasibility of extracting instructional value from YouTube content for model training, it focused primarily on lightweight processing and software-based demonstrations. The present work advances this concept by introducing an enhanced pipeline capable of transforming YouTube tutorial videos into structured datasets suitable for code generation, multimodal learning, and robot control. The redesigned system integrates transcript data, video frames, on-screen text, audio cues, and procedural narration into synchronized video–text–action triplets using a modern vision-language alignment model. These triplets support a dual-branch learning architecture: one branch for code generation tasks such as Python explanation, debugging, and snippet reconstruction; and another branch for robotic manipulation tasks derived from instructional videos. A key contribution of this work is the incorporation of Webots simulation as the primary environment for robot training. Motion sequences observed in tutorial content such as grasping, placing, assembling, rotating, or trajectory following—are translated into structured actions and trained through imitation learning inside Webots. The simulation enables precise control, rapid prototyping, and large-scale experimentation before transferring the learned policies to a physical UR5e robotic arm for real-world evaluation. Experimental results show 95% transcript extraction, 91% multimodal alignment accuracy, a 78% improvement in MBPP pass@1 for code generation, and an 87% success rate in Webots-trained manipulation tasks, with 85% task retention when deployed on the physical robot. Processing remains efficient at approximately 1.8 minutes per 5-minute video using a cluster of Raspberry Pi 5 devices. Overall, this work demonstrates an effective pathway for converting abundant online tutorial content into actionable datasets for AI-driven code generation and simulation validated robotics learning, enabling scalable development in both research and educational contexts.

Topics & Keywords

Multimodal Machine Learning Applications Robot Manipulation and Learning Social Robot Interaction and HRI

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18410611

Field-Weighted Citation Impact: 0.00