LIPCOORDNET: A DUAL-STREAM DEEP LEARNING ARCHITECTURE FOR VISUAL SPEECH RECOGNITION USING FACIAL LANDMARKS

Authors

  • Wissem Karous School of Electronics and Telecommunications, University of Sfax, Tunisia
  • Hanen Lajnef Innov’COM Laboratory, Sup’Com, University of Carthage, Ariana, Tunisia Corresponding Author
  • Tehani Dammak School of Electronics and Telecommunications, University of Sfax, Tunisia

DOI:

https://doi.org/10.22452/

Keywords:

Lip reading, Deep learning, 3D CNN, LSTM, Facial landmarks, Multi-modal fusion, Visual speech recognition

Abstract

Automated lip reading systems have emerged as critical assistive technologies for hearing-impaired individuals and communication in noisy environments. This research presents an advanced deep learning framework for sentence-level lip reading that integrates 3D Convolutional Neural Networks (3D CNN) with bidirectional Long Short-Term Memory (Bi-LSTM) networks, enhanced by facial landmark coordinates as supplementary input features. Our proposed LipCoordNet architecture achieves state-of-the-art performance on the GRID corpus benchmark, obtaining a Word Error Rate (WER) of 1.7% and Character Error Rate (CER) of 0.6%, representing significant improvements over existing state-of-the-art methodologies evaluated on the same dataset. The system demonstrates robust performance through the integration of spatial-temporal visual features and geometric lip movement patterns, validated through comprehensive experiments including statistical significance testing across five independent runs, and deployed as an interactive demonstration platform.

 

Downloads

Published

2026-06-11