Facial Feature Detection for Arabic Audio Visual Speech Recognition Systems

Nour Sami Ghadban, Jafar Alkheir, Mariam Saii


The visual speech modality plays an important role in the perception and production of speech. Although not purely confined to the mouth, it is generally agreed that the large proportion of speech information conveyed in the visual modality stems from the mouth region of interest (ROI).

To this end, it is imperative that an audio-visual speech processing (AVSP) system be able to accurately detect, track and normalize the mouth of a subject within a video sequence. This task is referred to as facial feature detection (FFD). The goal of FFD is to detect the presence and location of features, such as eyes, nose, nostrils, eyebrow, mouth, lips, ears, etc., with the assumption that there is only one face in an image. This differs slightly to the task of facial feature location which assumes the feature is present and only requires its location. Facial feature tracking is an extension to the task of location in that it incorporates temporal information in a video sequence to follow the location of a facial feature as time progresses. Throughout this article the tasks of facial feature detection, location and tracking are all thought to be encapsulated under the broad banner of FFD.



audio-visual speech processing, facial feature detection, Front-end Effect, eye detection and mouth location/tracking

Full Text:



L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall, Inc.

P. Jourlin, J. Luettin, D. Genoud, and H.Wassner, Acoustic-labial speaker verification," Pattern Recognition Letters.

T. Wark, Multi-modal Speech Processing for Automatic Speaker Recognition. PhD thesis, Electrical and Electronic Systems Engineering, Queensland Univeristy of Technology.

T. Chen and R. Rao, Audio-visual integration in multimodal communication," Proceedings of the IEEE, vol. 86, pp. 837.

H. McGurk and J. MacDonald, Hearing lips and seeing voices," Nature, pp. 746 .

L. R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition," Proceedings of the IEEE, vol. 77, pp. 257 .

A. J. Goldschen, Continuous automatic speech recognition by lipreading. PhD thesis, George Washington University, Washington DC .

B. Dodd and R. Campbell, eds., Hearing by Eye: The Psychology of Lipreading. London, England: Lawrence Erlbaum Associates Ltd.

D. Burnham and B. Dodd, The McGurk effect in infants across different languages," in Speechreading by Humans and Machines (D. Stork and M. Hennecke, eds.), pp. 103{114, Springer-Verlag,.

F. Lavagetto, Converting speech into lip movements: A multimedia telephone for hard hearing people," IEEE Transactions on Rehabilitation Engineering, vol. 3, pp. 90.

B. Dodd, The acquisition of lipreading skills by normally hearing children," in Hearing by Eye: The Psychology of Lipreading (B. Dodd and R. Campbell, eds.), pp. 163{175, Lawrence Erlbaum Associates Ltd.

A. E. Mills, The development of phonology in blind children," in Hearing by Eye: The Psychology of Lipreading (B. Dodd and R. Campbell, eds.), pp. 145{161, Lawrence Erlbaum Associates Ltd.

H. W. Frowein, G. F. Smoorenburg, L. Pyters, and D. Schinkel, Improved speech recognition through videotelephony: Experiments with the hard of hearing," IEEE Journal of Selected Areas in Communications, vol. 9, pp. 611.

W. Sumby and I. Pollak, Visual contributions to speech intelligibility in noise," Journal of the Acoustic Society of America, vol. 26, pp. 212.

C. C. Chibelushi, F. Deravi, and J. S. D. Mason, A review of speech-based bimodal recognition," IEEE Transactions on Multimedia, vol. 4, pp. 23.

S. Dupont and J. Luettin, Audio-visual speech modeling for continuous speech recognition," IEEE Transactions on Multimedia, vol. 2, pp. 141{ 151.

A. Adjoudani and C. Benoit, Audio-visual speech recognition compared across two architectures," in Eurospeech'95, (Madrid Spain), pp. 1563.

I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. A. Bangham, Lipreading using shape, shading and scale," in Auditory-Visual Speech Processing, (Sydney, Australia), pp. 73

C. C. Chibelushi, J. S. Mason, and F. Deravi, Integration of acoustic and visual speech for speaker recognition," in Eurospeech'93, (Berlin), pp. 157.

M. McGrath and Q. Summer_eld, Intermodal timing relations and audio-visual speech recognition," Journal of the Acoustical Society of America, vol. 77, pp. 678.

T. Chen, Audiovisual speech processing," IEEE Signal Processing Magazine, vol. 18, pp. 9.

D. A. Reynolds, Experimental evaluation of features for robust speaker identification," IEEE Transactions on Speech and Audio Processing, vol. 2, pp. 539.

E. Kreyszig, Advanced engineering mathematics. John Wiley and Sons, Inc., 7th ed.

M. Yang, D. J. Kriegman, and N. Ahuja, Detecting faces in images: A survey," IEEE Transacations on Pattern Analysis and Machine Intelligence, vol. 24, pp. 1.

O. Jesorsky, K. Kirchberg, and R. Frischholz, Robust face detection using the hausdor distance," in Third International Conference on Audio and Video based Biometric Person Authentication, (Halmstad, Sweden), pp. 90.

M. F. Augusteijn and T. L. Skujca, Identification of human faces through texture-based feature recognition and neural network technology," in IEEE Conference on Neural Networks, pp. 392.

J. Yang and A. Waibel, A real-time face tracker," in Third IEEE on Applications of Computer Vision, (Sarasota, Florida, USA), pp. 142.

M. H. Yang and N. Ahuja, Detecting human faces in color images," in International Conference on Image Processing, vol. 1, pp. 127.

A. Yuille, P. W. Hallinan, and D. S. Cohen, Feature extraction from faces using deformable templates," International Journal of Computer Vision, vol. 8, no. 2, pp. 99.

G. Chiou and J. Hwang, Lipreading from color video," IEEE Transactions on Image Processing, vol. 6, pp. 1192.

M. U. Ramos Sanchez, J. Matas, and J. Kittler, Statistical chromaticity models for lip tracking with B-splines," in International Conference on Audio and Video based Biometric Person Authentication, (Crans Montana,

Switzerland), pp. 69.


  • There are currently no refbacks.