Time-Frequency Image-based Speech Emotion Recognition using Artificial Neural Network

Neha Dewangan; Kavita Thakur; Sunandan Mandal; Bikesh Kumar Singh

doi:10.52228/JRUB.2023-36-2-10

Journal of Ravishankar University

Pt. Ravishankar Shukla University, Raipur, Chhattisgarh

PART-B

(SCIENCE)

ISSN: 0970-5910

Abstract View

Submit Article

Time-Frequency Image-based Speech Emotion Recognition using Artificial Neural Network

Author(s): Neha Dewangan, Kavita Thakur, Sunandan Mandal, Bikesh Kumar Singh

Email(s): dewanganneha92@gmail.com

Address: School of Studies in Electronics and Photonics, Pt. Ravishankar Shukla University, Raipur, 492010, India.
Department of Biomedical Engineering, National Institute of Technology, Raipur, 492010, India.
*Corresponding author: dewanganneha92@gmail.com

Published In: Volume - 36, Issue - 2, Year - 2023

DOI: 10.52228/JRUB.2023-36-2-10

View HTML

View PDF

ABSTRACT:
Automatic Speech Emotion Recognition (ASER) is a state-of-the-art application in artificial intelligence. Speech recognition intelligence is employed in various applications such as digital assistance, security, and other human-machine interactive products. In the present work, three open-source acoustic datasets, namely SAVEE, RAVDESS, and EmoDB, have been utilized (Haq et al., 2008, Livingstone et al., 2005, Burkhardt et al., 2005). From these datasets, six emotions namely anger, disgust, fear, happy, neutral, and sad, are selected for automatic speech emotion recognition. Various types of algorithms are already reported for extracting emotional content from acoustic signals. This work proposes a time-frequency (t-f) image-based multiclass speech emotion classification model for the six emotions mentioned above. The proposed model extracts 472 grayscale image features from the t-f images of speech signals. The t-f image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude. An artificial neural network-based multiclass machine learning approach is used to classify selected emotions. The experimental results show that the above-mentioned emotions' average classification accuracy (CA) of 88.6%, 85.5%, and 93.56% is achieved using SAVEE, RAVDESS, and EmoDB datasets, respectively. Also, an average CA of 83.44% has been achieved for the combination of all three datasets. The maximum reported average classification accuracy (CA) using spectrogram for SAVEE, RAVDESS, and EmoDB dataset is 87.8%, 79.5 %, and 83.4%, respectively (Wani et al., 2020, Mustaqeem and Kwon, 2019, Badshah et al., 2017). The proposed t-f image-based classification model shows improvement in average CA by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively. This study can be helpful in human-computer interface applications to detect emotions precisely from acoustic signals.

Keywords:

Cite this article:
Dewangan, Thakur, Mandal and Singh (2023). Time-Frequency Image-based Speech Emotion Recognition using Artificial Neural Network. Journal of Ravishankar University (Part-B: Science), 36(2), pp. 144-157.DOI: https://doi.org/10.52228/JRUB.2023-36-2-10

References

Akçay, M. B., & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56-76.
Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017, February). Speech emotion recognition from spectrograms with deep convolutional neural network. In 2017 international conference on platform technology and service (PlatCon) (pp. 1-5). IEEE.
Browne, M. W. (2000). Cross-validation methods. Journal of mathematical psychology, 44(1), 108-132.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005, September). A database of German emotional speech. In Interspeech (Vol. 5, pp. 1517-1520).
Dennis, J., Tran, H. D., & Li, H. (2010). Spectrogram image feature for sound event classification in mismatched conditions. IEEE signal processing letters, 18(2), 130-133.
Ekman, P., Friesen, W. V., & Ellsworth, P. (2013). Emotion in the human face: Guidelines for research and an integration of findings (Vol. 11). Elsevier.
Fayek, H. M., Lech, M., & Cavedon, L. (2015, December). Towards real-time speech emotion recognition using deep neural networks. In 2015 9th international conference on signal processing and communication systems (ICSPCS) (pp. 1-5). IEEE.
Goh, A. T. (1995). Back-propagation neural networks for modeling complex systems. Artificial intelligence in engineering, 9(3), 143-151.
Hajarolasvadi, N., & Demirel, H. (2019). 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy, 21(5), 479.
Haq, S., Jackson, P. J., & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification. In Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08), Tangalooma, Australia.
Lee, D. D., Pham, P., Largman, Y., & Ng, A. (2009). Advances in neural information processing systems 22. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012, NIPS 2012.
Li, S., Xing, X., Fan, W., Cai, B., Fordson, P., & Xu, X. (2021). Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing, 448, 238-248.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 13(5), e0196391.
Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE transactions on multimedia, 16(8), 2203-2213.
Mohammed, S. N., & Abdul Hassan, A. K. (2020). Speech Emotion Recognition Using MELBP Variants of Spectrogram Image. International Journal of Intelligent Engineering & Systems, 13(5).
Mustaqeem, & Kwon, S. (2019). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183.
Özseven, T. (2018). Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Applied Acoustics, 142, 70-77.
Padi, S., Sadjadi, S. O., Sriram, R. D., & Manocha, D. (2021, October). Improved speech emotion recognition using transfer learning and spectrogram augmentation. In Proceedings of the 2021 international conference on multimodal interaction (pp. 645-652).
Singh, B. K., Verma, K., & Thoke, A. S. (2015). Adaptive gradient descent backpropagation for classification of breast tumors in ultrasound imaging. Procedia Computer Science, 46, 1601-1609.
Sönmez, Y. Ü., & Varol, A. (2020). A speech emotion recognition model based on multi-level local binary and local ternary patterns. IEEE Access, 8, 190784-190796.
Wang, K. C. (2014). The feature extraction based on texture image information for emotion sensing in speech. Sensors, 14(9), 16692-16714.
Wani, T. M., Gunawan, T. S., Qadri, S. A. A., Mansor, H., Kartiwi, M., & Ismail, N. (2020, September). Speech emotion recognition using convolution neural networks and deep stride convolutional neural networks. In 2020 6th International Conference on Wireless and Telematics (ICWT) (pp. 1-6). IEEE.
Yalamanchili, B., Anne, K. R., & Samayamantula, S. K. (2022). Speech emotion recognition using time distributed 2D-Convolution layers for CAPSULENETS. In Multimedia Tools and Applications, 81(12), 16945-16966.
Yu, D., Seltzer, M. L., Li, J., Huang, J. T., & Seide, F. (2013). Feature learning in deep neural networks-studies on speech recognition tasks. arXiv preprint arXiv:1301.3605.
Zhang, P., Bai, X., Zhao, J., Liang, Y., Wang, F., & Wu, X. (2023, June). Speech Emotion Recognition Using Dual Global Context Attention and Time-Frequency Features. In 2023 International Joint Conference on Neural Networks (IJCNN) (pp. 1-7). IEEE.
Zheng, W. Q., Yu, J. S., & Zou, Y. X. (2015, September). An experimental study of speech emotion recognition based on deep convolutional neural networks. In 2015 international conference on affective computing and intelligent interaction (ACII) (pp. 827-831). IEEE.