Article in HTML

Author(s): Neha Dewangan, Kavita Thakur, Sunandan Mandal, Bikesh Kumar Singh

Email(s): dewanganneha92@gmail.com

Address: School of Studies in Electronics and Photonics, Pt. Ravishankar Shukla University, Raipur, 492010, India.
Department of Biomedical Engineering, National Institute of Technology, Raipur, 492010, India.
*Corresponding author: dewanganneha92@gmail.com

Published In:   Volume - 36,      Issue - 2,     Year - 2023


Cite this article:
Dewangan, Thakur, Mandal and Singh (2023). Time-Frequency Image-based Speech Emotion Recognition using Artificial Neural Network. Journal of Ravishankar University (Part-B: Science), 36(2), pp. 144-157.



Time-Frequency Image-based Speech Emotion Recognition using Artificial Neural Network

Neha Dewangan 1,*, Kavita Thakur 1, Sunandan Mandal 2, Bikesh Kumar Singh2

1School of Studies in Electronics and Photonics, Pt. Ravishankar Shukla University, Raipur, 492010, India

2Department of Biomedical Engineering, National Institute of Technology, Raipur, 492010, India

 *Corresponding author: dewanganneha92@gmail.com

 Abstract

Automatic Speech Emotion Recognition (ASER) is a state-of-the-art application in artificial intelligence. Speech recognition intelligence is employed in various applications such as digital assistance, security, and other human-machine interactive products. In the present work, three open-source acoustic datasets, namely SAVEE, RAVDESS, and EmoDB, have been   utilized (Haq et al., 2008, Livingstone et al., 2005, Burkhardt et al., 2005). From these datasets, six emotions namely anger, disgust, fear, happy, neutral, and sad, are selected for automatic speech emotion recognition. Various types of algorithms are already reported for extracting emotional content from acoustic signals. This work proposes a time-frequency (t-f) image-based multiclass speech emotion classification model for the six emotions mentioned above. The proposed model extracts 472 grayscale image features from the t-f images of speech signals. The t-f image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude. An artificial neural network-based multiclass machine learning approach is used to classify selected emotions. The experimental results show that the above-mentioned emotions' average classification accuracy (CA) of 88.6%, 85.5%, and 93.56% is achieved using SAVEE, RAVDESS, and EmoDB datasets, respectively. Also, an average CA of 83.44% has been achieved for the combination of all three datasets. The maximum reported average classification accuracy (CA) using spectrogram for SAVEE, RAVDESS, and EmoDB dataset is 87.8%, 79.5 %, and 83.4%, respectively (Wani et al., 2020, Mustaqeem and Kwon, 2019, Badshah et al., 2017). The proposed t-f image-based classification model shows improvement in average CA by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively. This study can be helpful in human-computer interface applications to detect emotions precisely from acoustic signals.

Keywords: Time-Frequency Image; Neural Network; Automatic Speech Emotion Recognition; Acoustic Signal; Grayscale Image Feature

Introduction

Emotions are an individual's feelings about a situation. It is the body's physical and emotional response to a person's thoughts and feelings. People express emotions in identical situations differently. People's expressions on their faces and their voices usually reflect their feelings. For effective communication, emotion is essential. Different emotions can be expressed as happiness, sadness, anger, anxiety, cheerfulness, excitement, lonely, helplessness, annoyance, etc. Additionally, it can be categorized as positive and negative emotions. The primary emotions are classified into six categories: happy, sad, anger, fear, disgust, and surprise (Ekman et al., 2013). Human uses many types of gestures, including speech, as a means of communication. Speech is the most common, easiest, and natural form of communication. Communication can be done in other ways as well, but it lacks emotions such as text messages, without proper emotions, can induce misunderstanding. So, emojis were introduced, which replicated the emotions. Using emojis in text messages conveys our emotions. When people speak, their emotions are reflected in their voices, facilitating better communication. One of the unique physiological processes is speech generation. Therefore, the inherent emotional state in the speech can also help to detect the mental and physical health of human (Akçay and Oğuz, 2020).

Automatic Speech Emotion Recognition (ASER) is the process of recognizing emotions in speech. ASER uses speech analysis and machine learning to create an automated system that can detect the emotions of human beings from their voice for various purposes. Acoustic feature extraction plays a prominent role in the ASER system to analyze a speaker's voice and determine the speaker's emotional state. A variety of methods are used to classify emotions from the acoustic signal, such as Prosodic features, Mel-frequency cepstral coefficients, Pitch Frequencies, vocalization duration, and spectrogram. ASER-based systems can be applied to make supportive tools for some areas such as healthcare, digital assistant-based customer service, marketing and other human-machine interactive services. 

The main challenge in ASER is extracting hidden features embedded in the acoustic signal. For this, various methods are applied to extract features. Prosodic features, voice quality features, spectral features, and Teager energy features are the different types of speech features (Akçay and Oğuz, 2020). Time domain, as well as frequency domain feature extraction techniques for acoustic signals, were reported by many researchers. Some researcher uses spectrogram of the acoustic signal. The present method extracts emotional features from time-frequency images (spectrogram). The t-f image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude.

Literature Review

Some recent studies using spectrogram-based emotion recognition with various datasets and classifiers are reported briefly below.

Wang (2014) extracted texture image information from a spectrogram of the speech signals to sense emotions embedded in speech. Two open-source emotional datasets have been used in their work, namely EmoDB and eNTERFACE, along with their self-recorded dataset (KHUSC-EmoDB) for the cross-corpus method. Firstly, the speech signals of the dataset mentioned above are converted into a spectrogram; after that, it is transformed into a normalized grayscale image, and then a cubic curve is used to enhance the image contrast. The features were extracted by Laws' Masks, based on the principle of texture energy measurement and SVM is used as a classifier. Their experimental results show that the correct classification rates range from 65.20% to 77.42%.

Badshah et al. (2017) proposed a model for SER using a spectrogram with the deep convolutional neural network. The EmoDB dataset has been utilized in their work, and the acoustic signals have been converted to spectrogram images. They used three convolutional layers and three fully connected layers to extract suitable features from spectrogram images. The Softmax layer performs the final classification for seven emotions embedded in acoustic signals of the EmoDB dataset. They achieved an overall classification accuracy of 83.4% for all seven emotions.

Özseven (2018) investigated the effects of texture analysis methods and spectrogram images on speech emotion recognition. In their work, four different texture analysis methods were used to obtain features from the spectrogram images of the speech signals. Also, acoustic features were studied to compare texture analysis methods with acoustic analysis methods. They achieved 82.4%, 60.9%, and 64.6% success rates for texture analysis method and 82.8%, 56.3%, and 74.3% success rates for acoustic analysis methods for EMO-DB  (Berlin Database of Emotional Speech), eNTERFACE'05 and SAVEE (Surrey Audio-Visual Expressed Emotion) databases, respectively using SVM classifier. When comparing SER performance based on approach, the acoustic analysis outperforms the texture analysis by 0.4% for EMO-DB and 9.7% for SAVEE and underperforms by 4.6% for eNTERFACE'05.

Hajarolasvadi and Demirel (2019) extracted 88-dimensional vectors from acoustic signals of SAVEE, RML, and eNTERFACE’05 databases. They also obtained each signal's spectrogram and then applied k-means clustering to all the extracted features to get keyframes. Then, the corresponding spectrogram of keyframes is encapsulated in a 3-D tensor form, which works as an input of 3-D CNN. The 3-D CNN consists of 2 convolutional layers and a fully connected layer for classifying six emotions, anger, disgust, fear, happy, sad, and surprise, in the dataset mentioned. They achieved 81.05%, 77% & and 72.55% overall classification accuracy using SAVEE, RML(Ryerson Multimedia Laboratory), and eNTERFACE'05 datasets, respectively.

 

Mohammed and Hasan (2020) have been used MELBP variants of spectrogram images to recognize emotion from the acoustic signal. They converted the emotional acoustic signals into 2D spectrogram images, and then four forms of Extended Local Binary Pattern (ELBP) were generated to extract the emotional features from spectrogram images. ELBP provides information about direction and variation in amplitude intensities for the given emotions; as a result, more effective feature vectors were captured. In this paper, a Multi-Block of ELBP (MELBP) using the histogram is proposed to highlight the important features of the spectrogram image. Here, Deep Belief Network (DBN) is used to classify the emotions from extracted features. For the well know SAVEE dataset, they achieved 72.14% accuracy.

Sönmez and Varol (2020) developed a lightweight, effective speech emotion recognition method called multi-level local binary pattern and local ternary pattern abbreviated as 1BTPDN. This method first applied a one-dimensional local binary pattern (1D-LBP) and a one-dimensional local ternary pattern (1D-LTP) on the raw speech signal. Then 1D discrete wavelet transform (DWT) with nine levels was utilized to extract the features. Out of 7680 features, 1024 features are selected using neighbourhood component analysis (NCA). Using a third-degree polynomial kernel-based support vector machine as a classifier, they obtained success rates of 89.16%, 76.67%, and 74.31% for EMO-DB, SAVEE, and EMOVO (an Italian emotional speech database) databases, respectively.

Padi et al. (2021) proposed transfer learning and spectrogram augmentation based automatic SER model. In this work, they used spectrogram images as input for ResNet layer. Statistics Pooling layer, fully connected layers, and softmax layer were other consecutive layers of proposed model. Three experimental setups were prepared for evaluation of this model. Each experimental setup contained data of four emotions only. The emotions namely angry, happy, neutral, sad, and excited were selected from the IEMOCAP (Interactive emotional dyadic motion capture) database. High classification accuracy of 71.92 % was achieved using this model.

Yalamanchi et al. (2022) proposed architecture that utilizes CapsuleNets with Time distributed 2D-convolution layers to accurately predict emotions from speech samples. This paper highlights the importance of time distributed layers in handling time series data and capturing crucial cues for emotion recognition. The proposed model is trained and evaluated on two datasets, RAVDESS and IEMOCAP for Speech Emotion Recognition (SER) and achieved an accuracy of 92.6% on the RAVDESS dataset and 93.2% on the IEMOCAP dataset. The CapsuleNets architecture with Time distributed 2D-convolution layers outperformed the architecture without Time distributed layers. Class-wise accuracies were used to evaluate the model's performance in predicting each emotion class. Precision, recall, and F1 score were also calculated to assess the model's performance. The results showed that the proposed architecture effectively classified emotions.

Zhang et al. (2023) proposed speech emotion recognition (SER) model based on dual global context attention and time-frequency features achieves competitive performance on three public datasets: IEMOCAP, RAVDESS, and EMO-DB. The model demonstrates recognition accuracies of 70.08%, 86.67%, and 93.27% on these datasets, respectively. The utilization of time-frequency features in the model leads to improved performance compared to using either time-domain or frequency-domain features alone. Their proposed model outperforms most of the baseline methods on the RAVDESS and EMO-DB datasets, achieving accuracies of 86.67% and 93.27%, respectively. Additionally, their proposed model achieves high recognition accuracy on the EMO-DB dataset, with an accuracy of 93.27%. The results show the effectiveness of the model in addressing the misuse of features and improving recognition accuracy in speech emotion recognition tasks.

Contributions of the present paper

1. We have implemented and evaluated t-f image-based multiclass emotional state classification model using the BPANN classifier. 

2. We have also validated the proposed SER model's performance using various datasets for the emotions namely anger, disgust, fear, happy, neutral, and sad only.

The rest of the paper is arranged in the following section: The material and Methods section includes a brief discussion about datasets, time-frequency images, feature extraction, BPANN classifier, and multiclass ASER model. The next section is results and discussions, followed by the conclusion.

 

Materials and methods

Dataset

For this study, we have selected three benchmarked database of different native speakers and different languages, which includes male and female speakers. The datasets used in this paper are discussed briefly below:

 

SAVEE :

Surrey Audio-Visual Expressed Emotion (SAVEE) is a well-known dataset of emotional speech. It contains an audio-visual signal with seven emotions, anger, disgust, fear, happiness, neutral, sadness, and surprise, a total of 480 speech signals in .wav format of four male actors. Each subject's audio-visual signal was recorded for seven emotions, 30 utterances for neutral emotions, and 15 sentences for each remaining emotion. The speech signals were recorded in a visual media lab with 16-bit encoding and a 44.1 kHz sampling rate. All subjects were British English speakers (Haq et al., 2008).

 

RAVDESS:

Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) is an open-source dataset that contains speech and song, audio, and video signals of 24 actors (12 male, 12 female). All actors were native North Americans, and their speech signals contained anger, disgust, fear, calm, happiness, neutral, sad, and surprise emotions. The speech signals were recorded in a studio with 16-bit encoding and a 48 kHz sampling rate. RAVDESS dataset contains 60 sentences per actor, i.e., 60x24 = 1440, and all files are in .wav format (Livingstone et al., 2005).

 

EmoDB:

The EmoDB dataset (Berlin Database of Emotional Speech) is created by the Institute of Communication Science, Technical University, Berlin, Germany, which can be openly accessed on their website. This dataset contains speech signals from 10 speakers (5 male, 5 female) on seven emotions: anger, fear, boredom, disgust, happy, sad, and neutral. A total of 535 utterances were recorded with a 48 kHz sampling rate (Burkhardt et al., 2005).

In the present work, only six common emotions in the above dataset have been selected, and speech signals of emotions, namely anger, disgust, fear, happy, neutral, and sad are used. The number of utterances for each emotion is listed in Table 1. A total of 1870 utterances, i.e., 360, 1056 and 454, are utilized from SAVEE, RAVDESS and EmoDB datasets, respectively.

 

Table 1 — Emotion and number of utterances selected from the various dataset

S. No.

Dataset →

SAVEE

RAVDESS

EmoDB

Emotion ↓

 

 

 

1.      

Anger

60

96

127

2.      

Disgust

60

192

46

3.      

Fear

60

192

69

4.      

Happy

60

192

71

5.      

Neutral

60

192

79

6.      

Sad

60

192

62

 

Total =

360

1056

454

Time-Frequency Image

The time-frequency image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude. Dark blue shows low amplitude, and bright colors yellow to red show high amplitude (Fig.1). It is also known as a spectrogram, and when used in acoustic signals, it is called voicegram or voiceprints. The speech signal represents a 1-D signal and provides information such as speech rate, amplitude, and space between each sample, giving information about emotions. Similarly, the t-f image represents a 2-D image with color-coded amplitudes, which gives information about emotions embedded in them. The t-f image is usually obtained by applying Fast Fourier Transform (FFT) on the acoustic signal. It begins with the decomposition of the acoustic signals into small time frames. Each frame is converted to the frequency domain from the time domain by applying windowed Short-Time Fourier Transform (STFT), as shown in (eq.1). Here, hamming window with 50% overlapping is used.

 

X(k,t)  =                        k=0,1,2,…….N-1                                (1)

 

Where X(k,t) is the time-frequency representation of acoustic signal x(n), x(n) is preprocessed Acoustic signal, w(n) represents the Hamming window function, N represents the length of the window function, k represents the corresponding frequency, f (k) = kfs/N , where fs is the sampling frequency

The following equation (eq.2) generates the coefficients of a Hamming window:

 

w(n)=0.54−0.46cos(2πnN),          0≤n≤N                                                              (2)

 

The time-frequency image is a very reliable form to extract features for ASER. It holds rich information which can't be extracted in the time domain or frequency domain alone (Mustaqeem and Kwon, 2019). Due to this reason, time-frequency image has been used to improve the study in various fields. In many applications, the time-frequency image has been used to classify sound events, speech recognition, speech emotion recognition, and speaker recognition (Mao et al., 2014, Yu et al., 2013, Dennis et al., 2010, Lee et al., 2009). In this paper, MATLAB © R2021a has been used to replicate acoustic signal into a t-f image. (Fig.1) shows the acoustic signal and their t-f image for six emotions.

 

Feature Extraction from Grayscale Image

Once the t-f image is obtained for every acoustic signal, it is converted into a 400x400 grayscale image. The grayscale image is represented by gray colors with binary values between 0-255. 0 shows the black color, and 255 shows the white color. Using various statistical methods, 472 features are extracted such as First Order Statistics (FOS), Haralick Spatial Gray Level Dependence Matrices (SGLDM), Gray Level Difference Statistics (GLDS), Neighborhood Gray Tone Difference Matrix (NGTDM), Statistical Feature Matrix (SFM), Spectral Texture of Images (STI), Gray Level Run Length Matrix (GLRLM), etc. (see Table 2). Next, these features were given to the back propagation artificial neural network (BPANN) classifier to train and test emotions embedded in acoustic signals (Singh et al., 2015).

 

Anger

Disgust

Fear

Happy

Neutral

Sad

Fig.1— Acoustic signal and t-f image of 6 emotions

BPANN

The neural network is one of the most broadly used classifiers. In neural networks, there are two algorithms: feedforward and backpropagation. In the present work, backpropagation artificial neural network (BPANN) is utilized to classify emotions. This type of neural network needs supervised learning; it contains one input layer, one output layer, and some hidden layer. The goal of backpropagation is to minimize the error between the predicted output of the neural network and the actual target output. This is achieved by adjusting the weights of the network through a process of iterative optimization. This process is repeated for multiple iterations (epochs) until the network converges to a set of weights and biases that minimize the error on the training data (Goh, 1995).

The main terminology and annotations that explain the BPANN method is denoted by following terms:

·        L: The total number of layers in the neural network (including the input and output layers).

·        W(l): The matrix of weights connecting layer l to layer l +1. The superscript (l) indicates the layer index.

·        b(l): The vector of biases for layer l +1.

·        a(l): The vector of activations in layer l after applying the activation function.

·        z(l): The vector of inputs to the activation function in layer l.

·        δ(l): The error term for layer l.

The backpropagation algorithm consists of two main steps: the forward pass and the backward pass.

1.     Forward Pass:

Compute the activations a(l) for each layer using the weighted sum of inputs and the activation function.

z(l+1)=W(l)a(l)+b(l)

a(l+1)=f(z(l+1))

This process will be repeated for each layer until we reach the output layer.

2.     Backward Pass:

Compute the error term δ(L) for the output layer:

δ(L)=aJf′(z(L))

where J is the cost function, aJ is the gradient of the cost function with respect to the activations, denotes element-wise multiplication, and f′() is the derivative of the activation function.

Propagate the error backward through the layers to compute the error terms for each layer: δ(l)=((W(l))(l+1))f′(z(l))

Compute the gradients of the cost function with respect to the weights and biases:

 = δ(l+1)(a(l))T

 = δ(l+1)

 

Update the weights and biases using a gradient descent optimization algorithm:

W(l)=W(l ) α 

b(l)=b(l) α

where α is the learning rate.

 

 

 

Table 2  — List of grayscale image features(Singh et al., 2015)

Feature Category

Feature Name

No. of Features

Statistical Features

Mean, Variance, median, mode, skewness

5

Haralick textural features

Mean and range values are calculated for features, namely angular second moment, contrast, correlation, a sum of squares, homogeneity, sum average, sum variance, sum entropy, entropy, difference variance, difference entropy, information measures of correlation-1, information measures of correlation-2

26

Gray level difference statistics (GLDS)

Homogeneity, contrast, energy, entropy

4

 

 

Neighbourhood gray-tone difference matrix (NGTDM)

 

 

Coarseness, contrast, busyness, complexity, strength

 

 

5

Statistical feature matrix (SFM)

Coarseness, contrast, periodicity, roughness

4

Texture energy measures (TEM)

LL, EE, SS, LE, ES, and LS kernel-based TEM features

6

Fractal dimension texture analysis (FDTA)

FDTA-H1, FDTA-H2, FDTA-H3, FDTA-H4

4

Shape

Area, perimeter, perimeter square per unit area

3

 Spectral texture of images (STI)

199 features of spectral energy distribution as a function of radius, 180 features of spectral energy distribution as a function of angle

379

Invariant moments of image (IMI)

MI1-IMI7

7

Statistical measures of texture (SMT)

Average gray level, average contrast, measure of smoothness, third moment, uniformity, entropy

6

Gray-level run length matrix-based properties (GLRLP)

SRE, LRE, GLN, RLN, RP, LGRE, HGRE, SRLGE, SRHGE, LRLGE, LRHGE

11

Texture feature using Segmentation based fractal texture analysis (SFTA) algorithm

SFTA1-SFTA12

12

Total

         472

Experimental analysis of the multiclass SER model

In the present work, three benchmark datasets of the emotional speech signal, namely SAVEE, RAVDESS, and EmoDB, have been utilized to recognize six emotions, i.e., anger, fear, disgust, happy, neutral, and sad. The speech signals of each dataset are firstly preprocessed to remove noise and unwanted signals. It is a necessary part before feature extraction. Here, the preprocessing method abides by two stages. The first stage is noise removal using a bandpass filter with a 20 Hz to 20 kHz frequency range. In the second stage, the silence part of the  speech signal has been removed to decrease the frame length and unwanted signals. After that, speech signals are transformed into 2-D time-frequency images using the FFT method. The t-f image is then converted into a 400x400 grayscale image using MATLAB ©R2021a software. The t-f image-based multiclass SER model is shown in (Fig.2). 472 features have been extracted from these grayscale images. Using these features as input and ground truth the BPANN classifier is trained. The trained network contains the learning parameters that will help the BPANN to take decision during testing. The testing part is almost similar to the training part. Here only relevant features are extracted from the unknown preprocessed speech signals. The trained BPANN classifier takes the decision on the signals based on the learning during the training process. Here, we studied and evaluated the performance of the multiclass SER model using each dataset separately and with the mixed dataset. Each dataset is divided using a 5-fold data division protocol. This method divides the whole data into 5 equal parts, 4 used for training purposes, and 1 for testing. It repeats itself process 5 times with different validation data (Browne 2000).





Fig.2 — A t-f image-based Speech Emotion Recognition (SER) model

Result and discussion

In the present work, we proposed the t-f image-based SER model for classifying six common emotions (i.e., anger, disgust, fear, happy, neutral, and sad) from SAVEE, RAVDESS, and EmoDB datasets. The average accuracies of the classifier using grayscale features with 5-fold BPANN are shown in (Fig.3). We got the highest average classification accuracy (CA) of 93.56 % for the EmoDB dataset, followed by the SAVEE dataset with 88.60% and for RAVDESS dataset it is 85.5%. Also, the average CA of 83.44% is obtained for the mixed dataset.


Fig.3 — Classification accuracy of BPANN-based SER model for various datasets

 

The true positive rate (TPR) and false positive rate (FPR) of six emotions of all three datasets used in the present work are shown in Table 3. The TPR represents the correct positive results during the test among all positive samples. While the FPR gives the amount of incorrectly positive results out of all the negative samples during the test, the FPR tells you how frequently those results occur. TPR should be high, and FPR should be low. Here, the highest TPR of 100% is obtained for the sad

 

Table 3  — TPR & FPR for six emotions of SAVEE, RAVDESS, EmoDB

Dataset

Emotion

TPR (%)

FPR (%)

SAVEE

Anger

90.00

3.00

Disgust

85.00

4.00

Fear

88.33

1.67

Happy

88.33

2.00

Neutral

91.67

1.67

Sad

88.33

1.39

RAVDESS

Anger

87.13

3.72

Disgust

86.11

3.02

Fear

84.24

3.13

Happy

82.65

3.12

Neutral

85.26

1.88

Sad

87.37

3.00

EmoDB

Anger

95.20

2.18

Disgust

93.33

1.50

Fear

85.71

1.87

Happy

88.57

1.60

Neutral

94.67

0.81

Sad

100.00

0.53

emotion of the EmoBD dataset. For the RAVDESS dataset, sad emotion also shows the highest TPR of 87.37%, while for the SAVEE dataset, a TPR of 91.67% is the highest for neutral emotion. The lowest FPR is obtained for the sad emotion of the EmoDB dataset, which is 0.53%. For other datasets, the lowest FPR is 1.39% for the sad emotion of SAVEE and 1.88% for the neutral emotion of RAVDESS. In overall analysis, the EmoDB dataset gives the highest CA and TPR, and lowest FPR from all three datasets utilized in this work.

 

Table 4 — Performance comparison with other research works

Author Name & Year

Dataset

Feature & Classifier

Accuracy

Wang, 2014

EMO-DB

eNTERFACE05

Texture image information (TII) from Spectrogram with SVM classifier

65.20% to 77.42%.

Zheng et al., 2015

IMOCAP

Deep Convolutional Neutral Network using the spectrogram segment as input

40%

Fayek et al., 2015

eNTERFACE’05

SAVEE

Deep Neutral Network, Spectrogram as input

60.53

59.7

Badshah et al., 2017

EmoDB

Convolutional Neutral Network, Spectrogram as input

83.4%

Sönmez and Varol, 2017

EMO-DB

SAVEE

EMOVO

global optimum features with Third-degree polynomial kernel-based SVM classifier

89.16%,

76.67%,

74.31%

Özseven, 2018

EmoDB

SAVEE

eNTERFACE’05

Spectrogram-based features with SVM classifier

82.8%,

74.3%

60.9%

Hajarolasvadi and Demirel, 2019

SAVEE

RML

eNTERFACE’05

3D CNN with tensors as an input consists of Mel Frequency Cepstral Coefficients (MFCC), pitch,  intensity, and spectrogram

 

81.05%

77%

72.55%

Mustaqeen and Kwon, 2019

RAVDESS

deep stride convolutional

neural network (DSCNN) with a spectrogram as input

79.5%

Mohammad and Hasan, 2020

SAVEE

Spectrogram features with ELBP and Deep Belief Network(DBN)

72.14%

Wani et al., 2020

SAVEE

Deep Stride Convolutional Neural Networks (DSCNN), spectrogram as input

87.8%

Shuzhen Li et al., 2021

IEMOCAP

EMO-DB

eNTERFACE05

SAVEE

Spatiotemporal and Frequential Cascaded Attention Network consist of CNN

80.47%

83.30%

75.80%

56.50%

Proposed Method

SAVEE

RAVDESS

EmoDB

SAVEE+

RAVDESS+

EmoDB

Gray scale image features of spectrogram with BPANN classifier

88.60%

85.50%

93.56%

83.44%

 

Table 4 shows the performance comparison of different works done on spectrogram features for speech emotion recognition. In this table, many researchers worked on different datasets of emotional signals viz. EmoDB,  eNTERFACE, IMOCAP, SAVEE, EMOVO and RAVDESS. Here, only three datasets are participating in the comparison i.e., SAVEE, RAVDESS and EmoDB.  Using Deep Stride Convolutional Neural Network (DSCNN) Wani's work show the highest average classification accuracy (CA) for the SAVEE dataset with 87.8% (Wani et al., 2020). The highest average classification accuracy (CA) of 79.5% for the RAVDESS dataset using Deep Stride Convolutional Neural Network (DSCNN) is reported by Mustaqeem and Kwon (2019). The highest average classification accuracy (CA) for the EmoDB dataset is 83.4%, reported by Badshah (Badshah et al., 2017). using a convolutional neural network. The proposed t-f image-based model with Back Propagation Artificial Neural Network(BPANN) shows better results with average classification accuracy (CA) of 88.6%, 85.5%, and 93.56% for SAVEE, RAVDESS, and  EmoDB datasets, respectively. This shows improvement in maximum reported average CA by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively.

 

Conclusion

This study's proposed model is based on time-frequency images of acoustic signals to recognize emotions. This model shows improvement by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively, as compared to the maximum reported accuracies for the same datasets. The proposed model is also validated with various dataset and combination of the datasets.

The proposed method is efficiently works on the combination of datasets as well and shows the classification accuracy of 83.44%. These out performances are also the evidence of efficiency of the proposed model.

 

Future Scopes and Challenges

The time-frequency image-based study is the most prominent in the area of ASER. Much research has already been done in the ASER application area for human-computer interfaces. Still, it needs more improvement for much more perfection and accuracy and a less complex structure so it can be used widely at a low cost. The emotion recognition model validation and test with emotion database of Indian context is also new path for further research.   

 

References

  1. Akçay, M. B., & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56-76.
  2. Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017, February). Speech emotion recognition from spectrograms with deep convolutional neural network. In 2017 international conference on platform technology and service (PlatCon) (pp. 1-5). IEEE.
  3. Browne, M. W. (2000). Cross-validation methods. Journal of mathematical psychology, 44(1), 108-132.
  4. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005, September). A database of German emotional speech. In Interspeech (Vol. 5, pp. 1517-1520).
  5. Dennis, J., Tran, H. D., & Li, H. (2010). Spectrogram image feature for sound event classification in mismatched conditions. IEEE signal processing letters, 18(2), 130-133.
  6. Ekman, P., Friesen, W. V., & Ellsworth, P. (2013). Emotion in the human face: Guidelines for research and an integration of findings (Vol. 11). Elsevier.
  7. Fayek, H. M., Lech, M., & Cavedon, L. (2015, December). Towards real-time speech emotion recognition using deep neural networks. In 2015 9th international conference on signal processing and communication systems (ICSPCS) (pp. 1-5). IEEE.
  8. Goh, A. T. (1995). Back-propagation neural networks for modeling complex systems. Artificial intelligence in engineering, 9(3), 143-151.
  9. Hajarolasvadi, N., & Demirel, H. (2019). 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy, 21(5), 479.
  10. Haq, S., Jackson, P. J., & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification. In Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08), Tangalooma, Australia.
  11. Lee, D. D., Pham, P., Largman, Y., & Ng, A. (2009). Advances in neural information processing systems 22. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012, NIPS 2012.
  12. Li, S., Xing, X., Fan, W., Cai, B., Fordson, P., & Xu, X. (2021). Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing, 448, 238-248.
  13. Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 13(5), e0196391.
  14. Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE transactions on multimedia, 16(8), 2203-2213.
  15. Mohammed, S. N., & Abdul Hassan, A. K. (2020). Speech Emotion Recognition Using MELBP Variants of Spectrogram Image. International Journal of Intelligent Engineering & Systems, 13(5).
  16. Mustaqeem, & Kwon, S. (2019). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183.
  17. Özseven, T. (2018). Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Applied Acoustics, 142, 70-77.
  18. Padi, S., Sadjadi, S. O., Sriram, R. D., & Manocha, D. (2021, October). Improved speech emotion recognition using transfer learning and spectrogram augmentation. In Proceedings of the 2021 international conference on multimodal interaction (pp. 645-652).
  19. Singh, B. K., Verma, K., & Thoke, A. S. (2015). Adaptive gradient descent backpropagation for classification of breast tumors in ultrasound imaging. Procedia Computer Science, 46, 1601-1609.
  20. Sönmez, Y. Ü., & Varol, A. (2020). A speech emotion recognition model based on multi-level local binary and local ternary patterns. IEEE Access, 8, 190784-190796.
  21. Wang, K. C. (2014). The feature extraction based on texture image information for emotion sensing in speech. Sensors, 14(9), 16692-16714.
  22. Wani, T. M., Gunawan, T. S., Qadri, S. A. A., Mansor, H., Kartiwi, M., & Ismail, N. (2020, September). Speech emotion recognition using convolution neural networks and deep stride convolutional neural networks. In 2020 6th International Conference on Wireless and Telematics (ICWT) (pp. 1-6). IEEE.
  23. Yalamanchili, B., Anne, K. R., & Samayamantula, S. K. (2022). Speech emotion recognition using time distributed 2D-Convolution layers for CAPSULENETS. In Multimedia Tools and Applications, 81(12), 16945-16966.
  24. Yu, D., Seltzer, M. L., Li, J., Huang, J. T., & Seide, F. (2013). Feature learning in deep neural networks-studies on speech recognition tasks. arXiv preprint arXiv:1301.3605.
  25. Zhang, P., Bai, X., Zhao, J., Liang, Y., Wang, F., & Wu, X. (2023, June). Speech Emotion Recognition Using Dual Global Context Attention and Time-Frequency Features. In 2023 International Joint Conference on Neural Networks (IJCNN) (pp. 1-7). IEEE.
  26. Zheng, W. Q., Yu, J. S., & Zou, Y. X. (2015, September). An experimental study of speech emotion recognition based on deep convolutional neural networks. In 2015 international conference on affective computing and intelligent interaction (ACII) (pp. 827-831). IEEE.


Related Images:

Recomonded Articles:

Author(s): Sunandan Mandal; Kavita Thakur; Bikesh Kumar Singh; Heera Ram

DOI: 10.52228/JRUB.2020-33-1-1         Access: Open Access Read More

Author(s): Suchita Agrawal; Prabha Rohatgi

DOI:         Access: Open Access Read More

Author(s): A S Raghavendra

DOI:         Access: Open Access Read More

Author(s): Hemant Kumar Nashine

DOI:         Access: Open Access Read More

Author(s): G.S. Saluja; B.K. Sharma

DOI:         Access: Open Access Read More

Author(s): Keshrj Verma; O.P. Vyas

DOI:         Access: Open Access Read More

Author(s): Amit Alexander; Ajazuddin

DOI:         Access: Open Access Read More

Author(s): Hemant Kumar Mahine; C.L. Dewangan

DOI:         Access: Open Access Read More

Author(s): Neha Dewangan; Kavita Thakur; Sunandan Mandal; Bikesh Kumar Singh

DOI: 10.52228/JRUB.2023-36-2-10         Access: Open Access Read More