Current Issue : April-June Volume : 2023 Issue Number : 2 Articles : 5 Articles
Speech emotion recognition (SER) has grown to be one of the most trending research topics in computational linguistics in the last two decades. Speech being the primary communication medium, understanding the emotional state of humans from speech and responding accordingly have made the speech emotion recognition system an essential part of the human-computer interaction (HCI) field. Although there are a few review works carried out for SER, none of them discusses the development of SER system for the Indo-Aryan or Dravidian language families. This paper focuses on some studies carried out for the development of an automatic SER system for Indo-Aryan and Dravidian languages. Besides, it presents a brief study of the prominent databases available for SER experiments. Some remarkable research works on the identification of emotion from the speech signal in the last two decades have also been discussed in this paper....
This paper proposes a new adaptive algorithm for the second-order blind signal separation (BSS) problem with convolutive mixtures by utilising a combination of an accelerated gradient and a conjugate gradient method. For each iteration of the adaptive algorithm, the search point and the search direction are obtained based on the current and the previous iterations. The algorithm efficiently calculates the step size for the accelerated conjugate gradient algorithm in each iteration. Simulation results show that the proposed accelerated conjugate gradient algorithm with optimal step size converges faster than the accelerated descent algorithm and the steepest descent algorithm with optimal step size while having lower computational complexity. In particular, the number of iterations required for convergence of the accelerated conjugate gradient algorithm is significantly lower than the accelerated descent algorithm and the steepest descent algorithm. In addition, the proposed system achieves improvement in terms of the signal to interference ratio and signal to noise ratio for the dominant speech outputs....
Recently, pattern recognition in audio signal processing using electroencephalography (EEG) has attracted significant attention. Changes in eye cases (open or closed) are reflected in distinct patterns in EEG data, gathered across a range of cases and actions. Therefore, the accuracy of extracting other information from these signals depends significantly on the prediction of the eye case during the acquisition of EEG signals. In this paper, we use deep learning vector quantization (DLVQ), and feedforward artificial neural network (F-FANN) techniques to recognize the case of the eye. The DLVQ is superior to traditional VQ in classification issues due to its ability to learn a code-constrained codebook. On initialization by the k-means VQ approach, the DLVQ shows very promising performance when tested on an EEG-audio information retrieval task, while F-FANN classifies EEG-audio signals of eye state as open or closed. The DLVQ model achieves higher classification accuracy, higher F score, precision, and recall, as well as superior classification abilities as compared to the F-FANN....
In this paper, an automatic speech emotion recognition (SER) task of classifying eight different emotions was experimented using parallel based networks trained using the Ryeson Audio- Visual Dataset of Speech and Song (RAVDESS) dataset. A combination of a CNN-based network and attention-based networks, running in parallel, was used to model both spatial features and temporal feature representations. Multiple Augmentation techniques using Additive White Gaussian Noise (AWGN), SpecAugment, Room Impulse Response (RIR), and Tanh Distortion techniques were used to augment the training data to further generalize the model representation. Raw audio data were transformed into Mel-Spectrograms as the model’s input. Using CNN’s proven capability in image classification and spatial feature representations, the spectrograms were treated as an image with the height and width represented by the spectrogram’s time and frequency scales. Temporal feature representations were represented by attention-based models Transformer, and BLSTM-Attention modules. Proposed architectures of the parallel CNN-based networks running along with Transformer and BLSTM-Attention modules were compared with standalone CNN architectures and attention-based networks, as well as with hybrid architectures with CNN layers wrapped in time-distributed wrappers stacked on attention-based networks. In these experiments, the highest accuracy of 89.33% for a Parallel CNN-Transformer network and 85.67% for a Parallel CNN-BLSTM-Attention Network were achieved on a 10% hold-out test set from the dataset. These networks showed promising results based on their accuracies, while keeping significantly less training parameters compared with non-parallel hybrid models....
It has become popular for people to share their opinions about products on TikTok and YouTube. Automatic sentiment extraction on a particular product can assist users in making buying decisions. For videos in languages such as Spanish, the tone of voice can be used to determine sentiments, since the translation is often unknown. In this paper, we propose a novel algorithm to classify sentiments in speech in the presence of environmental noise. Traditional models rely on pretrained audio feature extractors for humans that do not generalize well across different accents. In this paper, we leverage the vector space of emotional concepts where words with similar meanings often have the same prefix. For example, words starting with ‘con’ or ‘ab’ signify absence and hence negative sentiments. Augmentations are a popular way to amplify the training data during audio classification. However, some augmentations may result in a loss of accuracy. Hence, we propose a new metric based on eigenvalues to select the best augmentations. We evaluate the proposed approach on emotions in YouTube videos and outperform baselines in the range of 10–20%. Each neuron learns words with similar pronunciations and emotions. We also use the model to determine the presence of birds from audio recordings in the city....
Loading....