Frequency: Quarterly E- ISSN: 2250-2912 P- ISSN: Awaited Abstracted/ Indexed in: Ulrich's International Periodical Directory, Google Scholar, SCIRUS, Genamics JournalSeek, EBSCO Information Services
Quarterly published in print and online "Inventi Impact: Audio, Speech & Music Processing" publishes high quality unpublished as well as high impact pre-published research and reviews catering to the needs of researchers and professionals. It focuses on sound engineering, recording, electronic production of speech and music, and digitization of sound, etc.
The Cordoba Guitar Festival is one of the most important cultural events in Spain.\nThis article analyses the musical preferences, satisfaction, attitudinal loyalty, and behavioural loyalty\nof spectators who attended the 36th festival held in July 2016, as well as the festivalâ??s economic\nimpact on the city. These characteristics of the public give rise to the four hypotheses of this study.\nTo achieve this aim, a structural equation model (SEM) was used. The results.............
This paper presents a low-power, high-gain integrator design that uses a cascode operational transconductance amplifier (OTA) with floating inverter–amplifier (FIA) assistance. Compared to a traditional cascode, the proposed integrator can achieve a gain of 80 dB, while reducing power consumption by 30%. Upon completing the analysis, the value of the FIA drive capacitor and clock scheme for the FIA-assisted OTA were obtained. To enhance the dynamic range (DR) and mitigate quantization noise, a tri-level quantizer was employed. The design of the feedback digital-to-analog converter (DAC) was simplified, as it does not use additional mismatch shaping techniques. A thirdorder, discrete-time delta–sigma modulator was designed and fabricated in a 0.18 μm complementary metal-oxide semiconductor (CMOS) process. It operated on a 1.8 V supply, consuming 221 μW with a 24 kHz bandwidth. The measured SNDR and DR were 90.9 dB and 95.3 dB, respectively....
Deep learning is bringing breakthroughs to many computer vision subfields including\nOptical Music Recognition (OMR), which has seen a series of improvements to musical symbol\ndetection achieved by using generic deep learning models. However, so far, each such proposal has\nbeen based on a specific dataset and different evaluation criteria, which made it difficult to quantify\nthe new deep learning-based state-of-the-art and assess the relative merits of these detection models\non music scores. In this paper, a baseline for general detection of musical symbols with deep learning\nis presented. We consider three datasets of heterogeneous typology but with the same annotation\nformat, three neural models of different nature, and establish their performance in terms of a common\nevaluation standard. The experimental results confirm that the direct music object detection with\ndeep learning is indeed promising, but at the same time illustrates some of the domain-specific\nshortcomings of the general detectors. A qualitative comparison then suggests avenues for OMR\nimprovement, based both on properties of the detection model and how the datasets are defined.\nTo the best of our knowledge, this is the first time that competing music object detection systems from\nthe machine learning paradigm are directly compared to each other. We hope that this work will\nserve as a reference to measure the progress of future developments of OMR in music object detection....
The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algorithms for visualization, effectively outlining emotion-specific audio features. As a model for emotion recognition, we propose a new model architecture that combines the bidirectional long short-term memory (BiLSTM)–Transformer and a 2D convolutional neural network (CNN). The BiLSTM–Transformer processes audio features to capture the sequence of speech patterns, while the 2D CNN handles Mel-Spectrograms to capture the spatial details of audio. To validate the proficiency of the model, the 10-fold cross-validation method is used. The methodology proposed in this study was applied to Emo-DB and RAVDESS, two major emotion recognition from speech databases, and achieved high unweighted accuracy rates of 95.65% and 80.19%, respectively. These results indicate that the use of the proposed transformer-based deep learning model with appropriate feature selection can enhance performance in emotion recognition from speech....
Speech is the most significant mode of communication among human beings and a potential\nmethod for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion\nrecognition using these sensors from speech signals is an emerging area of research in HCI, which\napplies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment,\nhealthcare, and emergency call centers to determine the speakerâ??s emotional state from an individualâ??s\nspeech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion\nrecognition (SER) compared to state of the art and (ii) reducing the computational complexity of\nthe presented SER model. We propose an artificial intelligence-assisted deep stride convolutional\nneural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative\nfeatures from spectrogram of speech signals that are enhanced in prior steps to perform better. Local\nhidden patterns are learned in convolutional layers with special strides to down-sample the feature\nmaps rather than pooling layer and global discriminative features are learned in fully connected layers.\nA SoftMax classifier is used for the classification of emotions in speech. The proposed technique is\nevaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual\nDatabase of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%,\nrespectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of\nthe proposed SER technique and reveals its applicability in real-world applications....
Building a voice-operated system for learning disabled users is a difficult task that requires a considerable amount of\ntime and effort. Due to the wide spectrum of disabilities and their different related phonopathies, most approaches\navailable are targeted to a specific pathology. This may improve their accuracy for some users, but makes them\nunsuitable for others. In this paper, we present a cross-lingual approach to adapt a general-purpose modular speech\nrecognizer for learning disabled people. The main advantage of this approach is that it allows rapid and cost-effective\ndevelopment by taking the already built speech recognition engine and its modules, and utilizing existing resources\nfor standard speech in different languages for the recognition of the users� atypical voices. Although the recognizers\nbuilt with the proposed technique obtain lower accuracy rates than those trained for specific pathologies, they can be\nused by a wide population and developed more rapidly, which makes it possible to design various types of\nspeech-based applications accessible to learning disabled users....
In recent years, the use of electroencephalography (EEG) has grown as a tool for diagnostic and brain function monitoring, being a simple and non-invasive method compared with other procedures like histological sampling. Typically, in order to extract functional brain responses from EEG signals, prolonged and repeated stimuli are needed because of the artifacts generated in recordings which adversely impact the stimulus-response analysis. To mitigate the artifact effect, correlation analysis (CA) methods are applied in the literature, where the predominant approaches focus on enhancing stimulus-response correlations through the use of linear analysis methods like canonical correlation analysis (CCA). This paper introduces a novel CA framework based on a neural network with a loss function specifically designed to maximize correlation between EEG and speech stimuli. Compared with other deep learning CA approaches (DCCAs) in the literature, this framework introduces a single multilayer perceptron (MLP) network instead of two networks for each stimulus. To validate the proposed approach, a comparison with linear CCA (LCCA) and DCCA was performed, using a dataset containing the EEG traces of subjects listening to speech stimuli. The experimental results show that the proposed method improves the overall Pearson correlation by 10.56% compared with the state-of-the-art DCCA method....
The automatic identification of emotions from speech holds significance in facilitating interactions between humans and machines. To improve the recognition accuracy of speech emotion, we extract mel-frequency cepstral coefficients (MFCCs) and pitch features from raw signals, and an improved differential evolution (DE) algorithm is utilized for feature selection based on K-nearest neighbor (KNN) and random forest (RF) classifiers. The proposed multivariate DE (MDE) adopts three mutation strategies to solve the slow convergence of the classical DE and maintain population diversity, and employs a jumping method to avoid falling into local traps. The simulations are conducted on four public English speech emotion datasets: eNTERFACE05, Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Surrey Audio-Visual Expressed Emotion (SAEE), and Toronto Emotional Speech Set (TESS), and they cover a diverse range of emotions. The MDE algorithm is compared with PSO-assisted biogeography-based optimization (BBO_PSO), DE, and the sine cosine algorithm (SCA) on emotion recognition error, number of selected features, and running time. From the results obtained, MDE obtains the errors of 0.5270, 0.5044, 0.4490, and 0.0420 in eNTERFACE05, RAVDESS, SAVEE, and TESS based on the KNN classifier, and the errors of 0.4721, 0.4264, 0.3283 and 0.0114 based on the RF classifier. The proposed algorithm demonstrates excellent performance in emotion recognition accuracy, and it finds meaningful acoustic features from MFCCs and pitch....
Substantial amounts of resources are usually required to robustly develop a language model for an open vocabulary\nspeech recognition system as out-of-vocabulary (OOV) words can hurt recognition accuracy. In this work, we applied\na hybrid lexicon of word and sub-word units to resolve the problem of OOV words in a resource-efficient way. As\nsub-lexical units can be combined to form new words, a compact set of hybrid vocabulary can be used while still\nmaintaining a low OOV rate. For Thai, a syllable-based unit called pseudo-morpheme (PM) was chosen as a sub-word\nunit. To also benefit from different levels of linguistic information embedded in different input types, a hybrid\nrecurrent neural network language model (RNNLM) framework is proposed. An RNNLM can model not only\ninformation from multiple-type input units through a hybrid input vector of words and PMs, but can also capture long\ncontext history through recurrent connections. Several hybrid input representations were also explored to optimize\nboth recognition accuracy and computational time. The hybrid LM has shown to be both resource-efficient and\nwell-performed on two Thai LVCSR tasks: broadcast news transcription and speech-to-speech translation. The\nproposed hybrid lexicon can constitute an open vocabulary for Thai LVCSR as it can greatly reduce the OOV rate to\nless than 1 % while using only 42 % of the vocabulary size of the word-based lexicon. In terms of recognition\nperformance, the best proposed hybrid RNNLM, which uses a mixed word-PM input, obtained 1.54 % relative WER\nreduction when compared with a conventional word-based RNNLM. In terms of computational time, the best hybrid\nRNNLM has the lowest training and decoding time among all RNNLMs including the word-based RNNLM. The overall\nrelative reduction on WER of the proposed hybrid RNNLM over a traditional n-gram model is 6.91 %....
The purpose of this work is to develop a spoken language processing system for smart device\ntroubleshooting using human-machine interaction. This system combines a software Bidirectional\nLong Short Term Memory Cell (BLSTM)-based speech recognizer and a hardware LSTM-based\nlanguage processor for Natural Language Processing (NLP) using the serial RS232 interface. Mel\nFrequency Cepstral Coecient (MFCC)-based feature vectors from the speech signal are directly\ninput into a BLSTM network. A dropout layer is added to the BLSTM layer to reduce over-fitting and\nimprove robustness. The speech recognition component is a combination of an acoustic modeler,\npronunciation dictionary, and a BLSTM network for generating query text, and executes in real time\nwith an 81.5% Word Error Rate (WER) and average training time of 45 s. The language processor\ncomprises a vectorizer, lookup dictionary, key encoder, Long Short Term Memory Cell (LSTM)-based\ntraining and prediction network, and dialogue manager, and transforms query intent to generate\nresponse text with a processing time of 0.59 s, 5% hardware utilization, and an F1 score of 95.2%.\nThe proposed system has a 4.17% decrease in accuracy compared with existing systems. The existing\nsystems use parallel processing and high-speed cache memories to perform additional training, which\nimproves the accuracy. However, the performance of the language processor has a 36.7% decrease in\nprocessing time and 50% decrease in hardware utilization, making it suitable for troubleshooting\nsmart devices....
Loading....