Frequency: Quarterly E- ISSN: 2250-2912 P- ISSN: Awaited Abstracted/ Indexed in: Ulrich's International Periodical Directory, Google Scholar, SCIRUS, Genamics JournalSeek
Quarterly published in print and online "Inventi Impact: Audio, Speech & Music Processing" publishes high quality unpublished as well as high impact pre-published research and reviews catering to the needs of researchers and professionals. It focuses on sound engineering, recording, electronic production of speech and music, and digitization of sound, etc.
The Cordoba Guitar Festival is one of the most important cultural events in Spain.\nThis article analyses the musical preferences, satisfaction, attitudinal loyalty, and behavioural loyalty\nof spectators who attended the 36th festival held in July 2016, as well as the festivalâ??s economic\nimpact on the city. These characteristics of the public give rise to the four hypotheses of this study.\nTo achieve this aim, a structural equation model (SEM) was used. The results.............
Deep learning is bringing breakthroughs to many computer vision subfields including\nOptical Music Recognition (OMR), which has seen a series of improvements to musical symbol\ndetection achieved by using generic deep learning models. However, so far, each such proposal has\nbeen based on a specific dataset and different evaluation criteria, which made it difficult to quantify\nthe new deep learning-based state-of-the-art and assess the relative merits of these detection models\non music scores. In this paper, a baseline for general detection of musical symbols with deep learning\nis presented. We consider three datasets of heterogeneous typology but with the same annotation\nformat, three neural models of different nature, and establish their performance in terms of a common\nevaluation standard. The experimental results confirm that the direct music object detection with\ndeep learning is indeed promising, but at the same time illustrates some of the domain-specific\nshortcomings of the general detectors. A qualitative comparison then suggests avenues for OMR\nimprovement, based both on properties of the detection model and how the datasets are defined.\nTo the best of our knowledge, this is the first time that competing music object detection systems from\nthe machine learning paradigm are directly compared to each other. We hope that this work will\nserve as a reference to measure the progress of future developments of OMR in music object detection....
Speech is the most significant mode of communication among human beings and a potential\nmethod for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion\nrecognition using these sensors from speech signals is an emerging area of research in HCI, which\napplies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment,\nhealthcare, and emergency call centers to determine the speakerâ??s emotional state from an individualâ??s\nspeech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion\nrecognition (SER) compared to state of the art and (ii) reducing the computational complexity of\nthe presented SER model. We propose an artificial intelligence-assisted deep stride convolutional\nneural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative\nfeatures from spectrogram of speech signals that are enhanced in prior steps to perform better. Local\nhidden patterns are learned in convolutional layers with special strides to down-sample the feature\nmaps rather than pooling layer and global discriminative features are learned in fully connected layers.\nA SoftMax classifier is used for the classification of emotions in speech. The proposed technique is\nevaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual\nDatabase of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%,\nrespectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of\nthe proposed SER technique and reveals its applicability in real-world applications....
Building a voice-operated system for learning disabled users is a difficult task that requires a considerable amount of\ntime and effort. Due to the wide spectrum of disabilities and their different related phonopathies, most approaches\navailable are targeted to a specific pathology. This may improve their accuracy for some users, but makes them\nunsuitable for others. In this paper, we present a cross-lingual approach to adapt a general-purpose modular speech\nrecognizer for learning disabled people. The main advantage of this approach is that it allows rapid and cost-effective\ndevelopment by taking the already built speech recognition engine and its modules, and utilizing existing resources\nfor standard speech in different languages for the recognition of the users� atypical voices. Although the recognizers\nbuilt with the proposed technique obtain lower accuracy rates than those trained for specific pathologies, they can be\nused by a wide population and developed more rapidly, which makes it possible to design various types of\nspeech-based applications accessible to learning disabled users....
Substantial amounts of resources are usually required to robustly develop a language model for an open vocabulary\nspeech recognition system as out-of-vocabulary (OOV) words can hurt recognition accuracy. In this work, we applied\na hybrid lexicon of word and sub-word units to resolve the problem of OOV words in a resource-efficient way. As\nsub-lexical units can be combined to form new words, a compact set of hybrid vocabulary can be used while still\nmaintaining a low OOV rate. For Thai, a syllable-based unit called pseudo-morpheme (PM) was chosen as a sub-word\nunit. To also benefit from different levels of linguistic information embedded in different input types, a hybrid\nrecurrent neural network language model (RNNLM) framework is proposed. An RNNLM can model not only\ninformation from multiple-type input units through a hybrid input vector of words and PMs, but can also capture long\ncontext history through recurrent connections. Several hybrid input representations were also explored to optimize\nboth recognition accuracy and computational time. The hybrid LM has shown to be both resource-efficient and\nwell-performed on two Thai LVCSR tasks: broadcast news transcription and speech-to-speech translation. The\nproposed hybrid lexicon can constitute an open vocabulary for Thai LVCSR as it can greatly reduce the OOV rate to\nless than 1 % while using only 42 % of the vocabulary size of the word-based lexicon. In terms of recognition\nperformance, the best proposed hybrid RNNLM, which uses a mixed word-PM input, obtained 1.54 % relative WER\nreduction when compared with a conventional word-based RNNLM. In terms of computational time, the best hybrid\nRNNLM has the lowest training and decoding time among all RNNLMs including the word-based RNNLM. The overall\nrelative reduction on WER of the proposed hybrid RNNLM over a traditional n-gram model is 6.91 %....
The purpose of this work is to develop a spoken language processing system for smart device\ntroubleshooting using human-machine interaction. This system combines a software Bidirectional\nLong Short Term Memory Cell (BLSTM)-based speech recognizer and a hardware LSTM-based\nlanguage processor for Natural Language Processing (NLP) using the serial RS232 interface. Mel\nFrequency Cepstral Coecient (MFCC)-based feature vectors from the speech signal are directly\ninput into a BLSTM network. A dropout layer is added to the BLSTM layer to reduce over-fitting and\nimprove robustness. The speech recognition component is a combination of an acoustic modeler,\npronunciation dictionary, and a BLSTM network for generating query text, and executes in real time\nwith an 81.5% Word Error Rate (WER) and average training time of 45 s. The language processor\ncomprises a vectorizer, lookup dictionary, key encoder, Long Short Term Memory Cell (LSTM)-based\ntraining and prediction network, and dialogue manager, and transforms query intent to generate\nresponse text with a processing time of 0.59 s, 5% hardware utilization, and an F1 score of 95.2%.\nThe proposed system has a 4.17% decrease in accuracy compared with existing systems. The existing\nsystems use parallel processing and high-speed cache memories to perform additional training, which\nimproves the accuracy. However, the performance of the language processor has a 36.7% decrease in\nprocessing time and 50% decrease in hardware utilization, making it suitable for troubleshooting\nsmart devices....
To achieve a better trade-off between the vector dimension and the memory requirements of a vector quantizer (VQ),\nan entropy-constrained VQ (ECVQ) scheme with finite memory, called finite-state ECVQ (FS-ECVQ), is presented in this\npaper. The scheme consists of a finite-state VQ (FSVQ) and multiple component ECVQs. By utilizing the FSVQ, the\ninter-frame dependencies within source sequence can be effectively exploited and no side information needs to be\ntransmitted. By employing the ECVQs, the total memory requirements of the FS-ECVQ can be efficiently decreased\nwhile the coding performance is improved. An FS-ECVQ, designed for the modified discrete cosine transform (MDCT)\ncoefficients coding, was implemented and evaluated based on the Unified Speech and Audio Coding (USAC) scheme.\nResults showed that the FS-ECVQ achieved a reduction of the total memory requirements by about 11.3%, compared\nwith the encoder in USAC final version (FINAL), while maintaining a similar coding performance....
This paper proposes a new speech enhancement (SE) algorithm utilizing constraints to the Wiener gain function\nwhich is capable of working at 10 dB and lower signal-to-noise ratios (SNRs). The wavelet thresholded multi taper\nspectrum was taken as the clean spectrum for the constraints. The proposed algorithm was evaluated under eight\ntypes of noises and seven SNR levels in NOIZEUS database and was predicted by the composite measures and the\nSNRLOSS measure to improve subjective quality and speech intelligibility in various noisy environments. Comparisons\nwith two other algorithms (KLT and wavelet thresholding (WT)) demonstrate that in terms of signal distortion, overall\nquality, and the SNRLOSS measure, our proposed constrained SE algorithm outperforms the KLT and WT schemes for\nmost conditions considered....
Aiming at the shortcomings of single network classification model, this paper applies CNN-LSTM (convolutional neural\nnetworks-long short-term memory) combined network in the field of music emotion classification and proposes a multifeature\ncombined network classifier based on CNN-LSTM which combines 2D (two-dimensional) feature input through CNN-LSTM and\n1D (single-dimensional) feature input through DNN (deep neural networks) to make up for the deficiencies of original single\nfeature models. The model uses multiple convolution kernels in CNN for 2D feature extraction, BiLSTM (bidirectional LSTM) for\nserialization processing and is used, respectively, for audio and lyrics single-modal emotion classification output. In the audio\nfeature extraction, music audio is finely divided and the human voice is separated to obtain pure background sound clips; the\nspectrogram and LLDs (Low Level Descriptors) are extracted therefrom. In the lyrics feature extraction, the chi-squared test vector\nand word embedding extracted by Word2vec are, respectively, used as the feature representation of the lyrics. Combining the two\ntypes of heterogeneous features selected by audio and lyrics through the classification model can improve the classification\nperformance. In order to fuse the emotional information of the two modals of music audio and lyrics, this paper proposes a\nmultimodal ensemble learning method based on stacking, which is different from existing feature-level and decision-level fusion\nmethods, the method avoids information loss caused by direct dimensionality reduction, and the original features are converted\ninto label results for fusion, effectively solving the problem of feature heterogeneity. Experiments on million song dataset show\nthat the audio classification accuracy of the multifeature combined network classifier in this paper reaches 68%, and the lyrics\nclassification accuracy reaches 74%. The average classification accuracy of the multimodal reaches 78%, which is significantly\nimproved compared with the single-modal....
The aim of this paper is to improve beat-tracking for live guitar performances. Beat-tracking is a function to\r\nestimate musical measurements, for example musical tempo and phase. This method is critical to achieve a\r\nsynchronized ensemble performance such as musical robot accompaniment. Beat-tracking of a live guitar\r\nperformance has to deal with three challenges: tempo fluctuation, beat pattern complexity and environmental\r\nnoise. To cope with these problems, we devise an audiovisual integration method for beat-tracking. The auditory\r\nbeat features are estimated in terms of tactus (phase) and tempo (period) by Spectro-Temporal Pattern Matching\r\n(STPM), robust against stationary noise. The visual beat features are estimated by tracking the position of the hand\r\nrelative to the guitar using optical flow, mean shift and the Hough transform. Both estimated features are\r\nintegrated using a particle filter to aggregate the multimodal information based on a beat location model and a\r\nhand�s trajectory model. Experimental results confirm that our beat-tracking improves the F-measure by 8.9 points\r\non average over the Murata beat-tracking method, which uses STPM and rule-based beat detection. The results\r\nalso show that the system is capable of real-time processing with a suppressed number of particles while\r\npreserving the estimation accuracy. We demonstrate an ensemble with the humanoid HRP-2 that plays the\r\ntheremin with a human guitarist....
Loading....