Current Issue : April - June Volume : 2015 Issue Number : 2 Articles : 4 Articles
Building a voice-operated system for learning disabled users is a difficult task that requires a considerable amount of\ntime and effort. Due to the wide spectrum of disabilities and their different related phonopathies, most approaches\navailable are targeted to a specific pathology. This may improve their accuracy for some users, but makes them\nunsuitable for others. In this paper, we present a cross-lingual approach to adapt a general-purpose modular speech\nrecognizer for learning disabled people. The main advantage of this approach is that it allows rapid and cost-effective\ndevelopment by taking the already built speech recognition engine and its modules, and utilizing existing resources\nfor standard speech in different languages for the recognition of the users� atypical voices. Although the recognizers\nbuilt with the proposed technique obtain lower accuracy rates than those trained for specific pathologies, they can be\nused by a wide population and developed more rapidly, which makes it possible to design various types of\nspeech-based applications accessible to learning disabled users....
Although the field of automatic speaker or speech recognition has been extensively studied over the past decades,\nthe lack of robustness has remained a major challenge. The missing data technique (MDT) is a promising approach.\nHowever, its performance depends on the correlation across frequency bands. This paper presents a new\nreconstruction method for feature enhancement based on the trait. In this paper, the degree of concentration across\nfrequency bands is measured with principal component analysis (PCA). Through theoretical analysis and experimental\nresults, it is found that the correlation of the feature vector extracted from the sub-band (SB) is much stronger than\nthe ones extracted from the full-band (FB). Thus, rather than dealing with the spectral features as a whole, this paper\nsplits full-band into sub-bands and then individually reconstructs spectral features extracted from each SB based on\nMDT. At the end, those constructed features from all sub-bands will be recombined to yield the conventional\nmel-frequency cepstral coefficient (MFCC) for recognition experiments. The 2-sub-band reconstruction approach is\nevaluated in speaker recognition system. The results show that the proposed approach outperforms full-band\nreconstruction in terms of recognition performance in all noise conditions. Finally, we particularly discuss the optimal\nselection of frequency division ways for the recognition task. When FB is divided into much more sub-bands, some of\nthe correlations across frequency channels are lost. Consequently, efficient division ways need to be investigated to\nperform further recognition performance....
Feature-based vocoders, e.g., STRAIGHT, offer a way to manipulate the perceived characteristics of the speech signal\nin speech transformation and synthesis. For the harmonic model, which provide excellent perceived quality, features\nfor the amplitude parameters already exist (e.g., Line Spectral Frequencies (LSF), Mel-Frequency Cepstral Coefficients\n(MFCC)). However, because of the wrapping of the phase parameters, phase features are more difficult to design.\nTo randomize the phase of the harmonic model during synthesis, a voicing feature is commonly used, which\ndistinguishes voiced and unvoiced segments. However, voice production allows smooth transitions between\nvoiced/unvoiced states which makes voicing segmentation sometimes tricky to estimate. In this article, two-phase\nfeatures are suggested to represent the phase of the harmonic model in a uniform way, without voicing decision.\nThe synthesis quality of the resulting vocoder has been evaluated, using subjective listening tests, in the context of\nresynthesis, pitch scaling, and Hidden Markov Model (HMM)-based synthesis. The experiments show that the\nsuggested signal model is comparable to STRAIGHT or even better in some scenarios. They also reveal some\nlimitations of the harmonic framework itself in the case of high fundamental frequencies....
The full modulation spectrum is a high-dimensional representation of one-dimensional audio signals. Most previous\nresearch in automatic speech recognition converted this very rich representation into the equivalent of a sequence of\nshort-time power spectra, mainly to simplify the computation of the posterior probability that a frame of an unknown\nspeech signal is related to a specific state. In this paper we use the raw output of a modulation spectrum analyser in\ncombination with sparse coding as a means for obtaining state posterior probabilities. The modulation spectrum\nanalyser uses 15 gammatone filters. The Hilbert envelope of the output of these filters is then processed by nine\nmodulation frequency filters, with bandwidths up to 16 Hz. Experiments using the AURORA-2 task show that the\nnovel approach is promising. We found that the representation of medium-term dynamics in the modulation\nspectrum analyser must be improved. We also found that we should move towards sparse classification, by modifying\nthe cost function in sparse coding such that the class(es) represented by the exemplars weigh in, in addition to the\naccuracy with which unknown observations are reconstructed. This creates two challenges: (1) developing a method\nfor dictionary learning that takes the class occupancy of exemplars into account and (2) developing a method for\nlearning a mapping from exemplar activations to state posterior probabilities that keeps the generalization to unseen\nconditions that is one of the strongest advantages of sparse coding....
Loading....