Current Issue : October - December Volume : 2014 Issue Number : 4 Articles : 5 Articles
To achieve a better trade-off between the vector dimension and the memory requirements of a vector quantizer (VQ),\nan entropy-constrained VQ (ECVQ) scheme with finite memory, called finite-state ECVQ (FS-ECVQ), is presented in this\npaper. The scheme consists of a finite-state VQ (FSVQ) and multiple component ECVQs. By utilizing the FSVQ, the\ninter-frame dependencies within source sequence can be effectively exploited and no side information needs to be\ntransmitted. By employing the ECVQs, the total memory requirements of the FS-ECVQ can be efficiently decreased\nwhile the coding performance is improved. An FS-ECVQ, designed for the modified discrete cosine transform (MDCT)\ncoefficients coding, was implemented and evaluated based on the Unified Speech and Audio Coding (USAC) scheme.\nResults showed that the FS-ECVQ achieved a reduction of the total memory requirements by about 11.3%, compared\nwith the encoder in USAC final version (FINAL), while maintaining a similar coding performance....
The paper describes an auditory processing-based feature extraction strategy for robust speech recognition in\nenvironments, where conventional automatic speech recognition (ASR) approaches are not successful. It incorporates\na combination of gammatone filtering, modulation spectrum and non-linearity for feature extraction in the\nrecognition chain to improve robustness, more specifically the ASR in adverse acoustic conditions. The experimental\nresults with standard Aurora-4 large vocabulary evaluation task revealed that the proposed features provide reliable\nand considerable improvement in terms of robustness in different noise conditions and are comparable to those of\nstandard feature extraction techniques....
Neural network language models (NNLM) have been proved to be quite powerful for sequence modeling, including\nfeed-forward NNLM (FNNLM), recurrent NNLM (RNNLM), etc. One main issue concerned for NNLM is the heavy\ncomputational burden of the output layer, where the output needs to be probabilistically normalized and the\nnormalizing factors require lots of computation. How to fast rescore the N-best list or lattice with NNLM attracts much\nattention for large-scale applications. In this paper, the statistic characteristics of normalizing factors are investigated\non the N-best list. Based on the statistic observations, we propose to approximate the normalizing factors for each\nhypothesis as a constant proportional to the number of words in the hypothesis. Then, the unnormalized NNLM is\ninvestigated and combined with back-off N-gram for fast rescoring, which can be computed very fast without the\nnormalization in the output layer, with the complexity reduced significantly. We apply our proposed method to a\nwell-tuned context-dependent deep neural network hidden Markov model (CD-DNN-HMM) speech recognition\nsystem on the English-Switchboard phone-call speech-to-text task, where both FNNLM and RNNLM are trained to\ndemonstrate our method. Experimental results show that unnormalized probability of NNLM is quite complementary\nto that of back-off N-gram, and combining the unnormalized NNLM and back-off N-gram can further reduce the word\nerror rate with little computational consideration....
This paper investigates the estimation of underlying articulatory targets of Thai vowels as invariant representation\nof vocal tract shapes by means of analysis-by-synthesis based on acoustic data. The basic idea is to simulate the\nprocess of learning speech production as a distal learning task, with acoustic signals of natural utterances in the\nform of Mel-frequency cepstral coefficients (MFCCs) as input, VocalTractLab - a 3D articulatory synthesizer controlled\nby target approximation models as the learner, and stochastic gradient descent as the target training method. To\ntest the effectiveness of this approach, a speech corpus was designed to contain contextual variations of Thai\nvowels by juxtaposing nine Thai long vowels in two-syllable sequences. A speech corpus consisting of 81 disyllabic\nutterances was recorded from a native Thai speaker. Nine vocal tract shapes, each corresponding to a vowel, were\nestimated by optimizing the vocal tract shape parameters of each vowel to minimize the sum of square error of\nMFCCs between original and synthesized speech. The stochastic gradient descent algorithm was used to iteratively\noptimize the shape parameters. The optimized vocal tract shapes were then used to synthesize Thai vowels both in\nmonosyllables and in disyllabic sequences. The results, both numerically and perceptually, indicate that this\nmodel-based analysis strategy allows us to effectively and economically estimate the vocal tract shapes to\nsynthesize accurate Thai vowels as well as smooth formant transitions between adjacent vowels...
In this paper, a two-stage scheme is proposed to deal with the difficult problem of acoustic echo cancellation (AEC) in\nsingle-channel scenario in the presence of noise. In order to overcome the major challenge of getting a separate\nreference signal in adaptive filter-based AEC problem, the delayed version of the echo and noise suppressed signal is\nproposed to use as reference. A modified objective function is thereby derived for a gradient-based adaptive filter\nalgorithm, and proof of its convergence to the optimum Wiener-Hopf solution is established. The output of the AEC\nblock is fed to an acoustic noise cancellation (ANC) block where a spectral subtraction-based algorithm with an\nadaptive spectral floor estimation is employed. In order to obtain fast but smooth convergence with maximum\npossible echo and noise suppression, a set of updating constraints is proposed based on various speech\ncharacteristics (e.g., energy and correlation) of reference and current frames considering whether they are voiced,\nunvoiced, or pause. Extensive experimentation is carried out on several echo and noise corrupted natural utterances\ntaken from the TIMIT database, and it is found that the proposed scheme can significantly reduce the effect of both\necho and noise in terms of objective and subjective quality measures....
Loading....