Current Issue : July - September Volume : 2013 Issue Number : 3 Articles : 5 Articles
This article describes a modified technique for enhancing noisy speech to improve automatic speech recognition\r\n(ASR) performance. The proposed approach improves the widely used spectral subtraction which inherently suffers\r\nfrom the associated musical noise effects. Through a psychoacoustic masking and critical band variance normalization\r\ntechnique, the artifacts produced by spectral subtraction are minimized for improving the ASR accuracy. The popular\r\nadvanced ETSI-2 front end is tested for comparison purposes. The performed speech recognition evaluations on the\r\nnoisy standard AURORA-2 tasks show enhanced performance for all noise conditions....
In this study, a consistency analysis of energy parameter for Mandarin speech is presented. Identified as a result of\r\ninspection of the human pronunciation process, the consistency can be interpreted as a high correlation of a\r\nwarping curve between the spectrum and the prosody intra a syllable. Through three steps in the procedure of the\r\nconsistency analysis, the hidden Markov model (HMM) algorithm is used first to decode HMM-state sequences\r\nwithin a syllable at the same time as to divide them into three segments. Second, based on a designated syllable,\r\nthe vector quantization (VQ) with the Lindeââ?¬â??Buzoââ?¬â??Gray algorithm is used to train the VQ codebooks of each\r\nsegment. Third, the energy vector of each segment is encoded as an index by VQ codebooks, and then the\r\nprobability of each possible path is evaluated as a prerequisite to analyze the consistency. It is demonstrated\r\nexperimentally that a consistency is definitely acquired in case the syllable is located exactly in the same word.\r\nThese results offer a research direction that the energy warping process intra a syllable must be considered in a\r\ntext-to-speech system to improve the synthesized speech quality....
We propose an efficient solution to the problem of sparse linear prediction analysis of the speech signal. Our method\r\nis based on minimization of a weighted l2-norm of the prediction error. The weighting function is constructed such\r\nthat less emphasis is given to the error around the points where we expect the largest prediction errors to occur (the\r\nglottal closure instants) and hence the resulting cost function approaches the ideal l0-normcost function for sparse\r\nresidual recovery. We show that the efficient minimization of this objective function (by solving normal equations of\r\nlinear least squares problem) provides enhanced sparsity level of residuals compared to the l1-norm minimization\r\napproach which uses the computationally demanding convex optimization methods. Indeed, the computational\r\ncomplexity of the proposed method is roughly the same as the classic minimum variance linear prediction analysis\r\napproach. Moreover, to show a potential application of such sparse representation, we use the resulting linear\r\nprediction coefficients inside a multi-pulse synthesizer and show that the corresponding multi-pulse estimate of the\r\nexcitation source results in slightly better synthesis quality when compared to the classical technique which uses the\r\ntraditional non-sparse minimum variance synthesizer....
Conventional parametric stereo (PS) audio coding employs inter-channel phase difference and overall phase\r\ndifference as phase parameters. In this article, it is shown that those parameters cannot correctly represent the phase\r\nrelationship between the stereo channels when inter-channel correlation (ICC) is less than one, which is common in\r\npractical situations. To solve this problem, we introduce new phase parameters, channel phase differences (CPDs),\r\ndefined as the phase differences between the mono downmix and the stereo channels. Since CPDs have a descriptive\r\nrelationship with ICC as well as inter-channel intensity difference, they are more relevant to represent the phase\r\ndifference between the channels in practical situations. We also propose methods of synthesizing CPDs at the\r\ndecoder. Through computer simulations and subjective listening tests, it is confirmed that the proposed methods\r\nproduce significantly lower phase errors than conventional PS, and it can noticeably improve sound quality for stereo\r\ninputs with low ICCs....
A lot of effort has been made in Computational Auditory Scene Analysis (CASA) to segregate target speech from\r\nmonaural mixtures. Based on the principle of CASA, this article proposes an improved algorithm for monaural speech\r\nsegregation. To extract the energy feature more accurately, the proposed algorithm improves the threshold selection\r\nfor response energy in initial segmentation stage. Since the resulting mask map often contains broken auditory\r\nelement groups after grouping stage, a smoothing stage is proposed based on morphological image processing.\r\nThrough the combination of erosion and dilation operations, we suppress the intrusions by removing the unwanted\r\nparticles and enhance the segregated speech by complementing the broken auditory elements. Systematic\r\nevaluation shows that the proposed segregation algorithm improves the output signal-to-noise ratio by an average of\r\n8.55 dB and cuts the percentage of noise residue by an average of 25.36% compared with the mixture, yielding a\r\nsignificant improvement for speech segregation....
Loading....