Current Issue : January - March Volume : 2015 Issue Number : 1 Articles : 5 Articles
This paper proposes a new speech enhancement (SE) algorithm utilizing constraints to the Wiener gain function\nwhich is capable of working at 10 dB and lower signal-to-noise ratios (SNRs). The wavelet thresholded multi taper\nspectrum was taken as the clean spectrum for the constraints. The proposed algorithm was evaluated under eight\ntypes of noises and seven SNR levels in NOIZEUS database and was predicted by the composite measures and the\nSNRLOSS measure to improve subjective quality and speech intelligibility in various noisy environments. Comparisons\nwith two other algorithms (KLT and wavelet thresholding (WT)) demonstrate that in terms of signal distortion, overall\nquality, and the SNRLOSS measure, our proposed constrained SE algorithm outperforms the KLT and WT schemes for\nmost conditions considered....
In a bid to enhance the search performance, this paper presents an improved version of reduced candidate\nmechanism (RCM), an algebraic code book search conducted on an algebraic code-excited linear prediction (ACELP)\nspeech coder. This improvement is made based on two findings in a piece of our prior work. The first finding is\nthat a pulse with a high contribution in the associated track is more likely to serve as an optimal pulse in the\noptimal code vector and the second is that the speech quality can be well maintained at a search accuracy above\n50% approximately. Subsequently, a new finding in this study concerning a structured algebraic code book in G.729\nindicates that there is a 0.8321 probability that the number 1 ranked pulse in a global sorting by pulse contribution\nis indeed one of the optimal pulses. Hence, the number 1 pulse in the global sorting is labeled as one of the\noptimal pulses, following which a sequence of search tasks are fulfilled through RCM. This proposed complexity\nreduction algorithm, implemented on a G.729A speech codec, takes as few as eight searches, a search load tantamount to 2.5% of G.729A, 12.5% of global pulse replacement method (iterati 16.7% of iteration-free\npulse replacement method, and 50% of RCM (N = 2). This proposal is thus found to successfully reduce the\nrequired computational complexity to a great extent as intended....
This paper studies a novel audio segmentation-by-classification approach based on factor analysis. The proposed\ntechnique compensates the within-class variability by using class-dependent factor loading matrices and obtains the\nscores by computing the log-likelihood ratio for the class model to a non-class model over fixed-length windows.\nAfterwards, these scores are smoothed to yield longer contiguous segments of the same class by means of different\nback-end systems. Unlike previous solutions, our proposal does not make use of specific acoustic features and does\nnot need a hierarchical structure. The proposed method is applied to segment and classify audios coming from TV\nshows into five different acoustic classes: speech, music, speech with music, speech with noise, and others. The\ntechnique is compared to a hierarchical system with specific acoustic features achieving a significant error reduction....
In many speech communication applications, robust localization and tracking of multiple speakers in noisy and\nreverberant environments are of major importance. Several algorithms to tackle this problem have been proposed in\nthe last decades. In this paper, we propose several extensions to a recently presented joint direction of arrival (DOA)\nand pitch estimation method, increasing its robustness in multi-speaker scenarios, noise, and reverberation. First, a\nspectral comb filter is added to the original algorithm to better cope with concurrent speakers. Second, the\nwell-known generalized cross-correlation with phase transform (GCC-PHAT) is used as an additional weighting\nfunction to improve the DOA estimation accuracy in terms of correct hits. Third, using multiple microphone pairs, the\nmulti-channel cross-correlation approach is incorporated to improve the robustness against noise and reverberation.\nIn order to improve tracking for moving and even intersecting speakers, a particle filter is used. Experiments with\nreal-world recordings in realistic acoustic conditions show that the proposed extensions increase the DOA hit rate by\nabout 33% compared to the original algorithm for two step-wise moving sources at a signal-to-noise ratio (SNR) of\n15 dB and a reverberation time RT60 of 560 ms....
The current paper examines influences of speech rate on Fujisaki model parameters based on read speech from the\nBonn Tempo-Corpus containing productions by 12 native speakers of German at five different intended tempo levels\n(very slow, slow, normal, fast, fastest possible). The normal condition was produced at an average rate of 6.34 syllables/s\nor 100%, the very slow version at 67%, and the fastest version at 161% of the normal rate. We extracted F0 contours\nand subjected them to decomposition using the Fujisaki model. We ordered all the data with respect to their actual\nspeech rates. First, we assessed how prosodic realizations vary with speech rate and examined phrase command\nmagnitudes, the number of phrase commands as well as the base frequency, accent command amplitudes, and the\ntiming of accent command with respects to the underlying syllables and their nuclear vowels. Second, we analyzed\nbetween-sentence variability within and between speakers and investigated whether and how the prosodic structure is\npreserved at different speech rates. For very slow speech, we found for some of the speakers that the original phrase\nstructure had disintegrated into something like a list of isolated words separated by pauses. Very fast speech became\nchains of uniform syllables at very high pitch and with almost flat intonation. With respect to the F0 range reflected by\nthe amplitude of accent commands, we found strong interspeaker differences. While four of the subjects exhibited a\nsignificant reduction at higher speech rates, the others did not. As speed increases, it appears that F0 gestures\ncommence earlier in the syllable, that is, the onset time of accent commands is located closer to the syllable/vowel\nonset than at lower speed....
Loading....