Current Issue : July - September Volume : 2016 Issue Number : 3 Articles : 4 Articles
The goal of voice conversion is to modify a source speakers speech to sound as if spoken by a target speaker.\nCommon conversion methods are based on Gaussian mixture modeling (GMM). They aim to statistically model the\nspectral structure of the source and target signals and require relatively large training sets (typically dozens of\nsentences) to avoid over-fitting. Moreover, they often lead to muffled synthesized output signals, due to excessive\nsmoothing of the spectral envelopes.\nMobile applications are characterized with low resources in terms of training data, memory footprint, and\ncomputational complexity. As technology advances, computational and memory requirements become less limiting;\nhowever, the amount of available training data still presents a great challenge, as a typical mobile user is willing to\nrecord himself saying just few sentences. In this paper, we propose the grid-based (GB) conversion method for such\nlow resource environments, which is successfully trained using very few sentences (5ââ?¬â??10). The GB approach is based\non sequential Bayesian tracking, by which the conversion process is expressed as a sequential estimation problem of\ntracking the target spectrum based on the observed source spectrum. The converted Mel frequency cepstrum\ncoefficient (MFCC) vectors are sequentially evaluated using a weighted sum of the target training vectors used as grid\npoints. The training process includes simple computations of Euclidian distances between the training vectors and is\neasily performed even in cases of very small training sets.\nWe use global variance (GV) enhancement to improve the perceived quality of the synthesized signals obtained by\nthe proposed and the GMM-based methods. Using just 10 training sentences, our enhanced GB method leads to\nconverted sentences having closer GV values to those of the target and to lower spectral distances at the same time,\ncompared to enhanced version of the GMM-based conversion method. Furthermore, subjective evaluations show\nthat signals produced by the enhanced GB method are perceived as more similar to the target speaker than the\nenhanced GMM signals, at the expense of a small degradation in the perceived quality....
Unit selection based text-to-speech synthesis (TTS) has been the dominant TTS approach of the last decade. Despite\nits success, unit selection approach has its disadvantages. One of the most significant disadvantages is the sudden\ndiscontinuities in speech that distract the listeners (Speech Commun 51:1039ââ?¬â??1064, 2009). The second disadvantage\nis that significant expertise and large amounts of data is needed for building a high-quality synthesis system which is\ncostly and time-consuming. The statistical speech synthesis (SSS) approach is a promising alternative synthesis\ntechnique. Not only that the spurious errors that are observed in the unit selection system are mostly not observed in\nSSS but also building voice models is far less expensive and faster compared to the unit selection system. However,\nthe resulting speech is typically not as natural-sounding as speech that is synthesized with a high-quality unit\nselection system. There are hybrid methods that attempt to take advantage of both SSS and unit selection systems.\nHowever, existing hybrid methods still require development of a high-quality unit selection system. Here, we propose\na novel hybrid statistical/unit selection system for Turkish that aims at improving the quality of the baseline SSS\nsystem by improving the prosodic parameters such as intonation and stress. Commonly occurring suffixes in Turkish\nare stored in the unit selection database and used in the proposed system. As opposed to existing hybrid systems, the\nproposed system was developed without building a complete unit selection synthesis system. Therefore, the\nproposed method can be used without collecting large amounts of data or utilizing substantial expertise or\ntime-consuming tuning that is typically required in building unit selection systems. Listeners preferred the hybrid\nsystem over the baseline system in the AB preference tests...
Last decade has given witness of development in the area of speech processing (speech codec). A codec is nothing but telephony networks employ a system of coders and decoders which is used to reduce bandwidth requirements over limited capacity channels in real-time communications. The codec takes a speech signal which is analog in nature from an input source (ex. Microphone) and converts the analog signal into a digital format that can be transmitted across a packet network. In this review article, we have provided details of various methodologies for speech coding (speech codec) with emphasis on those methods and algorithms which are part of a recent development of communications specifically for speech processing. We have specifically focused on G.7xx series of ITU-T standards for speech codec. The G.7xx recommendations are generally used in digital communication, specifically used for the coding of analog signals into digital signals. The G.7xx family of standards is comprised of speech and audio codec that are primarily used in cellular telephony and Internet telephony, VoIP communications and audio-visual teleconferencing. This review paper is not only providing the references to beginners but also gives the detail information about coders....
Now a day’s wide offered facilities of telephones, mobiles associated tape recorders, tablets, laptops, PCs along with installed Apps, TVs etc lead to the misuse of the device and so creating them an efficient tool in service of criminal offences like bank account hacking, kidnapping, blackmail threats, terrorist calls, etc. The criminals have seen the chance for misuse of the varied modes of communication of voice. Therefore, speaker verification is now widely known technology of speech processing. Speaker verification uses the speaker voice samples to verify the claimed identity for authentication. If the speaker claims for certain identity and the voice of speaker is used to verify this claim, is called speaker verification or authentication. This can be either text-dependent or text-independent. It consists of two Modules: Training Module and Testing Module. This article discusses the MFCC technique for extracting features from the input voice samples. Here, simulation results are shown for single frame in MATLAB-13.0....
Loading....