Current Issue : January - March Volume : 2013 Issue Number : 1 Articles : 5 Articles
In this article, a novel voice activity detection (VAD) approach based on phoneme recognition using Gaussian\r\nMixture Model based Hidden Markov Model (HMM/GMM) is proposed. Some sophisticated speech features such as\r\nhigh order statistics (HOS), harmonic structure information and Mel-frequency cepstral coefficients (MFCCs) are\r\nemployed to represent each speech/non-speech segment. The main idea of this new method is regarding the\r\nnon-speech as a new phoneme corresponding to the conventional phonemes in mandarin, and all of them are\r\nthen trained under maximum likelihood principle with Baum-Welch algorithm using GMM/HMM model. The Viterbi\r\ndecoding algorithm is finally used for searching the maximum likelihood of the observed signals. The proposed\r\nmethod shows a higher speech/non-speech detection accuracy over a wide range of SNR regimes compared with\r\nsome existing VAD methods. We also propose a different method to demonstrate that the conventional speech\r\nenhancement method only with accurate VAD is not effective enough for automatic speech recognition (ASR) at\r\nlow SNR regimes....
As fundamental research for human-robot interaction, this paper addresses the rhythmic reference of a human\r\nwhile turning a rope with another human. We hypothyzed that when interpreting rhythm cues to make a rhythm\r\nreference, humans will use auditory and force rhythms more than visual ones. We examined 21-23 years old test\r\nsubjects. We masked perception of each test subject using 3 kinds of masks, an eye-mask, headphones, and a force\r\nmask. The force mask is composed of a robot arm and a remote controller. These instruments allow a test subject\r\nto turn a rope without feeling force from the rope. In the first experiment, each test subject interacted with an\r\noperator that turned a rope with a constant rhythm. 8 experiments were conducted for each test subject that\r\nwore combinations of masks. We measured the angular velocity of force between a test subject/the operator and\r\na rope. We calculated error between the angular velocities of the force directions, and validated the error. In the\r\nsecond experiment, two test subjects interacted with each other. 1.6 - 2.4 Hz auditory rhythm was presented from\r\nheadphones so as to inform target turning frequency. Addition to the auditory rhythm, the test subjects wore eyemasks.\r\nThe first experiment showed that visual rhythm has little influence on rope-turning cooperation between\r\nhumans. The second experiment provided firmer evidence for the same hypothesis because humans neglected\r\ntheir visual rhythms....
A novel approach for robust dialogue act detection in a spoken dialogue system is proposed. Shallow\r\nrepresentation named partial sentence trees are employed to represent automatic speech recognition outputs.\r\nParsing results of partial sentences can be decomposed into derivation rules, which turn out to be salient features\r\nfor dialogue act detection. Data-driven dialogue acts are learned via an unsupervised learning algorithm called\r\nspectral clustering, in a vector space whose axes correspond to derivation rules. The proposed method is evaluated\r\nin a Mandarin spoken dialogue system for tourist-information services. Combined with information obtained from\r\nthe automatic speech recognition module and from a Markov model on dialogue act sequence, the proposed\r\nmethod achieves a detection accuracy of 85.1%, which is significantly better than the baseline performance of\r\n62.3% using a na�¯ve Bayes classifier. Furthermore, the average number of turns per dialogue session also decreases\r\nsignificantly with the improved detection accuracy....
This article discusses our research on polyphonic music transcription using non-negative matrix factorisation (NMF).\r\nThe application of NMF in polyphonic transcription offers an alternative approach in which observed frequency\r\nspectra from polyphonic audio could be seen as an aggregation of spectra from monophonic components.\r\nHowever, it is not easy to find accurate aggregations using a standard NMF procedure since there are many ways\r\nto satisfy the factoring of V Ã?Å? WH. Three limitations associated with the application of standard NMF to factor\r\nfrequency spectra are (i) the permutation of transcription output; (ii) the unknown factoring r; and (iii) the factoring W\r\nand H that have a tendency to be trapped in a sub-optimal solution. This work explores the uses of the heuristics\r\nthat exploit the harmonic information of each pitch to tackle these limitations. In our implementation, this\r\nharmonic information is learned from the training data consisting of the pitches from a desired instrument, while\r\nthe unknown effective r is approximated from the correlation between the input signal and the training data. This\r\napproach offers an effective exploitation of the domain knowledge. The empirical results show that the proposed\r\napproach could significantly improve the accuracy of the transcription output as compared to the standard NMF\r\napproach....
The problem of blind source separation (BSS) of convolved acoustic signals is of great interest for many classes of\r\napplications. Due to the convolutive mixing process, the source separation is performed in the frequency domain,\r\nusing independent component analysis (ICA). However, frequency domain BSS involves several major problems\r\nthat must be solved. One of these is the permutation problem. The permutation ambiguity of ICA needs to be\r\nresolved so that each separated signal contains the frequency components of only one source signal. This article\r\npresents a class of methods for solving the permutation problem based on information theoretic distance\r\nmeasures. The proposed algorithms have been tested on different real-room speech mixtures with different\r\nreverberation times in conjunction with different ICA algorithms....
Loading....