Current Issue : January - March Volume : 2016 Issue Number : 1 Articles : 4 Articles
Over recent years, i-vector-based framework has been proven to provide state-of-the-art performance in speaker\nverification. Each utterance is projected onto a total factor space and is represented by a low-dimensional feature\nvector. Channel compensation techniques are carried out in this low-dimensional feature space. Most of the\ncompensation techniques take the sets of extracted i-vectors as input. By constructing between-class covariance and\nwithin-class covariance, we attempt to minimize the between-class variance mainly caused by channel effect and to\nmaximize the variance between speakers. In the real-world application, enrollment and test data from each user (or\nspeaker) are always scarce. Although it is widely thought that session variability is mostly caused by channel effects,\nphonetic variability, as a factor that causes session variability, is still a matter to be considered. We propose in this\npaper a new i-vector extraction algorithm from the total factor matrix which we term component reduction analysis\n(CRA). This new algorithm contributes to better modelling of session variability in the total factor space.\nWe reported results on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation\n(SREs) dataset. As measured both by equal error rate and the minimum values of the NIST detection cost function,\n10ââ?¬â??15% relative improvement is achieved compared to the baseline of traditional i-vector-based system....
Automatic summarization generation of sports video content has been object of\ngreat interest for many years. Although semantic descriptions techniques have been\nproposed, many of the approaches still rely on low-level video descriptors that render\nquite limited results due to the complexity of the problem and to the low capability\nof the descriptors to represent semantic content. In this paper, a new approach for\nautomatic highlights summarization generation of soccer videos using audio-visual\ndescriptors is presented. The approach is based on the segmentation of the video\nsequence into shots that will be further analyzed to determine its relevance and interest.\nOf special interest in the approach is the use of the audio information that provides\nadditional robustness to the overall performance of the summarization system. For\nevery video shot a set of low and mid level audio-visual descriptors are computed and\nlately adequately combined in order to obtain different relevance measures based on\nempirical knowledge rules. The final summary is generated by selecting those shots\nwith highest interest according to the specifications of the user and the results of\nrelevance measures. A variety of results are presented with real soccer video sequences\nthat prove the validity of the approach....
This paper presents extended techniques aiming at the improvement of automatic speech recognition (ASR) in\nsingle-channel scenarios in the context of the REVERB (REverberant Voice Enhancement and Recognition Benchmark)\nchallenge. The focus is laid on the development and analysis of ASR front-end technologies covering speech\nenhancement and feature extraction. Speech enhancement is performed using a joint noise reduction and\ndereverberation system in the spectral domain based on estimates of the noise and late reverberation power spectral\ndensities (PSDs). To obtain reliable estimates of the PSDsââ?¬â?even in acoustic conditions with positive\ndirect-to-reverberation energy ratios (DRRs)ââ?¬â?we adopt the statistical model of the room impulse response explicitly\nincorporating DRRs, as well in combination with a novel proposed joint estimator for the reverberation time T60 and\nthe DRR. The feature extraction approach is inspired by processing strategies of the auditory system, where an\namplitude modulation filterbank is applied to extract the temporal modulation information. These techniques were\nshown to improve the REVERB baseline in our previous work. Here, we investigate if similar improvements are\nobtained when using a state-of-the-art ASR framework, and to what extent the results depend on the specific\narchitecture of the back-end. Apart from conventional Gaussian mixture model (GMM)-hidden Markov model (HMM)\nback-ends, we consider subspace GMM (SGMM)-HMMs as well as deep neural networks in a hybrid system. The\nspeech enhancement algorithm is found to be helpful in almost all conditions, with the exception of deep learning\nsystems in matched training-test conditions. The auditory feature type improves the baseline for all system\narchitectures. The relative word error rate reduction achieved by combining our front-end techniques with current\nback-ends is 52.7% on average with the REVERB evaluation test set compared to our original REVERB result....
In this paper we present the Latin Music Mood Database, an extension of the Latin Music Database but for the task of\nmusic mood/emotion classification. The method for assigning mood labels to the musical recordings is based on the\nknowledge of a professionally trained Brazilian musician and the identification of the predominant emotion perceived\nin each song. We also present an analysis of the mood distribution according to the different genres of the database....
Loading....