For speaker tracking, integrating multimodal information from audio and video provides an effective and promising solution. The\ncurrent challenges are focused on the construction of a stable observation model. To this end, we propose a 3D audio-visual\nspeaker tracker assisted by deep metric learning on the two-layer particle filter framework. Firstly, the audio-guided motion model\nis applied to generate candidate samples in the hierarchical structure consisting of an audio layer and a visual layer. Then, a stable\nobservation model is proposed with a designed Siamese network, which provides the similarity-based likelihood to calculate\nparticle weights.Thespeaker position is estimated using an optimal particle set, which integrates the decisions from audio particles\nand visual particles. Finally, the long short-term mechanism-based template update strategy is adopted to prevent drift during\ntracking. Experimental results demonstrate that the proposed method outperforms the single-modal trackers and comparison\nmethods. Efficient and robust tracking is achieved both in 3D space and on image plane.
Loading....