We present an extensive experimental study of Phrase-based Statistical Machine Translation, from the point of view of its learning\r\ncapabilities. Very accurate Learning Curves are obtained, using high-performance computing, and extrapolations of the projected\r\nperformance of the system under different conditions are provided. Our experiments confirm existing and mostly unpublished\r\nbeliefs about the learning capabilities of statistical machine translation systems. We also provide insight into the way statistical\r\nmachine translation learns from data, including the respective influence of translation and language models, the impact of\r\nphrase length on performance, and various unlearning and perturbation analyses. Our results support and illustrate the fact\r\nthat performance improves by a constant amount for each doubling of the data, across different language pairs, and different\r\nsystems. This fundamental limitation seems to be a direct consequence of Zipf law governing textual data. Although the rate of\r\nimprovement may depend on both the data and the estimation method, it is unlikely that the general shape of the learning curve\r\nwill change without major changes in the modeling and inference phases. Possible research directions that address this issue include\r\nthe integration of linguistic rules or the development of active learning procedures.
Loading....