Background: The number of applications of deep learning algorithms in\nbioinformatics is increasing as they usually achieve superior performance over\nclassical approaches, especially, when bigger training datasets are available. In deep\nlearning applications, discrete data, e.g. words or n-grams in language, or amino\nacids or nucleotides in bioinformatics, are generally represented as a continuous\nvector through an embedding matrix. Recently, learning this embedding matrix\ndirectly from the data as part of the continuous iteration of the model to optimize\nthe target prediction - a process called â??end-to-end learningâ?? - has led to state-ofthe-\nart results in many fields. Although usage of embeddings is well described in the\nbioinformatics literature, the potential of end-to-end learning for single amino acids,\nas compared to more classical manually-curated encoding strategies, has not been\nsystematically addressed. To this end, we compared classical encoding matrices,\nnamely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid\nembeddings for two different prediction tasks using three widely used architectures,\nnamely recurrent neural networks (RNN), convolutional neural networks (CNN), and\nthe hybrid CNN-RNN.\nResults: By using different deep learning architectures, we show that end-to-end\nlearning is on par with classical encodings for embeddings of the same dimension\neven when limited training data is available, and might allow for a reduction in the\nembedding dimension without performance loss, which is critical when deploying\nthe models to devices with limited computational capacities. We found that the\nembedding dimension is a major factor in controlling the model performance.\nSurprisingly, we observed that deep learning models are capable of learning from\nrandom vectors of appropriate dimension.\nConclusion: Our study shows that end-to-end learning is a flexible and powerful\nmethod for amino acid encoding. Further, due to the flexibility of deep learning\nsystems, amino acid encoding schemes should be benchmarked against random\nvectors of the same dimension to disentangle the information content provided by\nthe encoding scheme from the distinguishability effect provided by the scheme.
Loading....