Current Issue : October-December Volume : 2024 Issue Number : 4 Articles : 5 Articles
Background: Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads. Results: We present CAREx—an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to 99% for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPUaccelerated version of CAREx exhibits the fastest execution times among all tested tools. Conclusion: CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-theart programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at (https:// github. com/ fkall en/ CAREx)....
Surveillance for genetic variation of microbial pathogens, both within and among species, plays an important role in informing research, diagnostic, prevention, and treatment activities for disease control. However, large-scale systematic screening for novel genotypes remains challenging in part due to technological limitations. Towards addressing this challenge, we present an advancement in universal microbial high resolution melting (HRM) analysis that is capable of accomplishing both known genotype identification and novel genotype detection. Specifically, this novel surveillance functionality is achieved through time-series modeling of sequence-defined HRM curves, which is uniquely enabled by the large-scale melt curve datasets generated using our high-throughput digital HRM platform. Taking the detection of bacterial genotypes as a model application, we demonstrate that our algorithms accomplish an overall classification accuracy over 99.7% and perform novelty detection with a sensitivity of 0.96, specificity of 0.96 and Youden index of 0.92. Since HRM-based DNA profiling is an inexpensive and rapid technique, our results add support for the feasibility of its use in surveillance applications....
Background: Microbiome dysbiosis has recently been associated with different diseases and disorders. In this context, machine learning (ML) approaches can be useful either to identify new patterns or learn predictive models. However, data to be fed to ML methods can be subject to different sampling, sequencing and preprocessing techniques. Each different choice in the pipeline can lead to a different view (i.e., feature set) of the same individuals, that classical (single-view) ML approaches may fail to simultaneously consider. Moreover, some views may be incomplete, i.e., some individuals may be missing in some views, possibly due to the absence of some measurements or to the fact that some features are not available/applicable for all the individuals. Multi-view learning methods can represent a possible solution to consider multiple feature sets for the same individuals, but most existing multi-view learning methods are limited to binary classification tasks or cannot work with incomplete views. Results: We propose irBoost.SH, an extension of the multi-view boosting algorithm rBoost.SH, based on multi-armed bandits. irBoost.SH solves multi-class classification tasks and can analyze incomplete views. At each iteration, it identifies one winning view using adversarial multi-armed bandits and uses its predictions to update a shared instance weight distribution in a learning process based on boosting. In our experiments, performed on 5 multi-view microbiome datasets, the model learned by irBoost. SH always outperforms the best model learned from a single view, its closest competitor rBoost.SH, and the model learned by a multi-view approach based on feature concatenation, reaching an improvement of 11.8% of the F1-score in the prediction of the Autism Spectrum disorder and of 114% in the prediction of the Colorectal Cancer disease. Conclusions: The proposed method irBoost.SH exhibited outstanding performances in our experiments, also compared to competitor approaches. The obtained results confirm that irBoost.SH can fruitfully be adopted for the analysis of microbiome data, due to its capability to simultaneously exploit multiple feature sets obtained through different sequencing and preprocessing pipelines....
Background: The selection of primer pairs in sequencing-based research can greatly influence the results, highlighting the need for a tool capable of analysing their performance in-silico prior to the sequencing process. We therefore propose PrimerEvalPy, a Python-based package designed to test the performance of any primer or primer pair against any sequencing database. The package calculates a coverage metric and returns the amplicon sequences found, along with information such as their average start and end positions. It also allows the analysis of coverage for different taxonomic levels. Results: As a case study, PrimerEvalPy was used to test the most commonly used primers in the literature against two oral 16S rRNA gene databases containing bacteria and archaea. The results showed that the most commonly used primer pairs in the oral cavity did not match those with the highest coverage. The best performing primer pairs were found for the detection of oral bacteria and archaea. Conclusions: This demonstrates the importance of a coverage analysis tool such as PrimerEvalPy to find the best primer pairs for specific niches. The software is available under the MIT licence at https:// gitlab. citius. usc. es/ lara. vazquez/ PrimerEvalPy....
Principal component analysis (PCA) is an important and widely used unsupervised learning method that determines population structure based on genetic variation. Genome sequencing of thousands of individuals usually generate tens of millions of SNPs, making it challenging for PCA analysis and interpretation. Here we present VCF2PCACluster, a simple, fast and memory-efficient tool for Kinship estimation, PCA and clustering analysis, and visualization based on VCF formatted SNPs. We implemented five Kinship estimation methods and three clustering methods for its users to choose from. Moreover, unlike other PCA tools, VCF2PCACluster possesses a clustering function based on PCA result, which enabling users to automatically and clearly know about population structure. We demonstrated the same accuracy but a higher performance of this tool in performing PCA analysis on tens of millions of SNPs compared to another popular PLINK2 software, especially in peak memory usage that is independent of the number of SNPs in VCF2PCACluster....
Loading....