Tinted de Bruijn Graphs for efficient read extraction from sequencing datasets

Vandamme, Lea; Cazaux, Bastien; Limasset, Antoine

Type de document :

Pré-publication ou Document de travail

DOI :

10.1101/2024.02.15.580442

URL permanente :

http://hdl.handle.net/20.500.12210/118896

Titre :

Tinted de Bruijn Graphs for efficient read extraction from sequencing datasets

Auteur(s) :

Vandamme, Lea [Auteur]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Cazaux, Bastien [Auteur]

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Limasset, Antoine [Auteur]

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]

Date de publication :

2024-02-18

Mot(s)-clé(s) en anglais :

Indexing Sequencing datasets Compression
Indexing
Sequencing datasets
Compression

Discipline(s) HAL :

Informatique [cs]

Résumé en anglais : [en]

<div><p>The study of biological sequences often relies on using reference genomes, yet achieving accurate assemblies remains challenging. Consequently, de novo analysis directly from raw reads, without pre-processing, is ...
Lire la suite ><div><p>The study of biological sequences often relies on using reference genomes, yet achieving accurate assemblies remains challenging. Consequently, de novo analysis directly from raw reads, without pre-processing, is frequently more practical. We identify a very commonly shared need across various applications: identifying reads containing a specific kmer in a dataset. This kmer-to-reads association would be pivotal in multiple contexts, including genotyping, bacterial strain resolution, profiling, data compression, error correction or assembly. While this challenge appears similar to the extensively researched colored de Bruijn graph problem, resolving it at the read level would be prohibitively resource-intensive in practical applications. In this work, we demonstrate its tractable resolution by leveraging certain assumptions for sequencing dataset indexing. To tackle this challenge, we introduce the Tinted de Bruijn Graph concept, an altered version of the colored de Bruijn graph where each read within a sequencing dataset represents a unique source. We developed K2R, a highly scalable index that implement such search efficiently within this framework. K2R's performance, in terms of index size, memory footprint, throughput, and construction time, is benchmarked against leading methods, including hashing techniques (e.g., Short Read Connector) and full-text indexing (e.g., Spumoni and Movi), across various datasets. K2R consistently outperforms contemporary solutions in most metrics and is the only tool capable of scaling to larger datasets. To prove K2R scalability we indexed two human datasets of the T2T consortium: the 126X coverage ONT dataset was indexed in 18hours using 19GB of RAM for a final index of 9.5GB and the 56X coverage HiFi dataset was constructed in 90 minutes using 5Gb of RAM for a final index of 207Mb. The K2R index, developed in C++, is open source and available on Github github.com/LeaVandamme/K2R.</p></div>Lire moins >

Langue :

Anglais

Projet ANR :

Structures de graphe adaptées pour l'exploration de données de séquençage de troisième génération

Collections :