Algorithmes bio-informatiques pour l'analyse ...
Type de document :
Thèse
Titre :
Algorithmes bio-informatiques pour l'analyse de données de séquençage à haut débit
Titre en anglais :
New algorithmic and bioinformatic approaches for the analysis of data from high throughput sequencing
Auteur(s) :
Kopylova, Evguenia [Auteur]
Laboratoire d'Informatique Fondamentale de Lille [LIFL]
Bioinformatics and Sequence Analysis [BONSAI]
Laboratoire d'Informatique Fondamentale de Lille [LIFL]
Bioinformatics and Sequence Analysis [BONSAI]
Directeur(s) de thèse :
Hélène Touzet
Date de soutenance :
2013-12-11
Organisme de délivrance :
Université des Sciences et Technologie de Lille - Lille I
Mot(s)-clé(s) :
Metagenomique
Discipline(s) HAL :
Informatique [cs]/Bio-informatique [q-bio.QM]
Sciences du Vivant [q-bio]/Bio-Informatique, Biologie Systémique [q-bio.QM]
Informatique [cs]/Traitement du texte et du document
Informatique [cs]/Algorithme et structure de données [cs.DS]
Sciences du Vivant [q-bio]/Bio-Informatique, Biologie Systémique [q-bio.QM]
Informatique [cs]/Traitement du texte et du document
Informatique [cs]/Algorithme et structure de données [cs.DS]
Résumé en anglais : [en]
Nucleotide sequence alignment is a method used to identify regions of similarity between organisms at the genomic level. In this thesis we focus on the alignment of millions of short sequences produced by Next-Generation ...
Lire la suite >Nucleotide sequence alignment is a method used to identify regions of similarity between organisms at the genomic level. In this thesis we focus on the alignment of millions of short sequences produced by Next-Generation Sequencing (NGS) technologies against a reference database. Particularly, we direct our attention toward the analysis of metagenomic and metatranscriptomic data, that is the DNA and RNA directly extracted for an environment. Two major challenges were confronted in our developed algorithms. First, all NGS technologies today are susceptible to sequencing errors in the form of nucleotide substitutions, insertions and deletions and error rates vary between 1-15%. Second, metagenomic samples can contain thousands of unknown organisms and the only means of identifying them is to align against known closely related species. To overcome these challenges we designed a new approximate matching technique based on the universal Levenshtein automaton which quickly locates short regions of similarity (seeds) between two sequences allowing 1 error of any type. Using seeds to detect possible high scoring alignments is a widely used heuristic for rapid sequence alignment, although most existing software are optimized for performing high similarity searches and apply exact seeds. Furthermore, we describe a new indexing data structure based on the Burst trie which optimizes the search for approximate seeds. We demonstrate the efficacy of our method in two implemented software, SortMeRNA and SortMeDNA. The former can quickly filter ribosomal RNA fragments from metatranscriptomic data and the latter performs full alignment for genomic and metagenomic data.Lire moins >
Lire la suite >Nucleotide sequence alignment is a method used to identify regions of similarity between organisms at the genomic level. In this thesis we focus on the alignment of millions of short sequences produced by Next-Generation Sequencing (NGS) technologies against a reference database. Particularly, we direct our attention toward the analysis of metagenomic and metatranscriptomic data, that is the DNA and RNA directly extracted for an environment. Two major challenges were confronted in our developed algorithms. First, all NGS technologies today are susceptible to sequencing errors in the form of nucleotide substitutions, insertions and deletions and error rates vary between 1-15%. Second, metagenomic samples can contain thousands of unknown organisms and the only means of identifying them is to align against known closely related species. To overcome these challenges we designed a new approximate matching technique based on the universal Levenshtein automaton which quickly locates short regions of similarity (seeds) between two sequences allowing 1 error of any type. Using seeds to detect possible high scoring alignments is a widely used heuristic for rapid sequence alignment, although most existing software are optimized for performing high similarity searches and apply exact seeds. Furthermore, we describe a new indexing data structure based on the Burst trie which optimizes the search for approximate seeds. We demonstrate the efficacy of our method in two implemented software, SortMeRNA and SortMeDNA. The former can quickly filter ribosomal RNA fragments from metatranscriptomic data and the latter performs full alignment for genomic and metagenomic data.Lire moins >
Langue :
Anglais
Collections :
Source :
Fichiers
- https://tel.archives-ouvertes.fr/tel-00919185v2/document
- Accès libre
- Accéder au document
- https://tel.archives-ouvertes.fr/tel-00919185v2/document
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- these.pdf
- Accès libre
- Accéder au document