DiNAMO: Exact method for degenerate IUPAC ...
Type de document :
Autre communication scientifique (congrès sans actes - poster - séminaire...): Communication dans un congrès avec actes
Titre :
DiNAMO: Exact method for degenerate IUPAC motifs discovery, characterization of sequence-specific errors
Auteur(s) :
Saad, Chadi [Auteur]
Bioinformatics and Sequence Analysis [BONSAI]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Noé, Laurent [Auteur]
Bioinformatics and Sequence Analysis [BONSAI]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Richard, Hugues [Auteur]
Biologie Computationnelle et Quantitative = Laboratory of Computational and Quantitative Biology [LCQB]
Leclerc, Julie [Auteur]
Centre de Recherche Jean-Pierre AUBERT Neurosciences et Cancer - U837 [JPArc]
Pôle de Biologie Pathologie Génétique [CHU Lille]
Buisine, Marie-Pierre [Auteur]
Pôle de Biologie Pathologie Génétique [CHU Lille]
Touzet, Helene [Auteur]
Bioinformatics and Sequence Analysis [BONSAI]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Figeac, Martin [Auteur]
Plateforme de génomique fonctionnelle et structurelle [Lille]
Bioinformatics and Sequence Analysis [BONSAI]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Noé, Laurent [Auteur]
Bioinformatics and Sequence Analysis [BONSAI]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Richard, Hugues [Auteur]
Biologie Computationnelle et Quantitative = Laboratory of Computational and Quantitative Biology [LCQB]
Leclerc, Julie [Auteur]
Centre de Recherche Jean-Pierre AUBERT Neurosciences et Cancer - U837 [JPArc]
Pôle de Biologie Pathologie Génétique [CHU Lille]
Buisine, Marie-Pierre [Auteur]
Pôle de Biologie Pathologie Génétique [CHU Lille]
Touzet, Helene [Auteur]
Bioinformatics and Sequence Analysis [BONSAI]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Figeac, Martin [Auteur]
Plateforme de génomique fonctionnelle et structurelle [Lille]
Titre de la manifestation scientifique :
JOBIM 2017 - Journées Ouvertes en Biologie, Informatique et Mathématiques
Ville :
Lille
Pays :
France
Date de début de la manifestation scientifique :
2017-07-03
Date de publication :
2017
Discipline(s) HAL :
Informatique [cs]/Bio-informatique [q-bio.QM]
Résumé en anglais : [en]
Next generation sequencing technologies are still associated with relatively high error rates, about 1%, which correspond to thousands of errors in the scale of a complete genome. Each region needs therefore to be sequenced ...
Lire la suite >Next generation sequencing technologies are still associated with relatively high error rates, about 1%, which correspond to thousands of errors in the scale of a complete genome. Each region needs therefore to be sequenced several times and variants are usually filtered based on depth criteria. The significant number of artifacts, in spite of those filters, shows the limit of conventional approaches and indicates that some sequencing artifacts are recurrent. This recurrence underlines that sequencing errors can depend on the upstream nucleotide sequence context. Our goal is to search for overrepresented motifs that tend to induce sequencing errors. Previous studies showed that some motifs, such as GGT [1,2], induce sequencing errors in the Illumina technologies. However, these studies were dedicated to exact motifs, and did not take into account approximate motifs, limiting the statistical power of such approaches. On the other hand, some tools, such as FIRE [3], DREME [4] and Discrover [5], were developed to search for degenerate motifs over the 15-letter IUPAC alphabet in the context of chip-seq studies. However, these tools use greedy algorithms, implying a lack of sensitivity. So we developed an exact algorithm to search for degenerate motifs by enumerating all possible IUPAC motifs. This algorithm is based on mutual information and uses hashtables with graphs data structure to store the motifs. It is independent from the sequencing technology. Experimental results on real data show that there are many overrepresented motifs upstream of sequencing artifacts. These latter are identified through the strand bias between forward and reverse reads. The homopoly-mer of length 3 CCC seems to be sufficient to induce errors on IonTorrent. On Illumina, motifs are mainly composed of GGC followed by GGT (like: TGGCNGGT) or homopolymers. We have also noticed a base quality fall after the detected motifs. Our exact algorithm requires less than one minute (Intel R Core TM i5-4570 CPU, 3.20GHz), and less than 2GB of RAM to search for full degenerate motifs of length 6 on a dataset of approximately 24000 sequences, extracted from 11 exomes sequenced on IonTorrent Proton.Lire moins >
Lire la suite >Next generation sequencing technologies are still associated with relatively high error rates, about 1%, which correspond to thousands of errors in the scale of a complete genome. Each region needs therefore to be sequenced several times and variants are usually filtered based on depth criteria. The significant number of artifacts, in spite of those filters, shows the limit of conventional approaches and indicates that some sequencing artifacts are recurrent. This recurrence underlines that sequencing errors can depend on the upstream nucleotide sequence context. Our goal is to search for overrepresented motifs that tend to induce sequencing errors. Previous studies showed that some motifs, such as GGT [1,2], induce sequencing errors in the Illumina technologies. However, these studies were dedicated to exact motifs, and did not take into account approximate motifs, limiting the statistical power of such approaches. On the other hand, some tools, such as FIRE [3], DREME [4] and Discrover [5], were developed to search for degenerate motifs over the 15-letter IUPAC alphabet in the context of chip-seq studies. However, these tools use greedy algorithms, implying a lack of sensitivity. So we developed an exact algorithm to search for degenerate motifs by enumerating all possible IUPAC motifs. This algorithm is based on mutual information and uses hashtables with graphs data structure to store the motifs. It is independent from the sequencing technology. Experimental results on real data show that there are many overrepresented motifs upstream of sequencing artifacts. These latter are identified through the strand bias between forward and reverse reads. The homopoly-mer of length 3 CCC seems to be sufficient to induce errors on IonTorrent. On Illumina, motifs are mainly composed of GGC followed by GGT (like: TGGCNGGT) or homopolymers. We have also noticed a base quality fall after the detected motifs. Our exact algorithm requires less than one minute (Intel R Core TM i5-4570 CPU, 3.20GHz), and less than 2GB of RAM to search for full degenerate motifs of length 6 on a dataset of approximately 24000 sequences, extracted from 11 exomes sequenced on IonTorrent Proton.Lire moins >
Langue :
Anglais
Comité de lecture :
Oui
Audience :
Nationale
Vulgarisation :
Non
Collections :
Source :
Fichiers
- https://hal.inria.fr/hal-01574630/document
- Accès libre
- Accéder au document
- https://hal.inria.fr/hal-01574630/document
- Accès libre
- Accéder au document
- https://hal.inria.fr/hal-01574630/document
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- JOBIM2017_paper_80.pdf
- Accès libre
- Accéder au document