Minimally-overlapping words for sequence ...
Type de document :
Article dans une revue scientifique: Article original
Titre :
Minimally-overlapping words for sequence similarity search
Auteur(s) :
Frith, Martin [Auteur]
Artificial Intelligence Research Center [Tokyo] [AIST]
Noé, Laurent [Auteur]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Kucherov, Gregory [Auteur]
Laboratoire d'Informatique Gaspard-Monge [LIGM]
Artificial Intelligence Research Center [Tokyo] [AIST]
Noé, Laurent [Auteur]

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Kucherov, Gregory [Auteur]
Laboratoire d'Informatique Gaspard-Monge [LIGM]
Titre de la revue :
Bioinformatics
Éditeur :
Oxford University Press (OUP)
Date de publication :
2020-12-21
ISSN :
1367-4803
Discipline(s) HAL :
Informatique [cs]/Algorithme et structure de données [cs.DS]
Informatique [cs]/Bio-informatique [q-bio.QM]
Informatique [cs]/Bio-informatique [q-bio.QM]
Résumé en anglais : [en]
Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) ...
Lire la suite >Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results: Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Supplementary information: Supplementary data are available at Bioinformatics online.Lire moins >
Lire la suite >Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results: Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Supplementary information: Supplementary data are available at Bioinformatics online.Lire moins >
Langue :
Anglais
Comité de lecture :
Oui
Audience :
Internationale
Vulgarisation :
Non
Collections :
Source :
Fichiers
- https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/document
- Accès libre
- Accéder au document
- https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/file/resubmission-bioinfo.pdf
- Accès libre
- Accéder au document
- https://doi.org/10.1101/2020.07.24.220616
- Accès libre
- Accéder au document
- https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/document
- Accès libre
- Accéder au document
- https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/file/resubmission-bioinfo.pdf
- Accès libre
- Accéder au document
- https://hal.archives-ouvertes.fr/hal-03087470/document
- Accès libre
- Accéder au document
- https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
- Accès libre
- Accéder au document
- https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
- Accès libre
- Accéder au document
- https://hal.archives-ouvertes.fr/hal-03087470/document
- Accès libre
- Accéder au document
- https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
- Accès libre
- Accéder au document
- https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- resubmission-bioinfo.pdf
- Accès libre
- Accéder au document
- btaa1054.pdf
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- resubmission-bioinfo.pdf
- Accès libre
- Accéder au document