Minimally-overlapping words for sequence ...
Document type :
Article dans une revue scientifique: Article original
Title :
Minimally-overlapping words for sequence similarity search
Author(s) :
Frith, Martin [Auteur]
Artificial Intelligence Research Center [Tokyo] [AIST]
Noé, Laurent [Auteur]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Kucherov, Gregory [Auteur]
Laboratoire d'Informatique Gaspard-Monge [LIGM]
Artificial Intelligence Research Center [Tokyo] [AIST]
Noé, Laurent [Auteur]

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Kucherov, Gregory [Auteur]
Laboratoire d'Informatique Gaspard-Monge [LIGM]
Journal title :
Bioinformatics
Publisher :
Oxford University Press (OUP)
Publication date :
2020-12-21
ISSN :
1367-4803
HAL domain(s) :
Informatique [cs]/Algorithme et structure de données [cs.DS]
Informatique [cs]/Bio-informatique [q-bio.QM]
Informatique [cs]/Bio-informatique [q-bio.QM]
English abstract : [en]
Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) ...
Show more >Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results: Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Supplementary information: Supplementary data are available at Bioinformatics online.Show less >
Show more >Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results: Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Supplementary information: Supplementary data are available at Bioinformatics online.Show less >
Language :
Anglais
Peer reviewed article :
Oui
Audience :
Internationale
Popular science :
Non
Collections :
Source :
Files
- https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/document
- Open access
- Access the document
- https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/file/resubmission-bioinfo.pdf
- Open access
- Access the document
- https://doi.org/10.1101/2020.07.24.220616
- Open access
- Access the document
- https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/document
- Open access
- Access the document
- https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/file/resubmission-bioinfo.pdf
- Open access
- Access the document
- https://hal.archives-ouvertes.fr/hal-03087470/document
- Open access
- Access the document
- https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
- Open access
- Access the document
- https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
- Open access
- Access the document
- https://hal.archives-ouvertes.fr/hal-03087470/document
- Open access
- Access the document
- https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
- Open access
- Access the document
- https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
- Open access
- Access the document
- document
- Open access
- Access the document
- resubmission-bioinfo.pdf
- Open access
- Access the document
- btaa1054.pdf
- Open access
- Access the document
- document
- Open access
- Access the document
- resubmission-bioinfo.pdf
- Open access
- Access the document