Minimally-overlapping words for sequence similarity search

Frith, Martin; Noé, Laurent; Kucherov, Gregory

Document type :

Article dans une revue scientifique: Article original

DOI :

10.1093/bioinformatics/btaa1054

Title :

Minimally-overlapping words for sequence similarity search

Author(s) :

Frith, Martin [Auteur]
Artificial Intelligence Research Center [Tokyo] [AIST]
Noé, Laurent [Auteur]

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Kucherov, Gregory [Auteur]
Laboratoire d'Informatique Gaspard-Monge [LIGM]

Journal title :

Bioinformatics

Publisher :

Oxford University Press (OUP)

Publication date :

2020-12-21

ISSN :

1367-4803

HAL domain(s) :

Informatique [cs]/Algorithme et structure de données [cs.DS]
Informatique [cs]/Bio-informatique [q-bio.QM]

English abstract : [en]

Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) ...
Show more >Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results: Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Supplementary information: Supplementary data are available at Bioinformatics online.Show less >

Language :

Anglais

Peer reviewed article :

Oui

Audience :

Internationale

Popular science :

Non

ANR Project :

Algorithmes et outils logiciels pour le séquençage d'ARN de troisième génération

Collections :

Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189

Source :

Harvested from HAL

Files

https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/document
Open access
Access the document

https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/file/resubmission-bioinfo.pdf
Open access
Access the document

https://doi.org/10.1101/2020.07.24.220616
Open access
Access the document

https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/document
Open access
Access the document

https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/file/resubmission-bioinfo.pdf
Open access
Access the document

https://hal.archives-ouvertes.fr/hal-03087470/document
Open access
Access the document

https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
Open access
Access the document

https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
Open access
Access the document

https://hal.archives-ouvertes.fr/hal-03087470/document
Open access
Access the document

https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
Open access
Access the document

https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
Open access
Access the document

document
Open access
Access the document

resubmission-bioinfo.pdf
Open access
Access the document

btaa1054.pdf
Open access
Access the document

document
Open access
Access the document

resubmission-bioinfo.pdf
Open access
Access the document

Minimally-overlapping words for sequence ... BibTeX CSV Excel RIS

Files

Minimally-overlapping words for sequence ...

BibTeX

CSV

Excel

RIS