Minimally-overlapping words for sequence similarity search

Frith, Martin; Noé, Laurent; Kucherov, Gregory

Type de document :

Article dans une revue scientifique: Article original

DOI :

10.1093/bioinformatics/btaa1054

Titre :

Minimally-overlapping words for sequence similarity search

Auteur(s) :

Frith, Martin [Auteur]
Artificial Intelligence Research Center [Tokyo] [AIST]
Noé, Laurent [Auteur]

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Kucherov, Gregory [Auteur]
Laboratoire d'Informatique Gaspard-Monge [LIGM]

Titre de la revue :

Bioinformatics

Éditeur :

Oxford University Press (OUP)

Date de publication :

2020-12-21

ISSN :

1367-4803

Discipline(s) HAL :

Informatique [cs]/Algorithme et structure de données [cs.DS]
Informatique [cs]/Bio-informatique [q-bio.QM]

Résumé en anglais : [en]

Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) ...
Lire la suite >Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results: Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Supplementary information: Supplementary data are available at Bioinformatics online.Lire moins >

Langue :

Anglais

Comité de lecture :

Oui

Audience :

Internationale

Vulgarisation :

Non

Projet ANR :

Algorithmes et outils logiciels pour le séquençage d'ARN de troisième génération

Collections :

Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189

Source :

Harvested from HAL

Fichiers

https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/document
Accès libre
Accéder au document

https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/file/resubmission-bioinfo.pdf
Accès libre
Accéder au document

https://doi.org/10.1101/2020.07.24.220616
Accès libre
Accéder au document

https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/document
Accès libre
Accéder au document

https://hal-upec-upem.archives-ouvertes.fr/hal-03087470/file/resubmission-bioinfo.pdf
Accès libre
Accéder au document

https://hal.archives-ouvertes.fr/hal-03087470/document
Accès libre
Accéder au document

https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
Accès libre
Accéder au document

https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
Accès libre
Accéder au document

https://hal.archives-ouvertes.fr/hal-03087470/document
Accès libre
Accéder au document

https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
Accès libre
Accéder au document

https://academic.oup.com/bioinformatics/article-pdf/36/22-23/5344/36855836/btaa1054.pdf
Accès libre
Accéder au document

document
Accès libre
Accéder au document

resubmission-bioinfo.pdf
Accès libre
Accéder au document

btaa1054.pdf
Accès libre
Accéder au document

document
Accès libre
Accéder au document

resubmission-bioinfo.pdf
Accès libre
Accéder au document

Minimally-overlapping words for sequence ... BibTeX CSV Excel RIS

Fichiers

Minimally-overlapping words for sequence ...

BibTeX

CSV

Excel

RIS