Constant-Delay Enumeration for Nondeterministic Document Spanners

Amarilli, Antoine; Bourhis, Pierre; Mengel, Stefan; Niewerth, Matthias

Document type :

Communication dans un congrès avec actes

DOI :

10.4230/LIPIcs.ICDT.2019.22

Title :

Constant-Delay Enumeration for Nondeterministic Document Spanners

Author(s) :

Amarilli, Antoine [Auteur]
Département Informatique et Réseaux [INFRES]
Data, Intelligence and Graphs [DIG]
Bourhis, Pierre [Auteur]

Self-adaptation for distributed services and large software systems [SPIRALS]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Mengel, Stefan [Auteur]
Centre de Recherche en Informatique de Lens [CRIL]
Niewerth, Matthias [Auteur]
Universität Bayreuth [Deutschland] = University of Bayreuth [Germany] = Université de Bayreuth [Allemagne]

Conference title :

ICDT

City :

Lisbon

Country :

Portugal

Start date of the conference :

2019-03-26

Journal title :

22nd International Conference on Database Theory (ICDT 2019)

HAL domain(s) :

Informatique [cs]/Base de données [cs.DB]
Informatique [cs]/Logique en informatique [cs.LO]
Informatique [cs]/Théorie et langage formel [cs.FL]
Informatique [cs]/Algorithme et structure de données [cs.DS]

English abstract : [en]

We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a ...
Show more >We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs.Show less >

Language :

Anglais

Peer reviewed article :

Oui

Audience :

Internationale

Popular science :

Non

Comment :

25 pages including 17 pages of main material. Integrates all reviewer feedback. Outside of possible minor formatting differences, this paper is exactly the same as the ICDT'19 paper except that it contains 6 pages of technical appendix

Collections :

Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189

Source :

Harvested from HAL

Files

http://arxiv.org/pdf/1807.09320
Open access
Access the document

1807.09320
Open access
Access the document

Constant-Delay Enumeration for Nondeterministic ... BibTeX CSV Excel RIS

Files

Constant-Delay Enumeration for Nondeterministic ...

BibTeX

CSV

Excel

RIS