Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models

Zanon Boito, Marcely; Bérard, Alexandre; Villavicencio, Aline; Besacier, Laurent

Type de document :

Communication dans un congrès avec actes

Titre :

Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models

Auteur(s) :

Zanon Boito, Marcely [Auteur]
Laboratoire d'Informatique de Grenoble [LIG ]
Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole [GETALP ]
Instituto de Informática [Porto Alegre]
Bérard, Alexandre [Auteur]
Laboratoire d'Informatique de Grenoble [LIG ]
Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole [GETALP ]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Sequential Learning [SEQUEL]
Villavicencio, Aline [Auteur]
Instituto de Informática [Porto Alegre]
Besacier, Laurent [Auteur]
Laboratoire d'Informatique de Grenoble [LIG ]
Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole [GETALP ]
Institut universitaire de France [IUF]
Université Grenoble Alpes [2016-2019] [UGA [2016-2019]]

Titre de la manifestation scientifique :

IEEE Automatic Speech Recognition and Understanding (ASRU)

Ville :

Okinawa

Pays :

Japon

Date de début de la manifestation scientifique :

2017-12-16

Mot(s)-clé(s) en anglais :

Word Discovery
Computational Language Documentation
Neural Machine Translation
Attention models

Discipline(s) HAL :

Informatique [cs]/Informatique et langage [cs.CL]

Résumé en anglais : [en]

Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neu-ral networks can be applied to this task in a realistic unwritten language scenario, where only small corpora ...
Lire la suite >Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neu-ral networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtained results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences. This result is close to those obtained with a task-specific Bayesian nonparametric model. Moreover, our approach has the advantage of generating translation alignments, which could be used to create a bilingual lexicon. As a future perspective, this approach is also well suited to work directly from speech.Lire moins >

Langue :

Anglais

Comité de lecture :

Oui

Audience :

Internationale

Vulgarisation :

Non

Projet ANR :

Breaking the Unwritten Language Barrier
Systemes et Algorithmes Pervasifs au confluent des mondes physique et numérique

Collections :