Find-2-Find: Multitask Learning for Anaphora Resolution and Object Localization

Oguz, Cennet; Denis, Pascal; Vincent, Emmanuel; Ostermann, Simon; van Genabith, Josef

Type de document :

Communication dans un congrès avec actes

Titre :

Find-2-Find: Multitask Learning for Anaphora Resolution and Object Localization

Auteur(s) :

Oguz, Cennet [Auteur]
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH = German Research Center for Artificial Intelligence [DFKI]
Denis, Pascal [Auteur]

Machine Learning in Information Networks [MAGNET]
Vincent, Emmanuel [Auteur]
Speech Modeling for Facilitating Oral-Based Communication [MULTISPEECH]
Ostermann, Simon [Auteur]
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH = German Research Center for Artificial Intelligence [DFKI]
van Genabith, Josef [Auteur]
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH = German Research Center for Artificial Intelligence [DFKI]

Titre de la manifestation scientifique :

2023 Conference on Empirical Methods in Natural Language Processing

Ville :

Singapore

Pays :

Singapour

Date de début de la manifestation scientifique :

2023-12-06

Date de publication :

2023

Discipline(s) HAL :

Informatique [cs]

Résumé en anglais : [en]

In multimodal understanding tasks, visual and linguistic ambiguities can arise. Visual ambiguity can occur when visual objects require a model to ground a referring expression in a video without strong supervision, while ...
Lire la suite >In multimodal understanding tasks, visual and linguistic ambiguities can arise. Visual ambiguity can occur when visual objects require a model to ground a referring expression in a video without strong supervision, while linguistic ambiguity can occur from changes in entities in action flows. As an example from the cooking domain, "oil" mixed with "salt" and "pepper" could later be referred to as a "mixture". Without a clear visual-linguistic alignment, we cannot know which among several objects shown is referred to by the language expression "mixture", and without resolved antecedents, we cannot pinpoint what the mixture is. We define this chicken-and-egg problem as visual-linguistic ambiguity. In this paper, we present Find2Find, a joint anaphora resolution and object localization dataset targeting the problem of visual-linguistic ambiguity, consisting of 500 anaphora-annotated recipes with corresponding videos. We present experimental results of a novel end-to-end joint multitask learning framework for Find2Find that fuses visual and textual information and shows improvements both for anaphora resolution and object localization as compared to a strong single-task baseline.Lire moins >

Langue :

Anglais

Comité de lecture :

Oui

Audience :

Internationale

Vulgarisation :

Non

Collections :

Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189

Source :

Harvested from HAL

Fichiers

document
Accès libre
Accéder au document

emnlp23impress.pdf
Accès libre
Accéder au document

document
Accès libre
Accéder au document

emnlp23impress.pdf
Accès libre
Accéder au document

Find-2-Find: Multitask Learning for Anaphora ... BibTeX CSV Excel RIS

Fichiers

Find-2-Find: Multitask Learning for Anaphora ...

BibTeX

CSV

Excel

RIS