Dissertations as Data
Type de document :
Communication dans un congrès avec actes
Titre :
Dissertations as Data
Auteur(s) :
Schopfel, Joachim [Auteur]
Groupe d'Études et de Recherche Interdisciplinaire en Information et COmmunication - ULR 4073 [GERIICO ]
Kergosien, Eric [Auteur]
Groupe d'Études et de Recherche Interdisciplinaire en Information et COmmunication - ULR 4073 [GERIICO ]
Chaudiron, Stéphane [Auteur]
Groupe d'Études et de Recherche Interdisciplinaire en Information et COmmunication - ULR 4073 [GERIICO ]
Jacquemin, Bernard [Auteur]
Groupe d'Études et de Recherche Interdisciplinaire en Information et COmmunication - ULR 4073 [GERIICO ]

Groupe d'Études et de Recherche Interdisciplinaire en Information et COmmunication - ULR 4073 [GERIICO ]
Kergosien, Eric [Auteur]

Groupe d'Études et de Recherche Interdisciplinaire en Information et COmmunication - ULR 4073 [GERIICO ]
Chaudiron, Stéphane [Auteur]

Groupe d'Études et de Recherche Interdisciplinaire en Information et COmmunication - ULR 4073 [GERIICO ]
Jacquemin, Bernard [Auteur]

Groupe d'Études et de Recherche Interdisciplinaire en Information et COmmunication - ULR 4073 [GERIICO ]
Titre de la manifestation scientifique :
19th International Symposium on Electronic Theses and Dissertations (ETD 2016): "Data and Dissertations"
Organisateur(s) de la manifestation scientifique :
Université de Lille Sciences humaines et sociales
Ville :
Villeneuve d'Ascq
Pays :
France
Date de début de la manifestation scientifique :
2016-07-11
Mot(s)-clé(s) en anglais :
text and data mining
content mining
Electronic theses and dissertations
retro
digitisation
research data
content mining
Electronic theses and dissertations
retro
digitisation
research data
Discipline(s) HAL :
Sciences de l'Homme et Société/Sciences de l'information et de la communication
Résumé en anglais : [en]
Problem/goalThe paper provides an overview and empirical evidence on the usability of electronic theses and dissertations (ETDs) and related research data for text and data mining (TDM) techniques.Research method/procedureThe ...
Lire la suite >Problem/goalThe paper provides an overview and empirical evidence on the usability of electronic theses and dissertations (ETDs) and related research data for text and data mining (TDM) techniques.Research method/procedureThe first part of the paper is a review of recent publications and projects on the potential and usefulness of ETDs for TDM, followed by a description of our own research projects in the field.Anticipated resultsUsually, research studies on dissertations and data address the handling and potential exploitation of dissertations as a “data vehicle”, where data are published together with the dissertation (e.g. as a kind of data appendix), or as a “gateway to data”, where the data are not published together with the text but are available on a distant server. Yet, often the data are not available; or data, methodology, tools, primary sources are mingled, not indexed, badly described, and unrelated with the text, unconnected with other files.Our paper will describe a different approach that may be helpful to cope with this problem, in particular (but not only) when it is impossible to distinguish between data and dissertations and thus to process the data appropriately (data repository etc.). Our approach is to consider the dissertation as a whole (text, metadata, data, numbers, facts, figures etc.), as “material” potentially exploitable by TDM tools (including natural language processing) designed for unstructured information, i.e. lacking a pre-defined data model or not organized in a pre-defined manner.These tools and techniques may be helpful to find patterns or other useful information but usually involve some kind of structuring the documents, e.g. through manual tagging with metadata. A quite different condition is the legal feasibility. While in some countries TDM for scientific purpose does not require copyright clearance because copyright exceptions recognize that it is legal to extract content for data analytics, in other countries like in France copyright-based legal barriers to TDM are still waiting for removal.Our paper will address these issues, in a general way but also with regards to recent research on content mining of UK dissertations in law and chemistry, to automatic processing of PhD metadata for innovation search and identification of scientific skills and to our own research projects on TDM of unstructured information in the fields of cultural and industrial heritage, geographical data and academic publishing. In particular, we will draw on preliminary results of our interdisciplinary research project TERRE-ISTEX (2016-2018) that will retrieve, organize and make accessible knowledge related to geographical territories from heterogeneous digital academic resources available on the ISTEX platform and in dissertations.Also, we will address the issue of retro-digitisation of older print dissertations and related material in order to make them usable for automatic content mining and to valorise these often hidden treasures of academic heritage.Practical implications/originalityThe paper will provide an up-date on an emerging and promising field of research and development. Our results will be useful for academic libraries and repositories, for the conception and creation of added value services for their ETDs.Lire moins >
Lire la suite >Problem/goalThe paper provides an overview and empirical evidence on the usability of electronic theses and dissertations (ETDs) and related research data for text and data mining (TDM) techniques.Research method/procedureThe first part of the paper is a review of recent publications and projects on the potential and usefulness of ETDs for TDM, followed by a description of our own research projects in the field.Anticipated resultsUsually, research studies on dissertations and data address the handling and potential exploitation of dissertations as a “data vehicle”, where data are published together with the dissertation (e.g. as a kind of data appendix), or as a “gateway to data”, where the data are not published together with the text but are available on a distant server. Yet, often the data are not available; or data, methodology, tools, primary sources are mingled, not indexed, badly described, and unrelated with the text, unconnected with other files.Our paper will describe a different approach that may be helpful to cope with this problem, in particular (but not only) when it is impossible to distinguish between data and dissertations and thus to process the data appropriately (data repository etc.). Our approach is to consider the dissertation as a whole (text, metadata, data, numbers, facts, figures etc.), as “material” potentially exploitable by TDM tools (including natural language processing) designed for unstructured information, i.e. lacking a pre-defined data model or not organized in a pre-defined manner.These tools and techniques may be helpful to find patterns or other useful information but usually involve some kind of structuring the documents, e.g. through manual tagging with metadata. A quite different condition is the legal feasibility. While in some countries TDM for scientific purpose does not require copyright clearance because copyright exceptions recognize that it is legal to extract content for data analytics, in other countries like in France copyright-based legal barriers to TDM are still waiting for removal.Our paper will address these issues, in a general way but also with regards to recent research on content mining of UK dissertations in law and chemistry, to automatic processing of PhD metadata for innovation search and identification of scientific skills and to our own research projects on TDM of unstructured information in the fields of cultural and industrial heritage, geographical data and academic publishing. In particular, we will draw on preliminary results of our interdisciplinary research project TERRE-ISTEX (2016-2018) that will retrieve, organize and make accessible knowledge related to geographical territories from heterogeneous digital academic resources available on the ISTEX platform and in dissertations.Also, we will address the issue of retro-digitisation of older print dissertations and related material in order to make them usable for automatic content mining and to valorise these often hidden treasures of academic heritage.Practical implications/originalityThe paper will provide an up-date on an emerging and promising field of research and development. Our results will be useful for academic libraries and repositories, for the conception and creation of added value services for their ETDs.Lire moins >
Langue :
Anglais
Comité de lecture :
Oui
Audience :
Internationale
Vulgarisation :
Non
Collections :
Source :
Fichiers
- https://hal.univ-lille.fr/hal-01400071/document
- Accès libre
- Accéder au document
- https://hal.univ-lille.fr/hal-01400071/file/ETD2016_Dissertations_as_data.pdf
- Accès libre
- Accéder au document
- https://hal.univ-lille.fr/hal-01400071/file/ETD2016_Dissertations_as_data.pptx
- Accès libre
- Accéder au document
- https://hal.univ-lille.fr/hal-01400071/document
- Accès libre
- Accéder au document
- https://hal.univ-lille.fr/hal-01400071/document
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- ETD2016%20Paper%20Dissertations%20as%20Data%20v2-1.pdf
- Accès libre
- Accéder au document
- ETD2016_Dissertations_as_data.pptx
- Accès libre
- Accéder au document
- ETD2016_Dissertations_as_data.pdf
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- ETD2016%20Paper%20Dissertations%20as%20Data%20v2-1.pdf
- Accès libre
- Accéder au document
- ETD2016_Dissertations_as_data.pdf
- Accès libre
- Accéder au document
- ETD2016_Dissertations_as_data.pptx
- Accès libre
- Accéder au document