Open this publication in new window or tab >>2021 (English)In: Selected Papers from the CLARIN Annual Conference 2020, 2021Conference paper, Published paper (Refereed)
Abstract [en]
We here demonstrate how a set of tools that are being maintained and further developed within the Språkbanken Sam and SWE-CLARIN infrastructures can be employed for creating manually labelled training data in a low-resource setting. As example text, we used the “COVID-19 Open Research Dataset”, and created manually annotated training data for its associated Kaggle task,“What do we know about COVID-19 risk factors?”. We first used our topic modelling tool to i) select a text set for manual annotation, ii) classify the texts into preliminary classification categories, and iii) analyse the texts in search for potential refinements of the annotation categories. We then annotated the text set on a more granular level by labelling the token sequences that indicated the existence of the refined categories in the text. Finally, we used the granularly annotated text set as a seed set, and applied our active learning tool for actively selecting additional texts for annotation. For the token-sequence annotations, we used our text annotation tool, which includes support for incorporating automatic pre-annotations.
National Category
Natural Language Processing
Identifiers
urn:nbn:se:sprakochfolkminnen:diva-2075 (URN)
Conference
CLARIN Annual Conference 2020
Funder
Swedish Research Council, 2017-00626
2021-10-212021-10-212025-09-05Bibliographically approved