Learning with Limited Labelled Data

Learning with limited labelled data is useful for small domains or languages with little resources. Methods we research to mitigate problems arising in these contexts include multi-task learning, weakly supervised and zero-shot learning.

This is a cross-cutting theme in most of our research. Two funded projects specifically addressing this are Multi3Generation and Andreas Nugaard Holm’s industrial PhD project with BASE Life Science, supported by Innovation Fund Denmark.

Multi3Generation is a COST Action that funds collaboration of researchers in Europe and abroad. The project is coordinated by Isabelle Augenstein, and its goals are to study language generation using multi-task, multilingual and multi-modal signals.

Andreas Nugaard Holm’s industrial PhD project focuses on transfer learning and domain adaptation for scientific text.

Publications

Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly …

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the …

The digitisation of historical documents has provided historians with unprecedented research opportunities. Yet, the conventional …

Uncertainty approximation in text classification is an important area with applications in domain adaptation and interpretability. One …

Data-driven analyses of biases in historical texts can help illuminate the origin and development of biases prevailing in modern …

NLP methods can aid historians in analyzing textual materials in greater volumes than manually feasible. Developing such methods poses …

The task of Stance Detection is concerned with identifying the attitudes expressed by an author towards a target of interest. This task …

Selecting an effective training signal for tasks in natural language processing is difficult: expert annotations are expensive, and …

Fact-checking systems have become important tools to verify fake and misguiding news. These systems become more trustworthy when …

Modern natural language processing (NLP) methods employ self-supervised pretraining objectives such as masked language modeling to …

Automated scientific fact checking is difficult due to the complexity of scientific language and a lack of significant amounts of …

This paper presents the Multitask, Multilingual, Multimodal Language Generation COST Action – Multi3Generation (CA18231), an …

We propose a novel framework for cross-lingual content flagging with limited target-language data, which significantly outperforms …

Explanations shed light on a machine learning model’s rationales and can aid in identifying deficiencies in its reasoning …

The goal of stance detection is to determine the viewpoint expressed in a piece of text towards a target. These viewpoints or contexts …

For natural language processing (NLP) tasks such as sentiment or topic classification, currently prevailing approaches heavily rely on …

Citation count prediction is the task of predicting the number of citations a paper has gained after a period of time. Prior work …

Stance detection concerns the classification of a writer’s viewpoint towards a target. There are different task variants, e.g., …

As NLP models are increasingly deployed in socially situated settings such as online abusive content detection, ensuring these models …

Public trust in science depends on honest and factual communication of scientific papers. However, recent studies have demonstrated a …

Emotion lexica are commonly used resources to combat data poverty in automatic emotion detection. However, methodological issues emerge …

Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. …

Scientific document understanding is challenging as the data is highly domain specific and diverse. However, datasets for tasks with …

Bridging the performance gap between high- and low-resource languages has been the focus of much previous work. Typological features …

In this paper, we describe our participation in the TREC Health Misinformation Track 2020. We submitted 11 runs to the Total Recall …

Subjectivity is the expression of internal opinions or beliefs which cannot be objectively observed or verified, and has been shown to …

In practical machine learning settings, the data on which a model must make predictions often come from a different distribution than …

Learning what to share between tasks has been a topic of high importance recently, as strategic sharing of knowledge has been shown to …

A critical component of automatically combating misinformation is the detection of fact check-worthiness, i.e. determining if a piece …

Typological knowledge bases (KBs) such as WALS contain information about linguistic properties of the world’s languages. They …

While state-of-the-art NLP explainability (XAI) methods focus on supervised, per-instance end or diagnostic probing task evaluation[4, …

In this paper, we extend the task of semantic textual similarity to include sentences which contain emojis. Emojis are ubiquitous on …

Language evolves over time in many ways relevant to natural language processing tasks. For example, recent occurrences of tokens …

Task oriented dialogue systems rely heavily on specialized dialogue state tracking (DST) modules for dynamically predicting user intent …

Although the vast majority of knowledge bases KBs are heavily biased towards English, Wikipedias do cover very different topics in …

Multi-task learning and self-training are two common ways to improve a machine learning model’s performance in settings with …

Studying to what degree the language we use is gender-specific has long been an area of interest in socio-linguistics. Studies have …

When assigning quantitative labels to a dataset, different methodologies may rely on different scales. In particular, when assigning …

In online discussion fora, speakers often make arguments for or against something, say birth control, by highlighting certain aspects …

Multi-task learning (MTL) allows deep neural networks to learn from related tasks by sharing parameters with other networks. In …

Previous work has suggested that parameter sharing between transition-based neural dependency parsers for related languages can lead to …

This paper documents the Team Copenhagen system which placed first in the CoNLL–SIGMORPHON 2018 shared task on universal …

Neural part-of-speech (POS) taggers are known to not perform well with little training data. As a step towards overcoming this problem, …

We combine multi-task learning and semisupervised learning by inducing a joint embedding space between disparate label spaces and …

We take a multi-task learning approach to the shared Task 1 at SemEval-2018. The general idea concerning the model structure is to use …

Keyphrase boundary classification (KBC) is the task of detecting keyphrases in scientific articles and labelling them with respect to …