Semantic Textual Similarity of Sentences with Emojis

Alok Debnath , Nikhil Pinnaparaju , Manish Shrivastava , Vasudeva Varma , Isabelle Augenstein

1 Feb 2020

PDF Project

Abstract

In this paper, we extend the task of semantic textual similarity to include sentences which contain emojis. Emojis are ubiquitous on social media today, but are often removed in the pre-processing stage of curating datasets for NLP tasks. In this paper, we qualitatively ascertain the amount of semantic information lost by discounting emojis, as well as show a mechanism of accounting for emojis in a semantic task. We create a sentence similarity dataset of 4000 pairs of tweets with emojis, which have been annotated for relatedness. The corpus contains tweets curated based on common topic as well as by replacement of emojis. The latter was done to analyze the difference in semantics associated with different emojis. We aim to provide an understanding of the information lost by removing emojis by providing a qualitative analysis of the dataset. We also aim to present a method of using both emojis and words for downstream NLP tasks beyond sentiment analysis.

Type

Conference paper

Publication

In Proceedings of the 8th International Workshop on Natural Language Processing for Social Media (SocialNLP) at TheWebConf 2020

Date

February, 2020