Anatomy of Hate Speech Datasets: Composition Analysis and Cross-dataset Classification

Jan 1, 2023·
Samuel Guimarães
,
Gabriel Kakizaki
,
Philipe Melo
,
Márcio Silva
,
Fabricio Murai
,
Julio C S Reis
,
Fabr\́\ii̧o Benevenuto
· 0 min read
Abstract
Manifestations of hate speech in different scenarios are increasingly frequent on social platforms. In this context, there is a large number of works that propose solutions for identifying this type of content in these environments. Most efforts to automatically detect hate speech follow the same process of supervised learning, using annotators to label a predefined set of messages, which are, in turn, used to train classifiers. However, annotators can create labels for different classification tasks, with divergent definitions of hate speech, binary or multi-label schemes, and various methodologies for collecting data. In this context, we examine the principal publicly available datasets for hate speech research. We investigate the types of hate speech (e.g., ethnicity, religion, sexual orientation) present in their composition, explore their content beyond the labels, and use cross-dataset classification to examine the use of the labeled data beyond its original work. Our results reveal interesting insights toward a better understanding of the hate speech phenomenon and improving its detection on social platforms.Warning. This paper contains offensive words and tweet examples.
Type
Publication
Proceedings of the 34th ACM Conference on Hypertext and Social Media (HT'23)