An Empirical Investigation of Learning from Biased Toxicity Labels

Abstract

When collecting data from human annotators to train machine learning models, there is a trade-off between quantity and quality. We explore the setting of collecting a large amount of low-quality noisy and biased data and a small amount of high-quality clean data. We study a toxicity classification task involving a small dataset of human-annotated labels and a large but biased dataset of synthetically generated labels. We explore the best ways to train accurate and unbiased models in this context, and the resulting trade-off. We find that training on noisy data then fine-tuning on clean data produces the most accurate models, while the best approach to training unbiased models is sensitive to the approach to measuring fairness.

Publications