Multilingual Offensive Language Identification with Cross-lingual Embeddings

Tharindu Ranasinghe, Marcos Zampieri

Computational Social Science and Social Media Short Paper

Gather-4F: Nov 18, Gather-4F: Nov 18 (02:00-04:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in a separate window.

Abstract: Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbulling, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources. We project predictions on comparable data in Bengali, Hindi, and Spanish and we report results of 0.8415 F1 macro for Bengali, 0.8568 F1 macro for Hindi, and 0.7513 F1 macro for Spanish. Finally, we show that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages, confirming the robustness of cross-lingual contextual embeddings and transfer learning for this task.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EMNLP2020

Similar Papers

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization
Alessandro Raganato, Tommaso Pasini, Jose Camacho-Collados, Mohammad Taher Pilehvar,
Design Challenges in Low-resource Cross-lingual Entity Linking
Xingyu Fu, Weijia Shi, Xiaodong Yu, Zian Zhao, Dan Roth,
Hate-Speech and Offensive Language Detection in Roman Urdu
Hammad Rizwan, Muhammad Haroon Shakeel, Asim Karim,