CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval

Shuo Sun, Kevin Duh

Information Retrieval and Text Mining Long Paper

Zoom-8A: Nov 17, Zoom-8A: Nov 17 (17:00-18:00 UTC) [Join Zoom Meeting]

You can open the pre-recorded video in a separate window.

Abstract: We present CLIRMatrix, a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. CLIRMatrix comprises (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs, and (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages. In total, we mined 49 million unique queries and 34 billion (query, document, label) triplets, making it the largest and most comprehensive CLIR dataset to date. This collection is intended to support research in end-to-end neural information retrieval and is publicly available at [url]. We provide baseline neural model results on BI-139, and evaluate MULTI-8 in both single-language retrieval and mix-language retrieval settings.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EMNLP2020

Similar Papers

LAReQA: Language-Agnostic Answer Retrieval from a Multilingual Pool
Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips, Yinfei Yang,
XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation
Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Rangan Majumder, Ming Zhou,
Improving Multilingual Models with Language-Clustered Vocabularies
Hyung Won Chung, Dan Garrette, Kiat Chuan Tan, Jason Riesa,