Exploiting Sentence Order in Document Alignment

Brian Thompson, Philipp Koehn

Machine Translation and Multilinguality Short Paper

Gather-4A: Nov 18, Gather-4A: Nov 18 (02:00-04:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in a separate window.

Abstract: We present a simple document alignment method that incorporates sentence order information in both candidate generation and candidate re-scoring. Our method results in 61% relative reduction in error compared to the best previously published result on the WMT16 document alignment shared task. Our method improves downstream MT performance on web-scraped Sinhala–English documents from ParaCrawl, outperforming the document alignment method used in the most recent ParaCrawl release. It also outperforms a comparable corpora method which uses the same multilingual embeddings, demonstrating that exploiting sentence order is beneficial even if the end goal is sentence-level bitext.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EMNLP2020

Similar Papers

Compositional Phrase Alignment and Beyond
Yuki Arase, Jun'ichi Tsujii,
Multilevel Text Alignment with Cross-Document Attention
Xuhui Zhou, Nikolaos Pappas, Noah A. Smith,
Improving Word Sense Disambiguation with Translations
Yixing Luan, Bradley Hauer, Lili Mou, Grzegorz Kondrak,