Exploiting Sentence Order in Document Alignment

Brian Thompson; Philipp Koehn

Exploiting Sentence Order in Document Alignment

Brian Thompson, Philipp Koehn

Abstract Paper Connected Papers Add to Favorites

Machine Translation and Multilinguality Short Paper

Gather-4A: Nov 18, Gather-4A: Nov 18 (02:00-04:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in a separate window.

Abstract: We present a simple document alignment method that incorporates sentence order information in both candidate generation and candidate re-scoring. Our method results in 61% relative reduction in error compared to the best previously published result on the WMT16 document alignment shared task. Our method improves downstream MT performance on web-scraped Sinhala–English documents from ParaCrawl, outperforming the document alignment method used in the most recent ParaCrawl release. It also outperforms a comparable corpora method which uses the same multilingual embeddings, demonstrating that exploiting sentence order is beneficial even if the end goal is sentence-level bitext.

NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EMNLP2020