If You Build Your Own NER Scorer, Non-replicable Results Will Come

Constantine Lignos, Marjan Kamyab

Workshop on Insights from Negative Results in NLP Workshop Paper

You can open the pre-recorded video in a separate window.

Abstract: We attempt to replicate a named entity recognition (NER) model implemented in a popular toolkit and discover that a critical barrier to doing so is the inconsistent evaluation of improper label sequences. We define these sequences and examine how two scorers differ in their handling of them, finding that one approach produces F1 scores approximately 0.5 points higher on the CoNLL 2003 English development and test sets. We propose best practices to increase the replicability of NER evaluations by increasing transparency regarding the handling of improper label sequences.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.