Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast—Choose Three
Steven Reich, David Mueller, Nicholas Andrews
Machine Learning for NLP Long Paper
You can open the pre-recorded video in a separate window.
Abstract:
Modern neural networks do not always produce well-calibrated predictions, even when trained with a proper scoring function such as cross-entropy. In classification settings, simple methods such as isotonic regression or temperature scaling may be used in conjunction with a held-out dataset to calibrate model outputs. However, extending these methods to structured prediction is not always straightforward or effective; furthermore, a held-out calibration set may not always be available. In this paper, we study \emph{ensemble distillation} as a general framework for producing well-calibrated structured prediction models while avoiding the prohibitive inference-time cost of ensembles. We validate this framework on two tasks: named-entity recognition and machine translation. We find that, across both tasks, ensemble distillation produces models which retain much of, and occasionally improve upon, the performance and calibration benefits of ensembles, while only requiring a single model during test-time.
NOTE: Video may display a random order of authors.
Correct author list is at the top of this page.