Textual Data Augmentation for Efficient Active Learning on Tiny Datasets

Husam Quteineh, Spyridon Samothrakis, Richard Sutcliffe

NLP Applications Long Paper

Zoom-12D: Nov 18, Zoom-12D: Nov 18 (09:00-10:00 UTC) [Join Zoom Meeting]

You can open the pre-recorded video in a separate window.

Abstract: In this paper we propose a novel data augmentation approach where guided outputs of a language generation model, e.g. GPT-2, when labeled, can improve the performance of text classifiers through an active learning process. We transform the data generation task into an optimization problem which maximizes the usefulness of the generated output, using Monte Carlo Tree Search (MCTS) as the optimization strategy and incorporating entropy as one of the optimization criteria. We test our approach against a Non-Guided Data Generation (NGDG) process that does not optimize for a reward function. Starting with a small set of data, our results show an increased performance with MCTS of 26% on the TREC-6 Questions dataset, and 10% on the Stanford Sentiment Treebank SST-2 dataset. Compared with NGDG, we are able to achieve increases of 3% and 5% on TREC-6 and SST-2.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EMNLP2020

Similar Papers

New Protocols and Negative Results for Textual Entailment Data Collection
Samuel R. Bowman, Jennimaria Palomaki, Livio Baldini Soares, Emily Pitler,
Active Learning for BERT: An Empirical Study
Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, Lena Dankin, Leshem Choshen, Marina Danilevsky, Ranit Aharonov, Yoav Katz, Noam Slonim,
Q-learning with Language Model for Edit-based Unsupervised Summarization
Ryosuke Kohita, Akifumi Wachi, Yang Zhao, Ryuki Tachibana,
On the importance of pre-training data volume for compact language models
Vincent Micheli, Martin d'Hoffschmidt, François Fleuret,