When BERT Plays the Lottery, All Tickets Are Winning
Sai Prasanna, Anna Rogers, Anna Rumshisky
Interpretability and Analysis of Models for NLP Long Paper
You can open the pre-recorded video in a separate window.
{L}arge {T}ransformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned {BERT}, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained {BERT} weights are potentially useful. We also study the ``good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.
NOTE: Video may display a random order of authors.
Correct author list is at the top of this page.