Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal

Question Answering Long Paper

Gather-5E: Nov 18, Gather-5E: Nov 18 (18:00-20:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in a separate window.

Abstract: Has there been real progress in multi-hop question-answering? Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts. This limits our ability to measure true progress and defeats the purpose of building multi-hop QA datasets. We make three contributions towards addressing this. First, we formalize such undesirable behavior as disconnected reasoning across subsets of supporting facts. This allows developing a model-agnostic probe for measuring how much any model can cheat via disconnected reasoning. Second, using a notion of \emph{contrastive support sufficiency}, we introduce an automatic transformation of existing datasets that reduces the amount of disconnected reasoning. Third, our experiments suggest that there hasn't been much progress in multi-hop QA in the reading comprehension setting. For a recent large-scale model (XLNet), we show that only 18 points out of its answer F1 score of 72 on HotpotQA are obtained through multifact reasoning, roughly the same as that of a simpler RNN baseline. Our transformation substantially reduces disconnected reasoning (19 points in answer F1). It is complementary to adversarial approaches, yielding further reductions in conjunction.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EMNLP2020

Similar Papers

Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering
Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, Xiang Ren,
ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning
Michael Boratko, Xiang Li, Tim O'Gorman, Rajarshi Das, Dan Le, Andrew McCallum,
A Simple Yet Strong Pipeline for HotpotQA
Dirk Groeneveld, Tushar Khot, Mausam, Ashish Sabharwal,