Why do small language models underperform? | George Mason Department of Computer Science

When: Thursday, May 02, 2024 from 11:00 AM to 12:00 PM
Speakers: Benoît Sagot
Location: Nguyen Engineering Bldg, Conference Room 4201
Export to iCal

Abstract:

Language models, and in particular generative and conversational language models, are at the heart of recent advances in natural language processing (NLP). Understanding how these models represent textual content and how they learn these representations still raises multiple research questions. In this talk, I will start from an observation that small models are less efficient than expected. I will show that language models relying on the Transformer architecture tend to produce vector representations that are not isotropically distributed in space. This anisotropy is linked to the way in which these models are learned, which leads to the frequency of the tokens taking a preponderant place in their representation. I will show that this effect has negative consequences on the ability of small models to train satisfactorily (“performance saturation”) but does not seem to affect larger models. I will then describe a new approach for training language models intended to avoid the undesirable effects of this prevalence of frequency information. The resulting “headless” models display a number of advantages over standard models, including on downstream performance.

Bio:

Benoît Sagot is a computer scientist specialized in natural language processing (NLP). He is a Senior Researcher (Directeur de Recherches) at INRIA, where is heads the INRIA research project ALMAnaCH in Paris, France. He also holds a chair in the PRAIRIE institute dedicated to artificial intelligence, and currently holds the annual chair for computer science in the Collège de France. His research focuses on language modelling, machine translation, language resource development and computational linguistics, with a focus on French in all its form and on less-resourced languages.

Posted 1 week, 3 days ago

Why do small language models underperform? Events / Distinguished Lecture Series

Categories

Why do small language models underperform?
Events / Distinguished Lecture Series