Can Transformers Learn n-gram Language Models?

Abstract

Much theoretical work has described the ability of transformer language models (LMs) to represent formal languages.However, linking theoretical results to empirical performance is not straightforward. We empirically evaluate recent work linking transformers to n-gram LMs by studying their ability to learn random n-gram LMs of two kinds: ones with arbitrary next-symbol probabilities and ones where next-symbol probabilities are defined with shared parameters. We find that classic n-gram estimation techniques such as Add-lambda outperform transformers on the former, while transformers perform well on the latter, outperforming methods specifically designed to learn n-gram LMs.

Publication
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)
Date