Abstract
We investigate the predictive capability of mathematical models of the type-token relationship applied to the vocabulary growth profiles of selected of English language documents. We compare the existing Good-Toulmin and Heaps formulae with an alternative approach based on Bernoulli trial word selection from a fixed finite vocabulary using the Zipf and Zipf-Mandelbrot probability distributions. We make two major observations: firstly, while the Zipf-Mandelbrot model makes better predictions of vocabulary growth than the Zipf model, the optimized parameters of the latter correlate better than those of the former with statistics gleaned independently from the data. Secondly, the mean of the Zipf-Mandelbrot, Good-Toulmin and Heaps models provides a more consistent and unbiased prediction of vocabulary than any individual model alone.
| Original language | English |
|---|---|
| Article number | 101227 |
| Journal | Computer Speech & Language |
| Volume | 70 |
| Early online date | 20 Apr 2021 |
| DOIs | |
| Publication status | Published - 30 Nov 2021 |
Keywords
- Computer science and informatics