The predictive capabilities of mathematical models for the type-token relationship in English language corpora

    Research output: Contribution to journalArticlepeer-review

    1 Downloads (Pure)

    Abstract

    We investigate the predictive capability of mathematical models of the type-token relationship applied to the vocabulary growth profiles of selected of English language documents. We compare the existing Good-Toulmin and Heaps formulae with an alternative approach based on Bernoulli trial word selection from a fixed finite vocabulary using the Zipf and Zipf-Mandelbrot probability distributions. We make two major observations: firstly, while the Zipf-Mandelbrot model makes better predictions of vocabulary growth than the Zipf model, the optimized parameters of the latter correlate better than those of the former with statistics gleaned independently from the data. Secondly, the mean of the Zipf-Mandelbrot, Good-Toulmin and Heaps models provides a more consistent and unbiased prediction of vocabulary than any individual model alone.
    Original languageEnglish
    Article number101227
    JournalComputer Speech & Language
    Volume70
    Early online date20 Apr 2021
    DOIs
    Publication statusPublished - 30 Nov 2021

    Keywords

    • Computer science and informatics

    Fingerprint

    Dive into the research topics of 'The predictive capabilities of mathematical models for the type-token relationship in English language corpora'. Together they form a unique fingerprint.

    Cite this