The number of unique words in children’s speech is one of most basic statistics indicating their language development. We may face, however, to a difficulty to accurately evaluate the number of unique words in a child’s growing corpus over time with a limited sample size. This study proposes a novel technique to estimate the latent number of words from a series of words uttered by children. This technique utilizes statistical properties of the number of types as a function of the number of sampled tokens. We tested the practical effectiveness of the proposed method in the empirical data analysis of the cross-sectional and longitudinal samples. The converging empirical evidence suggests that the proposed estimator improves the accuracy of vocabulary size estimation over a naïve type-counting estimators. Utilizing this efficient estimator, we propose a new sampling scheme for vocabulary assessment that has lower cost and higher accuracy compared to existing methods.
Vocabulary growth; Small sample size; Number of latent types; Type–token ratio;
任意のある確率分布に従って単語を抽出する場合に,抽出単語数に対する単語の 種類数の確率分布がポアソン二項分布に漸近的に従うことを証明しました (Hidaka, accepted)。 この結果を用いると,抽出した単語数に対する単語種類数のデータから,潜在的 にどの程度の数の未知の単語種類数が存在するか統計的に見積もることが可能に なります。従って,この成果を応用することで,言語獲得期の幼児の獲得単語数, コーパスデータの単語数,生態系における種数,など,様々な分野で未知の項目 の種類数をより正確に概算する事が可能になります。