### Competitive Distribution Estimation: Why is Good-Turing Good

**Abstract:**
Estimating distributions over large alphabets is a fundamental
machine-learning tenet. Yet no method is known to estimate all
distributions well. For example, add-constant estimators are nearly
min-max optimal but often perform poorly in practice, and practical
estimators such as absolute discounting, Jelinek-Mercer, and
Good-Turing are not known to be near optimal for essentially any
distribution.

We describe the first universally near-optimal probability estimators.
For every discrete distribution, they are provably nearly the best in
the following two competitive ways. First they estimate every
distribution nearly as well as the best estimator designed with prior
knowledge of the distribution up to a permutation. Second, they
estimate every distribution nearly as well as the best estimator
designed with prior knowledge of the exact distribution, but as all
natural estimators, restricted to assign the same probability to all
symbols appearing the same number of times.

Specifically, for distributions over k symbols and n samples, we
show that for both comparisons, a simple variant of Good-Turing
estimator is always within KL divergence of (3+o(1))/n^{1/3} from
the best estimator, and that a more involved estimator is within
Õ(min(k/n,1/&rad;n)). Conversely, we show that
any estimator must have a KL divergence at least
*Ω*(min(k/n,1/n^{2/3})) over the best estimator for
the first comparison, and at least
*Ω*(min(k/n,1/&rad;n)) for the second.