Part of Advances in Neural Information Processing Systems 11 (NIPS 1998)
Matthew Brand
We introduce a novel framework for simultaneous structure and parameter learning in hidden-variable conditional probability models, based on an en tropic prior and a solution for its maximum a posteriori (MAP) estimator. The MAP estimate minimizes uncertainty in all respects: cross-entropy between model and data; entropy of the model ; entropy of the data's descriptive statistics. Iterative estimation extinguishes weakly supported parameters, compressing and sparsifying the model. Trimming operators accelerate this process by removing excess parameters and, unlike most pruning schemes, guarantee an increase in posterior probability. Entropic estimation takes a overcomplete random model and simplifies it, inducing the structure of relations between hidden and observed variables. Applied to hidden Markov models (HMMs), it finds a concise finite-state machine representing the hidden structure of a signal. We entropically model music, handwriting, and video time-series, and show that the resulting models are highly concise, structured, predictive, and interpretable: Surviving states tend to be highly correlated with meaningful partitions of the data, while surviving transitions provide a low-perplexity model of the signal dynamics.
1 . An entropic prior In entropic estimation we seek to maximize the information content of parameters. For conditional probabilities, parameters values near chance add virtually no information to the model, and are therefore wasted degrees of freedom. In contrast, parameters near the extrema {O, I} are informative because they impose strong constrĀ·aints on the class of signals accepted by the model. In Bayesian terms, our prior should assert that parameters that do not reduce uncertainty are improbable. We can capture this intuition in a surprisingly simple form: For a model of N conditional probabilities 9 = {(h , . . . , () N } we write
(1)
whence we can see that the prior measures a model's freedom from ambiguity (H(9) is an entropy measure). Applying Pe (.) to a multinomial yields the posterior
p (LlI) P(wI9)Pe (9) e U W