This distribution contains some of the data used in:

@inproceedings{wallach-murray-rsalakhu-mimno-2009,
    title  = {Evaluation Methods for Topic Models},
    author = {Hanna M. Wallach and Iain Murray and Ruslan Salakhutdinov and David Mimno},
    booktitle = {Proceedings of the 26th International Conference on Machine Learning (ICML)},
    editor = {L\'{e}on Bottou and Michael Littman},
    pages = {1105--1112},
    year = {2009},
    address = {Montreal},
    month = {June},
    publisher = {Omnipress}
}


Each directory contains at least three files:

documents.txt.gz
    Each line contains the words contained in one document.

document_topic_prior.txt.gz
    The parameters of a Dirichlet prior on topic distributions.

topic_word_distributions.txt.gz
    Values *proportional* to the topic-word distributions.


The 3 and 50 topic directories contain synthetic data. The documents were
simulated from the given LDA model, which was estimated from a collection
of about 1200 ICML/NIPS abstracts. These directories also contain
distributions.txt.gz -- which give the actual probabilities of each topic that
was used when simulating each document.

For convenience we've provided .mat files, in which the words have been mapped
to unique integers for ease of handling in Matlab/Octave (or anything else that
can read .mat files, like scipy). If you have any issues with the .mat files I'm
afraid we can't provide support for them: the (gzipped) text files are the
canonical version of the datasets.


We do not have permission to distribute the NYT data. Sorry.