This distribution contains some of the data used in: @inproceedings{wallach-murray-rsalakhu-mimno-2009, title = {Evaluation Methods for Topic Models}, author = {Hanna M. Wallach and Iain Murray and Ruslan Salakhutdinov and David Mimno}, booktitle = {Proceedings of the 26th International Conference on Machine Learning (ICML)}, editor = {L\'{e}on Bottou and Michael Littman}, pages = {1105--1112}, year = {2009}, address = {Montreal}, month = {June}, publisher = {Omnipress} } Each directory contains at least three files: documents.txt.gz Each line contains the words contained in one document. document_topic_prior.txt.gz The parameters of a Dirichlet prior on topic distributions. topic_word_distributions.txt.gz Values *proportional* to the topic-word distributions. The 3 and 50 topic directories contain synthetic data. The documents were simulated from the given LDA model, which was estimated from a collection of about 1200 ICML/NIPS abstracts. These directories also contain distributions.txt.gz -- which give the actual probabilities of each topic that was used when simulating each document. For convenience we've provided .mat files, in which the words have been mapped to unique integers for ease of handling in Matlab/Octave (or anything else that can read .mat files, like scipy). If you have any issues with the .mat files I'm afraid we can't provide support for them: the (gzipped) text files are the canonical version of the datasets. We do not have permission to distribute the NYT data. Sorry.