This README file contains a brief description of each of the files in the /wiki_data directory. This data was collected from Wikipedia on September 23, 2009 using a web crawler. 

sb_train_wid.txt
A vector whose entries are the word i.d.s for the word tokens in the train set. 

sb_train_did.txt
A vector whose entries are the document i.d.s for the word tokens in the train set. There are 416 training documents numbered from 1 to 416.

sb_test_wid.txt
A vector whose entries are the word i.d.s for the word tokens in the test set.

sb_test_did.txt
A vector whose entries are the document i.d.s for the word tokens in the test set. There are 104 training documents numbered from 1 to 104.

test_document_indices.txt
A vector whose entries are the global i.d.s for the documents assigned to the test set. The nth entry is the global document i.d. for the nth test document. These global document i.d.s align with the titles in the file titles.Machine_learning.txt

train_document_indices.txt
A vector whose entries are the global i.d.s for the documents assigned to the training set. The nth entry is the global document i.d. for the nth training document. These global document i.d.s align with the titles in the file titles.Machine_learning.txt

titles.Machine_learning.txt
The titles of the Wikipedia articles.

vocab.txt
The words in the vocabulary. 

Wikipedia-20090923220021.xml
The full text of the Wikipedia articles.