I implelemented the EM algorithm [2] for Probabilistic Latent Semantic Indexing (pLSI) [1] in Python.
Usage Example:
python pLSI.py input-file-name number-of-topics maximum-number-of-iterations log-likelihood-difference-threshold
Input File Format:
The input file format is very simple, each line should be of the following format:
document-ID word-ID TFIDF-value
where the document and word ID are integers, and TFIDF-value is float.
Actually, the TFIDF-value can be any other values, e.g., Term Frequency, as long as it's numeric.
These three fields in each line can be separated by spaces or tabs.
Python Code:
pLSI.py
References:
[1] Thomas Hofmann. Probabilistic Latent Semantic Indexing. SIGIR. 1999.
[2] Das et al. Google News Personalization: Scalable OnlineCollaborative Filtering. WWW. 2007