On Github chdoig / pytexas2015-topic-modeling
Free Python distribution: Anaconda
Open source: conda, blaze, dask, bokeh, numba...
Proud sponsor of PyTexas, PyData, SciPy, PyCon, Europython...
We are hiring!
Building the NYT Recommendation Engine: From keywords over collaborative filtering to Collaborative Topic Modeling
Iterative algorithm
Initialize parameters Initialize topic assignments randomly IterateHard: Unsupervised learning. No labels.
Human-in-the-loop
Human-in-the-loop
Metrics
Metrics
More Metrics [1]:
Size (# of tokens assigned) Within-doc rank Similarity to corpus-wide distribution Locally-frequent words Co-doc CoherenceWarning: Current LDA in scikit-learn refers to Linear Discriminant Analysis!
import gensim # load id->word mapping (the dictionary) id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt') # load corpus iterator mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm') # extract 100 LDA topics, using 20 full passes, (batch mode) no online updates lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)
import graphlab as gl docs = graphlab.SArray('http://s3.amazonaws.com/dato-datasets/nytimes') m = gl.topic_model.create(docs, num_topics=20, # number of topics num_iterations=10, # algorithm parameters alpha=.01, beta=.1) # hyperparameters
import lda X = lda.datasets.load_reuters() model = lda.LDA(n_topics=20, n_iter=1500, random_state=1) model.fit(X) # model.fit_transform(X) is also available
from sklearn.decomposition import NMF, LatentDirichletAllocation X = ... lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, random_state=0) lda.fit(X)
from sklearn.decomposition import LatentDirichletAllocation from sklearn.datasets import fetch_20newsgroups # Initialize variables n_samples = 2000 n_features = 1000 n_topics = 10 dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes')) data_samples = dataset.data[:n_samples] # use tf feature for LDA model tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english') tf = tf_vectorizer.fit_transform(data_samples) lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, learning_method='online', learning_offset=50., random_state=0) lda.fit(tf)
Slides: http://chdoig.github.com/pytexas2015-topic-modeling
Email: christine.doig@continuum.io
Twitter: ch_doig