On Github chdoig / pytexas2015-topic-modeling
Free Python distribution: Anaconda
Open source: conda, blaze, dask, bokeh, numba...
Proud sponsor of PyTexas, PyData, SciPy, PyCon, Europython...
We are hiring!
Building the NYT Recommendation Engine: From keywords over collaborative filtering to Collaborative Topic Modeling
Iterative algorithm
Initialize parameters Initialize topic assignments randomly IterateHard: Unsupervised learning. No labels.
Human-in-the-loop
Human-in-the-loop
Metrics
Metrics
More Metrics [1]:
Size (# of tokens assigned) Within-doc rank Similarity to corpus-wide distribution Locally-frequent words Co-doc CoherenceWarning: Current LDA in scikit-learn refers to Linear Discriminant Analysis!
import gensim
# load id->word mapping (the dictionary)
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
# load corpus iterator
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
# extract 100 LDA topics, using 20 full passes, (batch mode) no online updates
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)
import graphlab as gl
docs = graphlab.SArray('http://s3.amazonaws.com/dato-datasets/nytimes')
m = gl.topic_model.create(docs,
num_topics=20, # number of topics
num_iterations=10, # algorithm parameters
alpha=.01, beta=.1) # hyperparameters
import lda X = lda.datasets.load_reuters() model = lda.LDA(n_topics=20, n_iter=1500, random_state=1) model.fit(X) # model.fit_transform(X) is also available
from sklearn.decomposition import NMF, LatentDirichletAllocation X = ... lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, random_state=0) lda.fit(X)
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
# Initialize variables
n_samples = 2000
n_features = 1000
n_topics = 10
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
# use tf feature for LDA model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online', learning_offset=50.,
random_state=0)
lda.fit(tf)
Slides: http://chdoig.github.com/pytexas2015-topic-modeling
Email: christine.doig@continuum.io
Twitter: ch_doig