Introduction to Topic Modeling in Python

PyTexas 2015

by Christine Doig

Introduction

About me

Data Scientist at Continuum Analytics

Barcelona & Austin

http://chdoig.github.com

@ch_doig

About Continuum Analytics

Free Python distribution: Anaconda

Open source: conda, blaze, dask, bokeh, numba...

Proud sponsor of PyTexas, PyData, SciPy, PyCon, Europython...

We are hiring!

http://continuum.io

About this talk

Introduction Topic Modeling LDA Algorithm Python libraries Pipelines Other algorithms Additional resources

http://chdoig.github.com/pytexas2015-topic-modeling

Topic Modeling

Topic Modeling Applications

Building the NYT Recommendation Engine: From keywords over collaborative filtering to Collaborative Topic Modeling

Definitions

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents [1] Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts [2] Topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together[3]

http://en.wikipedia.org/wiki/Topic_model http://www.cs.princeton.edu/~blei/topicmodeling.html http://mallet.cs.umass.edu/topics.php

Characteristics

Diagram

LDA

LDA vs LDA

LDA Plate notation

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Parameters and variables

Understanding LDA

LDA algorithm

Iterative algorithm

Initialize parameters Initialize topic assignments randomly Iterate

For each word in each document:
Resample topic for word, given all other words and their current topic assignments

Get results Evaluate model

Initialize parameters

Initialize topic assignments randomly

Iterate

Resample topic for word, given all other words and their current topic assignments

Which topics occur in this document? Which topics like the word X?

Get results

Evaluate model

Hard: Unsupervised learning. No labels.

Human-in-the-loop

Word intrusion [1]: For each trained topic, take first ten words, substitute one of them with another, randomly chosen word (intruder!) and see whether a human can reliably tell which one it was. If so, the trained topic is topically coherent (good); if not, the topic has no discernible theme (bad) [2] Topic intrusion: Subjects are shown the title and a snippet from a document. Along with the document they are presented with four topics. Three of those topics are the highest probability topics assigned to that document. The remaining intruder topic is chosen randomly from the other low-probability topics in the model [1]

[1] - http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf [2] - http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

Evaluate model

Human-in-the-loop

Evaluate model

Metrics

Cosine similarity: split each document into two parts, and check that topics of the first half are similar to topics of the second halves of different documents are mostly dissimilar

[1] - http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

Evaluate model

Metrics

Evaluate model

More Metrics [1]:

Size (# of tokens assigned) Within-doc rank Similarity to corpus-wide distribution Locally-frequent words Co-doc Coherence

[1] - http://mimno.infosci.cornell.edu/slides/details.pdf

Python libraries

Gensim: https://radimrehurek.com/gensim/
Graphlab: https://dato.com/products/create/docs/generated/graphlab.topic_model.create.html
lda: http://pythonhosted.org//lda/
sklearn: LatentDirichletAllocation: COMING SOON in 0.17!

Warning: Current LDA in scikit-learn refers to Linear Discriminant Analysis!

[1] - https://de.dariah.eu/tatom/topic_model_python.html

[2] - http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html

Gensim

import gensim
# load id->word mapping (the dictionary)
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
# load corpus iterator
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
# extract 100 LDA topics, using 20 full passes, (batch mode) no online updates
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)

http://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation

Graphlab

import graphlab as gl
docs = graphlab.SArray('http://s3.amazonaws.com/dato-datasets/nytimes')
m = gl.topic_model.create(docs,
                          num_topics=20,       # number of topics
                          num_iterations=10,   # algorithm parameters
                          alpha=.01, beta=.1)  # hyperparameters

https://dato.com/products/create/docs/generated/graphlab.topic_model.create.html

lda

import lda
X = lda.datasets.load_reuters()
model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
model.fit(X)  # model.fit_transform(X) is also available

http://pythonhosted.org//lda/

sklearn.decomposition.LatentDirichletAllocation

from sklearn.decomposition import NMF, LatentDirichletAllocation
X = ...
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, random_state=0)
lda.fit(X)

scikit-learn LDA example

Pipeline

Preprocessing

Vector Space

Model

Gensim Models

Scikit-learn example

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

# Initialize variables
n_samples = 2000
n_features = 1000
n_topics = 10

dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]

# use tf feature for LDA model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online', learning_offset=50.,
                                random_state=0)
lda.fit(tf)

http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf_lda.html

Evaluation - Visualization

LDAVis

https://github.com/cpsievert/LDAvis, https://github.com/bmabey/pyLDAvis

Resources

http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf http://miriamposner.com/blog/very-basic-strategies-for-interpreting-results-from-the-topic-modeling-tool/ http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/ https://beta.oreilly.com/ideas/topic-models-past-present-and-future

Resources

IPython notebooks explaining Dirichlet Processes, HDPs, and Latent Dirichlet Allocation, Timothy Hopper https://github.com/tdhopper/notes-on-dirichlet-processes Visualizing Topic Models, Data Science Summit & Dato Conference 2015 Video, Ben Mabey

Questions?

Slides: http://chdoig.github.com/pytexas2015-topic-modeling

Email: christine.doig@continuum.io

Twitter: ch_doig

ch_doig chdoig

Introduction to Topic Modeling in Python PyTexas 2015 by Christine Doig

Introduction to Topic Modeling in Python – PyTexas 2015 – Introduction

chdoig

Introduction to Topic Modeling in Python – PyTexas 2015 – Introduction

7 15 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

pytexas2015-topic-modeling

Introduction to Topic Modeling in Python

PyTexas 2015

by Christine Doig

Introduction

About me

About Continuum Analytics

About this talk

Topic Modeling

Topic Modeling Applications

Definitions

Characteristics

Diagram

LDA

LDA vs LDA

LDA Plate notation

Parameters and variables

Understanding LDA

LDA algorithm

Initialize parameters

Initialize topic assignments randomly

Iterate

Resample topic for word, given all other words and their current topic assignments

Resample topic for word, given all other words and their current topic assignments

Get results

Evaluate model

Evaluate model

Evaluate model

Evaluate model

Evaluate model

Python libraries

Python libraries

Gensim

Graphlab

lda

sklearn.decomposition.LatentDirichletAllocation

Pipeline

Pipeline

Preprocessing

Vector Space

Model

Scikit-learn example

Evaluation - Visualization

Resources

Resources

Questions?

7 15