Topic Modeling for Amnesty International Data



Topic Modeling for Amnesty International Data

0 1


next-ml-2015

Next ML 2015 Slides

On Github bugra / next-ml-2015

Topic Modeling for Amnesty International Data

@DataKind

Hi!

Bugra Akyildiz

Data Scientist at Axial

@bugraa

Machine Learning Newsletter | mln.io

bugra@nyu.edu

Axial

A network that brings private companies with investors together

Enables business owners access to private capital markets

We are hiring! | axial.net

Datakind

Data science in the service of humanity

Nonprofit: NGO's, governments and such

http://www.datakind.org/

Amnesty International

Global movement of people fighting injustice and promoting human rights

http://www.amnestyusa.org/

Data science for humanity and profit

Data

Amnesty International Data

Complaints

  • Fear of Safety
  • Getting threats
  • ...

News

  • Conviction
  • Execution
  • Disappearance
  • ...

Example News

Another One

Machine Learning Model

Topic Modeling

The unsupervised learning method you apply to a bunch of text when you have no idea what to do with them

Topic model is umbrella name for a suite of graphical models for discovering topics or themes in a collection of documents.

Why Topic Model?

  • Useful to explore the documents
  • Easy to apply to unstructured documents
  • Unsupervised, no need for labels
  • Applicable for any medium-to-long form text

Latent Dirichlet Allocation

LDA

  • M: Number of Documents
  • N: Number of Words
  • α: Dirichlet parameter on document-topic
  • β: Dirichlet parameter topic-word
  • θi: Topic distribution of the document i
  • ϕk: Word distribution for topic k
  • wij: jth word of of ith document
  • zij: topic assignment of the word above

Observed

  • wij

How?

  • Generate θi∼Dir(α)
  • Generate θk∼Dir(β)
  • For each document i and word j:
    • Choose a topic zij∼Multinomial(θi)
    • Choose a word wij from
    • Multinomial(ϕzij)

Dirichlet Distribution

Alpha=0.1

Alpha=1

Alpha=10

Alpha=100

Topics

https://datadive.herokuapp.com/

Topics Represented as Graph

https://datadive.herokuapp.com/graph_clusters

Every color represents a topic

Fear of Safety

Execution

Disappearance

If all you have is a hammer, everything looks like a nail

Not everything has to be a clustering problem.

Know its limitations

Variants

Correlated Topic Models

Dynamic Topic Models

Supervised Topic Models

Turbo Topics

...

Tools

  • Factorie (Scala)
  • Mallet (Java)
  • Gensim (Python)
  • Stanford Topic Modeling Toolbox (Scala)
Blei's Topic Modeling Page

Questions?

References