Topic Modeling for Amnesty International Data
@DataKind
Hi!
Bugra Akyildiz
Data Scientist at Axial
@bugraa
Machine Learning Newsletter | mln.io
bugra@nyu.edu
Axial
A network that brings private companies with investors together
Enables business owners access to private capital markets
We are hiring! | axial.net
Datakind
Data science in the service of humanity
Nonprofit: NGO's, governments and such
http://www.datakind.org/
Amnesty International
Global movement of people fighting injustice and promoting human rights
http://www.amnestyusa.org/
Data science for humanity and profit
Amnesty International Data
Complaints
- Fear of Safety
- Getting threats
- ...
News
- Conviction
- Execution
- Disappearance
- ...
Topic Modeling
The unsupervised learning method you apply to a bunch of text when you have no idea what to do with them
Topic model is umbrella name for a suite of graphical models for discovering topics or themes
in a collection of documents.
Why Topic Model?
- Useful to explore the documents
- Easy to apply to unstructured documents
- Unsupervised, no need for labels
- Applicable for any medium-to-long form text
Latent Dirichlet Allocation
LDA
-
M: Number of Documents
-
N: Number of Words
-
α: Dirichlet parameter on document-topic
-
β: Dirichlet parameter topic-word
-
θi: Topic distribution of the document i
-
ϕk: Word distribution for topic k
-
wij: jth word of of ith document
-
zij: topic assignment of the word above
How?
- Generate θi∼Dir(α)
- Generate θk∼Dir(β)
- For each document i and word j:
- Choose a topic zij∼Multinomial(θi)
- Choose a word wij from
-
Multinomial(ϕzij)
Topics
https://datadive.herokuapp.com/
Topics Represented as Graph
https://datadive.herokuapp.com/graph_clusters
Every color represents a topic
If all you have is a hammer, everything looks like a nail
Not everything has to be a clustering problem.
Variants
Correlated Topic Models
Dynamic Topic Models
Supervised Topic Models
Turbo Topics
...