Grids, Streets & Pipelines

Building a linguistic street map with scikit-learn

Michelle Fullwood / @michelleful

Who I am

I'm a grad student in linguistics.

I love languages and maps.

I'm from Singapore.

Singapore street names

Clusters of street names

Goal

A map of Singapore with streets coloured by linguistic origin

Ingredients

Geographic location of roads - OpenStreetMap
Linguistic classification - scikit-learn

Ingredients

Geographic location of roads - OpenStreetMap
Linguistic classification - scikit-learn

Ingredients

Geographic location of roads - OpenStreetMap
Linguistic classification - scikit-learn

Goals for this talk

Classifying with scikit-learn (70%)
- Organising features with pipelines
- Improving performance by tuning hyperparameters
Wrangling geodata with GeoPandas (30%)
- Data preparation
- Plotting a map

Wrangling geodata

OpenStreetMap Metro Extracts

GeoJSON

{ "type": "Feature", 
  "properties": 
      { "id": 5436.0, "osm_id": 48673274.0, 
        "type": "residential", 
        "name": "Montreal Drive", ...
        "class": "highway" },
  "geometry": 
      { "type": "LineString", 
        "coordinates": [ [ 103.827628075898062, 1.45001447378366  ], 
                         [ 103.827546855256259, 1.450088485988644 ], 
                         [ 103.82724167016174 , 1.450461983594056 ], 
                         ... ] } }

GeoPandas

>>> import geopandas as gpd

>>> df = gpd.read_file('singapore-roads.geojson')

>>> df.shape
(59218, 13)

Plotting with GeoPandas

>>> df.plot()

Geographic operations made easy

>>> # `within` function returns true if one feature 
>>> # sits within the boundary of another
>>> df = df[df.geometry.within(singapore.geometry)]

>>> df.plot()

⊂ =

Pandas operations are still available

>>> # filter out empty road names
>>> df = df[df['name'].notnull()]

>>> # only accept roads whose 'highway' variable is
>>> # in an accepted list (not a footpath, etc)
>>> df = df[df['highway'].isin(accepted_road_types)]

Building the baseline classifier

Ingredients we need:

Classification schema
Labelled train/test set
Numerical features
A classifier
An evaluation metric

Classification schema

Malay
Chinese
English
Indian
Generic
Other

Labelled train/test set

Hand label

Labelled train/test set

Train Classify

Labelled train/test set

Train Hand correct

Labelled train/test set

Train Train Classify

Train/Test split

# split into train and test data

from sklearn.cross_validation import train_test_split

data_train, data_test, y_train, y_true = \
    train_test_split(df['road_name'], df['classification'], test_size=0.2)

Choosing features: n-grams

(Jalan) Malu-Malu

unigrams m(2) a(2) l(2) u(2) -(1) bigrams #m(1) ma(2) al(2) lu(2) u-(1) ... trigrams ##m(1) #ma(1) mal(2) alu(2) ...

Choosing features: n-grams

Names containing bigram 'ck'

British Chinese Malay Indian 23 17 0 0 AlnwickBerwickBrickson... Boon TeckHock ChyeKheam Hock...

Choosing features: n-grams

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> ngram_counter = CountVectorizer(ngram_range=(1, 4), analyzer='char')

>>> X_train = ngram_counter.fit_transform(data_train)
>>> X_test  = ngram_counter.transform(data_test)

Selecting a classifier

Linear Support Vector Classification (SVC)

Building the classification model

>>> from sklearn.svm import LinearSVC

>>> classifier = LinearSVC()

>>> model = classifier.fit(X_train, y_train)

Testing the classifier

>>> y_test = model.predict(X_test)

>>> sklearn.metrics.accuracy_score(y_true, y_test)
0.551818181818

Improving the classifier

More data
Trying other classifiers
Adding more features
Hyperparameter tuning

Adding features

At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used...This is typically where most of the effort in a machine learning project goes. - Pedro Domingos, "A Few Useful Things to Know about Machine Learning"

Features to add

Number of words
Average length of word
Are all the words in the dictionary?
Is the road tag Malay? (Street, Road vs Jalan, Lorong)

Before: Feature code

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> ngram_counter = CountVectorizer(ngram_range=(1, 4), analyzer='char')

>>> X_train = ngram_counter.fit_transform(data_train)
>>> X_test  = ngram_counter.transform(data_test)

Pipelines

After: rewrite with pipelines

>>> from sklearn.pipeline import Pipeline

>>> ppl = Pipeline([
              ('ngram', CountVectorizer(ngram_range=(1, 4), analyzer='char')),
              ('clf',   LinearSVC())
          ])

>>> model = ppl.fit(data_train)
>>> y_test = model.predict(data_test)

Adding a new feature

Average word length

Longer: likely to be of British or Indian origin
Shorter: likely to be of Chinese origin
Need a new data transformer that takes in road names and outputs this number

Writing your own transformer class

from sklearn.base import BaseEstimator, TransformerMixin

class SampleExtractor(BaseEstimator, TransformerMixin):
    
    def __init__(self, vars):
        self.vars = vars
        
    def transform(self, X, y=None):
        return do_something_to(X, self.vars)

    def fit(self, X, y=None):
        return self

Writing your own transformer class

from sklearn.base import BaseEstimator, TransformerMixin

class AverageWordLengthExtractor(BaseEstimator, TransformerMixin):
    """Takes in df, extracts road name column, outputs average word length"""
    
    def __init__(self):
        pass
        
    def average_word_length(self, name):
        return np.mean([len(word) for word in name.split()])
        
    def transform(self, X, y=None):
        return X['road_name'].apply(self.average_word_length)

    def fit(self, X, y=None):
        return self

Putting transformers in parallel

Feature Union

from sklearn.pipeline import Pipeline, FeatureUnion

pipeline = Pipeline([
    ('feats', FeatureUnion([
        ('ngram', ngram_count_pipeline), # can pass in either a pipeline
        ('ave', AverageWordLengthExtractor()) # or a transformer
    ])),
    ('clf', LinearSVC())  # classifier
])

Additional features I tried

Number of words
Average length of word
Are all the words in the dictionary?
Is the road tag Malay? (Street, Road vs Jalan, Lorong)

Summary

Using Pipelines and FeatureUnions does not improve performance in and of itself
They're a great tool for organising code readably, allowing for easier experimentation

Hyperparameter tuning

Hyperparameters

>>> # When you do this:
>>> clf = LinearSVC()

>>> # You're really doing this:
>>> clf = LinearSVC(C=1.0, loss='l2', ...)

>>> # changing the values of these hyperparameters can alter performance,
>>> # sometimes quite significantly

How GridSearchCV works

C 0.10 1.00 10.0 100 1000 Gamma 2-2 20 ✔ 22

How GridSearchCV works

Within each cell: cross-validation Train Test 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20%

GridSearchCV

>>> from sklearn.grid_search import GridSearchCV

>>> pg = {'clf__C': [0.1, 1, 10, 100]}

>>> grid = GridSearchCV(pipeline, param_grid=pg, cv=5)
>>> grid.fit(X_train, y_train)

>>> grid.best_params_
{'clf__C': 0.1}

>>> grid.best_score_
0.702290076336

GridSearchCV

>>> model = grid.best_estimator_.fit(X_train, y_train)
>>> y_test = model.predict(X_test)
>>> accuracy_score(y_test, y_true)

0.686590909091

Which hyperparameters?

>>> pipeline.get_params()  # only works if all transformers
                           # inherit from BaseEstimator!

{'clf__C': 1.0,
 'clf__class_weight': None,
 'clf__dual': True,
 ...
 'feats__ngram__vect__ngram_range': (1, 4),
 'feats__ngram__vect__preprocessor': None,
 'feats__ngram__vect__stop_words': None,
}

Picking hyperparameters to test

Understanding the classification algorithm

Picking hyperparameters to test

Understanding the classification algorithm

Picking hyperparameters to test

Understanding the classification algorithm

What parameter grid?

The academic literature

A practical guide to Support Vector Classification
Two empirical survey papers:
- Caruana & Niculescu-Mizil (2006) An Empirical Comparison of Supervised Learning Algorithms
- Fernandez-Delgado et al (2014) Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

What parameter grid?

Searching on Github

Summary

Hyperparameter search can be time-consuming
- CPU Time ≥ product of number of features tested in each dimension
Can be parallelized easily
Can be performed iteratively: explore a coarse grid before exploring finer grid around most promising region
Use logarithmic values roughly centred around default(unless it makes sense not to)

Final accuracy

Making the map

(In two lines of Python)

Step 1

Create a matplotlib map with GeoPandas:

>>> ax = df.plot(column='classification', colormap='accent')

Step 2

Convert it into an interactive Leaflet web map

>>> import mplleaflet
>>> mplleaflet.display(fig=ax.figure, crs=df.crs, tiles='cartodb_positron')

Acknowledgments & Useful Resources

Jake Wasserman (GeoPandas, mplleaflet)
Zac Stewart on Pipelines
Savage & Yeoh, Singapore Street Names: A Study of Toponymics

SingaporeRoadnameOrigins

michelleful

SingaporeRoadnameOrigins

0 18 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

SingaporeRoadnameOrigins

Grids, Streets & Pipelines

Building a linguistic street map with scikit-learn

Who I am

Singapore street names

Clusters of street names

Goal

Ingredients

Ingredients

Ingredients

Goals for this talk

Wrangling geodata

OpenStreetMap Metro Extracts

GeoJSON

GeoPandas

Plotting with GeoPandas

Geographic operations made easy

Pandas operations are still available

Building the baseline classifier

Ingredients we need:

Classification schema

Labelled train/test set

Labelled train/test set

Labelled train/test set

Labelled train/test set

Train/Test split

Choosing features: n-grams

Choosing features: n-grams

Choosing features: n-grams

Selecting a classifier

Linear Support Vector Classification (SVC)

Linear Support Vector Classification (SVC)

Linear Support Vector Classification (SVC)

Linear Support Vector Classification (SVC)

Building the classification model

Testing the classifier

Improving the classifier

Adding features

Features to add

Before: Feature code

Pipelines

After: rewrite with pipelines

Adding a new feature

Writing your own transformer class

Writing your own transformer class

Putting transformers in parallel

Feature Union

Additional features I tried

Summary

Hyperparameter tuning

Hyperparameters

How GridSearchCV works

How GridSearchCV works

GridSearchCV

GridSearchCV

Which hyperparameters?

Picking hyperparameters to test

Picking hyperparameters to test

Picking hyperparameters to test

Picking hyperparameters to test

What parameter grid?

What parameter grid?

Summary

Final accuracy

Making the map

(In two lines of Python)

Step 1

Step 2

Acknowledgments & Useful Resources

0 18