SingaporeRoadnameOrigins



SingaporeRoadnameOrigins

0 18


SingaporeRoadnameOrigins

A map of Singapore, with roads colour-coded for language of origin

On Github michelleful / SingaporeRoadnameOrigins

Grids, Streets & Pipelines

Building a linguistic street map with scikit-learn

Michelle Fullwood / @michelleful

Who I am

I'm a grad student in linguistics.

I love languages and maps.

I'm from Singapore.

Singapore street names

Clusters of street names

© Open Street Map contributors

Goal

A map of Singapore with streets coloured by linguistic origin

Ingredients

  • Geographic location of roads - OpenStreetMap
  • Linguistic classification - scikit-learn

Ingredients

  • Geographic location of roads - OpenStreetMap
  • Linguistic classification - scikit-learn

Ingredients

  • Geographic location of roads - OpenStreetMap
  • Linguistic classification - scikit-learn

Goals for this talk

  • Classifying with scikit-learn (70%)
    • Organising features with pipelines
    • Improving performance by tuning hyperparameters
  • Wrangling geodata with GeoPandas (30%)
    • Data preparation
    • Plotting a map

Wrangling geodata

OpenStreetMap Metro Extracts

GeoJSON

{ "type": "Feature", 
  "properties": 
      { "id": 5436.0, "osm_id": 48673274.0, 
        "type": "residential", 
        "name": "Montreal Drive", ...
        "class": "highway" },
  "geometry": 
      { "type": "LineString", 
        "coordinates": [ [ 103.827628075898062, 1.45001447378366  ], 
                         [ 103.827546855256259, 1.450088485988644 ], 
                         [ 103.82724167016174 , 1.450461983594056 ], 
                         ... ] } }

GeoPandas

>>> import geopandas as gpd

>>> df = gpd.read_file('singapore-roads.geojson')

>>> df.shape
(59218, 13)

Plotting with GeoPandas

>>> df.plot()

Geographic operations made easy

>>> # `within` function returns true if one feature 
>>> # sits within the boundary of another
>>> df = df[df.geometry.within(singapore.geometry)]

>>> df.plot()
⊂ =

Pandas operations are still available

>>> # filter out empty road names
>>> df = df[df['name'].notnull()]

>>> # only accept roads whose 'highway' variable is
>>> # in an accepted list (not a footpath, etc)
>>> df = df[df['highway'].isin(accepted_road_types)]

Building the baseline classifier

Ingredients we need:

  • Classification schema
  • Labelled train/test set
  • Numerical features
  • A classifier
  • An evaluation metric

Classification schema

  • Malay
  • Chinese
  • English
  • Indian
  • Generic
  • Other

Labelled train/test set

Hand label  

Labelled train/test set

Train Classify  

Labelled train/test set

Train Hand correct  

Labelled train/test set

Train Train Classify  

Train/Test split

# split into train and test data

from sklearn.cross_validation import train_test_split

data_train, data_test, y_train, y_true = \
    train_test_split(df['road_name'], df['classification'], test_size=0.2)

Choosing features: n-grams

(Jalan) Malu-Malu

unigrams m(2) a(2) l(2) u(2) -(1) bigrams #m(1) ma(2) al(2) lu(2) u-(1) ... trigrams ##m(1) #ma(1) mal(2) alu(2) ...

Choosing features: n-grams

Names containing bigram 'ck'
British Chinese Malay Indian 23 17 0 0 AlnwickBerwickBrickson... Boon TeckHock ChyeKheam Hock...

Choosing features: n-grams

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> ngram_counter = CountVectorizer(ngram_range=(1, 4), analyzer='char')

>>> X_train = ngram_counter.fit_transform(data_train)
>>> X_test  = ngram_counter.transform(data_test)

Selecting a classifier

Linear Support Vector Classification (SVC)

Linear Support Vector Classification (SVC)

Linear Support Vector Classification (SVC)

Linear Support Vector Classification (SVC)

Building the classification model

>>> from sklearn.svm import LinearSVC

>>> classifier = LinearSVC()

>>> model = classifier.fit(X_train, y_train)

Testing the classifier

>>> y_test = model.predict(X_test)

>>> sklearn.metrics.accuracy_score(y_true, y_test)
0.551818181818

Improving the classifier

  • More data
  • Trying other classifiers
  • Adding more features
  • Hyperparameter tuning

Adding features

At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used...This is typically where most of the effort in a machine learning project goes. - Pedro Domingos, "A Few Useful Things to Know about Machine Learning"

Features to add

  • Number of words
  • Average length of word
  • Are all the words in the dictionary?
  • Is the road tag Malay? (Street, Road vs Jalan, Lorong)

Before: Feature code

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> ngram_counter = CountVectorizer(ngram_range=(1, 4), analyzer='char')

>>> X_train = ngram_counter.fit_transform(data_train)
>>> X_test  = ngram_counter.transform(data_test)

Pipelines

After: rewrite with pipelines

>>> from sklearn.pipeline import Pipeline

>>> ppl = Pipeline([
              ('ngram', CountVectorizer(ngram_range=(1, 4), analyzer='char')),
              ('clf',   LinearSVC())
          ])

>>> model = ppl.fit(data_train)
>>> y_test = model.predict(data_test)

Adding a new feature

Average word length

  • Longer: likely to be of British or Indian origin
  • Shorter: likely to be of Chinese origin
  • Need a new data transformer that takes in road names and outputs this number

Writing your own transformer class

from sklearn.base import BaseEstimator, TransformerMixin

class SampleExtractor(BaseEstimator, TransformerMixin):
    
    def __init__(self, vars):
        self.vars = vars
        
    def transform(self, X, y=None):
        return do_something_to(X, self.vars)

    def fit(self, X, y=None):
        return self        

Writing your own transformer class

from sklearn.base import BaseEstimator, TransformerMixin

class AverageWordLengthExtractor(BaseEstimator, TransformerMixin):
    """Takes in df, extracts road name column, outputs average word length"""
    
    def __init__(self):
        pass
        
    def average_word_length(self, name):
        return np.mean([len(word) for word in name.split()])
        
    def transform(self, X, y=None):
        return X['road_name'].apply(self.average_word_length)

    def fit(self, X, y=None):
        return self        

Putting transformers in parallel

Feature Union

from sklearn.pipeline import Pipeline, FeatureUnion

pipeline = Pipeline([
    ('feats', FeatureUnion([
        ('ngram', ngram_count_pipeline), # can pass in either a pipeline
        ('ave', AverageWordLengthExtractor()) # or a transformer
    ])),
    ('clf', LinearSVC())  # classifier
])

Additional features I tried

  • Number of words
  • Average length of word
  • Are all the words in the dictionary?
  • Is the road tag Malay? (Street, Road vs Jalan, Lorong)

Summary

  • Using Pipelines and FeatureUnions does not improve performance in and of itself
  • They're a great tool for organising code readably, allowing for easier experimentation

Hyperparameter tuning

Hyperparameters

>>> # When you do this:
>>> clf = LinearSVC()

>>> # You're really doing this:
>>> clf = LinearSVC(C=1.0, loss='l2', ...)

>>> # changing the values of these hyperparameters can alter performance,
>>> # sometimes quite significantly

How GridSearchCV works

C 0.10 1.00 10.0 100 1000 Gamma 2-2 20 ✔ 22

How GridSearchCV works

Within each cell: cross-validation Train Test 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20%

GridSearchCV

>>> from sklearn.grid_search import GridSearchCV

>>> pg = {'clf__C': [0.1, 1, 10, 100]}

>>> grid = GridSearchCV(pipeline, param_grid=pg, cv=5)
>>> grid.fit(X_train, y_train)

>>> grid.best_params_
{'clf__C': 0.1}

>>> grid.best_score_
0.702290076336

GridSearchCV

>>> model = grid.best_estimator_.fit(X_train, y_train)
>>> y_test = model.predict(X_test)
>>> accuracy_score(y_test, y_true)

0.686590909091

Which hyperparameters?

>>> pipeline.get_params()  # only works if all transformers
                           # inherit from BaseEstimator!

{'clf__C': 1.0,
 'clf__class_weight': None,
 'clf__dual': True,
 ...
 'feats__ngram__vect__ngram_range': (1, 4),
 'feats__ngram__vect__preprocessor': None,
 'feats__ngram__vect__stop_words': None,
}

Picking hyperparameters to test

Picking hyperparameters to test

  • Understanding the classification algorithm

Picking hyperparameters to test

  • Understanding the classification algorithm

Picking hyperparameters to test

  • Understanding the classification algorithm

What parameter grid?

The academic literature

What parameter grid?

  • Searching on Github

Summary

  • Hyperparameter search can be time-consuming
    • CPU Time ≥ product of number of features tested in each dimension
  • Can be parallelized easily
  • Can be performed iteratively: explore a coarse grid before exploring finer grid around most promising region
  • Use logarithmic values roughly centred around default(unless it makes sense not to)

Final accuracy

Making the map

(In two lines of Python)

Step 1

Create a matplotlib map with GeoPandas:
>>> ax = df.plot(column='classification', colormap='accent')

Step 2

Convert it into an interactive Leaflet web map
>>> import mplleaflet
>>> mplleaflet.display(fig=ax.figure, crs=df.crs, tiles='cartodb_positron')

Acknowledgments & Useful Resources