Photograph by Jorge Royan, licensed under the Creative Commons Attribution-Share Alike 3.0 Unported
Created by Joris Gillis using Maple 10, distributed as public domain.
Created by Joris Gillis using Maple 10, distributed as public domain.
In some cases, hyperparameters determine the complexity or flexibility of the model (e.g. regularization parameters) which are used to prevent overfitting (high variance).
In other cases, hyperparameters affect how the algorithm learns:
[A hyperparameter optimizer is] a functional from data to classifier (taking classification problems as an example)...
-- James Bergstra, Algorithms for Hyper-Parameter Optimization [0]
Illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio.[1]
Bergstra and Bengio's paper, Random Search for Hyper-Parameter Optimization[1] tell us that if we randomly sample our space, we approach our global optima using fewer steps than grid search. Furthermore, while grid search gets exponentially worse (see Curse of Dimensionality), random search actually performs quite well in higher dimensional spaces because many problems actually have lower effective representations, namely even though a data set may have many parameters, only a few of those parameters account for most of the variation. Hence randomized sampling of the search space is suprisingly effective.
Random Search vs Grid Search
How many times would we need to sample randomly to get reasonably close (>95% probability) to an optimal configuration?
To rephrase a bit more quantitatively: let's figure out how many points we should sample using random search in order to get a sample within 5% of the globally optimal hyper-parameterization of our sample space with at least 95% probability.
The probability of missing (not getting within 5% of) our global optimum on $n$ successive samples will be
$$ P_{miss} = (1 - 0.05)^n $$Hence the probability of a hit will be
$$ P_{hit} = 1 - P_{miss} = 1 - (1 - 0.05)^n. $$Setting the probability of a hit to 95% and solving for $n$ gives us $$ P_{hit} = 1 - (1 - 0.05)^n = 0.95 $$
Hence, using only 60 samples, we have a 95% probability of getting a sample within 5% of the global optimum. Very nice!
Core illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio. [1]
import numpy as np from time import time from operator import itemgetter from scipy.stats import randint as sp_randint from sklearn.grid_search import GridSearchCV, RandomizedSearchCV from sklearn.datasets import load_digits from sklearn.ensemble import RandomForestClassifier
iris = load_digits() # get some data X, y = iris.data, iris.target
clf = RandomForestClassifier(n_estimators=20) # build a classifier def report(grid_scores, n_top=3): # Utility function to report best scores top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top] for i, score in enumerate(top_scores): print("Model with rank: {0}".format(i + 1)) print("Mean validation score: {0:.3f} (std: {1:.3f})".format( score.mean_validation_score, np.std(score.cv_validation_scores))) print("Parameters: {0}".format(score.parameters)) print("")
# specify parameters and distributions to sample from param_dist = {"max_depth": [3, None], "max_features": sp_randint(1, 11), "min_samples_split": sp_randint(1, 11), "min_samples_leaf": sp_randint(1, 11), "bootstrap": [True, False], "criterion": ["gini", "entropy"]}
# run randomized search n_iter_search = 20 random_search = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=n_iter_search)
start = time() random_search.fit(X, y) finish = (time() - start)
print("RandomizedSearchCV took {time} seconds for {candidate} candidates" " parameter settings.".format(time=finish, candidate=n_iter_search)) report(random_search.grid_scores_)
RandomizedSearchCV took 4.0490076541900635 seconds for 20 candidates parameter settings. Model with rank: 1 Mean validation score: 0.921 (std: 0.013) Parameters: {'max_features': 8, 'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 5, 'bootstrap': True, 'min_samples_leaf': 3} Model with rank: 2 Mean validation score: 0.920 (std: 0.007) Parameters: {'max_features': 4, 'criterion': 'gini', 'max_depth': None, 'min_samples_split': 3, 'bootstrap': True, 'min_samples_leaf': 3} Model with rank: 3 Mean validation score: 0.919 (std: 0.017) Parameters: {'max_features': 8, 'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 10, 'bootstrap': False, 'min_samples_leaf': 5}
# use a full grid over all parameters param_grid = {"max_depth": [3, None], "max_features": [1, 3, 10], "min_samples_split": [1, 3, 10], "min_samples_leaf": [1, 3, 10], "bootstrap": [True, False], "criterion": ["gini", "entropy"]}
# run grid search grid_search = GridSearchCV(clf, param_grid=param_grid) start = time() grid_search.fit(X, y) finish = (time() - start)
print("GridSearchCV took {time} seconds for {candidates} candidate parameter settings.".format( time=finish, candidates=len(grid_search.grid_scores_))) report(grid_search.grid_scores_)
GridSearchCV took 37.768978118896484 seconds for 216 candidate parameter settings. Model with rank: 1 Mean validation score: 0.936 (std: 0.013) Parameters: {'max_features': 10, 'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 3, 'bootstrap': True, 'min_samples_leaf': 1} Model with rank: 2 Mean validation score: 0.932 (std: 0.009) Parameters: {'max_features': 3, 'criterion': 'gini', 'max_depth': None, 'min_samples_split': 1, 'bootstrap': False, 'min_samples_leaf': 1} Model with rank: 3 Mean validation score: 0.931 (std: 0.012) Parameters: {'max_features': 10, 'criterion': 'gini', 'max_depth': None, 'min_samples_split': 3, 'bootstrap': False, 'min_samples_leaf': 3}
Feel free to check out the TotalGood Hyperparameter Optimization repo (HyperParameterOptimization/benchmarks) on github to learn more about our current research (we love pull requests fyi).
[0] Bengio, Y., Bergstra, J., Bardenet, R., & Kegl, B. (n.d.). Algorithms for Hyper-Parameter Optimization. Annual Conference on Neural Information Processing Systems (NIPS). Retrieved August 21, 2015, from http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
[1] Bengio, Yoshua, and James Bergstra. "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research 13.281-305 (2012). Web. 20 Aug. 2015. http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf.
Zheng, Alice. "How to Evaluate Machine Learning Models: Hyperparameter Tuning." How to Evaluate Machine Learning Models: Hyperparameter Tuning. 27 May 2015. Web. 21 Aug. 2015.
"Automatic Hyperparameter Tuning Methods." John Myles White. Web. 21 Aug. 2015.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Presentation made using Jupyter, IPython, and React.js