Machine learning for particle physics using R – Data production pipeline – Visualising a correlation matrix as a force-directed network



Machine learning for particle physics using R – Data production pipeline – Visualising a correlation matrix as a force-directed network

0 0


zimanyi15

Presentation for Zimanyi Winter School 2015

On Github andrewjohnlowe / zimanyi15

Quark/gluon jet tagging for ALICE

Machine learning for particle physics using R

Andrew John Lowe

Wigner Research Centre for Physics,

Hungarian Academy of Sciences

Introduction: about this talk

  • This is a walk-through of an ongoing analysis that I am performing entirely in R
  • Computing power was limited to my own laptop
  • Limited RAM and CPU processing power imposed contraints that drove the analysis decisions
  • Similar constrains apply when performing machine learning in TMVA (ROOT) or Scikit-learn (Python), both of which are used in HEP, so the lessons learned are applicable when using these tools also
  • This talk focuses on the problem of distinguishing between (light) quark- and gluon-initiated jets

The problem in a nutshell

  • Beams of energetic protons and/or heavy ions collide inside our detector
  • Quarks and gluons emerge and decay into collimated sprays of particles
  • Algorithms cluster these decay products into jets
  • For each jet, we'd like to know what initiated it
  • Was it a Quark? Or a Gluon?
  • Being able to accurately discriminate between quark- and gluon-initiated jets would be an extremely powerful tool in the search for new particles and new physics
  • This is an archetypal classification problem that might be amenable to machine learning

Machine learning& particle physics

  • Machine learning is more or less what is commonly known in particle physics as multivariate analysis (MVA)
  • Used for many years but faced widespread scepticism
  • Use of multivariate pattern recognition algorithms was basically taboo in new particle searches until recently
  • Much prejudice against using what were considered "black box" selection algorithms
  • Neural nets and Fisher discriminants used somewhat in the 1990's
  • Boosted Decision Trees (AdaBoost, 1996) is the favourite algorithm used for many analyses (1st use: 2004)

Successes, Challenges and Future Outlook of Multivariate Analysis In HEP, Helge Voss, 2015 J. Phys.: Conf. Ser. 608 (2015) 012058; Higgs Machine Learning Challenge visits CERN, 19 May 2015, CERN; Boosted Decision Trees as an Alternative to Artificial Neural Networks for Particle Identification, Hai-Jun Yang et al., Nucl.Instrum.Meth. A543 (2005) 577-584

Machine Learning definition

  • Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed
  • Tom Mitchell (1998): A computer program is said to learn from experience \(E\) with respect to some task \(T\) and some performance measure \(P\), if its performance on \(T\), as measured by \(P\), improves with experience \(E\)

Data production pipeline

  • Use experiment's software (AliROOT) to process Monte Carlo simulated data that contains lots of jets
    • Insert my own C++ code with handcrafted features
  • Attach ground-truth class labels ("quark"/"gluon") to data for each jet
    • There is significant class noise (mislabelled jets)
    • Different labelling schemes exist, but none are perfect because Monte Carlo simulation cannot perfectly simulate real data
  • Write-out data and convert to CSV format for use in R

Class labelling

  • Jets labelled using the partons in the generator event record
  • Parton with highest \(p_{\mathsf{T}}\) within a \({\Delta}R\) equal to the radius parameter of the jet algorithm determines the jet label
    • This is identical to the scheme used by ATLAS\(^{*\dagger}\)
    • This labelling procedure is not unambiguous and is not strictly identical for different MC generators
    • Definitions are not theoretically robust, but studies (with MADGRAPH) have shown that for most generators, truth labelling is identical to matrix-element-based labelling for 90-95% of (isolated) jets

* Light-quark and gluon jet discrimination in \(pp\) collisions at \(\sqrt{s}\) = 7 TeV with the ATLAS detector, Eur.Phys.J. C74 (2014) 3023

\(\dagger\) Light-quark and Gluon Jets in ATLAS: Calorimeter Response, Jet Energy Scale Systematics, and Sample Characterization, ATLAS-CONF-2011-053, Mar. 2011, also ATLAS-CONF-2012-138, Sept. 2012

What about mislabelled jets?

  • Impose a requirement that the jets are isolated to restrict contamination from wide-angle QCD radiation
  • For each jet, require no other jet within \({\Delta}R\) = 1 (2.5 \(\times\) jet radial parameter) of the jet axis
  • Methods exist for dealing with mislabelled training data
    • Majority vote filtering and consensus filtering can significantly improve classification accuracy for noise levels up to 30%\(^*\)

* Identifying Mislabeled Training Data, C. E. Brodley and M. A. Friedl, Journal of Artificial Intelligence Research, Vol. 11, pages 131-167, 1999

Data production

The data that I am using for the results shown are:

  • \(pp\) at 7 TeV, high-\(p_\mathrm{T}\) data, Pythia 6, Perugia 2011 tune, full detector Monte Carlo simulation
  • Charged-track jets reconstructed with Anti-\(k_\mathrm{T}\) with a radial parameter of 0.4
  • Constituents: "Picotracks" with \(p_\mathrm{T}\) > 0.15 GeV
  • Code exercised using the ALICE "EMCAL jet framework"
  • No other cuts at this stage
  • What about heavy-ions?
    • One step at a time...

Why did I choose R?

  • Chief among those were the need for fast prototyping and high-level abstractions that let me concentrate on what I wanted to achieve, rather than on the mechanics and the highly-granular details of how I might do it
  • Incredibly easy to express what I want to achieve
  • Exponentially-growing number of add-on packages
  • Latest machine learning algorithms are available
  • About 2 million R users worldwide\(^*\); technical questions are answered extremely quickly (if not already)
  • Not as fast as C++, but my goal is to quickly test new ideas rather than implement a final model
    • A prototype can be distilled into C++ code later to produce a final model

* http://www.inside-r.org/what-is-r

  • For experimental particle physics, ROOT is the ubiquitous data analysis tool, and is 20 years old
  • Command language: CINT ("interpreted C++") or Python
    • Small data: work interactively or run macros
    • Big data: compile with ROOT libraries, run on Grid
  • Data format optimised for large data sets
  • Complex algorithms are difficult to do interactively
  • End up writing huge C++ programs
  • Lots of tweaking, endless edit-compile-run loops
  • The best choice for prototyping new methods?

See Highlights and Analysis of the Answers to the ROOT Users' Survey, "ROOT Turns 20" Users' Workshop, 15-18 September 2015, Saas-Fee, Switzerland

On C++ and data analysis

  • Is C++ a good choice for data analysis?
    • Spend days coding something that runs in minutes or
    • Write something in a couple of hours that will run during your lunch break?
    • Which will get you your answer faster? What strategy will help you define where you should be focusing your efforts and which paths lead to dead-ends?
  • Larry Wall, creator of Perl (speaking about differences in the number of lines of code needed to accomplish the same task using different languages):

You can eat a one-pound stake, or a 100 pounds of shoe leather, and you feel a greater sense of accomplishment after the shoe leather, but maybe there's some downsides...

Data preselection

  • There were some jets for which a partonic flavour could not be assigned; these are discarded
  • Heavy-flavour jets are discarded, as \(b\)-and \(c\)-tagging is not the topic of this analysis
  • Single-track jets are removed. The justification for doing this is that many cases it is not possible to construct a meaningful observable for these jets
    • Probably not much we can do for these jets anyway, so they are removed
    • Try to recover these later maybe
  • All remaining jets with NaN values were dropped; this amounts to less than 0.1% of jets (this might bias the data, but the impact will be extremely small)

Data munging

  • Started with more than 300 features
  • Removing duplicate and highly correlated features was critical for enabling my data to fit in RAM
    • I could reduce the size of my data by more than 50% with a conservative cut on correlations between features
    • Two-step process: cut on absolute value of Pearson correlation to identify linearly correlated features, then cut on absolute value of Spearman correlation to identify monotonically-related features
  • Needed to reduce data size considerably (to fit into RAM) while retaining useful features
    • Succeeded in reducing number of features to ~50

Feature ranking & selection

  • The question addressed by this work can be formally stated as follows: can we use quantitative characteristics of the jets to classify them as quark-jets or gluon-jets?
  • This invites the question: how should we find the variables that provide the best discrimination between quark-jets and gluon-jets?
  • We can use domain knowledge to drill down to what are believed to be the best discriminants; observables that:
    • Can explain most of the variance in the data
    • Are minimally correlated with each other
    • Provide the best predictive power
  • How to optimally search the feature space? (Manual inspection may be impractical for a large feature set)

Problems of too many features

Or: why don't we just throw everything into a BDT or neural net and hope for the best?

  • Correlated features can skew prediction
  • Irrelevant features (not correlated to class variable) cause unnecessary blowup of the model space
  • Irrelevant features can drown the information provided by informative features in noise
  • Irrelevant features in a model reduce its explanatory value
  • Training may be slower and more computationally expensive
    • Training time was an issue for me, so it was necessary to perform feature selection

Redundant & irrelevant features

What should we do when it is likely that the data contains many redundant or irrelevant features?

  • Redundant features are those which provide no more information than the currently selected features
  • Irrelevant features provide no useful information in any context

Feature ranking & selection methods

Several methods in R for feature ranking and selection:

  • Iteratively remove features shown by a statistical test to be less relevant than random probes: the Boruta algorithm
  • Correlation Feature Selection (CFS)
  • Recursive Feature Elimination (RFE, Backwards Selection)
  • Simulated annealing
  • Genetic algorithms

Tried all of these with varying levels of success

Very slow for large data like mine; this limits their utility somewhat \(-\) didn't investigate these further

Boruta

The basic principle, in a nutshell:

  • Boruta algorithm is a wrapper built around the random forest classification algorithm
    • Random forests are an ensemble learning method for classification (and regression) that operate by stochastically growing a forest of decision trees; each tree is grown in such a way that at each split only a random subset of all features is considered
  • The importance measure of an attribute is obtained as the loss of classification accuracy caused by the random permutation of feature values between objects
  • It is computed separately for all trees in the forest which use a given feature for classification
  • Then the average and standard deviation of the accuracy loss are computed
  • Claims to be robust against "selection bias"\(^*\)

* Selection bias in gene extraction on the basis of microarray gene-expression data, Christophe Ambroise and Geoffrey J. McLachlan, PNAS vol. 99, No. 10 (2002)

Correlation Feature Selection

  • The Correlation Feature Selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other"

  • The following equation gives the merit of a feature subset \(S\) consisting of \(k\) features:

\[ Merit_{S_{k}} = \frac{k\overline{r_{cf}}}{\sqrt{k+k(k-1)\overline{r_{ff}}}} \]

  • where \(\overline{r_{cf}}\) is the average value of all feature-classification correlations, and \(\overline{r_{ff}}\) is the average value of all feature-feature correlations. These variables are referred to as correlations, but are not necessarily Pearson's correlation coefficient or Spearman's \(\rho\).

Recursive Feature Elimination

  • First, the algorithm fits the model to all predictors
    • I used a random forest for the model
  • Each predictor is ranked using its importance to the model
  • Let \(S\) be a sequence of ordered numbers which are candidate values for the number of predictors to retain (\(S_1\) \(>\) \(S_2\), \(\dots\))
  • At each iteration of feature selection, the \(S_i\) top ranked predictors are retained, the model is refit and performance is assessed
  • The value of \(S_i\) with the best performance is determined and the top \(S_i\) predictors are used to fit the final model
  • To minimise the possibility of selection bias, I performed k-fold cross-validation during training with ten folds

Fast feature selection

  • The usual methods provided for feature selection work well with small-sized data, but become impracticably slow for data of the size that is typical in HEP
  • It was necessary to devise a faster method
  • I adapted a filter-based method that involves ranking by information gain (Kullback–Leibler divergence) or Gini impurity
  • Basic idea: rank by one or both of these metrics, and do this repeatedly for a large number of bootstrap resamplings from the data and average
    • For each feature, we now have a nonparametric estimate of the mean and can calculate a confidence interval for that estimate!

Information gain & Gini impurity

  • Information gain is based on the concept of entropy from information theory and is commonly used to decide which features to use when growing a decision tree

\[ Entropy = - \sum_{i}{p_i}{\log_{2}}{p_i} \]

  • At each level, the feature with the highest information gain is chosen
  • An alternative measure of "node impurity" commonly used in decision tree learning is the Gini impurity:

\[1 - \sum_{i}{p_i}^2\]

Information gain distributions

Note "BOGUS" random probes injected into the data

Feature ranking

Median information gain

Remove features with median information gain less than that for random probes, or within one standard deviation

Correlation analysis

  • Filter-based methods for feature selection are fast, but they only look at each feature individually and do not take into account interactions between features
  • Typically we would plot a correlation matrix to examine the Pearson (linear) correlations between features
  • This visualisation type commonly used in particle physics, albeit with smaller matrices (\(<\) 20 features)
  • "OK" for small matrices, but the information value of visualisation diminishes for larger matrices
    • Also: slightly more than half the plot is redundant!

Visualising a correlation matrix as a biplot

  • Consider instead visualising correlation structures by performing principal component analysis (PCA) and plotting a biplot that shows pairwise correlations as angles:

\[ r\left(x_1,x_2\right) = \cos^2\left(x_1,x_2\right) \]

  • In the following visualisation (next slide), PCA was performed on the data and the first two components extracted and plotted on a biplot (which also hints that existing features don't appear to be efficiently leveraging the apparent structure seen in the data)

Correlation matrix biplot

Screeplot

This screeplot shows that the most of the variance in the data, and hence most of the structure, is contained in perhaps no more than half a dozen principal components

Visualising a correlation matrix as a force-directed network

Connection strength between features (nodes) is proportional to their marginal correlation; the node colour provides a measure of feature importance

Visualising a correlation matrix as a force-directed network

One can also plot a partial correlation network that shows the conditional pairwise correlations between features

Visualising a correlation matrix as a force-directed network

We can remove edges between weakly-correlated predictors on a partial correlation network to isolate them

Prototype jet classifier

  • I decided to use a boosted decision tree as my classifier
  • Specifically, I used Stochastic Gradient Boosting or Gradient Boosting Machine (GBM)\(^\dagger\) in xgboost
  • There are several justifications for this choice:
    • Speed: whereas R is single-threaded, XGBoost runs using multiple threads and is very fast\(^*\)
    • Outstanding performance in Kaggle competitions
    • Can do early stopping: quit training if performance degrades after N rounds; great for tuning\(^*\)
    • Can be tweaked to do Random Forests also\(^*\)
    • Robust with respect to class noise

\(\dagger\) Greedy Function Approximation: A Gradient Boosting Machine, Jerome Friedman, The Annals of Statistics, Vol. 29, No. 5, Oct., 2001; Additive logistic regression: a statistical view of boosting, Jerome Friedman, Trevor Hastie, Robert Tibshirani, The Annals of Statistics, 2000, Vol. 28, No. 2, 337-407

* I'm indebted to Szilárd Pafka for this info. He's done great work benchmarking these tools.

Results

Confusion Matrix and Statistics

          Reference
Prediction Gluon Quark
     Gluon 44991  8224
     Quark 34660 12125

               Accuracy : 0.5712          
                 95% CI : (0.5681, 0.5742)
  • We correctly tagged 60% of true quark jets
  • We mis-tagged 44% of true gluon jets ☹
  • The main problem appears to be one of overfitting \(-\) performance on the training data is reasonable
  • The task ahead is to deal with the overfitting problem to hammer down the mis-tag rate

Future work

  • I have unbalanced classes: $>$80% of jets are gluon-jets
    • I got bad results until I realised that most of the classifiers I tried were optimising for accuracy; results improved when I optimised for AUC instead
  • Down-sampling the majority class could hurt performance, as we remove points that could be useful for defining the optimal decision boundary
  • Bagging should help ameliorate this
  • I plan to CV bag\(^*\) to build an ensemble learner
  • Add Random Forests and Deep Learning into the mix?

* Dissecting the Winning Solution of the HiggsML Challenge, Gábor Melis, Journal of Machine Learning Research, Workshops and Proceedings Vol 42, pp. 57–67, 2015

Summary

  • It's often said that 80% of data analysis is spent on data munging\(^*\) \(-\) this was certainly true in my case
  • However, I've found a good set of tools for streamlining this process
  • I've shown how it is possible to use R for fast prototyping of a new method for binary classification
    • To the best of my knowledge, nobody has tried to do a particle physics analysis like this in R before
    • Will be invaluable for helping me decide where to direct my efforts later when building a final model

* Exploratory Data Mining and Data Cleaning, Dasu T, Johnson T (2003), John Wiley & Sons

Thanks!

alice-machine-learning@cern.ch