On Github DonaldWhyte / intro-to-deeplearning
Originally made for:
A bit about myself...
Infrastructure Engineer
I currently work for Bloomberg as an infrastructure engineer. I help design and build the low-level systems that keep financial data moving to the right places. My role is essentially a hybrid between a software engineer, architect and data scientist. I dabble in a bit of everything basically!Hackathons
Organiser / Mentor / Hacker
Have participated in, mentored at and organised 17 hackathons, across the world. In countries such as: Egypt, UAE, Italy, Germany, the US and, of course, the UK.Applied machine learning in:
Machine learning is an approach to achieve AI
Machines learn behaviour with little human intervention
Programs that can adapt when exposed to new data
Based on pattern recognition
Supervised learning will be covered here
Use labelled historical data to predict future outcomes
Given some input data, predict the correct output
What features of the input tell us about the output?
Use trained model to classify new, unseen inputs
Be careful of overfitting!
For n features, the perceptron is defined as:
Simulates the 'firing' of a physical neuron
1 = neuron fires, 0 = neuron does not fire
$$ f(x) = \begin{cases}1 & \text{if }w \cdot x + b > 0\\0 & \text{otherwise}\end{cases} $$
How do we learn w and b?
Algorithm which learns correct weights and bias
Use training dataset to incrementally train perceptron
Guaranteed to create line that divides output classes
(if data is linearly separable)
Details of the algorithm are not covered here for brevity. Training dataset, which is a collection of known input/output pairs (typically produced by humans manually labelling input).Most data is not linearly separable
Need a network of neurons to discriminate non-linear data
Most common neural network architecture
Provides classification or regression
Uses multiple perceptrons in a layered fashion
where n is feature count and m is class count.
The hidden layers is where all the smarts comes in. I could spend days discussing how to choose the number of hidden layers and nodes in each layer. It depends on so many factors. The number of input features, the distribution of inputs across feature space.Produces multiple weight matrices
One for each layer
Weight matrix produced using the following Latex equation: W = \begin{bmatrix} w{00} & w{01} & \cdots & w{0n} \ w{10} & w{11} & \cdots & w{1n} \ \vdots & \vdots & \vdots & \vdots \ w{m0} & w{m1} & \cdots & w_{mn} \end{bmatrix}Learn the weight matrix!
Keep adjusting neuron weights
Such that loss/error function is minimised
Uses derivatives of activation functions to adjust weights
So we need continuous activation functions like sigmoid!
Mean squared error loss function:
$ J(w) = \frac {1} {n} \sum_{i=i}^N {(y_i - t_i)^2} $
where:
After training the network, we obtain weights which minimise the loss/error
Classify new, unseen inputs by running them through the forward pass step
Many types of gradient descent / backpropagation
Batch, mini-batch, stochastic
There are many types of gradient descent for optimising weight matrices and minimising loss. We don't have the time to go through them here. Here's a link to amazing blog post that does an excellent job of summarising the different types of gradient descent optimisers.A feed-forward neural network with a single hidden layer that has a finite number of nodes, can approximate any continuous function
Why not turn to deep learning?
A machine learning technique
Neural networks with many hidden layers
Learns the input features for you
Thanks to Google, Amazon, Microsoft, etc.
Everyone is on the web
Everyone's data is captured
Backpropagation can be reduced to matrix multiplication
Easily parallelised and distributed across multiple cores
GPGPU and distributed computation
Higher weight magnitudes and variance in earlier layers!
Widely used for image and sound processing
Both recognition and generation
Image recognition: Google ImageNet Speech synthesis: Google DeepMind WaveNet An important thing to note is that deep learning enables more than just recognition. It enables the generation of content as well. This capability is beyond most machine learning techniques.Annual revenue for deep learning will surpass $10 billion by 2014
The market intelligence firm forecasts that annual software revenue for enterprise applications of deep learning will increase from $109 million in 2015 to $10.4 billion in 2024. I suspect this large growth is because of deep learning's application in driver-less cars, which is going to become a huge industry.Higgs Boson Detection
Vast amounts of energy required to create them
Large Hadron Collider used to produce this energy
It is difficult to detect the Higgs Boson because of their massive size (compared with other particles). We need vast amounts of energy to create them. The Large Hadron Collider at CERN gave us enough energy to create them.Large Hadron Collider
Built by CERN in Switzerland, the Large Hadron Collider is capable of producing the vast amount of energy required to run complex particle physics simulations.Use measurements of effects to infer Higgs boson production!
2: Each collision produces a flurry of new particles which are detected by etectors around the point where they collide. There is still only a very small chance, one in 10 billion, of a Higgs Boson appearing and being detected, so the LHC needs to smash together trillions of particals. Supercomputers the need to sift through a massive amount of data to find the few collisions where evidence of the Higgs boson is. 3 - 4: However, even when a Higgs boson, or any interesting particle is produced, detecting them poses considerable challenges. They are too small to be directly observed and decay almost immediately into other particles. Though new particles cannot be directly observed, the lighter stable particles to which they decay, called decay products, can be observed. Multiple layers of detectors surround the point of collision for this purpose. The detects measure the direction and momentum of each decay product to be measured. So we use the measurements of the stable particles to infer the creation of a Higgs boson!Class, Feature 1, Feature 2, Feature 3, Feature 4, Feature 5, ... Feature 28 1.000000000000000000e+00, 8.692932128906250000e-01, -6.350818276405334473e-01, 2.256902605295181274e-01, 3.274700641632080078e-01, -6.899932026863098145e-01, 7.542022466659545898e-01, -2.485731393098831177e-01, -1.092063903808593750e+00, 0.000000000000000000e+00, 1.374992132186889648e+00, -6.536741852760314941e-01, 9.303491115570068359e-01, 1.107436060905456543e+00, 1.138904333114624023e+00, -1.578198313713073730e+00, -1.046985387802124023e+00, 0.000000000000000000e+00, 6.579295396804809570e-01, -1.045456994324922562e-02, -4.576716944575309753e-02, 3.101961374282836914e+00, 1.353760004043579102e+00, 9.795631170272827148e-01, 9.780761599540710449e-01, 9.200048446655273438e-01, 7.216574549674987793e-01, 9.887509346008300781e-01, 8.766783475875854492e-01 1.000000000000000000e+00, 1.159911632537841797e+00, 1.013847351074218750e+00, 1.086145266890525818e-01, 1.495523571968078613e+00, -5.375448465347290039e-01, 2.342396497726440430e+00, -8.397403955459594727e-01, 1.320682883262634277e+00, 0.000000000000000000e+00, 1.858587384223937988e+00, 1.131710648536682129e+00, -5.616937279701232910e-01, 0.000000000000000000e+00, 9.609999656677246094e-01, 6.710261702537536621e-01, -1.788218468427658081e-01, 0.000000000000000000e+00, 1.177603840827941895e+00, -9.706818312406539917e-02, 1.190679788589477539e+00, 3.101961374282836914e+00, 8.221355676651000977e-01, 7.667723298072814941e-01, 1.002190828323364258e+00, 1.061232686042785645e+00, 8.370041251182556152e-01, 8.604723811149597168e-01, 7.724842429161071777e-01 1.000000000000000000e+00, 6.183877587318420410e-01, -1.012981653213500977e+00, 1.110139250755310059e+00, 9.410225749015808105e-01, -3.791989386081695557e-01, 1.004656314849853516e+00, 3.485354781150817871e-01, -1.678592920303344727e+00, 2.173076152801513672e+00, 1.002569556236267090e+00, -6.361894607543945312e-01, -3.675004839897155762e-01, 0.000000000000000000e+00, 6.503257751464843750e-01, 5.117326378822326660e-01, -1.827050000429153442e-01, 0.000000000000000000e+00, 1.427826285362243652e+00, -2.169947475194931030e-01, 1.049177289009094238e+00, 3.101961374282836914e+00, 8.268290758132934570e-01, 9.898085594177246094e-01, 1.029103517532348633e+00, 1.199678778648376465e+00, 8.914807438850402832e-01, 9.384900331497192383e-01, 8.652693629264831543e-01 1.000000000000000000e+00, 7.005587816238403320e-01, 7.742510437965393066e-01, 1.520181775093078613e+00, 8.471117019653320312e-01, 2.112299501895904541e-01, 1.095530629158020020e+00, 5.245675146579742432e-02, 2.455338835716247559e-02, 2.173076152801513672e+00, 1.345027089118957520e+00, 9.937756061553955078e-01, -1.298507809638977051e+00, 2.214872121810913086e+00, 1.252707004547119141e+00, 1.465673208236694336e+00, 1.262483119964599609e+00, 0.000000000000000000e+00, 4.196339249610900879e-01, 1.585234761238098145e+00, 1.713961958885192871e+00, 0.000000000000000000e+00, 3.373743593692779541e-01, 8.452081084251403809e-01, 9.876095056533813477e-01, 8.834222555160522461e-01, 1.888438344001770020e+00, 1.153765797615051270e+00, 9.312791228294372559e-01 0.000000000000000000e+00, 1.178029537200927734e+00, 1.177960634231567383e-01, -1.276979565620422363e+00, 1.864456653594970703e+00, -5.843700766563415527e-01, 9.985186457633972168e-01, -1.264548897743225098e+00, 1.276332855224609375e+00, 0.000000000000000000e+00, 7.331360578536987305e-01, 3.565544188022613525e-01, 3.544357419013977051e-01, 2.214872121810913086e+00, 6.183627247810363770e-01, -9.267247468233108521e-02, -9.976139664649963379e-01, 0.000000000000000000e+00, 8.455964922904968262e-01, 1.399515151977539062e+00, -1.313188791275024414e+00, 0.000000000000000000e+00, 8.388421535491943359e-01, 8.828901052474975586e-01, 1.201380252838134766e+00, 9.392156600952148438e-01, 3.397049307823181152e-01, 7.590702772140502930e-01, 7.191191911697387695e-01 0.000000000000000000e+00, 4.644770622253417969e-01, -3.370473384857177734e-01, 2.290194332599639893e-01, 9.545962214469909668e-01, -8.684663772583007812e-01, 4.300042688846588135e-01, -2.713484168052673340e-01, -1.252278447151184082e+00, 2.173076152801513672e+00, 4.493495523929595947e-01, -1.141303658485412598e+00, 1.276562809944152832e+00, 2.214872121810913086e+00, 4.985889792442321777e-01, -1.894054651260375977e+00, 8.880367875099182129e-01, 0.000000000000000000e+00, 6.756982803344726562e-01, -1.652782082557678223e+00, -5.862540006637573242e-01, 0.000000000000000000e+00, 7.525347471237182617e-01, 7.407272458076477051e-01, 9.869167804718017578e-01, 6.639524102210998535e-01, 5.760836601257324219e-01, 5.414269566535949707e-01, 5.174198746681213379e-01 0.000000000000000000e+00, 4.644770622253417969e-01, -3.370473384857177734e-01, 2.290194332599639893e-01, 9.545962214469909668e-01, -8.684663772583007812e-01, 4.300042688846588135e-01, -2.713484168052673340e-01, -1.252278447151184082e+00, 2.173076152801513672e+00, 4.493495523929595947e-01, -1.141303658485412598e+00, 1.276562809944152832e+00, 2.214872121810913086e+00, 4.985889792442321777e-01, -1.894054651260375977e+00, 8.880367875099182129e-01, 0.000000000000000000e+00, 6.756982803344726562e-01, -1.652782082557678223e+00, -5.862540006637573242e-01, 0.000000000000000000e+00, 7.525347471237182617e-01, 7.407272458076477051e-01, 9.869167804718017578e-01, 6.639524102210998535e-01, 5.760836601257324219e-01, 5.414269566535949707e-01, 5.174198746681213379e-01 0.000000000000000000e+00, 1.861934423446655273e+00, -3.828238844871520996e-01, -1.608231782913208008e+00, 4.637614190578460693e-01, -1.487331032752990723e+00, 1.222498297691345215e+00, -4.099806249141693115e-01, 2.091603726148605347e-01, 1.086538076400756836e+00, 2.971322536468505859e-01, 4.381497800350189209e-01, -7.070591449737548828e-01, 2.214872121810913086e+00, 7.302335500717163086e-01, -2.419532686471939087e-01, 1.062223672866821289e+00, 0.000000000000000000e+00, 5.121286511421203613e-01, 2.327280282974243164e+00, -4.114539623260498047e-01, 0.000000000000000000e+00, 1.622552871704101562e+00, 9.486817717552185059e-01, 9.923209547996520996e-01, 1.476622343063354492e+00, 7.626733183860778809e-01, 1.014428615570068359e+00, 1.134126186370849609e+00Source of data: https://archive.ics.uci.edu/ml/datasets/HIGGS#
Source: [1]
High-level features perform worse than all features
Suggests high-level features do not capture all info in low-level features
Methods trained with only the high-level features perform worse than those with all the features. This suggests that despite the insight represented by the high-level features, they do not capture all of the information contained in the low-level features.Let's use deep learning!
Automatically discover insight contained in high-level features
Using only low-level features!
Weights initialised random from normal distribution
Less variance in weights for deeper layers
Layer(s) Standard Deviation First Layer 0.1 Hidden Layers 0.05 Output Layer 0.001Data scientists had to reluctantly accept the limitations of shallow networks
Must construct helpful, high-level features to guide shallow networks and other ML techniques
Until now, physicists have reluctantly accepted the limitations of the shallow networks employed to date. In an attempt to circumvent these limitations, physicists manually construct helpful non-linear feature combinations to guide the shallow networks.Higgs boson detection is an example where deep learning
Provides better discrimination than traditional classifiers
Even when traditional classifiers are aided by manually constructed features!
Recent advances in deep learning techniques lift these limitations by automatically discovering powerful non-linear feature combinations and providing better discrimination power than current classifiers, even when aided by manually-constructed features.Traditional ML requires complex, hand-crafted features
Correctly classifying complex feature spaces requires lots of fine-tuning of features and parameters
...and it still might not be accurate enough.
Theoretically, neural networks are extremely powerful
Can model any mathematical function and fit any space
Deep networks can learn complex features for us
No more time-consuming high-level feature engineering
But we just didn't know how to effectively train them
Vanishing gradients, poor weight initialisation and lack of computational power held us back
Meant we could realistically only use shallow networks
Until 2006, where we saw an explosion of new techniques
ReLU/softmax activations, smarter weight initialisations, regularisation techniques
And much better hardware for training nets (GPGPUs)
We've only scratched the surface in this talk
These new techniques have enabled us to build very complex architectures for variety of tasks
Classification / regression problems with highly complex feature spaces
Learns 'compressed' version of input data
Can provide better abstraction and generalisation
Layers can be trained individually
Allows layers to be reused in other networks
Great for image recognition
Image recognition and video analysis
Great for time-series data
Processing/generating audio or natural language
We don't HAVE to use deep learning.
It's not always better.
Traditional ML techniques are often powerful enough.
And are much easier to use:
Nevertheless, deep learning has proven to achieve that traditional techniques can't.
It's here to stay!