Spice up your website
with Machine Learning!
Evelina Gabasova
@evelgab
F# Snippets
F# snippets are like Gist on steroids
Started 5 years ago by Tomas PetricekF# Snippets
fssnip.net
Demo of F# snippets website
It all works nice until you start searching for somethingSearching through F# snippets
over 1600 snippets
over 1100 different tags
Searching through F# snippets
Do we need a custom system?
let! tells us that the code uses asynchronous workflows, get return value
of an asynchronous function
Google ignores diacritics, searching for let and let! gives
the same results although the meaning is very different
F# is a statically typed language!
I'm not interested in how is this function actually called,
but I'm interested in that it takes asynchronous workflow
and returns asynchronous workflow.
This type of information is not available as text and google
doesn't index it for search.
Great opportunity to create a custom machine learning system!!!Great opportunity to create a custom machine learning system!
We're programmers, we like to automate - alternative would be do
go through the tags manually and curate them!
what can possibly go wrong?
There's this company that also does a lot of machine learning...
this year has been great for machine learning
but as machine learning is getting more and more
ubiquitous, So are the problems!
Sergey Lavrov (Russian foreign minister) = sad little horseNguyen A et al.: Deep Neural Networks are Easily Fooled:
High Confidence Predictions for Unrecognizable Images. 2015.
Using machine learning in production
- dependence on training data
- inputs
incoming data is not controlled, users do many different things
Black boxes can be unpredictable- data-background : #87c594
PART I
Finding related snippets
If you liked this F# code, you'll also like ...
Simple information retrieval
common terms
Looking at common terms is not enough
Two snippets may have a large overlap if they use common wordsBag of words
- ignore order of words
separate text and code
This allows us to look at code and text differently
Text - free structured
Code - fixed structure with keywords
Keywords have different role than variable namesTerm frequency
Snippet 1
Term
Frequency
async
3
x
15
The
2
code
1
...
Snippet 2
Term
Frequency
async
0
x
15
The
2
code
1
...
Why comparing snippets based on their term frequency only is not sufficient
async is informative, x is not
term frequency is important but cannot be used aloneInverse document frequency
Relative importance of terms
idf(term)=lognumber of snippetsnumber of snippets with term
Weight - how informative a term is
Look at what features are the least informativeVector representation: TF-IDF
Term frequency - inverse document frequency
tfidf(term,snippet)=tf(term,snippet)×idf(term)
This makes it comparable between different documents
By computing this for every snippet, we get vector representationVector representation of snippets
Snippet
x
List
Array
...
snippet1
0
0.17
0
...
snippet2
0
0.04
0.001
...
snippet3
0.23
0.005
0.31
...
snippet4
0
0
0
...
...
Vector representation of snippets
angle between snippet vectors shows how related they areSuggesting tags
Actually one of the most popular tags is F#Making sense of user-generated tags
async, #async, async mailprocessor, async paraller, Async sequences, asyncseq, asynchronous, Asynchronous Processing, Asynchronous Programming, asynchronous sequence, asynchronous workflows
Capitalization, plural vs singular, typos, different transcriptions
singular
but then Window Form and other things like thisEdit distance
regex vs. regexp
sports vs. ports pi vs. API
insertion, deletionMachine learning
From snippets to tags
Mapping
For each snippet, I have a set of tags
I can use the data to train a predictorAssociations
string and parser
async and MailboxProcessor
sequence and exception
Naive Bayes
Why do you call me naive?
Why naive?
string and parser
async and MailboxProcessor
sequence and exception
naive means we're making our lives easier because we can
ignore how individual terms depend on each otherBuilding a predictor
We can actually estimate this from dataBuilding a predictor
Maybe this should have the "async" tag!Tag probabilities
Bayes theorem
p(A∣B)=p(B∣A)p(A)p(B)
Tag probabilities
Bayes theorem
p(tag∣snippet)∝p(tag)p(snippet∣tag)
Tag probabilities
Bayes theorem
p(tag∣snippet)∝p(tag)∏termp(term∣tag)
Tag probabilities
Bayes theorem
p(tag∣snippet)∝p(tag)×p(term1∣tag)p(term2∣tag)p(term2∣tag)…
1. Prior probabilities
p(tag)≈Number of snippets with the tagNumber of snippets
p(async) = 5%2. Tag likelihood
How frequent is the term among snippets that have the tag ?
p(term∣tag)=Number of snippets with the term and tagNumber of snippets with the tag
Naive Bayes prediction
p(tag∣snippet)∝p(tag)∏termp(term∣tag)
p(tag∣snippet)?>p(¬tag∣snippet)
Working in logarithm space
The whole prediction reduces to summationsThe theory is always nicer
What if there is no snippet tagged async that contains List?
Introduce numerical fixes that mean that it is a bit messier
Naive Bayes is a very simple algorithm - in principle can use any classification algorithmMachine learning to improve user experience
Machine learning
Why do you need machine learning?
Collect your data!
Feature engineering.
Actual machine learning.
Profit!
Machine learning
Why do you need machine learning?
Collect your data!
Feature engineering.
Actual machine learning.
...
Put it into production.
Profit!
Machine learning
Why do you need machine learning?
Collect your data!
Feature engineering.
Actual machine learning.
...
Put it into production.
Profit!
Machine learning
Why do you need machine learning?
Collect your data!
Feature engineering.
Actual machine learning.
...
Put it into production.
Profit!
- Do you really need a custom system?
- Domain representation
- What are important features
- Machine learning is fun!
Tf-idf as a reasonable representation
Choosing which features are important
Spice up your website
with Machine Learning!
Evelina Gabasova
@evelgab