talk-fssnip-recommender



talk-fssnip-recommender

0 1


talk-fssnip-recommender

Spice up your website with machine learning

On Github evelinag / talk-fssnip-recommender

Spice up your website

with Machine Learning!

Evelina Gabasova

@evelgab

F# Snippets

F# snippets are like Gist on steroids Started 5 years ago by Tomas Petricek

F# Snippets

fssnip.net

Demo of F# snippets website It all works nice until you start searching for something
You can search by tag

Searching through F# snippets

over 1600 snippets

over 1100 different tags

Searching through F# snippets

Do we need a custom system?

let! tells us that the code uses asynchronous workflows, get return value of an asynchronous function Google ignores diacritics, searching for let and let! gives the same results although the meaning is very different

F# is a statically typed language! I'm not interested in how is this function actually called, but I'm interested in that it takes asynchronous workflow and returns asynchronous workflow. This type of information is not available as text and google doesn't index it for search. Great opportunity to create a custom machine learning system!!!

Great opportunity to create a custom machine learning system!

We're programmers, we like to automate - alternative would be do go through the tags manually and curate them! what can possibly go wrong? There's this company that also does a lot of machine learning...
this year has been great for machine learning but as machine learning is getting more and more ubiquitous, So are the problems!

Sergey Lavrov (Russian foreign minister) = sad little horse

Nguyen A et al.: Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. 2015.

Using machine learning in production

  • dependence on training data
  • inputs
incoming data is not controlled, users do many different things Black boxes can be unpredictable

User-generated inputs

  • data-background : #87c594

PART I

Finding related snippets

If you liked this F# code, you'll also like ...

Simple information retrieval

common terms

Looking at common terms is not enough Two snippets may have a large overlap if they use common words

Bag of words

  • ignore order of words
  • separate text and code

This allows us to look at code and text differently Text - free structured Code - fixed structure with keywords Keywords have different role than variable names

Term frequency

Snippet 1

Term

Frequency

async

3

x

15

The

2

code

1

...

Snippet 2

Term

Frequency

async

0

x

15

The

2

code

1

...

Why comparing snippets based on their term frequency only is not sufficient async is informative, x is not term frequency is important but cannot be used alone

Inverse document frequency

Relative importance of terms

idf(term)=lognumber of snippetsnumber of snippets with term

Weight - how informative a term is Look at what features are the least informative

Vector representation: TF-IDF

Term frequency - inverse document frequency

tfidf(term,snippet)=tf(term,snippet)×idf(term)

This makes it comparable between different documents By computing this for every snippet, we get vector representation

Demo

Vector representation of snippets

Snippet

x

List

Array

...

snippet1

0

0.17

0

...

snippet2

0

0.04

0.001

...

snippet3

0.23

0.005

0.31

...

snippet4

0

0

0

...

...

Vector representation of snippets

angle between snippet vectors shows how related they are

PART II

Suggesting tags

Suggesting tags

Actually one of the most popular tags is F#

Making sense of user-generated tags

async, #async, async mailprocessor, async paraller, Async sequences, asyncseq, asynchronous, Asynchronous Processing, Asynchronous Programming, asynchronous sequence, asynchronous workflows

Capitalization, plural vs singular, typos, different transcriptions singular but then Window Form and other things like this

Edit distance

regex vs. regexp

sports vs. ports pi vs. API

insertion, deletion

Machine learning

From snippets to tags

Mapping For each snippet, I have a set of tags I can use the data to train a predictor

Associations

string and parser

async and MailboxProcessor

sequence and exception

Naive Bayes

Why do you call me naive?

Why naive?

string and parser

async and MailboxProcessor

sequence and exception

naive means we're making our lives easier because we can ignore how individual terms depend on each other

Building a predictor

We can actually estimate this from data

Building a predictor

Maybe this should have the "async" tag!

Building a predictor

Tag probabilities

Bayes theorem

p(A∣B)=p(B∣A)p(A)p(B)

Tag probabilities

Bayes theorem

p(tag∣snippet)∝p(tag)p(snippet∣tag)

Tag probabilities

Bayes theorem

p(tag∣snippet)∝p(tag)∏termp(term∣tag)

Tag probabilities

Bayes theorem

p(tag∣snippet)∝p(tag)×p(term1∣tag)p(term2∣tag)p(term2∣tag)…

1. Prior probabilities

p(tag)≈Number of snippets with the tagNumber of snippets

p(async) = 5%

2. Tag likelihood

How frequent is the term among snippets that have the tag ?

p(term∣tag)=Number of snippets with the term and tagNumber of snippets with the tag

Naive Bayes prediction

p(tag∣snippet)∝p(tag)∏termp(term∣tag)

p(tag∣snippet)?>p(¬tag∣snippet)

Working in logarithm space The whole prediction reduces to summations

The theory is always nicer

What if there is no snippet tagged async that contains List?

Introduce numerical fixes that mean that it is a bit messier Naive Bayes is a very simple algorithm - in principle can use any classification algorithm

Demo

Machine learning to improve user experience

Machine learning

Why do you need machine learning? Collect your data! Feature engineering. Actual machine learning. Profit!

Machine learning

Why do you need machine learning? Collect your data! Feature engineering. Actual machine learning. ... Put it into production. Profit!

Machine learning

Why do you need machine learning? Collect your data! Feature engineering. Actual machine learning. ... Put it into production. Profit!

Machine learning

Why do you need machine learning? Collect your data!

Feature engineering.

Actual machine learning. ... Put it into production. Profit!
  • Do you really need a custom system?
  • Domain representation
  • What are important features
  • Machine learning is fun!
Tf-idf as a reasonable representation Choosing which features are important

Learning more

Thank you!

@evelgab github.com/evelinag evelinag.com

Spice up your website with Machine Learning! Evelina Gabasova @evelgab