On Github evelinag / LanguageRecognizer-slides
Evelina Gabasova (@evelgab)
"F# empowers users to tackle complex computing problems with simple, maintainable and robust code."
What is the language of this text?
A Csillagok haboruja egy uropera filmsorozatnak, irodalmi muveknek es szamitogepes jatekoknak a neve.
This is Hungarian, of course!
[ NEAREST NEIGHBOUR CLASSIFIER ]
Get sample text from from Wikipedia pages (done)
Calculate features frequencies of letter pairs
Compare languages using their features
Example using sample English text "the three"
_t
2
th
2
he
1
e_
2
hr
1
re
1
ee
1
Now calculate probabilities of the pairs
_t
0.2
th
0.2
he
0.1
e_
0.2
hr
0.1
re
0.1
ee
0.1
th
e_
ee
el
English
0.3
0.2
0.2
0.1
Portuguese
0.0
0.2
0.1
0.3
Distance is the sum of squares of differences.
th
e_
ee
el
English
0.3
0.2
0.2
0.1
Portuguese
0.0
0.2
0.1
0.3
Difference
0.3
0.0
0.1
-0.2
Sum of squares: \(0.09+0.0+0.01+0.04 = 0.14\)
English
Spanish
Portuguese
Czech
Unknown text
0.10
0.14
0.25
0.27
[ PERCEPTRON ]
[ LOGISTIC REGRESSION ]
\(f(x) = \frac{1}{1 + e^{-x}}\)
Initial weights can be generated randomly
Improve weights using gradient descent
Repeat recursively until certain error or number of steps
FsLab Package www.fslab.org