Hacking Atom – My experience writing an NLP-powered code suggestion engine – Let me tell you about my research!



Hacking Atom – My experience writing an NLP-powered code suggestion engine – Let me tell you about my research!

0 0


gamboge-retrospective

Presentation on my time developing an Atom plugin

On Github eddieantonio / gamboge-retrospective

Hacking Atom

My experience writing an NLP-powered code suggestion engine

By Eddie Antonio Santos / @_eddieantonio

Hey all. I'm Eddie. Last year, I wrote a code suggestion engine for GitHub's Atom editor. For those unaware, Atom is like Sublime Text, but in a web view. That's oversimplifying it a little bit, but that'll give you a good idea of what it's like to use.

But before I get into my experience with Atom per se...

Let me tell you about my research!

Let me tell you about my research!

Naturalness of Software

In 2012, Abram Hindle and his collaborators observed the relationship between natural language --- that is, the languages humans use to communicate with each other --- and programming languages.

Natural Language Processing

$n$-gram language model

Computing scientists have been using statistical techniques to analyze, digest, and understand natural language. We call these techniques natural language processing. Statistical method work because, despite the theoretical complexity of languages, most utterances in a language tend to exhibit some degree of regularity and repeatability.

How many people use a predictive keyboard on their smartphone? So, this sometimes makes certain messages and utterances easier to type, because it predicts your next word based on your previous input to the system. This is an application of statistical natural language processing. In particular, the n-gram language model.

Can natural language processing be successfully applied to software?

Can natural language processing be successfully applied to software?

Yes, probably.

Well, yes.

The end results is that code tends to be even more repeatable and predictable than English when using N-gram language models. Which is a great result, but what does it do for us?

n-gram langauge models are nice, because all you need to know are the tokens of the language. Tokens being words an punctuation, and in whitespace sensitive langauge, indentation. So you don't need to know anything about the langauge's actual syntax, or its type system, or anything.

A veritable buttload of applications!

Well, a lot of things!

Just a few of the applications are: pinpointing syntax errors in amateur code---code that looks a bit odd to our language models likely is the root cause of a syntax error, even if the compiler detects it somewhere else entirely.

We can find find common idioms in code---little snippets or templates of code, that are used a lot and infer what they mean, and what they mean about the code they're found in. And we can make it easier to input these snippets as well.

And, there is code completion and code suggestion. You can predict the next word or punctuation in a programming language by simply asking the language model what it thinks. Abram and pals developed one such application, that predicts exactly one word or punctuation (henceforth: token) in the future.

Enter: Gamboge

Multitoken code suggestion in Atom

And that's where I come in. I wrote Gamboge in an attempt to predict MORE than one word or punctuation. In fact, as may as seem reasonable.

Here is an animated /ʒaɪf/ of Gamboge predicting Python code.

[ad lib about multi-token prediction, whitespace prediction].

Its secret for producing multiple tokens is sorting suggestions using averaging the surprisal of each token in the suggestion: that is, the mean surprise.

Mean Surprise

$$S_{i} = \frac{I_{i}}{|t_{i}|}$$

(Steal this!)

In an n-gram language model, we care about the surprisal of a sequence of tokens. I could bore you about negative log probability, but it's actually pretty intuitive: given some tokens, how surprising is it to find it in the language model?

This give us an intuitive way at looking at the next token's probability. The most probable token following some sequence is the least surprising token after the sequence. Indeed, we call this measure the surprise of the token sequence. An event that is certain to happen, would have a probability of 100%. It is “not surprising”, so its surprisal value is 0. Similarly, an impossible event (according to our model) would leave us “infinitely surprised”.

Adding another token to a suggestion will almost always increases the overall surprisal of the suggestion (unless that next token follows the last 100% of the time). So, how do we prioritize longer, but more useful suggestions? Instead of sorting suggestions based on the surprisal of the entire suggestion, we'll sort it according to the average, or mean surprise of the entire suggestion. This means for single token suggestions, its surprisal doesn't change. But for multitoken suggestions, where some tokens contribute very little to the surprisal of the suggestion, the longer suggestion is given.

$S_{i}$ is the rank of suggestion $i$. $I_{i}$ here is the surprisal of all of the tokens in this suggestion. Therefore, just divide the total of tokens in the suggestion and we get an arithmetic mean. (A thought occurs: do we want mean surprise or mean probability? Because, mean probability would require log transforming...).

It's effectiveness is left as an exercise for the reader...

And that's all I have about the theoretical component.

Lessons Learned

Lessons learned in implementing an Atom plugin

It's just JavaScript (& npm!)

Developing for Atom, on one hand was really nice, because it's just javascript. It's built on top of Atom Electron.

Electron (formerly known as Atom Shell ) is basically the Chromium smushed together with Node.js. It's pretty rad! And you can use it independently of Atom to make desktop apps in HTML and Javascript. No bundler (Browserify, WebPack) required.

Does anybosdy use Slack? Slack is made using Electron!

You can just use npm and `require` like normal, which grants you one of the best parts of javascript. Plus it has most front-end libraries, incluing React and jQuery. It's beautiful.

That said, it's React. So don't touch the DOM!

The Atom Community is... uh...

Disclaimer: this is my personal opinion. Your milage may vary! The Atom community is kind of... interesting.

So, I'm one of those obnoxious people that overuses emoji. But even I have to draw the line when I saw this in their official CONTRIBUTING.md.

Interesting priorities

It's a commit message emoji guide. And they're buttfucking serious about it.

CoffeeScript

Anyone down for a flamewar?

Remember that part where I said it was just JavaScript? I lied. It's just CoffeeScript. It'd be pointless for me to comment about CoffeeScript itself; but the community's steadfast insistence to the language is... a bit baffling. Anything that compiles to JavaScript will work... so, anything ever.

And yet they start flamewars on this issue.

The fact that this happens in a post-ES6 world seems problematic to me.

The Atom Community is... uh...

💩

So the Atom Community is... yeah.

Atom's Syntax Highligting Tokenizer

Insufficient for prediction!

require('esprima').tokenize('const hello = `world`');

Earlier, I mentioned that all you need for an n-gram are those precious, precious tokens; basically, a syntax highlighter should probably have a good idea of what tokens are in a language if it's going to highlight it. That said, using the tokens from the syntax highlighter is insufficient; its regular expressions often groups tokens together, or add whitespace to the defintion of a token that would otherwise lack it. Additionally, it knows nothing of the INDENT and DEDENT tokens in whitespace sensitive langauges (Python, Haskell, YAML, CoffeeScript), so it's completely useless for these langauges.

Luckily, one can use an external tool Esprima to do the tokenization for you.

Relying on Pre 1.0 APIs

BAD IDEA

I'd often get the rug pulled out underneath me as APIs were constantly changing. Not to mention the API docs were... lacking at best. I actually started a pull request to improve their docs for writing tests, and documenting the monkey-patching they did to make tests happen; but I ultimately did not complete it.

[Probably skip this] Also, the Vim-Mode was shit, and I actually learned plugin develpoing from having to patch the Vim-mode and make a PR.

At this point, I didn't know what React was, so I went ahead and modifed the DOM by hand. It's just JavaScript! But then they updated their React version which added aggressive checking of DOM invariants and inevitablly broke Gamboge. Gamboge has stayed in this state FOREVER.

Speaking of tests...

Asynchronous testing in Jasmine 1.x

BAD IDEA

In Atom, there's no good way to test your plugin using any other testing framework than Jasmine 1.3. That's because it's bundled into Atom.

The testing facillities themselves are pretty cool; you can fully script the editor. Really neat! But a lot of my tests involved asynchronous code. Who has tested asynchronous code in Jasmine prior to version 2? It's a bad idea. It's clunky, and problematic. This was the source of much frustration.

Conclusion

While there a few problem with Atom, if you're willing to put up with the mild absurdity of the community, its use of modern web-technologies makes it an incredible platform for making attractive text editor plugins.

The state of Atom has improved, and the fact that it's at 1.0 means the API is moderatly stable; you shouldn't have the same problems as I did.

And if you don't like Atom itself, you can make your own text editor using Atom Electron.

Resources

Natural Language Processing

Resources

Language Tools

Special Thanks

Hacking Atom My experience writing an NLP-powered code suggestion engine By Eddie Antonio Santos / @_eddieantonio Hey all. I'm Eddie. Last year, I wrote a code suggestion engine for GitHub's Atom editor. For those unaware, Atom is like Sublime Text, but in a web view. That's oversimplifying it a little bit, but that'll give you a good idea of what it's like to use. But before I get into my experience with Atom per se...