On Github eddieantonio / gamboge-retrospective
By Eddie Antonio Santos / @_eddieantonio
Hey all. I'm Eddie. Last year, I wrote a code suggestion engine for GitHub's Atom editor. For those unaware, Atom is like Sublime Text, but in a web view. That's oversimplifying it a little bit, but that'll give you a good idea of what it's like to use.
But before I get into my experience with Atom per se...
In 2012, Abram Hindle and his collaborators observed the relationship between natural language --- that is, the languages humans use to communicate with each other --- and programming languages.
Computing scientists have been using statistical techniques to analyze, digest, and understand natural language. We call these techniques natural language processing. Statistical method work because, despite the theoretical complexity of languages, most utterances in a language tend to exhibit some degree of regularity and repeatability.
How many people use a predictive keyboard on their smartphone? So, this sometimes makes certain messages and utterances easier to type, because it predicts your next word based on your previous input to the system. This is an application of statistical natural language processing. In particular, the n-gram language model.
Well, yes.
The end results is that code tends to be even more repeatable and predictable than English when using N-gram language models. Which is a great result, but what does it do for us?
n-gram langauge models are nice, because all you need to know are the tokens of the language. Tokens being words an punctuation, and in whitespace sensitive langauge, indentation. So you don't need to know anything about the langauge's actual syntax, or its type system, or anything.
Well, a lot of things!
Just a few of the applications are: pinpointing syntax errors in amateur code---code that looks a bit odd to our language models likely is the root cause of a syntax error, even if the compiler detects it somewhere else entirely.
We can find find common idioms in code---little snippets or templates of code, that are used a lot and infer what they mean, and what they mean about the code they're found in. And we can make it easier to input these snippets as well.
And, there is code completion and code suggestion. You can predict the next word or punctuation in a programming language by simply asking the language model what it thinks. Abram and pals developed one such application, that predicts exactly one word or punctuation (henceforth: token) in the future.
And that's where I come in. I wrote Gamboge in an attempt to predict MORE than one word or punctuation. In fact, as may as seem reasonable.
Here is an animated /ʒaɪf/ of Gamboge predicting Python code.
[ad lib about multi-token prediction, whitespace prediction].
Its secret for producing multiple tokens is sorting suggestions using averaging the surprisal of each token in the suggestion: that is, the mean surprise.
In an n-gram language model, we care about the surprisal of a sequence of tokens. I could bore you about negative log probability, but it's actually pretty intuitive: given some tokens, how surprising is it to find it in the language model?
This give us an intuitive way at looking at the next token's probability. The most probable token following some sequence is the least surprising token after the sequence. Indeed, we call this measure the surprise of the token sequence. An event that is certain to happen, would have a probability of 100%. It is “not surprising”, so its surprisal value is 0. Similarly, an impossible event (according to our model) would leave us “infinitely surprised”.
Adding another token to a suggestion will almost always increases the overall surprisal of the suggestion (unless that next token follows the last 100% of the time). So, how do we prioritize longer, but more useful suggestions? Instead of sorting suggestions based on the surprisal of the entire suggestion, we'll sort it according to the average, or mean surprise of the entire suggestion. This means for single token suggestions, its surprisal doesn't change. But for multitoken suggestions, where some tokens contribute very little to the surprisal of the suggestion, the longer suggestion is given.
$S_{i}$ is the rank of suggestion $i$. $I_{i}$ here is the surprisal of all of the tokens in this suggestion. Therefore, just divide the total of tokens in the suggestion and we get an arithmetic mean. (A thought occurs: do we want mean surprise or mean probability? Because, mean probability would require log transforming...).
It's effectiveness is left as an exercise for the reader...
And that's all I have about the theoretical component.
Developing for Atom, on one hand was really nice, because it's just javascript. It's built on top of Atom Electron.
Electron (formerly known as Atom Shell ) is basically the Chromium smushed together with Node.js. It's pretty rad! And you can use it independently of Atom to make desktop apps in HTML and Javascript. No bundler (Browserify, WebPack) required.
Does anybosdy use Slack? Slack is made using Electron!
You can just use npm and `require` like normal, which grants you one of the best parts of javascript. Plus it has most front-end libraries, incluing React and jQuery. It's beautiful.
That said, it's React. So don't touch the DOM!
Disclaimer: this is my personal opinion. Your milage may vary! The Atom community is kind of... interesting.
So, I'm one of those obnoxious people that overuses emoji. But even I have to draw the line when I saw this in their official CONTRIBUTING.md.
Remember that part where I said it was just JavaScript? I lied. It's just CoffeeScript. It'd be pointless for me to comment about CoffeeScript itself; but the community's steadfast insistence to the language is... a bit baffling. Anything that compiles to JavaScript will work... so, anything ever.
And yet they start flamewars on this issue.
The fact that this happens in a post-ES6 world seems problematic to me.
Earlier, I mentioned that all you need for an n-gram are those precious, precious tokens; basically, a syntax highlighter should probably have a good idea of what tokens are in a language if it's going to highlight it. That said, using the tokens from the syntax highlighter is insufficient; its regular expressions often groups tokens together, or add whitespace to the defintion of a token that would otherwise lack it. Additionally, it knows nothing of the INDENT and DEDENT tokens in whitespace sensitive langauges (Python, Haskell, YAML, CoffeeScript), so it's completely useless for these langauges.
Luckily, one can use an external tool Esprima to do the tokenization for you.
I'd often get the rug pulled out underneath me as APIs were constantly changing. Not to mention the API docs were... lacking at best. I actually started a pull request to improve their docs for writing tests, and documenting the monkey-patching they did to make tests happen; but I ultimately did not complete it.
[Probably skip this] Also, the Vim-Mode was shit, and I actually learned plugin develpoing from having to patch the Vim-mode and make a PR.
At this point, I didn't know what React was, so I went ahead and modifed the DOM by hand. It's just JavaScript! But then they updated their React version which added aggressive checking of DOM invariants and inevitablly broke Gamboge. Gamboge has stayed in this state FOREVER.
Speaking of tests...
In Atom, there's no good way to test your plugin using any other testing framework than Jasmine 1.3. That's because it's bundled into Atom.
The testing facillities themselves are pretty cool; you can fully script the editor. Really neat! But a lot of my tests involved asynchronous code. Who has tested asynchronous code in Jasmine prior to version 2? It's a bad idea. It's clunky, and problematic. This was the source of much frustration.
While there a few problem with Atom, if you're willing to put up with the mild absurdity of the community, its use of modern web-technologies makes it an incredible platform for making attractive text editor plugins.
The state of Atom has improved, and the fact that it's at 1.0 means the API is moderatly stable; you shouldn't have the same problems as I did.
And if you don't like Atom itself, you can make your own text editor using Atom Electron.