On Github eddieantonio / unicode-es6
(Thanks, Kim!)
Let's consult the standard!
So what does String.length actually measure, if it's not characters? Let's consult the ECMAScript 6 standard!The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”)
The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value.
The length of a String is the number of elements (i.e., 16-bit values) within it.
ECMAScript 6.0 Standard §6.1.4Unique number given to each character
They look like this:
U+hhhh or U+hhhhhh
Range from U+0000 to U+10FFFF
1,114,112 code points available in total
(See Unicode Chapter 3)
Code points are divided into 17 planes
- Let's take a quick tour of the unicode code space! - The range is divided into 17 planesSmallest unit of storage required to store or transmit a single character in an encoding scheme
- That brings us to the term "code unit" - Remember back when we were reading the ES6 spec, it talked about UTF-16 code unit values? - Code units are the smallest unit of storage required to store or transmit a single Unicode code point in any given encoding format.We want to count code points and not code units
(String.prototype.codePointAt() exists too)
- Among the slew of features ES6 introduced are iterators - The one for String iterates through code points!let a = []; for (let c of s) { a.push(c); }
var i, a = [], for (i = 0; i < s.length; i++) { a.push(s[i]); }- it's not just neater; it iterates over completely different things! The iterator gives you code points -- characters -- guaranteed! The C-style for-loop gives you 16-bit values, that just usually happen to be code points.
Trick: Use Array#from
(it just does this:)function (s) { let a = []; for (let c of s) { a.push(c); } return a; }
(Thanks, Kim!)
There are multiple ways of representing the same abstract character sequence
- The lesson is, there are sometimes multiple ways of representing a character sequence in Unicode - Then how can we account for variation when searching text?Useful for comparing different representations of the same abstract character sequence
- This is where unicode NORMALIZATION saves the day! - It allows you to compare different variations of the same character sequence, transparently!(See UAX #15)
- There are four normalization forms, of which we'll discuss two - There's NFD, canonical decomposition -- essentially, this breaks apart characters that can be represented by multiple combining characters, and sorts the combining characters (usually accents) in a "canonical" order. - Then there's NFC, which is just like NFD, except it pieces those combining characters back together again. - So many words! Which one is useful?When in doubt, use
Check the compatibility table!
- Unfortunately, not every thing is immediately supported, so go to the ES6 compatibility page to see if the engines that you're targetting support the features mentioned in this presentation. - And with that, I'm done.