Unicode!

(And how ES6 can help)

By Eddie Antonio Santos / @_eddieantonio

(Hypothetical Scenario)

- Tomorrow you get hired for a start-up - Our visionary CEO has decided that the internet needs a place to share its feelings because the intenet has a lot of feelings. - But, we don't want them to share *too many* feelings all at once; only allow them to share feelings 140 characters at a time.

- But that's okay; we have a lot of feelings too, so the intenet has a lot of feelings.

(Thanks, Kim!)

- So this app is called Bitter. - It allows you to post short, 140 character long messages. - Pretty simple, right? - I mean, we have feelings too, but we hope to find... alternate... coping mechanisms.

- With that in mind, let's code it! ###### INSTRUCTIONS FOR EDDIE - Ensure Bitter is loaded in a browser window - Ensure branch is `master` - Ensure src/index.jsx is loaded in Vim length code (have it let message length = 0) - Demo the app -- type some English in it. - Add string.length to message length. - Demo the app -- type some stuff in it. The character count should increase by 1. - `node -p '"😭".repeat(140)' | pbcopy` - Copy it, and try `pbpaste | wc -m` - Delete some things to demonstrate counting by two.

(demo)

###### INSTRUCTIONS FOR EDDIE - Demo the app. Write naïve, implementation.

What does String.length actually measure?

Let's consult the standard!

So what does String.length actually measure, if it's not characters? Let's consult the ECMAScript 6 standard!

The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”)

The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value.

The length of a String is the number of elements (i.e., 16-bit values) within it.

ECMAScript 6.0 Standard §6.1.4

A string is an ordered sequence of 16-bit unsigned integer values... Strings are generally used to represent text, in which case each element is treated as a UTF-16 code unit value. The length of a string is the number of 16-bit values within it.

What the 👿 is a UTF-16 code unit value‽

WHAT THE DINK IS A UTF-16 CODE UNIT VALUE???!?!??!??!?

Unicode!

What is Unicode?

A mapping of numbers (code points) to every character. Ever.
Database of properties for each character (e.g., name, general category).

- Unicode is a lot of things. - It's a big database assigning a number to every character used in human language. Ever. - Each character is also assigned with a number of properties, including a unique name. - Characters are abstract representations of the smallest units of written semantic value in written language [Unicode:3](http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf). - If you're dealing with text of any sort, you're probably dealing with Unicode

Code Points

Unique number given to each character

They look like this:

U+hhhh or U+hhhhhh

Range from U+0000 to U+10FFFF

1,114,112 code points available in total

(See Unicode Chapter 3)

- Code points are unique numbers assigned to each character. - They're written as four hexadecimal characters or six hexadecimal character, with that funky `U+` prefix.

A tour of the Unicode character space!

Code points are divided into 17 planes

- Let's take a quick tour of the unicode code space! - The range is divided into 17 planes

The Basic Multilingual Plane

- Most of the characters used in today's languages are here, in the Basic Multilingual Plane. - English, Standard Chinese, Cherokee, Hindi, Russian, Hebrew, Ethiopian—everything! It's in here.

The Basic Multilingual Plane

(Plane 0)

Characters from practically all widely-used modern-day scripts
Code points are notated as U+hhhh
Code points range from U+0000 to U+FFFF

- The characters are notated using four hexadecimal characters. - But this is just 1/17th of the possible Unicode characters...

All Unicode Code Points

- Here's the basic multilingual plane... - And here's where it lies in the entire Unicode! - The non-BMP planes are called *astral planes*

The Astral Planes

Everything else (Planes 1-16)
Characters from ancient scripts, alternative scripts, pictograms, and rare and archaic CJK(V) ideograms (Chinese-style characters). Also, (most) Emoji.
Two entire planes devoted to private use characters
Code points are notated as U+hhhhhh
Code points range from U+010000 to U+10FFFF

- So if the Basic Multilingual Plane contains all the characters for modern day scripts, what's in the astral planes? - Characters from ancient scripts, alternative scripts for writing modern day languages (there are at least two alternatives for writing English alone!), and rare and archaic CJK ideograms, or Chinese-style characters. - And oh yeah, most Emoji live here too. - There are also two planes that are assigned to private-use characters—characters you, yes *you*, can use for anything! - These code points are notated with 6 hexadecimal characters!

What is not Unicode?

a character encoding It's several character encodings!
Code points ≠ Bytes

- Now, it's important to talk about what Unicode is *not*. - It's not a character encoding. - It's several character encodings. - We need character encodings to convert characters into bytes. - It's important to note that one character does not mean one byte.

Code Unit

Smallest unit of storage required to store or transmit a single character in an encoding scheme

- That brings us to the term "code unit" - Remember back when we were reading the ES6 spec, it talked about UTF-16 code unit values? - Code units are the smallest unit of storage required to store or transmit a single Unicode code point in any given encoding format.

Ways of transmitting code points

UTF-8
UTF-16
UTF-32/UCS-4

- You've probably heard of UTF-8, whose smallest unit is 8-bits. There's also UTF-16, whose smallest unit uses 16-bits. And there's UTF-32, in which all units are 32-bits long. - At this point, things should seem odd. - Why... does UTF-32 have to even exist? - The astute among you have already noticed that astral code points... need more than 16-bits to be represented. - Thus...

UTF-16 needs two code units to represent one astral code point

- UTF-16 (that thing that JavaScript uses) needs two code units to represent a single astral code point. Like a poop emoji. - UTF-16 does this through *surrogate pairs*, but let's just leave that discussion for another time...

Back to our problem...

We want to count code points and not code units

Enter:

String.prototype[@@iterator]

(String.prototype.codePointAt() exists too)

- Among the slew of features ES6 introduced are iterators - The one for String iterates through code points!

When the @@iterator method is called it returns an Iterator object (25.1.1.2) that iterates over the code points of a String value, returning each code point as a String value.

Compare

let a = [];
for (let c of s) {
  a.push(c);
}

vs.

var i, a = [],
for (i = 0; i < s.length; i++) {
  a.push(s[i]);
}

- it's not just neater; it iterates over completely different things! The iterator gives you code points -- characters -- guaranteed! The C-style for-loop gives you 16-bit values, that just usually happen to be code points.

Let's fix our code!

Trick: Use Array#from

(it just does this:)

function (s) {
  let a = [];
  for (let c of s) {
    a.push(c);
  }
  return a;
}

Change of plans

- The CEO had a revelation, and it turns out that Bitter is too similar to Twitter. Who would have thought? So, we're shifting gears; we're gonna be a restaurant review site instead! (there aren't any of those around, are there?)

(Thanks, Kim!)

(demo)

###### INSTRUCTIONS FOR EDDIE - Stash changes, switch to branch "food" - Demo the app; talk about phở - search for "pho" and be surprised. - The reason this is surprising is because there are actually three ways to write phở!

Three different ways of writing 🍲

phở = o + ◌̛ + ◌̉
phở = ơ + ◌̉
phở = ở

- There's one way where the charater is the o, the horn thingy, and the question mark thing on top - Then there's a way in which the o with the horn is ONE code point, plus a combining hook - Then there's just one code point that's an o with a horn and a hook...

There are multiple ways of representing the same abstract character sequence

- The lesson is, there are sometimes multiple ways of representing a character sequence in Unicode - Then how can we account for variation when searching text?

Normalization forms!

Useful for comparing different representations of the same abstract character sequence

- This is where unicode NORMALIZATION saves the day! - It allows you to compare different variations of the same character sequence, transparently!

NFD Canonical decomposition
NFC Canonical decomposition, followed by Canonical Composition
NFKD Compatibility Decomposition
NFKC Compatibility Decomposition, followed by Canonical Composition

(See UAX #15)

- There are four normalization forms, of which we'll discuss two - There's NFD, canonical decomposition -- essentially, this breaks apart characters that can be represented by multiple combining characters, and sorts the combining characters (usually accents) in a "canonical" order. - Then there's NFC, which is just like NFD, except it pieces those combining characters back together again. - So many words! Which one is useful?

Canonical (De)composition

NFD: ở ⇒ o + ◌̛ + ◌̉
NFC: ở ⇒ ở

- Canonical decomposition is all about characters that look (that is, render) the same. - In this case, NFD, takes its input and breaks it down in to the base character, plus all of its accents in a standard order - NFC takes the canonical decomposition from the previous step and tries to smoosh all the code points together into one code point, if possible.

Enter:

String.prototype.normalize()

- That's why ES6 introduced String#normalize() - Can convert into any of the four normalization forms.

When in doubt, use

NFC

- When in doubt, use NFC. - In fact, if you provide no argument to String.normalize(), it does NFC by default. - It gives you one character for any accented non-sense for which that makes sense. - Twitter uses it internally, even if its character counter doesn't.

Let's fix our app!

- With out new knowledge about string normalization, let's fix our app!

Compatibility

Check the compatibility table!

- Unfortunately, not every thing is immediately supported, so go to the ES6 compatibility page to see if the engines that you're targetting support the features mentioned in this presentation. - And with that, I'm done.

Unicode! – (And how ES6 can help) – (Hypothetical Scenario)

eddieantonio

Unicode! – (And how ES6 can help) – (Hypothetical Scenario)

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

unicode-es6

Unicode!

(And how ES6 can help)

(Hypothetical Scenario)

(demo)

What does String.length actually measure?

What the 👿 is a UTF-16 code unit value‽

Unicode!

What is Unicode?

Code Points

A tour of the Unicode character space!

The Basic Multilingual Plane

The Basic Multilingual Plane

(Plane 0)

All Unicode Code Points

The Astral Planes

What is not Unicode?

Code Unit

Ways of transmitting code points

UTF-16 needs two code units to represent one astral code point

Back to our problem...

Enter:

String.prototype[@@iterator]

Compare

vs.

Let's fix our code!

Change of plans

(demo)

Three different ways of writing 🍲

Normalization forms!

Canonical (De)composition

Enter:

String.prototype.normalize()

NFC

Let's fix our app!

Compatibility

ASK ME QUESTIONS

Resources

Standards

Unicode

Other

0 0