Unicode! – (And how ES6 can help) – (Hypothetical Scenario)



Unicode! – (And how ES6 can help) – (Hypothetical Scenario)

0 0


unicode-es6

Unicode! (And how ES6 can help)

On Github eddieantonio / unicode-es6

Unicode!

(And how ES6 can help)

By Eddie Antonio Santos / @_eddieantonio

(Hypothetical Scenario)

- Tomorrow you get hired for a start-up - Our visionary CEO has decided that the internet needs a place to share its feelings because the intenet has a lot of feelings. - But, we don't want them to share *too many* feelings all at once; only allow them to share feelings 140 characters at a time.
- But that's okay; we have a lot of feelings too, so the intenet has a lot of feelings.

(Thanks, Kim!)

- So this app is called Bitter. - It allows you to post short, 140 character long messages. - Pretty simple, right? - I mean, we have feelings too, but we hope to find... alternate... coping mechanisms.
- With that in mind, let's code it! ###### INSTRUCTIONS FOR EDDIE - Ensure Bitter is loaded in a browser window - Ensure branch is `master` - Ensure src/index.jsx is loaded in Vim length code (have it let message length = 0) - Demo the app -- type some English in it. - Add string.length to message length. - Demo the app -- type some stuff in it. The character count should increase by 1. - `node -p '"😭".repeat(140)' | pbcopy` - Copy it, and try `pbpaste | wc -m` - Delete some things to demonstrate counting by two.

(demo)

###### INSTRUCTIONS FOR EDDIE - Demo the app. Write naïve, implementation.

What does String.length actually measure?

Let's consult the standard!

So what does String.length actually measure, if it's not characters? Let's consult the ECMAScript 6 standard!

The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”)

The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value.

The length of a String is the number of elements (i.e., 16-bit values) within it.

ECMAScript 6.0 Standard §6.1.4
A string is an ordered sequence of 16-bit unsigned integer values... Strings are generally used to represent text, in which case each element is treated as a UTF-16 code unit value. The length of a string is the number of 16-bit values within it.

What the 👿 is a UTF-16 code unit value‽

WHAT THE DINK IS A UTF-16 CODE UNIT VALUE???!?!??!??!?

Unicode!

What is Unicode?

  • A mapping of numbers (code points) to every character. Ever.
  • Database of properties for each character (e.g., name, general category).
- Unicode is a lot of things. - It's a big database assigning a number to every character used in human language. Ever. - Each character is also assigned with a number of properties, including a unique name. - Characters are abstract representations of the smallest units of written semantic value in written language [Unicode:3](http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf). - If you're dealing with text of any sort, you're probably dealing with Unicode

Code Points

Unique number given to each character

They look like this:

U+hhhh or U+hhhhhh

Range from U+0000 to U+10FFFF

1,114,112 code points available in total

(See Unicode Chapter 3)

- Code points are unique numbers assigned to each character. - They're written as four hexadecimal characters or six hexadecimal character, with that funky `U+` prefix.

A tour of the Unicode character space!

Code points are divided into 17 planes

- Let's take a quick tour of the unicode code space! - The range is divided into 17 planes

The Basic Multilingual Plane

- Most of the characters used in today's languages are here, in the Basic Multilingual Plane. - English, Standard Chinese, Cherokee, Hindi, Russian, Hebrew, Ethiopian—everything! It's in here.

The Basic Multilingual Plane

(Plane 0)

  • Characters from practically all widely-used modern-day scripts
  • Code points are notated as U+hhhh
  • Code points range from U+0000 to U+FFFF
- The characters are notated using four hexadecimal characters. - But this is just 1/17th of the possible Unicode characters...

All Unicode Code Points

- Here's the basic multilingual plane... - And here's where it lies in the entire Unicode! - The non-BMP planes are called *astral planes*

The Astral Planes

  • Everything else (Planes 1-16)
  • Characters from ancient scripts, alternative scripts, pictograms, and rare and archaic CJK(V) ideograms (Chinese-style characters). Also, (most) Emoji.
  • Two entire planes devoted to private use characters
  • Code points are notated as U+hhhhhh
  • Code points range from U+010000 to U+10FFFF
- So if the Basic Multilingual Plane contains all the characters for modern day scripts, what's in the astral planes? - Characters from ancient scripts, alternative scripts for writing modern day languages (there are at least two alternatives for writing English alone!), and rare and archaic CJK ideograms, or Chinese-style characters. - And oh yeah, most Emoji live here too. - There are also two planes that are assigned to private-use characters—characters you, yes *you*, can use for anything! - These code points are notated with 6 hexadecimal characters!

What is not  Unicode?

  • a character encoding It's several character encodings!
  • Code points ≠ Bytes
- Now, it's important to talk about what Unicode is *not*. - It's not a character encoding. - It's several character encodings. - We need character encodings to convert characters into bytes. - It's important to note that one character does not mean one byte.

Code Unit

Smallest unit of storage required to store or transmit a single character in an encoding scheme

- That brings us to the term "code unit" - Remember back when we were reading the ES6 spec, it talked about UTF-16 code unit values? - Code units are the smallest unit of storage required to store or transmit a single Unicode code point in any given encoding format.

Ways of transmitting code points

  • UTF-8
  • UTF-16
  • UTF-32/UCS-4
- You've probably heard of UTF-8, whose smallest unit is 8-bits. There's also UTF-16, whose smallest unit uses 16-bits. And there's UTF-32, in which all units are 32-bits long. - At this point, things should seem odd. - Why... does UTF-32 have to even exist? - The astute among you have already noticed that astral code points... need more than 16-bits to be represented. - Thus...

UTF-16 needs two code units to represent one astral code point

- UTF-16 (that thing that JavaScript uses) needs two code units to represent a single astral code point. Like a poop emoji. - UTF-16 does this through *surrogate pairs*, but let's just leave that discussion for another time...

Back to our problem...

We want to count code points and not code units

Enter:

String.prototype[@@iterator]

(String.prototype.codePointAt() exists too)

- Among the slew of features ES6 introduced are iterators - The one for String iterates through code points!
When the @@iterator method is called it returns an Iterator object (25.1.1.2) that iterates over the code points of a String value, returning each code point as a String value.

Compare

let a = [];
for (let c of s) {
  a.push(c);
}
            

vs.

var i, a = [],
for (i = 0; i < s.length; i++) {
  a.push(s[i]);
}
            
- it's not just neater; it iterates over completely different things! The iterator gives you code points -- characters -- guaranteed! The C-style for-loop gives you 16-bit values, that just usually happen to be code points.

Let's fix our code!

Trick: Use Array#from

(it just does this:)
function (s) {
  let a = [];
  for (let c of s) {
    a.push(c);
  }
  return a;
}
            

Change of plans

- The CEO had a revelation, and it turns out that Bitter is too similar to Twitter. Who would have thought? So, we're shifting gears; we're gonna be a restaurant review site instead! (there aren't any of those around, are there?)

(Thanks, Kim!)

(demo)

###### INSTRUCTIONS FOR EDDIE - Stash changes, switch to branch "food" - Demo the app; talk about phở - search for "pho" and be surprised. - The reason this is surprising is because there are actually three ways to write phở!

Three different ways of writing 🍲

  • phở = o + ◌̛ + ◌̉
  • phở = ơ + ◌̉
  • phở = ở
- There's one way where the charater is the o, the horn thingy, and the question mark thing on top - Then there's a way in which the o with the horn is ONE code point, plus a combining hook - Then there's just one code point that's an o with a horn and a hook...

There are multiple ways of representing the same abstract character sequence

- The lesson is, there are sometimes multiple ways of representing a character sequence in Unicode - Then how can we account for variation when searching text?

Normalization forms!

Useful for comparing different representations of the same abstract character sequence

- This is where unicode NORMALIZATION saves the day! - It allows you to compare different variations of the same character sequence, transparently!
  • NFD Canonical decomposition
  • NFC Canonical decomposition, followed by Canonical Composition
  • NFKD Compatibility Decomposition
  • NFKC Compatibility Decomposition, followed by Canonical Composition

(See UAX #15)

- There are four normalization forms, of which we'll discuss two - There's NFD, canonical decomposition -- essentially, this breaks apart characters that can be represented by multiple combining characters, and sorts the combining characters (usually accents) in a "canonical" order. - Then there's NFC, which is just like NFD, except it pieces those combining characters back together again. - So many words! Which one is useful?

Canonical (De)composition

  • NFD: ở ⇒ o + ◌̛ + ◌̉
  • NFC: ở ⇒ ở
- Canonical decomposition is all about characters that look (that is, render) the same. - In this case, NFD, takes its input and breaks it down in to the base character, plus all of its accents in a standard order - NFC takes the canonical decomposition from the previous step and tries to smoosh all the code points together into one code point, if possible.

Enter:

String.prototype.normalize()

- That's why ES6 introduced String#normalize() - Can convert into any of the four normalization forms.

When in doubt, use

NFC

- When in doubt, use NFC. - In fact, if you provide no argument to String.normalize(), it does NFC by default. - It gives you one character for any accented non-sense for which that makes sense. - Twitter uses it internally, even if its character counter doesn't.

Let's fix our app!

- With out new knowledge about string normalization, let's fix our app!

Compatibility

Check the compatibility table!

- Unfortunately, not every thing is immediately supported, so go to the ES6 compatibility page to see if the engines that you're targetting support the features mentioned in this presentation. - And with that, I'm done.

ASK ME QUESTIONS

- ASK ME QUESTIONS!

Resources

Standards

Unicode

Other