GOTO Night

Regular Expressions

Kim Hesselholt Reenberg2016-05-17

Wifi

SSID Win7/Linux easv-portal Everyone else EASV Username Guest461 Password Easvzgap

Win7/Linux go to cp.easv.dk to login.

Full presentation is available online at http://khreenberg.github.io/goto-regex

About me

Software Development PBA student
Lazy programmer
Self-taught regular expressions disciple

This talk is split into three main parts:

A brief introduction to formal languages Regex features and syntax Examples and workshop

Regular expressions

have many names
- Regex
- Regexp
- /reg(?:(?:exp?(?:e?s)?)|(?:ular expressions?))/ig
- Black magic
have many usages
- Validation
- Search & Replace
- CSS selectors
- String manipulation
- Party-trick
are not regular expressions

Regex ≠ regular expressions

"Real" regular expressions can only recognize regular languages
A regular language is a type of formal language

From wikipedia: …a formal language is a set of strings of symbols that may be constrained by rules that are specific to it.

formal languages are sets
defined by formal grammar

Formal languages 101

The alphabet, $\Sigma$, is the set of
Tokens, $\sigma$, used to form
The words, $w$, that make up
The language

(Relevant) set theory 101

Symbol Meaning Example $\emptyset$ The empty set $\emptyset=\{ \}$ $\in$ Element of $a \in \{a, b, c\}$ $\notin$ Not element of $x \notin \{a, b, c\}$ $\ni$ Contains $\{a, b, c\} \ni a$

Two sets (A and B) are equal iff $$x \in A \iff x \in B$$
$\therefore \{1, 2, 3\}=\{1, 1, 3, 1, 2, 1, 1, 2, 3\}$
Because the number of elements is not a factor in set equality, we will often ignore or remove duplicates

Set operations

Name Symbol Example Effect Union $\cup$ $\{1,2\} \cup \{2,3\}$ $\{1,2,3\}$ Intersection $\cap$ $\{1,2\} \cap \{2,3\}$ $\{2\}$

Operations on sets of strings

Description Symbol Example Effect Concatenation $\cdot$ $\{a,b\} \cdot \{c, d\}$ $\{ac, ad, bc, bd\}$ Kleene Star $^*$ $\{a,b\}^*$ $\{\lambda,a,b,aa,ab,ba,bb,aaa,\ldots\}$

The symbol for logical disjunction ($\vee$) is visually similar to that of union
The symbol for logical conjunction ($\wedge$) is visually similar to that of intersection
Which makes sense as

$$x \in A \cup B \iff (x \in A) \vee (x \in B)$$ $$x \in A \cap B \iff (x \in A) \wedge (x \in B)$$

logical disjunction = "or"
logical conjunction = "and"

So, with the basics in place...

... let's get back to languages

Formal definition of regular languages

The collection of regular languages over $\Sigma$ is defined recursively as follows:

$\emptyset$ and $\{\lambda\}$ are regular languages
For each $\sigma \in \Sigma$, the language $\{\sigma\}$ regular
Let $L_1$ and $L_2$ be regular languages over $\Sigma$; the languages $L_1\cup L_2$, $\ L_1\cdot L_2$ and ${L_1}^*$ are also regular
No other languages are regular

Explicitly mentioning ${\lambda}$ is redundant, as $\emptyset^*={\lambda}$

Examples

With the alphabet $\Sigma=\{a,b,c\}$, we could for example define the following languages

$L_1=\{w \mid w\ \text{is a word that ends in}\ a\}$
$L_2=\{w \mid w\ \text{is a palindrome}\}$
$L_3=\{w \mid w\ \text{is a word where all $a$s are immediately followed by a $b$, unless they are preceded by a $c$, and}\ldots\}$

…if only we had a simple and precise way of defining languages…

Describing languages with regular expressions

Set Regular expression ${L_1}^*$ ${L_1}^*$ $L_1 \cdot L_2$ $L_1L_2$ $L_1 \cup L_2$ $L_1+L_2$

Previous examples

$$ \begin{align} L_1 & =\{w \mid w\ \text{is a word that ends in}\ a\} \\ L_2 & =\{w \mid w\ \text{is a palindrome}\} \\ L_3 & =\{w \mid w\ \text{is a word where all $a$s are immediately followed by a $b$, unless they are preceded by a $c$}\} \\ \end{align} $$

With regular expressions

\begin{align} L_1 & = (a+b+c)^*a \\ L_2 & = \text{???} \\ L_3 & = (ab+b)^∗(\lambda+(c(a+b+c)^∗)) \end{align}

Regex ≠ regular expressions (cont.)

"Real" regular expressions can only recognize regular languages
Regexes can recognize a much wider range of languages, including palindromes

Type Language Recognized by Type-0 Recursively enumerable Turing machine Recursive Total Turing machine Type-1 Context-sensitive Linear-bounded automaton Type-2 Context-free Pushdown automaton Type-3 Regular Finite automaton

Kleene's theorem states that regular expressions and finite automata are equal — They are simply different ways to convey the same meaning.

Regex can be simulated by a linear-bounded automaton.

I will talk about Kleene's theorem later, if there's time and interest.

Formal theory done,

let's see some regex!

Regex examples

Regular expression Regex $L_1 = (a+b+c)^*a$ $L_1 = $[abc]*a $L_2 = \text{N/A}$ $L_2 = $((.)(?1)\2|.?) $L_3 = (ab+b)^*(\lambda+(c(a+b+c)^*))$ $L_3 = $(ab|b)*(c[abc]*)?

Flavou?rs

There are many different regex engines
The engines typically use an algorithm based on either backtracking (BT) or a finite-state automaton (FSA)
BT is often slightly faster than FSA, but an "unlucky" a careless regex processed by BT can take a very long time
Which algorithm is used can affect the "default" behaviour of the regexFor example: FSA-based engines are typically biased towards the longest possible match, whereas those based on BT are likely to be biased towards the first —or leftmost— possible match.
The term flavor (or flavour) refers to the slight —and sometimes not so slight— differences in semantics between different engines

This might not make a lot of sense right now, but it is something to keep in mind.

For more information on the differences between regex engines, seehttps://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines

On the careless regex:

very long time: Anything from "a few seconds" to "completely unresponsive"
I will talk about how not to be careless later

A quick Google search did not reveal whether or not FSA-based algorithms can typically recognize the same languages as those based on BT

Regex anatomy

/reg(?:(?:exp?(?:e?s)?)|(?:ular expressions?))/ig

Delimiter
Literal symbols
Non-capturing group
Optional (0 or 1)
End non-capturing group
Alternation (or)
Modifier: Case insensitive
Modifier: Global

regex
regexs
regexes
regexp
regexps
regexpes
regular expression
regular expressions

Syntax

Metacharacters

Symbol Meaning Symbol Meaning \ Escape character . Any single character2 ^ Beginning of string1 […] Character class $ End of string1 [^…] Negated character class ? Zero or one | Alternation * Zero or more - Character class range3 + One or more {…} Explicit quantification Behavior can change with certain modifiers. Often the default behaviour is finding the start and end of individual lines in the input. Typically does not match newline characters without using modifiers.

- is only a meta character if it is used between characters in a character class. E.g. [-az], [a\-z] and [az-] all match a, z, and –. [a-z] however matches any character between a and z.

By metacharacters I mean characters that needs to be escaped if you want to search for them.

There is no {,n}, use {0,n} instead.

Character classes

Let's say we want to find all lowercase letters in the english alphabet.

This will work…

a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z

…but it's quite verbose.

Instead, we can use character classes, like so:

[abcdefghijklmnopqrstuvwxyz]

Or even better—we can use '–' to specify a range:

[a-z]

Metacharacter escaping is only neccessary for those that have a special meaning within character classes.

Shorthand character classes

Shorthand Matches \w Word character \d Digit \s Whitespace \v Vertical whitespace \h Horizontal whitespace

Each shorthand can be negated by using the uppercase letter instead:

Shorthand Matches \W Anything but a word character \D Anything but a digit \S Anything but a whitespace \V Anything but a vertical whitespace \H Anything but a horizontal whitespace

Note that the shorthand character classes are particularly dependent on flavor.

Negated character classes

Explicit character classes like [abc] can also be negated.

[^abc] matches anything that isn't an a, b or c.

Character classes—negated or not—can be used with the shorthand classes as well.

[\w\v] matches any character that is either a word character or a vertical whitespace.

By "exploiting" double negation, we can write some pretty neat shortcuts.

For example, [^abc\W] matches the same characters as (?![abc])\w.

Hopefully I will get to show some more cool usages of double negation later in the talk.

Unicode character classes

Some flavors feature character classes for each of the Unicode categories, scripts and blocks.

These character classes are typically accessed with \p{Category_Name}.

Examples

Type Usage Description Category \p{Lowercase_Letter} Any lowercase letter from any script that has an uppercase variant Script \p{Hiragana} Any character from the hiragana syllabary Block \p{InHiragana} Any Unicode code point in the range U+3040–U+309F

As with the other shorthand character classes, \p{…} can be negated with \P{…}.

A quick word on POSIX

You may come across the word 'POSIX' when reading or referencing regex materials.
POSIX is an abbreviation for "Portable Operating System Interface for uniX", which is a set of standards that a UNIX OS should support.
These standards define two flavors of regex, the POSIX Basic Regular Expressions (POSIX BRE) and POSIX Extended Regular Expressions (POSIX ERE).
The POSIX flavors are in many ways quite different from other (more modern) flavors, and many regex references will mention the POSIX character classes.
For example, the POSIX character class [:digit:] will match the same characters as [0-9], \d and \p{Decimal_Digit_Number}.

Quantifiers

Sometimes Very often we want to repeat (a part of) our regex.

Consider checking a piece of text to see if it contains a numbers-only substring of length 10.

\d\d\d\d\d\d\d\d\d\d works, but it is neither very readable or scalable.

By using quantifiers, we can tell the regex engine to repeat the \d part of our regex 10 times.

\d{10} matches the same substrings as the regex above.

Quantifier reference

Syntax Meaning ? Zero or one * Zero or more + One or more {n} Exactly n {n,} n or more {n,m} n to m (inclusive)

Quantifier behaviour

Regex quantifiers are greedy by default.

For example, the regex A\w*C applied to

bbbCAbbbCAbbbCAbbb

won't just match

bbbCAbbbCAbbbCAbbb

but instead

bbbCAbbbCAbbbCAbbb

Modifying the behaviour of quantifiers

Syntax Behaviour Description \w* Greedy Matches as many characters as possible \w*? Lazy Matches as few characters as possible \w*+ Possessive Will not backtrack

The * quantifier can of course be substituted by any of the other quantifiers.

Behaviour examples

Regex Behaviour Match [ab]+b Greedy aabaab [ab]+?b Lazy aabaab [ab]++b Possessive aabaab

Assertions

Sometimes we need to inspect our surroundings without actually matching them.

Consider the input string <h1>Replace me!</h1>

If we want to replace the content of the <h1>-tag with I was replaced., we could

search for <h1>.*?</h1>, and
replace it with <h1>I was replaced.</h1>

This example is not too bad, but one can imagine how some searches would need quite a bit of repetition.

Assertions make it possible to inspect the string without consuming any characters.

For example, we can achieve the same result as before by using regex look-arounds to

search for (?<=<h1>).*?(?=</h1>), and
replacing with I was replaced.

Assertion types

Usage Description Example Boundaries \b Word boundary \bword\b Anchors ^ Start of input (or line) ^.*$ $ End of input (or line) ^.*$ \A Absolute start of input \A.*$ \Z End of input (ignoring blank lines) \Z\s \z Absolute end of input \s\z Look-arounds (?=…) Positive look-ahead a(?=b) (?!…) Negative look-ahead b(?!a) (?<=…) Positive look-behind (?<=a)b (?<!…) Negative look-behind (?<!b)a

Groups

We've already seen the non-capturing group (?:…) syntax a few times.

As the name suggests, regex also has non-non-capturing groups—or just capturing groups.

The syntax for capturing groups is simply (…).

These are particularly useful for search & replace, and can even be used for backreferencing.

Search and replace example

By searching for ^(.*)$ and replacing with Hello, $1! we can quickly change the following input

World
GOTO
regex

to the following output

Hello, World!
Hello, GOTO!
Hello, regex!

Backreferencing

Recall that we can recognize a larger set of languages with regex compared to regular expressions.

This is because we are able to reference a capture group within the regex itself.

For example, (.)\1 will match aa and >>, but not xy

In addition to referencing the contents of a group, it is also possible to reference the regex for the group. This is called a subroutine.

Let's revisit our regex that matches palindromes.

Subroutines

^((.)(?1)\2|.?)$

Input: abba

Level Group 1 Group 2 Trying to match Finds 0 a (?1) 1 b (?1) 2 b (?1) 3 a (?1) 4 $ 3 a aa a$ 2 b bb ba 1 b bb bb 0 bb a abba abba

Heavily simplified for readability; I am ignoring the fact that .? is greedy.

Named groups

Capture groups and subroutines can be named for easier replacement and backreferencing.

The syntax for named groups is very flavor-dependent, and the examples here might not work with your engine.

Syntax Meaning (?<name>…) Creates a capture with the given name \k<name> Matches what is currently in the name group (?P>name) Call the name subroutine ${name} Insert the contents of name when replacing

Here is our palindrome regex with and without named groups:

^(?<palindrome>(?<first>.)(?P>palindrome)\k<first>|.?)$

^((.)(?1)\2|.?)$

Information

Libraries & Frameworks

https://github.com/hakimel/reveal.js/

Tools

GOTO Night Regular Expressions Kim Hesselholt Reenberg2016-05-17 Wifi SSID Win7/Linux easv-portal Everyone else EASV Username Guest461 Password Easvzgap Win7/Linux go to cp.easv.dk to login. Full presentation is available online at http://khreenberg.github.io/goto-regex

goto-regex

khreenberg

goto-regex

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

goto-regex

GOTO Night

Regular Expressions

About me

This talk is split into three main parts:

Regular expressions

Regex ≠ regular expressions

Formal languages 101

(Relevant) set theory 101

Set operations

Operations on sets of strings

So, with the basics in place...

... let's get back to languages

Formal definition of regular languages

Examples

Describing languages with regular expressions

Previous examples

With regular expressions

Regex ≠ regular expressions (cont.)

Formal theory done,

let's see some regex!

Regex examples

Flavou?rs

Regex anatomy

/reg(?:(?:exp?(?:e?s)?)|(?:ular expressions?))/ig

Syntax

Metacharacters

Character classes

Shorthand character classes

Negated character classes

Unicode character classes

Examples

A quick word on POSIX

Quantifiers

Quantifier reference

Quantifier behaviour

Modifying the behaviour of quantifiers

Behaviour examples

Assertions

Assertion types

Groups

Search and replace example

Backreferencing

Subroutines

^((.)(?1)\2|.?)$

Input: abba

Named groups

Information

Libraries & Frameworks

Tools

0 0