GOTO Night
Regular Expressions
Kim Hesselholt Reenberg2016-05-17
Wifi
SSID
Win7/Linux
easv-portal
Everyone else
EASV
Username
Guest461
Password
Easvzgap
Win7/Linux go to cp.easv.dk to login.
Full presentation is available online at http://khreenberg.github.io/goto-regex
About me
- Software Development PBA student
- Lazy programmer
- Self-taught regular expressions disciple
This talk is split into three main parts:
A brief introduction to formal languages
Regex features and syntax
Examples and workshop
Regular expressions
-
have many names
- Regex
- Regexp
- /reg(?:(?:exp?(?:e?s)?)|(?:ular expressions?))/ig
- Black magic
-
have many usages
- Validation
- Search & Replace
- CSS selectors
- String manipulation
- Party-trick
- are not regular expressions
Regex ≠ regular expressions
- "Real" regular expressions can only recognize regular languages
- A regular language is a type of formal language
From wikipedia:
…a formal language is a
set of strings of symbols that may be
constrained by rules that are specific to it.
- formal languages are sets
- defined by formal grammar
Formal languages 101
-
The alphabet, $\Sigma$, is the set of
-
Tokens, $\sigma$, used to form
-
The words, $w$, that make up
- The language
(Relevant) set theory 101
Symbol
Meaning
Example
$\emptyset$
The empty set
$\emptyset=\{ \}$
$\in$
Element of
$a \in \{a, b, c\}$
$\notin$
Not element of
$x \notin \{a, b, c\}$
$\ni$
Contains
$\{a, b, c\} \ni a$
- Two sets (A and B) are equal iff $$x \in A \iff x \in B$$
- $\therefore \{1, 2, 3\}=\{1, 1, 3, 1, 2, 1, 1, 2, 3\}$
- Because the number of elements is not a factor in set equality, we will often ignore or remove duplicates
Set operations
Name
Symbol
Example
Effect
Union
$\cup$
$\{1,2\} \cup \{2,3\}$
$\{1,2,3\}$
Intersection
$\cap$
$\{1,2\} \cap \{2,3\}$
$\{2\}$
Operations on sets of strings
Description
Symbol
Example
Effect
Concatenation
$\cdot$
$\{a,b\} \cdot \{c, d\}$
$\{ac, ad, bc, bd\}$
Kleene Star
$^*$
$\{a,b\}^*$
$\{\lambda,a,b,aa,ab,ba,bb,aaa,\ldots\}$
- The symbol for logical disjunction ($\vee$) is visually similar to that of union
- The symbol for logical conjunction ($\wedge$) is visually similar to that of intersection
- Which makes sense as
$$x \in A \cup B \iff (x \in A) \vee (x \in B)$$
$$x \in A \cap B \iff (x \in A) \wedge (x \in B)$$
- logical disjunction = "or"
- logical conjunction = "and"
So, with the basics in place...
... let's get back to languages
Formal definition of regular languages
The collection of regular languages over $\Sigma$ is defined recursively as follows:
-
$\emptyset$ and $\{\lambda\}$ are regular languages
- For each $\sigma \in \Sigma$, the language $\{\sigma\}$ regular
- Let $L_1$ and $L_2$ be regular languages over $\Sigma$; the languages $L_1\cup L_2$, $\ L_1\cdot L_2$ and ${L_1}^*$ are also regular
- No other languages are regular
Explicitly mentioning ${\lambda}$ is redundant, as $\emptyset^*={\lambda}$
Examples
With the alphabet $\Sigma=\{a,b,c\}$, we could for example define the following languages
-
$L_1=\{w \mid w\ \text{is a word that ends in}\ a\}$
- $L_2=\{w \mid w\ \text{is a palindrome}\}$
- $L_3=\{w \mid w\ \text{is a word where all $a$s are immediately followed by a $b$, unless they are preceded by a $c$, and}\ldots\}$
…if only we had a simple and precise way of defining languages…
Describing languages with regular expressions
Set
Regular expression
${L_1}^*$
${L_1}^*$
$L_1 \cdot L_2$
$L_1L_2$
$L_1 \cup L_2$
$L_1+L_2$
Previous examples
$$
\begin{align}
L_1 & =\{w \mid w\ \text{is a word that ends in}\ a\} \\
L_2 & =\{w \mid w\ \text{is a palindrome}\} \\
L_3 & =\{w \mid w\ \text{is a word where all $a$s are immediately followed by a $b$, unless they are preceded by a $c$}\} \\
\end{align}
$$
With regular expressions
\begin{align}
L_1 & = (a+b+c)^*a \\
L_2 & = \text{???} \\
L_3 & = (ab+b)^∗(\lambda+(c(a+b+c)^∗))
\end{align}
Regex ≠ regular expressions (cont.)
- "Real" regular expressions can only recognize regular languages
- Regexes can recognize a much wider range of languages, including palindromes
Type
Language
Recognized by
Type-0
Recursively enumerable
Turing machine
Recursive
Total Turing machine
Type-1
Context-sensitive
Linear-bounded automaton
Type-2
Context-free
Pushdown automaton
Type-3
Regular
Finite automaton
Type
Language
Recognized by
Type-0
Recursively enumerable
Turing machine
Recursive
Total Turing machine
Type-1
Context-sensitive
Linear-bounded automaton
Type-2
Context-free
Pushdown automaton
Type-3
Regular
Finite automaton
Kleene's theorem states that regular expressions and finite automata are equal — They are simply different ways to convey the same meaning.
Regex can be simulated by a linear-bounded automaton.
I will talk about Kleene's theorem later, if there's time and interest.
Formal theory done,
let's see some regex!
Regex examples
Regular expression
Regex
$L_1 = (a+b+c)^*a$
$L_1 = $[abc]*a
$L_2 = \text{N/A}$
$L_2 = $((.)(?1)\2|.?)
$L_3 = (ab+b)^*(\lambda+(c(a+b+c)^*))$
$L_3 = $(ab|b)*(c[abc]*)?
Flavou?rs
- There are many different regex engines
- The engines typically use an algorithm based on either backtracking (BT) or a finite-state automaton (FSA)
- BT is often slightly faster than FSA, but an "unlucky" a careless regex processed by BT can take a very long time
- Which algorithm is used can affect the "default" behaviour of the regexFor example: FSA-based engines are typically biased towards the longest possible match,
whereas those based on BT are likely to be biased towards the first —or leftmost— possible match.
- The term flavor (or flavour) refers to the slight —and sometimes not so slight— differences in semantics between different engines
This might not make a lot of sense right now, but it is something to keep in mind.
For more information on the differences between regex engines, seehttps://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines
On the careless regex:
-
very long time: Anything from "a few seconds" to "completely unresponsive"
- I will talk about how not to be careless later
A quick Google search did not reveal whether or not FSA-based algorithms can typically recognize the same languages as those based on BT
Regex anatomy
/reg(?:(?:exp?(?:e?s)?)|(?:ular expressions?))/ig
-
Delimiter
-
Literal symbols
-
Non-capturing group
-
Optional (0 or 1)
-
End non-capturing group
-
Alternation (or)
-
Modifier: Case insensitive
-
Modifier: Global
- regex
- regexs
- regexes
- regexp
- regexps
- regexpes
- regular expression
- regular expressions
Metacharacters
Symbol
Meaning
Symbol
Meaning
\
Escape character
.
Any single character2
^
Beginning of string1
[…]
Character class
$
End of string1
[^…]
Negated character class
?
Zero or one
|
Alternation
*
Zero or more
-
Character class range3
+
One or more
{…}
Explicit quantification
Behavior can change with certain modifiers.
Often the default behaviour is finding the start and end of individual lines in the input.
Typically does not match newline characters without using modifiers.
- is only a meta character if it is used between characters in a character class.
E.g. [-az], [a\-z] and [az-] all match a, z, and –. [a-z] however matches any character between a and z.
By metacharacters I mean characters that needs to be escaped if you want to search for them.
There is no {,n}, use {0,n} instead.
Character classes
Let's say we want to find all lowercase letters in the english alphabet.
This will work…
a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
…but it's quite verbose.
Instead, we can use character classes, like so:
[abcdefghijklmnopqrstuvwxyz]
Or even better—we can use '–' to specify a range:
[a-z]
Metacharacter escaping is only neccessary for those that have a special meaning within character classes.
Shorthand character classes
Shorthand
Matches
\w
Word character
\d
Digit
\s
Whitespace
\v
Vertical whitespace
\h
Horizontal whitespace
Each shorthand can be negated by using the uppercase letter instead:
Shorthand
Matches
\W
Anything but a word character
\D
Anything but a digit
\S
Anything but a whitespace
\V
Anything but a vertical whitespace
\H
Anything but a horizontal whitespace
Note that the shorthand character classes are particularly dependent on flavor.
Negated character classes
Explicit character classes like [abc] can also be negated.
[^abc] matches anything that isn't an a, b or c.
Character classes—negated or not—can be used with the shorthand classes as well.
[\w\v] matches any character that is either a word character or a vertical whitespace.
By "exploiting" double negation, we can write some pretty neat shortcuts.
For example, [^abc\W] matches the same characters as (?![abc])\w.
Hopefully I will get to show some more cool usages of double negation later in the talk.
Unicode character classes
Some flavors feature character classes for each of the Unicode categories, scripts and blocks.
These character classes are typically accessed with \p{Category_Name}.
Examples
Type
Usage
Description
Category
\p{Lowercase_Letter}
Any lowercase letter from any script that has an uppercase variant
Script
\p{Hiragana}
Any character from the hiragana syllabary
Block
\p{InHiragana}
Any Unicode code point in the range U+3040–U+309F
As with the other shorthand character classes, \p{…} can be negated with \P{…}.
A quick word on POSIX
You may come across the word 'POSIX' when reading or referencing regex materials.
POSIX is an abbreviation for "Portable Operating System Interface for uniX", which is a set of standards that a UNIX OS should support.
These standards define two flavors of regex, the POSIX Basic Regular Expressions (POSIX BRE) and POSIX Extended Regular Expressions (POSIX ERE).
The POSIX flavors are in many ways quite different from other (more modern) flavors, and many regex references will mention the POSIX character classes.
- For example, the POSIX character class [:digit:] will match the same characters as [0-9], \d and \p{Decimal_Digit_Number}.
Quantifiers
Sometimes Very often we want to repeat (a part of) our regex.
Consider checking a piece of text to see if it contains a numbers-only substring of length 10.
\d\d\d\d\d\d\d\d\d\d works, but it is neither very readable or scalable.
By using quantifiers, we can tell the regex engine to repeat the \d part of our regex 10 times.
\d{10} matches the same substrings as the regex above.
Quantifier reference
Syntax
Meaning
?
Zero or one
*
Zero or more
+
One or more
{n}
Exactly n
{n,}
n or more
{n,m}
n to m (inclusive)
Quantifier behaviour
Regex quantifiers are greedy by default.
For example, the regex A\w*C applied to
bbbCAbbbCAbbbCAbbb
won't just match
bbbCAbbbCAbbbCAbbb
but instead
bbbCAbbbCAbbbCAbbb
Modifying the behaviour of quantifiers
Syntax
Behaviour
Description
\w*
Greedy
Matches as many characters as possible
\w*?
Lazy
Matches as few characters as possible
\w*+
Possessive
Will not backtrack
The * quantifier can of course be substituted by any of the other quantifiers.
Behaviour examples
Regex
Behaviour
Match
[ab]+b
Greedy
aabaab
[ab]+?b
Lazy
aabaab
[ab]++b
Possessive
aabaab
Assertions
Sometimes we need to inspect our surroundings without actually matching them.
Consider the input string <h1>Replace me!</h1>
If we want to replace the content of the <h1>-tag with I was replaced., we could
- search for <h1>.*?</h1>, and
- replace it with <h1>I was replaced.</h1>
This example is not too bad, but one can imagine how some searches would need quite a bit of repetition.
Assertions make it possible to inspect the string without consuming any characters.
For example, we can achieve the same result as before by using regex look-arounds to
- search for (?<=<h1>).*?(?=</h1>), and
- replacing with I was replaced.
Assertion types
Usage
Description
Example
Boundaries
\b
Word boundary
\bword\b
Anchors
^
Start of input (or line)
^.*$
$
End of input (or line)
^.*$
\A
Absolute start of input
\A.*$
\Z
End of input (ignoring blank lines)
\Z\s
\z
Absolute end of input
\s\z
Look-arounds
(?=…)
Positive look-ahead
a(?=b)
(?!…)
Negative look-ahead
b(?!a)
(?<=…)
Positive look-behind
(?<=a)b
(?<!…)
Negative look-behind
(?<!b)a
Groups
We've already seen the non-capturing group (?:…) syntax a few times.
As the name suggests, regex also has non-non-capturing groups—or just capturing groups.
The syntax for capturing groups is simply (…).
These are particularly useful for search & replace, and can even be used for backreferencing.
Search and replace example
By searching for ^(.*)$ and replacing with Hello, $1! we can quickly change the following input
World
GOTO
regex
to the following output
Hello, World!
Hello, GOTO!
Hello, regex!
Backreferencing
Recall that we can recognize a larger set of languages with regex compared to regular expressions.
This is because we are able to reference a capture group within the regex itself.
For example, (.)\1 will match aa and >>, but not xy
In addition to referencing the contents of a group, it is also possible to reference the regex for the group. This is called a subroutine.
Let's revisit our regex that matches palindromes.
Subroutines
^((.)(?1)\2|.?)$
Input: abba
Level
Group 1
Group 2
Trying to match
Finds
0
a
(?1)
1
b
(?1)
2
b
(?1)
3
a
(?1)
4
$
3
a
aa
a$
2
b
bb
ba
1
b
bb
bb
0
bb
a
abba
abba
Heavily simplified for readability; I am ignoring the fact that .? is greedy.
Named groups
Capture groups and subroutines can be named for easier replacement and backreferencing.
The syntax for named groups is very flavor-dependent, and the examples here might not work with your engine.
Syntax
Meaning
(?<name>…)
Creates a capture with the given name
\k<name>
Matches what is currently in the name group
(?P>name)
Call the name subroutine
${name}
Insert the contents of name when replacing
Here is our palindrome regex with and without named groups:
^(?<palindrome>(?<first>.)(?P>palindrome)\k<first>|.?)$
^((.)(?1)\2|.?)$
Information
Libraries & Frameworks
Tools
GOTO Night
Regular Expressions
Kim Hesselholt Reenberg2016-05-17
Wifi
SSID
Win7/Linux
easv-portal
Everyone else
EASV
Username
Guest461
Password
Easvzgap
Win7/Linux go to cp.easv.dk to login.
Full presentation is available online at http://khreenberg.github.io/goto-regex