goto-regex



goto-regex

0 0


goto-regex


On Github khreenberg / goto-regex

GOTO Night

Regular Expressions

Kim Hesselholt Reenberg2016-05-17

Wifi

SSID Win7/Linux easv-portal Everyone else EASV Username Guest461 Password Easvzgap

Win7/Linux go to cp.easv.dk to login.

Full presentation is available online at http://khreenberg.github.io/goto-regex

About me

  • Software Development PBA student
  • Lazy programmer
  • Self-taught regular expressions disciple

This talk is split into three main parts:

A brief introduction to formal languages Regex features and syntax Examples and workshop

Regular expressions

  • have many names
    • Regex
    • Regexp
    • /reg(?:(?:exp?(?:e?s)?)|(?:ular expressions?))/ig
    • Black magic
  • have many usages
    • Validation
    • Search & Replace
    • CSS selectors
    • String manipulation
    • Party-trick
  • are not regular expressions

Regex ≠ regular expressions

  • "Real" regular expressions can only recognize regular languages
  • A regular language is a type of formal language

From wikipedia: …a formal language is a set of strings of symbols that may be constrained by rules that are specific to it.

  • formal languages are sets
  • defined by formal grammar

Formal languages 101

  • The alphabet, $\Sigma$, is the set of
  • Tokens, $\sigma$, used to form
  • The words, $w$, that make up
  • The language

(Relevant) set theory 101

Symbol Meaning Example $\emptyset$ The empty set $\emptyset=\{ \}$ $\in$ Element of $a \in \{a, b, c\}$ $\notin$ Not element of $x \notin \{a, b, c\}$ $\ni$ Contains $\{a, b, c\} \ni a$

  • Two sets (A and B) are equal iff $$x \in A \iff x \in B$$
  • $\therefore \{1, 2, 3\}=\{1, 1, 3, 1, 2, 1, 1, 2, 3\}$
  • Because the number of elements is not a factor in set equality, we will often ignore or remove duplicates

Set operations

Name Symbol Example Effect Union $\cup$ $\{1,2\} \cup \{2,3\}$ $\{1,2,3\}$ Intersection $\cap$ $\{1,2\} \cap \{2,3\}$ $\{2\}$

Operations on sets of strings

Description Symbol Example Effect Concatenation $\cdot$ $\{a,b\} \cdot \{c, d\}$ $\{ac, ad, bc, bd\}$ Kleene Star $^*$ $\{a,b\}^*$ $\{\lambda,a,b,aa,ab,ba,bb,aaa,\ldots\}$

  • The symbol for logical disjunction ($\vee$) is visually similar to that of union
  • The symbol for logical conjunction ($\wedge$) is visually similar to that of intersection
  • Which makes sense as

$$x \in A \cup B \iff (x \in A) \vee (x \in B)$$ $$x \in A \cap B \iff (x \in A) \wedge (x \in B)$$

  • logical disjunction = "or"
  • logical conjunction = "and"

So, with the basics in place...

... let's get back to languages

Formal definition of regular languages

The collection of regular languages over $\Sigma$ is defined recursively as follows:

  • $\emptyset$ and $\{\lambda\}$ are regular languages
  • For each $\sigma \in \Sigma$, the language $\{\sigma\}$ regular
  • Let $L_1$ and $L_2$ be regular languages over $\Sigma$; the languages $L_1\cup L_2$, $\ L_1\cdot L_2$ and ${L_1}^*$ are also regular
  • No other languages are regular

Explicitly mentioning ${\lambda}$ is redundant, as $\emptyset^*={\lambda}$

Examples

With the alphabet $\Sigma=\{a,b,c\}$, we could for example define the following languages

  • $L_1=\{w \mid w\ \text{is a word that ends in}\ a\}$
  • $L_2=\{w \mid w\ \text{is a palindrome}\}$
  • $L_3=\{w \mid w\ \text{is a word where all $a$s are immediately followed by a $b$, unless they are preceded by a $c$, and}\ldots\}$

…if only we had a simple and precise way of defining languages…

Describing languages with regular expressions

Set Regular expression ${L_1}^*$ ${L_1}^*$ $L_1 \cdot L_2$ $L_1L_2$ $L_1 \cup L_2$ $L_1+L_2$

Previous examples

$$ \begin{align} L_1 & =\{w \mid w\ \text{is a word that ends in}\ a\} \\ L_2 & =\{w \mid w\ \text{is a palindrome}\} \\ L_3 & =\{w \mid w\ \text{is a word where all $a$s are immediately followed by a $b$, unless they are preceded by a $c$}\} \\ \end{align} $$

With regular expressions

\begin{align} L_1 & = (a+b+c)^*a \\ L_2 & = \text{???} \\ L_3 & = (ab+b)^∗(\lambda+(c(a+b+c)^∗)) \end{align}

Regex ≠ regular expressions (cont.)

  • "Real" regular expressions can only recognize regular languages
  • Regexes can recognize a much wider range of languages, including palindromes

Type Language Recognized by Type-0 Recursively enumerable Turing machine Recursive Total Turing machine Type-1 Context-sensitive Linear-bounded automaton Type-2 Context-free Pushdown automaton Type-3 Regular Finite automaton
Type Language Recognized by Type-0 Recursively enumerable Turing machine Recursive Total Turing machine Type-1 Context-sensitive Linear-bounded automaton Type-2 Context-free Pushdown automaton Type-3 Regular Finite automaton

Kleene's theorem states that regular expressions and finite automata are equal — They are simply different ways to convey the same meaning.

Regex can be simulated by a linear-bounded automaton.

I will talk about Kleene's theorem later, if there's time and interest.

Formal theory done,

let's see some regex!

Regex examples

Regular expression Regex $L_1 = (a+b+c)^*a$ $L_1 = $[abc]*a $L_2 = \text{N/A}$ $L_2 = $((.)(?1)\2|.?) $L_3 = (ab+b)^*(\lambda+(c(a+b+c)^*))$ $L_3 = $(ab|b)*(c[abc]*)?

Flavou?rs

  • There are many different regex engines
  • The engines typically use an algorithm based on either backtracking (BT) or a finite-state automaton (FSA)
  • BT is often slightly faster than FSA, but an "unlucky" a careless regex processed by BT can take a very long time
  • Which algorithm is used can affect the "default" behaviour of the regexFor example: FSA-based engines are typically biased towards the longest possible match, whereas those based on BT are likely to be biased towards the first —or leftmost— possible match.
  • The term flavor (or flavour) refers to the slight —and sometimes not so slight— differences in semantics between different engines

This might not make a lot of sense right now, but it is something to keep in mind.

For more information on the differences between regex engines, seehttps://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines

On the careless regex:

  • very long time: Anything from "a few seconds" to "completely unresponsive"
  • I will talk about how not to be careless later

A quick Google search did not reveal whether or not FSA-based algorithms can typically recognize the same languages as those based on BT

Regex anatomy

/reg(?:(?:exp?(?:e?s)?)|(?:ular expressions?))/ig

  • Delimiter
  • Literal symbols
  • Non-capturing group
  • Optional (0 or 1)
  • End non-capturing group
  • Alternation (or)
  • Modifier: Case insensitive
  • Modifier: Global
  • regex
  • regexs
  • regexes
  • regexp
  • regexps
  • regexpes
  • regular expression
  • regular expressions

Syntax

Metacharacters

Symbol Meaning Symbol Meaning \ Escape character . Any single character2 ^ Beginning of string1 […] Character class $ End of string1 [^…] Negated character class ? Zero or one | Alternation * Zero or more - Character class range3 + One or more {…} Explicit quantification Behavior can change with certain modifiers. Often the default behaviour is finding the start and end of individual lines in the input. Typically does not match newline characters without using modifiers.

- is only a meta character if it is used between characters in a character class. E.g. [-az], [a\-z] and [az-] all match a, z, and –. [a-z] however matches any character between a and z.

By metacharacters I mean characters that needs to be escaped if you want to search for them.

There is no {,n}, use {0,n} instead.

Character classes

Let's say we want to find all lowercase letters in the english alphabet.

This will work…

a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z

…but it's quite verbose.

Instead, we can use character classes, like so:

[abcdefghijklmnopqrstuvwxyz]

Or even better—we can use '–' to specify a range:

[a-z]

Metacharacter escaping is only neccessary for those that have a special meaning within character classes.

Shorthand character classes

Shorthand Matches \w Word character \d Digit \s Whitespace \v Vertical whitespace \h Horizontal whitespace

Each shorthand can be negated by using the uppercase letter instead:

Shorthand Matches \W Anything but a word character \D Anything but a digit \S Anything but a whitespace \V Anything but a vertical whitespace \H Anything but a horizontal whitespace

Note that the shorthand character classes are particularly dependent on flavor.

Negated character classes

Explicit character classes like [abc] can also be negated.

[^abc] matches anything that isn't an a, b or c.

Character classes—negated or not—can be used with the shorthand classes as well.

[\w\v] matches any character that is either a word character or a vertical whitespace.

By "exploiting" double negation, we can write some pretty neat shortcuts.

For example, [^abc\W] matches the same characters as (?![abc])\w.

Hopefully I will get to show some more cool usages of double negation later in the talk.

Unicode character classes

Some flavors feature character classes for each of the Unicode categories, scripts and blocks.

These character classes are typically accessed with \p{Category_Name}.

Examples

Type Usage Description Category \p{Lowercase_Letter} Any lowercase letter from any script that has an uppercase variant Script \p{Hiragana} Any character from the hiragana syllabary Block \p{InHiragana} Any Unicode code point in the range U+3040–U+309F

As with the other shorthand character classes, \p{…} can be negated with \P{…}.

A quick word on POSIX

  • You may come across the word 'POSIX' when reading or referencing regex materials.

  • POSIX is an abbreviation for "Portable Operating System Interface for uniX", which is a set of standards that a UNIX OS should support.

  • These standards define two flavors of regex, the POSIX Basic Regular Expressions (POSIX BRE) and POSIX Extended Regular Expressions (POSIX ERE).

  • The POSIX flavors are in many ways quite different from other (more modern) flavors, and many regex references will mention the POSIX character classes.

  • For example, the POSIX character class [:digit:] will match the same characters as [0-9], \d and \p{Decimal_Digit_Number}.

Quantifiers

Sometimes Very often we want to repeat (a part of) our regex.

Consider checking a piece of text to see if it contains a numbers-only substring of length 10.

\d\d\d\d\d\d\d\d\d\d works, but it is neither very readable or scalable.

By using quantifiers, we can tell the regex engine to repeat the \d part of our regex 10 times.

\d{10} matches the same substrings as the regex above.

Quantifier reference

Syntax Meaning  ? Zero or one  * Zero or more  + One or more {n} Exactly n {n,} n or more {n,m} n to m (inclusive)

Quantifier behaviour

Regex quantifiers are greedy by default.

For example, the regex A\w*C applied to

bbbCAbbbCAbbbCAbbb

won't just match

bbbCAbbbCAbbbCAbbb

but instead

bbbCAbbbCAbbbCAbbb

Modifying the behaviour of quantifiers

Syntax Behaviour Description \w* Greedy Matches as many characters as possible \w*? Lazy Matches as few characters as possible \w*+ Possessive Will not backtrack

The * quantifier can of course be substituted by any of the other quantifiers.

Behaviour examples

Regex Behaviour Match [ab]+b Greedy aabaab [ab]+?b Lazy aabaab [ab]++b Possessive aabaab

Assertions

Sometimes we need to inspect our surroundings without actually matching them.

Consider the input string <h1>Replace me!</h1>

If we want to replace the content of the <h1>-tag with I was replaced., we could

  • search for <h1>.*?</h1>, and
  • replace it with <h1>I was replaced.</h1>

This example is not too bad, but one can imagine how some searches would need quite a bit of repetition.

Assertions make it possible to inspect the string without consuming any characters.

For example, we can achieve the same result as before by using regex look-arounds to

  • search for (?<=<h1>).*?(?=</h1>), and
  • replacing with I was replaced.

Assertion types

Usage Description Example Boundaries \b Word boundary \bword\b Anchors ^ Start of input (or line) ^.*$ $ End of input (or line) ^.*$ \A Absolute start of input \A.*$ \Z End of input (ignoring blank lines) \Z\s \z Absolute end of input \s\z Look-arounds (?=…) Positive look-ahead a(?=b) (?!…) Negative look-ahead b(?!a) (?<=…) Positive look-behind (?<=a)b (?<!…) Negative look-behind (?<!b)a

Groups

We've already seen the non-capturing group (?:…) syntax a few times.

As the name suggests, regex also has non-non-capturing groups—or just capturing groups.

The syntax for capturing groups is simply (…).

These are particularly useful for search & replace, and can even be used for backreferencing.

Search and replace example

By searching for ^(.*)$ and replacing with Hello, $1! we can quickly change the following input

World
GOTO
regex

to the following output

Hello, World!
Hello, GOTO!
Hello, regex!

Backreferencing

Recall that we can recognize a larger set of languages with regex compared to regular expressions.

This is because we are able to reference a capture group within the regex itself.

For example, (.)\1 will match aa and >>, but not xy

In addition to referencing the contents of a group, it is also possible to reference the regex for the group. This is called a subroutine.

Let's revisit our regex that matches palindromes.

Subroutines

^((.)(?1)\2|.?)$

Input: abba

Level Group 1 Group 2 Trying to match Finds 0 a (?1) 1 b (?1) 2 b (?1) 3 a (?1) 4 $ 3 a aa a$ 2 b bb ba 1 b bb bb 0 bb a abba abba

Heavily simplified for readability; I am ignoring the fact that .? is greedy.

Named groups

Capture groups and subroutines can be named for easier replacement and backreferencing.

The syntax for named groups is very flavor-dependent, and the examples here might not work with your engine.

Syntax Meaning (?<name>…) Creates a capture with the given name \k<name> Matches what is currently in the name group (?P>name) Call the name subroutine ${name} Insert the contents of name when replacing

Here is our palindrome regex with and without named groups:

^(?<palindrome>(?<first>.)(?P>palindrome)\k<first>|.?)$
^((.)(?1)\2|.?)$

Information

Libraries & Frameworks

Tools

GOTO Night Regular Expressions Kim Hesselholt Reenberg2016-05-17 Wifi SSID Win7/Linux easv-portal Everyone else EASV Username Guest461 Password Easvzgap Win7/Linux go to cp.easv.dk to login. Full presentation is available online at http://khreenberg.github.io/goto-regex