parboiled2

a macro-based PEG-parsersgenerator for Scala 2.10+

Alexander Myltsev alexander-myltsev@github linkedin.com/in/alexandermyltsev alex_myltsev@twitter This talk: http://myltsev.name/ScalaDays2014

my credentials

Google SoC'13 developed initial structure
porting to scala-js

parsing task

input: stream of characters or bytes
output: is it structured? what is the structure?

example:

scala?!> val input = "1 + 2 * 3"

scala?!> new Parser(input).run.arithmetical_expression?
true

scala?!> new Parser(input).run.AST_Structure
     ☂   +     ✂
     ♎  / \    ☎
Some ☭ 1   *   ☢
     ♎    / \  ❄
     ❤   2   3 ❤

what motivated us

parboiled 1.x used by numerous projects
Scala macros introduced
save API and improve performance

existing projects

tools and libs dealing with a language atop of JVM
use Scala combinator parsers or parboiled 1.x? — consider using parboiled2

define requirements

work inside of Scala ecosystem

tools: IDE, debugger, profiler, etc.
Java — not necessary

work fast easy to maintain — typed embedded DSL few dependencies

existing alternatives

barehanded parser
ANTLR — ANother Tool for Language Recognition
parser combinators (Haskell Parsec, Scala parser combinators, etc.)

either different IDE, or slower, or harder to maintain

be based on RegEx?

can't parse recursive data: arithmetical expr, json, etc.

scala?!> new RegExParser({
                           "location": "Kosmos Belin",
                           "presentation": {
                             "name" : "parboiled2",
                             "geeky": true
                           }
                         }).run

compare to parboiled 1.x

~10 times slower
less powerful DSL
more dependencies

parboiled2

gentle introduction

core features

Parsing Expression Grammar
DSL: flexible, type-safe
compile-time optimization
parsing errors reporting
one parsing phase, no lexer required
small API

install it

just a Scala lib
build against 2.10.4 and 2.11.1

libraryDependencies+= "org.parboiled" %% "parboiled" % "2.0.0"

import org.parboiled2._

toy parser

abstract class Parser {
  def input: ParserInput
}

class ToyParser(val input: ParserInput) extends Parser {
  def InputLine = rule { "a" ~ ("b" | "c") }
}

toy parser

def InputLine = rule { "a" ~ ("b" | "c") }

> new ToyParser("ab").InputLine.run()
Try[Unit] = Success(())

> new ToyParser("ac").InputLine.run()
Try[Unit] = Success(())

> new ToyParser("d").InputLine.run()
Try[Unit] = Failure(org.parboiled2.ParseError)

> val p = new ToyParser("d")
p: ToyParser = ToyParser@12345

> val Failure(err: ParseError) = p.InputLine.run()
err: ParseError = org.parboiled2.ParseError

> p.formatError(err)
String =
Invalid input 'd', expected 'a' (line 1, column 1):
d
^

toy parser: step by step

def InputLine = rule { "a" ~ ("b" | "c") }

input: "ac"

InputLine starts — calls "~" subrule — cursor at 0 pos "~" matches left "a", and if succeeds then right "|" rule "a" is matched against input(0) — cursor is at 1 "|" matches left "b", and if fails then matches "c" "b" is failed to match — cursor stays at 1 "c" is succeeded to match — cursor is moved to 2 match is successful

how to produce AST?

value stack & typed rules

exploring value stack

rules do not return any value push and pop result from ValueStack

implicit temporary storage
e.g. used for constructing AST, "in-phase" computations

rule is typed according to its interaction with ValueStack

rule typing

class Rule[-I <: HList, +O <: HList]

I <: HList number and types the rule pops off
O <: HList number and types the rule pushes on
types are inferred (in most cases)
aliases

type RuleN[L <: HList]   = Rule[HNil, L]

type Rule0               = RuleN[HNil]

type Rule1[T]            = RuleN[T :: HNil]

// B is pushed after A
type Rule2[A, B]         = RuleN[A :: B :: HNil]

type PopRule[L <: HList] = Rule[L, HNil]

rules groups

basic chars matchers
rules combinators and modifiers
parser actions

basic chars matchers

match input and cause progress
do nothing with ValueStack

implicit def ch(c: Char): Rule0
rule { 'a' } // rule { ch('a') }

implicit def str(s: String): Rule0
rule { "ab" } // rule { str("ab") }

rule { 42 } // compile-time error

implicit def predicate(p: CharPredicate): Rule0
CharPredicate.Digit, CharPredicate.LowerHexLetter, etc.

def EOI: Char

rules combinators & modifiers

e1 ~ e2

matches if e1 matches and then if e2 matches
type

def a: Rule1[Int]         = // ...
def b: Rule1[String]      = // ...
def c: Rule2[Int, String] = rule { a ~ b }

Rule[     , A    ] ~ Rule[     , B  ] = Rule[     , A:B    ]

Rule[A:B:C, D:E:F] ~ Rule[F    , G:H] = Rule[A:B:C, D:E:G:H]

Rule[A    , B:C  ] ~ Rule[D:B:C, E:F] = Rule[D:A  , E:F    ]

Rule[A    , B:C  ] ~ Rule[D:C  , E:F] - is Illegal if D != B

e1 | e2

tries to match e1 if matches then succeeds otherwise — result of e2 matching

optional(e)

tries to match e
always succeeds, even if e fails to match

optional(e: Rule0)    : Rule0
optional(e: Rule1[T]) : Rule1[Option[T]]

zeroOrMore(e) and oneOrMore(e)

zeroOrMore runs e until it fails — always succeeds
oneOrMore runs e until it fails — succeeds if inner rule is succeeded at least once

zeroOrMore(e: Rule0)    : Rule0
zeroOrMore(e: Rule1[T]) : Rule1[Seq[T]]

oneOrMore(e: Rule0)     : Rule0
oneOrMore(e: Rule1[T])  : Rule1[Seq[T]]

parser actions

can build recognizers so far: "does it have structure?"
need "actions" to produce result

push

pushes value onto ValueStack

rule { "true" ~ push(true) }

push(e: Unit)       : Rule0    - pushes nothing

push(e: L <: HList) : RuleN[L] - pushes all values of L

push(e: T)          : Rule1[T] - pushes a value of T

capture

pushes additional matched string onto ValueStack

rule { capture("42") } - if input matches "42" then pushes "42"

capture(zeroOrMore(CharPredicate.Digit)) : Rule1[String]

capture("true" ~ push(true))             : Rule2[Boolean, String]

a ~> (func)

transforms top elements of the Value Stack into some other object(s)

(foo: Rule1[String])      ~> ((s: String) ⇒ s.toInt)  : Rule1[Int]

(foo: Rule1[String])      ~> (s ⇒ s.toInt)            : Rule1[Int]

(foo: Rule1[String])      ~> (_.toInt)                : Rule1[Int]

(foo: Rule2[Int, String]) ~> ((i, s) ⇒ i + s.toInt)   : Rule1[Int]

(foo: Rule1[String])      ~> (println(_))             : Rule0

(foo: Rule1[String])      ~> (() ⇒ 17)                : Rule2[String, Int]
(foo: Rule1[String])      ~  push(17)                 : Rule2[String, Int]

(foo: Rule1[String]) ~> (s ⇒ s.toInt :: 3.14 :: HNil) : Rule2[Int, Double]

case class Person(name: String, age: Int)
(foo: Rule2[String, Int]) ~> Person                   : Rule1[Person]

reduction rules

parser makes progress leaving value stack unchanged

reduce "1+2+3+4+5" to (15:Int)

(number: Rule1[Int]) ~ zeroOrMore("+" ~ number) : Rule2[Int, Seq[Int]]

def reductor = "+" ~ number ~>
((a:Int, b: Int) => a + b) : Rule[Int::HNil, Int::HNil]

(number: Rule1[Int]) ~ zeroOrMore(reductor) : Rule1[Int]

capture(CharPredicate.Digit) ~ optional("h" ~> ((s: String) ⇒ s + "hex"))

(factor: Rule1[Int]) ~ oneOrMore("*" ~ factor ~> ((a: Int, b) ⇒ a * b))

calculator

> new Calculator("1+(2+4)*3").InputLine.run() 
scala.util.Success(25)

class Calculator(val input: ParserInput) extends Parser {

  def Digits = rule { oneOrMore(CharPredicate.Digit) } // Rule1[Stirng]

  def Number = rule { capture(Digits) ~> (_.toInt) }   // Rule1[Int]

  def Parens = rule { '(' ~ Expression ~ ')' }

  def Factor = rule { Number | Parens }

  def Term = rule { Factor ~ zeroOrMore('*' ~ Factor ~> ((_: Int) * _)
                                      | '/' ~ Factor ~> ((_: Int) / _)) }

  def Expression: Rule1[Int] = rule {
      Term ~ zeroOrMore('+' ~ Term ~> ((_: Int) + _)
                      | '-' ~ Term ~> ((_: Int) - _)) }

  def InputLine = rule { Expression ~ EOI }

}

under the hood

rule macro

case class Rule(matched: Boolean)

abstract class Parser {
  def rule(r: Rule): Rule = macro ParserMacros.ruleImpl

  def input: ParserInput
  def __nextChar(): Char = // ...
  def __cursor: Int = // ...
}

object ParserMacros {	
  def ruleImpl(ctx: ParserContext)(r: ctx.Expr[Rule]): ctx.Expr[Rule] = {
    // ...
  }
}

two step render()

def ruleImpl(ctx: ParserContext)(r: ctx.Expr[Rule]): ctx.Expr[Rule] = 
  reify {
    OpTree(r.tree).render().splice
  }

input: Scala compiler AST IR: parboiled2 OpTree output: effective Scala code AST

quasiquotes match

def r = rule { "a" }
def r = rule { str("a") }
def r = rule { CurrentParser.this.str("a") }

r.tree = Apply(Select(This(TypeName("CurrentParser")), TermName("str")),
               List(Literal(Constant("a"))))

r.tree match {
  case Apply(Select(_, TermName("str")), List(Literal(Constant(s)))) ⇒
    StringMatch(s) // s is "a"
}

r.tree match {
  case q"$a.this.str($s)" ⇒ StringMatch(s) // a is CurrentParser
                                           // s is "a"
}

OpTree

object OpTree {
  def apply(tree: Tree): OpTree =
    r.tree match {
      case q"$a.this.str($s)" ⇒ StringMatch(s)
      case q"$a.this.ch($ch)" ⇒ CharMatch(ch)
      case q"$lhs.|($rhs)"    ⇒ FirstOf(OpTree(lhs), OpTree(rhs))
      case q"$lhs.~($rhs)"    ⇒ Sequence(OpTree(lhs), OpTree(rhs))
    }
}

abstract class OpTree {
  def render(): Expr[Rule]
}

case class StringMatch(strTree: Tree)         extends OpTree
case class CharMatch(charTree: Tree)          extends OpTree
case class FirstOf(lhs: OpTree, rhs: OpTree)  extends OpTree
case class Sequence(lhs: OpTree, rhs: OpTree) extends OpTree

CharMatch

// (rule { ch('a') }).tree match { case q"$a.this.ch($charTree)" ⇒ ... }

case class CharMatch(charTree: Tree) extends OpTree {
  def render(): Expr[Rule] = reify {
    val p    = c.prefix.splice               // Current parser
    val char = c.Expr[Char](charTree).splice // : Char
    Rule(p.__nextChar() == char)
  }
}

StringMatch: naive

// (rule { str('a') }).tree match { case q"$a.this.str($strTree)" ⇒ ... }

case class StringMatch(strTree: Tree) extends OpTree {
  def render(): Expr[Rule] = c.Expr[Rule](q''
    val p   = ${c.prefix} // Current parser
    val str = $strTree    // : String

    val res = str.forall(c ⇒ c == p.__nextChar())
    Rule(res)
'')

StringMatch: effective

// (rule { str('a') }).tree match { case q"$a.this.str($strTree)" ⇒ ... }

case class StringMatch(strTree: Tree) extends OpTree {
  def render(): Expr[Rule] = c.Expr[Rule](q''
    val p   = ${c.prefix} // Current parser
    val str = $strTree    // : String

    var ix = 0
    while (ix < str.length && str.charAt(ix) == p.__nextChar()) {
      ix += 1
    }
    Rule(ix == str.length)
'')

Sequence: a ~ b

// case q"$lhs.~($rhs)" ⇒ Sequence(OpTree(lhs), OpTree(rhs))

  case class Sequence(lhs: OpTree, rhs: OpTree) extends OpTree {
    def render(): Expr[Rule] = c.Expr[Rule](q''
    val res = lhs.render().matched && rhs.render().matched
    Rule(res)
  '')
}

FirstOf: a | b

// case q"$lhs.|($rhs)" ⇒ FirstOf(OpTree(lhs), OpTree(rhs))

case class FirstOf(lhs: OpTree, rhs: OpTree) extends OpTree {
  def render(): Expr[Rule] = c.Expr[Rule](q''
    val mark = p.__cursor
    if (lhs.render().matched) {
      Rule(true)
    } else {
      p.__cursor = mark
      val res = rhs.render().matched
      Rule(res)
    }
  '')
}

benchmark

> cappi::benchmark

            benchmark    ms linear runtime
 Parboiled1JsonParser 85.64 ==============================
  arboiled2JsonParser 13.17 ====
             Argonaut  7.01 ==
         Json4SNative  8.06 ==
        Json4SJackson  4.09 =

parboiled2

alexander-myltsev

parboiled2

1 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

pb2-talk

parboiled2

a macro-based PEG-parsersgenerator for Scala 2.10+

my credentials

parsing task

what motivated us

existing projects

define requirements

existing alternatives

be based on RegEx?

compare to parboiled 1.x

parboiled2

gentle introduction

core features

install it

toy parser

toy parser

toy parser: step by step

how to produce AST?

value stack & typed rules

exploring value stack

rule typing

rules groups

basic chars matchers

rules combinators & modifiers

e1 ~ e2

e1 | e2

optional(e)

zeroOrMore(e) and oneOrMore(e)

parser actions

push

capture

a ~> (func)

reduction rules

calculator

under the hood

rule macro

two step render()

quasiquotes match

OpTree

CharMatch

StringMatch: naive

StringMatch: effective

Sequence: a ~ b

FirstOf: a | b

benchmark

Thanks

Q & A

@Alex_Myltsev

parboiled2.org

1 0