parboiled2



parboiled2

1 0


pb2-talk

ScalaDays 2014 parboiled2 task

On Github alexander-myltsev / pb2-talk

parboiled2

a macro-based PEG-parsersgenerator for Scala 2.10+

Alexander Myltsev alexander-myltsev@githublinkedin.com/in/alexandermyltsevalex_myltsev@twitter This talk: http://myltsev.name/ScalaDays2014

my credentials

  • Google SoC'13 developed initial structure
  • porting to scala-js

parsing task

  • input: stream of characters or bytes
  • output: is it structured? what is the structure?
example:
scala?!> val input = "1 + 2 * 3"

scala?!> new Parser(input).run.arithmetical_expression?
true

scala?!> new Parser(input).run.AST_Structure
     ☂   +     ✂
     ♎  / \    ☎
Some ☭ 1   *   ☢
     ♎    / \  ❄
     ❤   2   3 ❤

what motivated us

  • parboiled 1.x used by numerous projects
  • Scala macros introduced
  • save API and improve performance

existing projects

  • tools and libs dealing with a language atop of JVM
  • use Scala combinator parsers or parboiled 1.x? — consider using parboiled2

define requirements

work inside of Scala ecosystem
  • tools: IDE, debugger, profiler, etc.
  • Java — not necessary
work fast easy to maintain — typed embedded DSL few dependencies

existing alternatives

  • barehanded parser
  • ANTLR — ANother Tool for Language Recognition
  • parser combinators (Haskell Parsec, Scala parser combinators, etc.)
either different IDE, or slower, or harder to maintain

be based on RegEx?

can't parse recursive data: arithmetical expr, json, etc.

scala?!> new RegExParser({
                           "location": "Kosmos Belin",
                           "presentation": {
                             "name" : "parboiled2",
                             "geeky": true
                           }
                         }).run

compare to parboiled 1.x

  • ~10 times slower
  • less powerful DSL
  • more dependencies

parboiled2

gentle introduction

core features

  • Parsing Expression Grammar
  • DSL: flexible, type-safe
  • compile-time optimization
  • parsing errors reporting
  • one parsing phase, no lexer required
  • small API

install it

  • just a Scala lib
  • build against 2.10.4 and 2.11.1
libraryDependencies+= "org.parboiled" %% "parboiled" % "2.0.0"

import org.parboiled2._

toy parser

abstract class Parser {
  def input: ParserInput
}

class ToyParser(val input: ParserInput) extends Parser {
  def InputLine = rule { "a" ~ ("b" | "c") }
}

toy parser

def InputLine = rule { "a" ~ ("b" | "c") }
> new ToyParser("ab").InputLine.run()
Try[Unit] = Success(())
> new ToyParser("ac").InputLine.run()
Try[Unit] = Success(())
> new ToyParser("d").InputLine.run()
Try[Unit] = Failure(org.parboiled2.ParseError)
> val p = new ToyParser("d")
p: ToyParser = ToyParser@12345

> val Failure(err: ParseError) = p.InputLine.run()
err: ParseError = org.parboiled2.ParseError

> p.formatError(err)
String =
Invalid input 'd', expected 'a' (line 1, column 1):
d
^

toy parser: step by step

def InputLine = rule { "a" ~ ("b" | "c") }

input: "ac"
InputLine starts — calls "~" subrule — cursor at 0 pos "~" matches left "a", and if succeeds then right "|" rule "a" is matched against input(0) — cursor is at 1 "|" matches left "b", and if fails then matches "c" "b" is failed to match — cursor stays at 1 "c" is succeeded to match — cursor is moved to 2 match is successful

how to produce AST?

value stack & typed rules

exploring value stack

rules do not return any value push and pop result from ValueStack
  • implicit temporary storage
  • e.g. used for constructing AST, "in-phase" computations
rule is typed according to its interaction with ValueStack

rule typing

class Rule[-I <: HList, +O <: HList]
  • I   <: HList   number and types the rule pops off
  • O <: HList   number and types the rule pushes on
  • types are inferred (in most cases)
  • aliases
type RuleN[L <: HList]   = Rule[HNil, L]
type Rule0               = RuleN[HNil]
type Rule1[T]            = RuleN[T :: HNil]
// B is pushed after A
type Rule2[A, B]         = RuleN[A :: B :: HNil]
type PopRule[L <: HList] = Rule[L, HNil]

rules groups

  • basic chars matchers
  • rules combinators and modifiers
  • parser actions

basic chars matchers

  • match input and cause progress
  • do nothing with ValueStack
implicit def ch(c: Char): Rule0
rule { 'a' } // rule { ch('a') }
implicit def str(s: String): Rule0
rule { "ab" } // rule { str("ab") }

rule { 42 } // compile-time error
implicit def predicate(p: CharPredicate): Rule0
CharPredicate.Digit, CharPredicate.LowerHexLetter, etc.
def EOI: Char

rules combinators & modifiers

e1 ~ e2

  • matches if e1 matches and then if e2 matches
  • type
def a: Rule1[Int]         = // ...
def b: Rule1[String]      = // ...
def c: Rule2[Int, String] = rule { a ~ b }
Rule[     , A    ] ~ Rule[     , B  ] = Rule[     , A:B    ]

Rule[A:B:C, D:E:F] ~ Rule[F    , G:H] = Rule[A:B:C, D:E:G:H]

Rule[A    , B:C  ] ~ Rule[D:B:C, E:F] = Rule[D:A  , E:F    ]

Rule[A    , B:C  ] ~ Rule[D:C  , E:F] - is Illegal if D != B

e1 | e2

tries to match e1 if matches then succeeds otherwise — result of e2 matching

optional(e)

  • tries to match e
  • always succeeds, even if e fails to match
optional(e: Rule0)    : Rule0
optional(e: Rule1[T]) : Rule1[Option[T]]

zeroOrMore(e) and oneOrMore(e)

  • zeroOrMore runs e until it fails — always succeeds
  • oneOrMore runs e until it fails — succeeds if inner rule is succeeded at least once
zeroOrMore(e: Rule0)    : Rule0
zeroOrMore(e: Rule1[T]) : Rule1[Seq[T]]

oneOrMore(e: Rule0)     : Rule0
oneOrMore(e: Rule1[T])  : Rule1[Seq[T]]

parser actions

  • can build recognizers so far: "does it have structure?"
  • need "actions" to produce result

push

pushes value onto ValueStack

rule { "true" ~ push(true) }
push(e: Unit)       : Rule0    - pushes nothing
push(e: L <: HList) : RuleN[L] - pushes all values of L
push(e: T)          : Rule1[T] - pushes a value of T

capture

pushes additional matched string onto ValueStack

rule { capture("42") } - if input matches "42" then pushes "42"
capture(zeroOrMore(CharPredicate.Digit)) : Rule1[String]
capture("true" ~ push(true))             : Rule2[Boolean, String]

a ~> (func)

transforms top elements of the Value Stack into some other object(s)

(foo: Rule1[String])      ~> ((s: String) ⇒ s.toInt)  : Rule1[Int]
(foo: Rule1[String])      ~> (s ⇒ s.toInt)            : Rule1[Int]
(foo: Rule1[String])      ~> (_.toInt)                : Rule1[Int]
(foo: Rule2[Int, String]) ~> ((i, s) ⇒ i + s.toInt)   : Rule1[Int]
(foo: Rule1[String])      ~> (println(_))             : Rule0
(foo: Rule1[String])      ~> (() ⇒ 17)                : Rule2[String, Int]
(foo: Rule1[String])      ~  push(17)                 : Rule2[String, Int]
(foo: Rule1[String]) ~> (s ⇒ s.toInt :: 3.14 :: HNil) : Rule2[Int, Double]
case class Person(name: String, age: Int)
(foo: Rule2[String, Int]) ~> Person                   : Rule1[Person]

reduction rules

parser makes progress leaving value stack unchanged

reduce "1+2+3+4+5" to (15:Int)
(number: Rule1[Int]) ~ zeroOrMore("+" ~ number) : Rule2[Int, Seq[Int]]
def reductor = "+" ~ number ~>
((a:Int, b: Int) => a + b) : Rule[Int::HNil, Int::HNil]

(number: Rule1[Int]) ~ zeroOrMore(reductor) : Rule1[Int]
capture(CharPredicate.Digit) ~ optional("h" ~> ((s: String) ⇒ s + "hex"))

(factor: Rule1[Int]) ~ oneOrMore("*" ~ factor ~> ((a: Int, b) ⇒ a * b))

calculator

> new Calculator("1+(2+4)*3").InputLine.run() 
scala.util.Success(25)
class Calculator(val input: ParserInput) extends Parser {
  def Digits = rule { oneOrMore(CharPredicate.Digit) } // Rule1[Stirng]

  def Number = rule { capture(Digits) ~> (_.toInt) }   // Rule1[Int]
  def Parens = rule { '(' ~ Expression ~ ')' }
  def Factor = rule { Number | Parens }
  def Term = rule { Factor ~ zeroOrMore('*' ~ Factor ~> ((_: Int) * _)
                                      | '/' ~ Factor ~> ((_: Int) / _)) }
  def Expression: Rule1[Int] = rule {
      Term ~ zeroOrMore('+' ~ Term ~> ((_: Int) + _)
                      | '-' ~ Term ~> ((_: Int) - _)) }
  def InputLine = rule { Expression ~ EOI }

}

under the hood

rule macro

case class Rule(matched: Boolean)
abstract class Parser {
  def rule(r: Rule): Rule = macro ParserMacros.ruleImpl

  def input: ParserInput
  def __nextChar(): Char = // ...
  def __cursor: Int = // ...
}
object ParserMacros {	
  def ruleImpl(ctx: ParserContext)(r: ctx.Expr[Rule]): ctx.Expr[Rule] = {
    // ...
  }
}

two step render()

def ruleImpl(ctx: ParserContext)(r: ctx.Expr[Rule]): ctx.Expr[Rule] = 
  reify {
    OpTree(r.tree).render().splice
  }
input: Scala compiler AST IR: parboiled2 OpTree output: effective Scala code AST

quasiquotes match

def r = rule { "a" }
def r = rule { str("a") }
def r = rule { CurrentParser.this.str("a") }
r.tree = Apply(Select(This(TypeName("CurrentParser")), TermName("str")),
               List(Literal(Constant("a"))))
r.tree match {
  case Apply(Select(_, TermName("str")), List(Literal(Constant(s)))) ⇒
    StringMatch(s) // s is "a"
}
r.tree match {
  case q"$a.this.str($s)" ⇒ StringMatch(s) // a is CurrentParser
                                           // s is "a"
}

OpTree

object OpTree {
  def apply(tree: Tree): OpTree =
    r.tree match {
      case q"$a.this.str($s)" ⇒ StringMatch(s)
      case q"$a.this.ch($ch)" ⇒ CharMatch(ch)
      case q"$lhs.|($rhs)"    ⇒ FirstOf(OpTree(lhs), OpTree(rhs))
      case q"$lhs.~($rhs)"    ⇒ Sequence(OpTree(lhs), OpTree(rhs))
    }
}
abstract class OpTree {
  def render(): Expr[Rule]
}

case class StringMatch(strTree: Tree)         extends OpTree
case class CharMatch(charTree: Tree)          extends OpTree
case class FirstOf(lhs: OpTree, rhs: OpTree)  extends OpTree
case class Sequence(lhs: OpTree, rhs: OpTree) extends OpTree

CharMatch

// (rule { ch('a') }).tree match { case q"$a.this.ch($charTree)" ⇒ ... }

case class CharMatch(charTree: Tree) extends OpTree {
  def render(): Expr[Rule] = reify {
    val p    = c.prefix.splice               // Current parser
    val char = c.Expr[Char](charTree).splice // : Char
    Rule(p.__nextChar() == char)
  }
}

StringMatch: naive

// (rule { str('a') }).tree match { case q"$a.this.str($strTree)" ⇒ ... }

case class StringMatch(strTree: Tree) extends OpTree {
  def render(): Expr[Rule] = c.Expr[Rule](q''
    val p   = ${c.prefix} // Current parser
    val str = $strTree    // : String

    val res = str.forall(c ⇒ c == p.__nextChar())
    Rule(res)
'')

StringMatch: effective

// (rule { str('a') }).tree match { case q"$a.this.str($strTree)" ⇒ ... }

case class StringMatch(strTree: Tree) extends OpTree {
  def render(): Expr[Rule] = c.Expr[Rule](q''
    val p   = ${c.prefix} // Current parser
    val str = $strTree    // : String

    var ix = 0
    while (ix < str.length && str.charAt(ix) == p.__nextChar()) {
      ix += 1
    }
    Rule(ix == str.length)
'')

Sequence: a ~ b

// case q"$lhs.~($rhs)" ⇒ Sequence(OpTree(lhs), OpTree(rhs))

  case class Sequence(lhs: OpTree, rhs: OpTree) extends OpTree {
    def render(): Expr[Rule] = c.Expr[Rule](q''
    val res = lhs.render().matched && rhs.render().matched
    Rule(res)
  '')
}

FirstOf: a | b

// case q"$lhs.|($rhs)" ⇒ FirstOf(OpTree(lhs), OpTree(rhs))

case class FirstOf(lhs: OpTree, rhs: OpTree) extends OpTree {
  def render(): Expr[Rule] = c.Expr[Rule](q''
    val mark = p.__cursor
    if (lhs.render().matched) {
      Rule(true)
    } else {
      p.__cursor = mark
      val res = rhs.render().matched
      Rule(res)
    }
  '')
}

benchmark

> cappi::benchmark

            benchmark    ms linear runtime
 Parboiled1JsonParser 85.64 ==============================
  arboiled2JsonParser 13.17 ====
             Argonaut  7.01 ==
         Json4SNative  8.06 ==
        Json4SJackson  4.09 =

Thanks

Q & A

@Alex_Myltsev

parboiled2.org