Creating Grammar Parsers in Java and Scala with Parboiled
January 26th, 2014 by Micha KopsParboiled is a modern. lightweight and easy to use library to parse expression grammars in Java or Scala and in my humble opinion it is perfect for use cases where you need something between regular expressions and a complex parser generator like ANTLR.
In the following tutorial we’re going to create a simple grammar to specify a task list and write an implementation of a parser also as unit tests for each grammar rule in Java.
Additionally, we’re using the Scala variant of Parboiled to build up an Abstract Syntax Tree parser and analyze a given task list with it.
Specifying a DSL Grammar
In the following tutorial we want to parse a simple task list. I’ve recycled the a part of the syntax from my Quick Subtasks Plugin for JIRA here that is:
A task list contains of one ore more tasks, separated by a newline.
Each task definition may contain ..
- A summary (mandatory)
- An assignee (optional)
- One or more labels (optional)
Each line of a given input may specify a task, the summary is prefixed by a ‘- ‘ character and when an assignee or labels are given, these are separated from the summary by a ‘/’ character.
So this might be a sample input for the following parsers:
- First Task / assignee:"joe" labels:"foo,bar,baz" - Second Task / assignee:"fred" labels:"beer,wine"
Parsing this input should produce two task nodes:
- One task with summary “First Task“, an assignee named “joe” and three labels named “foo“, “bar“, “baz“
- A second task with summary “Second Task” , an assignee named “fred” and two labels “beer“, “wine“
Using Java
Project Setup
Just one dependency needed here: parboiled-java – please feel free to choose your favourite build system here.
If you’re interested in grammar building with the scala version of parboiled please skip to the section “Using Scala“.
Maven
When using Maven, simply add the following dependency to your project’s pom.xml:
<dependency> <groupId>org.parboiled</groupId> <artifactId>parboiled-java</artifactId> <version>1.1.6</version> </dependency>
Gradle
Adding the following entry to your build.gradle is all you need.
'org.parboiled:parboiled-java:1.1.6'
SBT
Adding the following entry to your build.sbt adds parboiled to your SBT enabled project
libraryDependencies += "org.parboiled" % "parboiled-java" % "1.1.6"
Processing a Single Task
First of all we’re starting to create a parser for a single line, composed out of different rules.
The Task Class
This class encapsules information for a single task and provides getters and setters (returning the task object).
public class Task { private String summary; private String assignee; private final Set<String> labels = new HashSet<>(); public final String summary() { return summary; } public final Task summary(final String summary) { this.summary = summary; return this; } public final String assignee() { return assignee; } public final Task assignee(final String assignee) { this.assignee = assignee; return this; } public Set<String> labels() { return labels; } public Task label(final String label) { labels.add(label); return this; } }
The Grammar Parser
This is our first approach for a parser – extending org.parboiled.BaseParser adds everything we need to construct our rules here.
A short note on the elements used here to create new rules:
- Sequence: Creates a new rule that only succeeds if all of its subrule succeed, one after the other.
- Optional: Creates a new rule that tries a match on its subrule and always succeeds, independently of the matching success of its sub rule.
- EOI: Matches the Chars.EOI (end of input) character.
- ZeroOrMore: Creates a new rule that tries repeated matches of its subrule.
- OneOrMore: Creates a new rule that tries repeated matches of its subrule and succeeds if the subrule matches at least once.
- TestNot: Creates a new rule that acts as an inverse syntactic predicate, i.e.
- ANY: Matches any character except Chars.EOI.
- Ch: Explicitly creates a rule matching the given character.
There is a plenty of other methods for your convenience, contained in org.parboiled.BaseParser.
package com.hascode.parser; import org.parboiled.BaseParser; import org.parboiled.Rule; import org.parboiled.annotations.BuildParseTree; import com.hascode.Task; @BuildParseTree public class SingleLineDslParser extends BaseParser<Task> { Task dto = new Task(); public Rule Task() { return Sequence(Summary(), Optional(Options()), EOI, push(dto)); } public Rule Summary() { return Sequence('-', ZeroOrMore(' '), Chars(), push(dto.summary(match()))); } public Rule Chars() { return OneOrMore(TestNot(OptLim()), TestNot(FieldLim()), ANY); } public Rule NoCommaChars() { return OneOrMore(TestNot(ValSep()), TestNot(OptLim()), TestNot(FieldLim()), ANY); } public Rule Options() { return Sequence(OptSep(), Optional(Assignee()), Optional(Labels())); } public Rule OptSep() { return Sequence(OptSp(), OptLim(), OptSp()); } public Rule Assignee() { return Sequence(OptSp(), String("assignee"), FieldSep(), FieldLim(), Chars(), push(dto.assignee(match())), FieldLim()); } public Rule FieldLim() { return Ch('"'); } public Rule OptLim() { return Ch('|'); } public Rule FieldSep() { return Ch(':'); } public Rule OptSp() { return ZeroOrMore(' '); } public Rule ValSep() { return Ch(','); } public Rule Labels() { return Sequence(OptSp(), String("labels"), FieldSep(), FieldLim(), OneOrMore(Label()), FieldLim()); } public Rule Label() { return Sequence(OptSp(), NoCommaChars(), push(dto.label(match())), Optional(ValSep()), OptSp()); } }
You may even use recursion here because Parboiled is able to cache method invocations to avoid infinite loops here.
Testing Rules
The more complex a grammar or domain-specific language tends to grow the greater the need to write detailed tests for each fragment.
In the following example we’re writing JUnit tests for each important rule:
package it.parser; import static org.hamcrest.MatcherAssert.assertThat; import static org.hamcrest.Matchers.containsInAnyOrder; import static org.hamcrest.Matchers.equalTo; import java.util.Set; import org.junit.Test; import org.parboiled.Parboiled; import org.parboiled.parserunners.RecoveringParseRunner; import org.parboiled.support.ParsingResult; import com.hascode.Task; import com.hascode.parser.SingleLineDslParser; public class SingleLineParserTest { SingleLineDslParser parser = Parboiled .createParser(SingleLineDslParser.class); @Test public void shouldParseMultipleLabels() throws Exception { String input = "labels:\"foo,bar,baz\""; ParsingResult<?> result = new RecoveringParseRunner<Task>( parser.Labels()).run(input); Set<String> labels = ((Task) result.resultValue).labels(); assertThat(labels.size(), equalTo(3)); assertThat(labels, containsInAnyOrder("foo", "bar", "baz")); } @Test public void shouldParseSummary() throws Exception { String input = "- This is a first task"; ParsingResult<?> result = new RecoveringParseRunner<Task>( parser.Summary()).run(input); Task task = (Task) result.resultValue; assertThat(task.summary(), equalTo("This is a first task")); } @Test public void shouldParseAssignee() throws Exception { String input = "assignee:\"tim\""; ParsingResult<?> result = new RecoveringParseRunner<Task>( parser.Assignee()).run(input); Task task = (Task) result.resultValue; assertThat(task.assignee(), equalTo("tim")); } @Test public void shouldParseCompleteLine() throws Exception { String input = "- This is a first task| assignee:\"tim\" labels:\"foo,bar,baz\""; ParsingResult<?> result = new RecoveringParseRunner<Task>(parser.Task()) .run(input); Task task = (Task) result.resultValue; assertThat(task.summary(), equalTo("This is a first task")); assertThat(task.assignee(), equalTo("tim")); Set<String> labels = task.labels(); assertThat(labels.size(), equalTo(3)); assertThat(labels, containsInAnyOrder("foo", "bar", "baz")); } }
Running the tests using the JUnit runner or mvn -Dtest=SingleLineParserTest test should produce no errors:
------------------------------------------------------- T E S T S ------------------------------------------------------- Running it.parser.SingleLineParserTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.148 sec Results : Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
Processing multiple Tasks
Now we’re going write a grammar to parse multiple tasks that should finally return a TaskList containing multiple Task objects.
In opposite to the first grammar, we’re using another approach here and make use of the generic org.parboiled.support.Var to pass the task list and the specific task objects around to be filled.
We’re using the same Task class as in the first example, new is the following task list:
The TaskList
Out tasklist is a simple container for a list of tasks:
public class TaskList { private final List<Task> tasks = new ArrayList<>(); public TaskList add(final Task task) { tasks.add(task); return this; } public List<Task> tasks() { return tasks; } }
The Parser
This is our final parser with Var-parameter passing and rule for tasks and newlines added:
package com.hascode.parser; import org.parboiled.BaseParser; import org.parboiled.Rule; import org.parboiled.annotations.BuildParseTree; import org.parboiled.support.Var; import com.hascode.Task; import com.hascode.TaskList; @BuildParseTree public class CompleteDslParser extends BaseParser<Object> { public Rule Tasks() { Var<TaskList> tasks = new Var<>(new TaskList()); return Sequence(OneOrMore(Task(tasks)), EOI, push(tasks.getAndClear())); } public Rule Task(final Var<TaskList> tasks) { Var<Task> dto = new Var<>(new Task()); return Sequence(Summary(dto), Optional(Options(dto)), Optional(Newline()), push(tasks.get().add(dto.get()))).label( "task"); } public Rule Summary(final Var<Task> dto) { return Sequence('-', ZeroOrMore(' '), Chars(), push(dto.get().summary(match()))).label("summary"); } public Rule Chars() { return OneOrMore(TestNot(OptLim()), TestNot(FieldLim()), ANY); } public Rule NoCommaChars() { return OneOrMore(TestNot(ValSep()), TestNot(OptLim()), TestNot(FieldLim()), ANY); } public Rule Options(final Var<Task> dto) { return Sequence(OptSep(), Optional(Assignee(dto)), Optional(Labels(dto))); } public Rule OptSep() { return Sequence(OptSp(), OptLim(), OptSp()); } public Rule Assignee(final Var<Task> dto) { return Sequence(OptSp(), String("assignee"), FieldSep(), FieldLim(), Chars(), push(dto.get().assignee(match())), FieldLim()).label( "assignee"); } public Rule FieldLim() { return Ch('"'); } public Rule OptLim() { return Ch('|'); } public Rule FieldSep() { return Ch(':'); } public Rule OptSp() { return ZeroOrMore(' '); } public Rule ValSep() { return Ch(','); } public Rule Labels(final Var<Task> dto) { return Sequence(OptSp(), String("labels"), FieldSep(), FieldLim(), OneOrMore(Label(dto)), FieldLim()); } public Rule Label(final Var<Task> dto) { return Sequence(OptSp(), NoCommaChars(), push(dto.get().label(match())), Optional(ValSep()), OptSp()); } public Rule Newline() { return FirstOf('\n', Sequence('\r', Optional('\n'))); } }
Testing the Parser
Now we’re ready to test the parser for a given input. The following JUnit tests should be green:
package it.parser; import static org.hamcrest.MatcherAssert.assertThat; import static org.hamcrest.Matchers.containsInAnyOrder; import static org.hamcrest.Matchers.equalTo; import static org.hamcrest.Matchers.nullValue; import java.util.List; import java.util.Set; import org.junit.Test; import org.parboiled.Parboiled; import org.parboiled.parserunners.RecoveringParseRunner; import org.parboiled.support.ParsingResult; import com.hascode.Task; import com.hascode.TaskList; import com.hascode.parser.CompleteDslParser; public class CompleteDslParserTest { CompleteDslParser parser = Parboiled.createParser(CompleteDslParser.class); @Test public void shouldParseMultipleLines() throws Exception { String line1 = String .format("- A first task | assignee:\"joe\" labels:\"foo,bar,baz\""); String line2 = "- This is the second entry | assignee:\"fred\" labels:\"beer,wine\""; String line3 = "- And a third entry"; String dslString = line1 + "\n" + line2 + "\n" + line3; ParsingResult<TaskList> result = new RecoveringParseRunner<TaskList>( parser.Tasks()).run(dslString); TaskList taskList = result.resultValue; List<Task> tasks = taskList.tasks(); assertThat(tasks.size(), equalTo(3)); Task task1 = tasks.get(0); assertThat(task1.summary(), equalTo("A first task ")); assertThat(task1.assignee(), equalTo("joe")); Set<String> task1Labels = task1.labels(); assertThat(task1Labels, containsInAnyOrder("foo", "bar", "baz")); Task task2 = tasks.get(1); assertThat(task2.summary(), equalTo("This is the second entry ")); assertThat(task2.assignee(), equalTo("fred")); Set<String> task2Labels = task2.labels(); assertThat(task2Labels, containsInAnyOrder("beer", "wine")); Task task3 = tasks.get(2); assertThat(task3.summary(), equalTo("And a third entry")); assertThat(task3.assignee(), nullValue()); Set<String> task3Labels = task3.labels(); assertThat(task3Labels.isEmpty(), equalTo(true)); } }
Test output could be similar to this one:
mvn -Dtest=CompleteDslParserTest test ------------------------------------------------------- T E S T S ------------------------------------------------------- Running it.parser.CompleteDslParserTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.146 sec Results : Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
Using Scala
Now for a short example in Scala we’re going to construct an abstract syntax tree (AST) using Parboiled’s Scala API.
Project Setup using SBT
This is my build.sbt with the dependency for parboiled-scala. If you’ve cloned the scala project from my Bitbucket repository and you’re using Eclipse IDE, you might want to need necessary project files by running sbt eclipse.
name := "parboiled-grammar-tutorial-scala" organization := "com.hascode.tutorial" version := "1.0.0" scalaVersion := "2.10.1" libraryDependencies ++= Seq( "org.parboiled" % "parboiled-scala_2.10" % "1.1.6" )
Writing an AST Parser
The following code constructs a simple AST tree covering a task with its summary (I was too lazy to cover all task properties here).
The Parser trait adds the different rule operations we need here and the import of org.parboiled.scala._ adds the rest we need here.
Please note the special syntax available in Scala in comparison to the Java version .. e.g. to specify a Sequence you may use the ~ operator here, additional syntactic sugar is hidden in the Rule objects e.g. Rule1.
package com.hascode.parser import org.parboiled.scala._ import org.parboiled.errors.ErrorUtils import org.parboiled.errors.ParsingException sealed abstract class AstNode case class TasksNode(tasks: List[TaskNode]) extends AstNode case class TaskNode(summary: SummaryNode) extends AstNode case class SummaryNode(text: StringNode) extends AstNode case class StringNode(text: String) extends AstNode class SingleLineDslParser extends Parser { def Tasks: Rule1[AstNode] = rule { oneOrMore(Task) ~~> TasksNode } def Task: Rule1[TaskNode] = rule { str("- ") ~ Summary ~~> TaskNode ~ optional(Newline) } def Summary: Rule1[SummaryNode] = rule { Chars ~~> SummaryNode } def Chars: Rule1[StringNode] = rule { oneOrMore(Char) ~> StringNode } def Char: Rule0 = rule { !anyOf("-") ~ ANY } def Newline: Rule0 = rule { str("\n") } def parse(input: String): AstNode = { val parsingResult = ReportingParseRunner(Tasks).run(input) parsingResult.result match { case Some(task) => task case None => throw new ParsingException("Invalid input:\n" + ErrorUtils.printParseErrors(parsingResult)) } } }
Running the Parser
Now we’d like to test the results so we’re adding third tasks in a string to be parsed and afterwards we’re printing the summary from the results.
object ParserExample extends App { val dslLine = """- A first task - And a second task - Finally a third task """ val parsingResult = new SingleLineDslParser().parse(dslLine); parsingResult match { case tasks: TasksNode => { println("Got " + tasks.tasks.length + " tasks:") tasks.tasks.foreach(task => println(s"Task: ${task.summary.text.text}")) } case _ => println("error") } }
To run the example above we’re using our IDE or using SBT by running:
sbt run
This should produce the following output:
Got 3 tasks:
Task: A first task
Task: And a second task
Task: Finally a third task
Tutorial Sources
Please feel free to download the tutorial sources from my Bitbucket repository, fork it there or clone it using Git:
git clone https://bitbucket.org/hascode/parboiled-grammar-tutorial.git
Conclusion
Parboiled offers a variety of possibilities here and I’m afraid that I haven’t grasped every detail yet. But if you’re interested in some more complex or real-life examples, I highly recommend to take a look at the Markdown Processor Example (Java API) or the JSON Parser Example (Scala API) or enjoy the detailed documentation on the project’s website here.
Thanks @resah for pointing me to this framework! I’m sure that it will make a good match for my toolbox of frameworks and libraries for future usage/projects.