Creating Grammar Parsers in Java and Scala with Parboiled

January 26th, 2014 by

Parboiled is a modern. lightweight and easy to use library to parse expression grammars in Java or Scala and in my humble opinion it is perfect for use cases where you need something between regular expressions and a complex parser generator like ANTLR.

In the following tutorial we’re going to create a simple grammar to specify a task list and write an implementation of a parser also as unit tests for each grammar rule in Java.

Additionally, we’re using the Scala variant of Parboiled to build up an Abstract Syntax Tree parser and analyze a given task list with it.

Testing the grammar parser

Testing the grammar parser

 

Specifying a DSL Grammar

In the following tutorial we want to parse a simple task list. I’ve recycled the a part of the syntax from my Quick Subtasks Plugin for JIRA here that is:

A task list contains of one ore more tasks, separated by a newline.

Each task definition may contain ..

  • A summary (mandatory)
  • An assignee (optional)
  • One or more labels (optional)

Each line of a given input may specify a task, the summary is prefixed by a ‘- ‘ character and when an assignee or labels are given, these are separated from the summary by a ‘/’ character.

So this might be a sample input for the following parsers:

- First Task / assignee:"joe" labels:"foo,bar,baz"
- Second Task / assignee:"fred" labels:"beer,wine"

Parsing this input should produce two task nodes:

  • One task with summary “First Task“, an assignee named “joe” and three labels named “foo“, “bar“, “baz
  • A second task with summary “Second Task” , an assignee named “fred” and two labels “beer“, “wine

Using Java

Project Setup

Just one dependency needed here: parboiled-java – please feel free to choose your favourite build system here.

If you’re interested in grammar building with the scala version of parboiled please skip to the section “Using Scala“.

Maven

When using Maven, simply add the following dependency to your project’s pom.xml:

<dependency>
	<groupId>org.parboiled</groupId>
	<artifactId>parboiled-java</artifactId>
	<version>1.1.6</version>
</dependency>

Gradle

Adding the following entry to your build.gradle is all you need.

'org.parboiled:parboiled-java:1.1.6'

SBT

Adding the following entry to your build.sbt adds parboiled to your SBT enabled project

libraryDependencies += "org.parboiled" % "parboiled-java" % "1.1.6"

Processing a Single Task

First of all we’re starting to create a parser for a single line, composed out of different rules.

The Task Class

This class encapsules information for a single task and provides getters and setters (returning the task object).

public class Task {
	private String summary;
	private String assignee;
	private final Set<String> labels = new HashSet<>();
 
	public final String summary() {
		return summary;
	}
 
	public final Task summary(final String summary) {
		this.summary = summary;
		return this;
	}
 
	public final String assignee() {
		return assignee;
	}
 
	public final Task assignee(final String assignee) {
		this.assignee = assignee;
		return this;
	}
 
	public Set<String> labels() {
		return labels;
	}
 
	public Task label(final String label) {
		labels.add(label);
		return this;
	}
}
The Grammar Parser

This is our first approach for a parser – extending org.parboiled.BaseParser adds everything we need to construct our rules here.

A short note on the elements used here to create new rules:

  • Sequence: Creates a new rule that only succeeds if all of its subrule succeed, one after the other.
  • Optional: Creates a new rule that tries a match on its subrule and always succeeds, independently of the matching success of its sub rule.
  • EOI: Matches the Chars.EOI (end of input) character.
  • ZeroOrMore: Creates a new rule that tries repeated matches of its subrule.
  • OneOrMore: Creates a new rule that tries repeated matches of its subrule and succeeds if the subrule matches at least once.
  • TestNot: Creates a new rule that acts as an inverse syntactic predicate, i.e.
  • ANY: Matches any character except Chars.EOI.
  • Ch: Explicitly creates a rule matching the given character.

There is a plenty of other methods for your convenience, contained in org.parboiled.BaseParser.

package com.hascode.parser;
 
import org.parboiled.BaseParser;
import org.parboiled.Rule;
import org.parboiled.annotations.BuildParseTree;
 
import com.hascode.Task;
 
@BuildParseTree
public class SingleLineDslParser extends BaseParser<Task> {
	Task dto = new Task();
 
	public Rule Task() {
		return Sequence(Summary(), Optional(Options()), EOI, push(dto));
	}
 
	public Rule Summary() {
		return Sequence('-', ZeroOrMore(' '), Chars(),
				push(dto.summary(match())));
	}
 
	public Rule Chars() {
		return OneOrMore(TestNot(OptLim()), TestNot(FieldLim()), ANY);
	}
 
	public Rule NoCommaChars() {
		return OneOrMore(TestNot(ValSep()), TestNot(OptLim()),
				TestNot(FieldLim()), ANY);
	}
 
	public Rule Options() {
		return Sequence(OptSep(), Optional(Assignee()), Optional(Labels()));
	}
 
	public Rule OptSep() {
		return Sequence(OptSp(), OptLim(), OptSp());
	}
 
	public Rule Assignee() {
		return Sequence(OptSp(), String("assignee"), FieldSep(), FieldLim(),
				Chars(), push(dto.assignee(match())), FieldLim());
	}
 
	public Rule FieldLim() {
		return Ch('"');
	}
 
	public Rule OptLim() {
		return Ch('|');
	}
 
	public Rule FieldSep() {
		return Ch(':');
	}
 
	public Rule OptSp() {
		return ZeroOrMore(' ');
	}
 
	public Rule ValSep() {
		return Ch(',');
	}
 
	public Rule Labels() {
		return Sequence(OptSp(), String("labels"), FieldSep(), FieldLim(),
				OneOrMore(Label()), FieldLim());
	}
 
	public Rule Label() {
		return Sequence(OptSp(), NoCommaChars(), push(dto.label(match())),
				Optional(ValSep()), OptSp());
	}
}

You may even use recursion here because Parboiled is able to cache method invocations to avoid infinite loops here.

Testing Rules

The more complex a grammar or domain-specific language tends to grow the greater the need to write detailed tests for each fragment.

In the following example we’re writing JUnit tests for each important rule:

package it.parser;
 
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.containsInAnyOrder;
import static org.hamcrest.Matchers.equalTo;
 
import java.util.Set;
 
import org.junit.Test;
import org.parboiled.Parboiled;
import org.parboiled.parserunners.RecoveringParseRunner;
import org.parboiled.support.ParsingResult;
 
import com.hascode.Task;
import com.hascode.parser.SingleLineDslParser;
 
public class SingleLineParserTest {
	SingleLineDslParser parser = Parboiled
			.createParser(SingleLineDslParser.class);
 
	@Test
	public void shouldParseMultipleLabels() throws Exception {
		String input = "labels:\"foo,bar,baz\"";
		ParsingResult<?> result = new RecoveringParseRunner<Task>(
				parser.Labels()).run(input);
		Set<String> labels = ((Task) result.resultValue).labels();
		assertThat(labels.size(), equalTo(3));
		assertThat(labels, containsInAnyOrder("foo", "bar", "baz"));
	}
 
	@Test
	public void shouldParseSummary() throws Exception {
		String input = "- This is a first task";
		ParsingResult<?> result = new RecoveringParseRunner<Task>(
				parser.Summary()).run(input);
		Task task = (Task) result.resultValue;
		assertThat(task.summary(), equalTo("This is a first task"));
	}
 
	@Test
	public void shouldParseAssignee() throws Exception {
		String input = "assignee:\"tim\"";
		ParsingResult<?> result = new RecoveringParseRunner<Task>(
				parser.Assignee()).run(input);
		Task task = (Task) result.resultValue;
		assertThat(task.assignee(), equalTo("tim"));
	}
 
	@Test
	public void shouldParseCompleteLine() throws Exception {
		String input = "- This is a first task| assignee:\"tim\" labels:\"foo,bar,baz\"";
		ParsingResult<?> result = new RecoveringParseRunner<Task>(parser.Task())
				.run(input);
		Task task = (Task) result.resultValue;
		assertThat(task.summary(), equalTo("This is a first task"));
		assertThat(task.assignee(), equalTo("tim"));
		Set<String> labels = task.labels();
		assertThat(labels.size(), equalTo(3));
		assertThat(labels, containsInAnyOrder("foo", "bar", "baz"));
	}
}

Running the tests using the JUnit runner or mvn -Dtest=SingleLineParserTest test should produce no errors:

-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running it.parser.SingleLineParserTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.148 sec
 
Results :
 
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0

Processing multiple Tasks

Now we’re going write a grammar to parse multiple tasks that should finally return a TaskList containing multiple Task objects.

In opposite to the first grammar, we’re using another approach here and make use of the generic org.parboiled.support.Var to pass the task list and the specific task objects around to be filled.

We’re using the same Task class as in the first example, new is the following task list:

The TaskList

Out tasklist is a simple container for a list of tasks:

public class TaskList {
	private final List<Task> tasks = new ArrayList<>();
 
	public TaskList add(final Task task) {
		tasks.add(task);
		return this;
	}
 
	public List<Task> tasks() {
		return tasks;
	}
}
The Parser

This is our final parser with Var-parameter passing and rule for tasks and newlines added:

package com.hascode.parser;
 
import org.parboiled.BaseParser;
import org.parboiled.Rule;
import org.parboiled.annotations.BuildParseTree;
import org.parboiled.support.Var;
 
import com.hascode.Task;
import com.hascode.TaskList;
 
@BuildParseTree
public class CompleteDslParser extends BaseParser<Object> {
 
	public Rule Tasks() {
		Var<TaskList> tasks = new Var<>(new TaskList());
		return Sequence(OneOrMore(Task(tasks)), EOI, push(tasks.getAndClear()));
	}
 
	public Rule Task(final Var<TaskList> tasks) {
		Var<Task> dto = new Var<>(new Task());
		return Sequence(Summary(dto), Optional(Options(dto)),
				Optional(Newline()), push(tasks.get().add(dto.get()))).label(
				"task");
	}
 
	public Rule Summary(final Var<Task> dto) {
		return Sequence('-', ZeroOrMore(' '), Chars(),
				push(dto.get().summary(match()))).label("summary");
	}
 
	public Rule Chars() {
		return OneOrMore(TestNot(OptLim()), TestNot(FieldLim()), ANY);
	}
 
	public Rule NoCommaChars() {
		return OneOrMore(TestNot(ValSep()), TestNot(OptLim()),
				TestNot(FieldLim()), ANY);
	}
 
	public Rule Options(final Var<Task> dto) {
		return Sequence(OptSep(), Optional(Assignee(dto)),
				Optional(Labels(dto)));
	}
 
	public Rule OptSep() {
		return Sequence(OptSp(), OptLim(), OptSp());
	}
 
	public Rule Assignee(final Var<Task> dto) {
		return Sequence(OptSp(), String("assignee"), FieldSep(), FieldLim(),
				Chars(), push(dto.get().assignee(match())), FieldLim()).label(
				"assignee");
	}
 
	public Rule FieldLim() {
		return Ch('"');
	}
 
	public Rule OptLim() {
		return Ch('|');
	}
 
	public Rule FieldSep() {
		return Ch(':');
	}
 
	public Rule OptSp() {
		return ZeroOrMore(' ');
	}
 
	public Rule ValSep() {
		return Ch(',');
	}
 
	public Rule Labels(final Var<Task> dto) {
		return Sequence(OptSp(), String("labels"), FieldSep(), FieldLim(),
				OneOrMore(Label(dto)), FieldLim());
	}
 
	public Rule Label(final Var<Task> dto) {
		return Sequence(OptSp(), NoCommaChars(),
				push(dto.get().label(match())), Optional(ValSep()), OptSp());
	}
 
	public Rule Newline() {
		return FirstOf('\n', Sequence('\r', Optional('\n')));
	}
}
Testing the Parser

Now we’re ready to test the parser for a given input. The following JUnit tests should be green:

package it.parser;
 
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.containsInAnyOrder;
import static org.hamcrest.Matchers.equalTo;
import static org.hamcrest.Matchers.nullValue;
 
import java.util.List;
import java.util.Set;
 
import org.junit.Test;
import org.parboiled.Parboiled;
import org.parboiled.parserunners.RecoveringParseRunner;
import org.parboiled.support.ParsingResult;
 
import com.hascode.Task;
import com.hascode.TaskList;
import com.hascode.parser.CompleteDslParser;
 
public class CompleteDslParserTest {
	CompleteDslParser parser = Parboiled.createParser(CompleteDslParser.class);
 
	@Test
	public void shouldParseMultipleLines() throws Exception {
		String line1 = String
				.format("- A first task | assignee:\"joe\" labels:\"foo,bar,baz\"");
		String line2 = "- This is the second entry | assignee:\"fred\" labels:\"beer,wine\"";
		String line3 = "- And a third entry";
		String dslString = line1 + "\n" + line2 + "\n" + line3;
		ParsingResult<TaskList> result = new RecoveringParseRunner<TaskList>(
				parser.Tasks()).run(dslString);
		TaskList taskList = result.resultValue;
		List<Task> tasks = taskList.tasks();
		assertThat(tasks.size(), equalTo(3));
 
		Task task1 = tasks.get(0);
		assertThat(task1.summary(), equalTo("A first task "));
		assertThat(task1.assignee(), equalTo("joe"));
		Set<String> task1Labels = task1.labels();
		assertThat(task1Labels, containsInAnyOrder("foo", "bar", "baz"));
 
		Task task2 = tasks.get(1);
		assertThat(task2.summary(), equalTo("This is the second entry "));
		assertThat(task2.assignee(), equalTo("fred"));
		Set<String> task2Labels = task2.labels();
		assertThat(task2Labels, containsInAnyOrder("beer", "wine"));
 
		Task task3 = tasks.get(2);
		assertThat(task3.summary(), equalTo("And a third entry"));
		assertThat(task3.assignee(), nullValue());
		Set<String> task3Labels = task3.labels();
		assertThat(task3Labels.isEmpty(), equalTo(true));
	}
}

Test output could be similar to this one:

mvn -Dtest=CompleteDslParserTest test
 
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running it.parser.CompleteDslParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.146 sec
 
Results :
 
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

Using Scala

Now for a short example in Scala we’re going to construct an abstract syntax tree (AST) using Parboiled’s Scala API.

Project Setup using SBT

This is my build.sbt with the dependency for parboiled-scala. If you’ve cloned the scala project from my Bitbucket repository and you’re using Eclipse IDE, you might want to need necessary project files by running sbt eclipse.

name := "parboiled-grammar-tutorial-scala"
 
organization := "com.hascode.tutorial"
 
version := "1.0.0"
 
scalaVersion := "2.10.1"
 
libraryDependencies ++= Seq(
  "org.parboiled" % "parboiled-scala_2.10" % "1.1.6"
)

Writing an AST Parser

The following code constructs a simple AST tree covering a task with its summary (I was too lazy to cover all task properties here).

The Parser trait adds the different rule operations we need here and the import of org.parboiled.scala._ adds the rest we need here.

Please note the special syntax available in Scala in comparison to the Java version .. e.g. to specify a Sequence you may use the ~ operator here, additional syntactic sugar is hidden in the Rule objects e.g. Rule1.

package com.hascode.parser
 
import org.parboiled.scala._
import org.parboiled.errors.ErrorUtils
import org.parboiled.errors.ParsingException
 
sealed abstract class AstNode
case class TasksNode(tasks: List[TaskNode]) extends AstNode
case class TaskNode(summary: SummaryNode) extends AstNode
case class SummaryNode(text: StringNode) extends AstNode
case class StringNode(text: String) extends AstNode
 
class SingleLineDslParser extends Parser {
  def Tasks: Rule1[AstNode] = rule { oneOrMore(Task) ~~> TasksNode }
 
  def Task: Rule1[TaskNode] = rule { str("- ") ~ Summary ~~> TaskNode ~ optional(Newline) }
 
  def Summary: Rule1[SummaryNode] = rule { Chars ~~> SummaryNode }
 
  def Chars: Rule1[StringNode] = rule { oneOrMore(Char) ~> StringNode }
 
  def Char: Rule0 = rule { !anyOf("-") ~ ANY }
 
  def Newline: Rule0 = rule { str("\n") }
 
  def parse(input: String): AstNode = {
    val parsingResult = ReportingParseRunner(Tasks).run(input)
    parsingResult.result match {
      case Some(task) => task
      case None => throw new ParsingException("Invalid input:\n" +
        ErrorUtils.printParseErrors(parsingResult))
    }
  }
}

Running the Parser

Now we’d like to test the results so we’re adding third tasks in a string to be parsed and afterwards we’re printing the summary from the results.

object ParserExample extends App {
  val dslLine = """- A first task
- And a second task
- Finally a third task
"""
 
  val parsingResult = new SingleLineDslParser().parse(dslLine);
  parsingResult match {
    case tasks: TasksNode => {
      println("Got " + tasks.tasks.length + " tasks:")
      tasks.tasks.foreach(task => println(s"Task: ${task.summary.text.text}"))
    }
    case _ => println("error")
  }
}

To run the example above we’re using our IDE or using SBT by running:

sbt run

This should produce the following output:

Got 3 tasks:
Task: A first task
 
Task: And a second task
 
Task: Finally a third task

Tutorial Sources

Please feel free to download the tutorial sources from my Bitbucket repository, fork it there or clone it using Git:

git clone https://bitbucket.org/hascode/parboiled-grammar-tutorial.git

Conclusion

Parboiled offers a variety of possibilities here and I’m afraid that I haven’t grasped every detail yet. But if you’re interested in some more complex or real-life examples, I highly recommend to take a look at the Markdown Processor Example (Java API) or the JSON Parser Example (Scala API) or enjoy the detailed documentation on the project’s website here.

Thanks @resah for pointing me to this framework! I’m sure that it will make a good match for my toolbox of frameworks and libraries for future usage/projects.

Resources

Tags: , , , , , , ,

Search
Categories