Wednesday, June 05, 2013

Diesel - DSL experiments on the JVM (Part 1)

I've been meaning to write about my series of little experiments building a mini-DSL on the JVM.

I've worked with and written about expression evaluators before. However, there've been many instances where I've felt the need to quickly build a part pseudo-language and part configuration script.

Also, I did not have the time, resources nor the justification to build a full fledged grammar/parser. There were times where I did morph an existing ANTLR grammar for something else, but in the end I realized that a simple hand built tokenizer and AST would've done the trick. (Note to self: try Parboiled)

So, I was curious to see what options I had to build an "almost language", quickly. "Quickly" being the operative word. "Dirty" being the unsaid word.

I'm not going to go into the details of what a DSL is, or spend time debating over internal or external DSLs etc. Enough material is available on the internet and some books too.

If you want to learn more about Java API based DSLs - more commonly known as a Fluent DSL, there are several good places to start learning by example - Jooq, Google Guava ComparisonChain etc. I've built Fluent DSLs several times and it is a cleaner and better way to implement the Builder design pattern - like Google Protocol Buffers' Builder.

This time though, I wanted to evaluate options to build something that could read/parse/load files that look like structured, readable English. Some common cases where you'd need this:

  • Configuration scripts are prime candidates for this. In the early-mid 2000's, XML would've been the way to go; with XPath and XSDs/DTDs; built in support in the JDK and support for hierarchical structures
  • Glue to stitch together different modules in a program - something that usually involves some configuration code and basic expressions
  • Actual mini-languages that allow business analysts or IT/DevOps people to plug in some logic without writing complex Java code. Also without having to get developers and a full build cycle involved
So, let's cut to the chase and see what I came up with.

For my tests, I wanted to accomplish something very simple. I wanted a way to describe stock buying or selling instruction to my stock broker. It's a completely contrived example of course but it seemed valid for this test. I wanted a way to specify which stock to buy or sell, at what price, for how long the instruction is valid and some other little things.

Since I brought up XML, I'll talk about the simplest approach first - XML's slightly less ugly cousin JSON:

Using JSON and calling it a DSL is not only dumb but also cheating. But there are obviously a lot of places where this would suffice. Unlike XML, this is less verbose, but it still needs the user to know where to put double quotes, braces, square brackets and all this without a schema to validate the file.

It does have its advantages. All I had to do was create a JavaBean with all the possible combinations my "stock specification" could have and then use Google Gson to do the serialization/deserialization to/from JSON.

This is what the JavaBeans look like:

Assuming that this was enough, all I had to do was read the JSON into the Stocks bean and related inner classes and start using it.

The keen reader will notice that the Order class has some properties - limit, stopLimit, market which are really mutually exclusive. JSON does not prevent me from providing values to all 3 which would be wrong. I could've spent some more time fleshing those properties into an enum or a complex string but I'll leave that as an exercise for later (or the reader).

The full source along with scripts can be found on my GitHub Diesel repo for your reference.

So, I decided that JSON wouldn't cut it. A while ago I had played with YAML briefly, which is JSON's distant cousin. Actually, the latest YAML spec makes it JSON's parent (how convenient).

YAML is like JSON but without the frivolous double quotes and braces. Compare this YAML file with the previous JSON file, it speaks for itself:

It is without doubt, cleaner and more usable than JSON. You use SnakeYaml to do automatic ser/deser into the Stocks JavaBean like Gson.

Also, like Gson, if you don't have a bean or your configuration makes it difficult to map directly to a bean, you can just read it free form as a map of maps. This would be a poor man's AST. Gson's free form structure is actually better that way, in that it almost looks like XML Nodes.

If YAML is good enough for you, you can stop reading right here. In fact, YAML is also my favorite for simple configuration files that involve lists and hierarchies. This is miles ahead of and better than the flat format used in Java Properties.

But, defining YAML still has the same issues that JSON had with regards to semantic validations like limit, market etc. However this is really an issue with the way I've created the beans. Think of the YAML file as a free form AST. You'd have to write your semantic and syntactic validations in your Java code by walking this AST. I prefer to do this in Java because it's easier to have all the validations and exception messages in one file than split it across multiple ANTLR and Java files.

In part 2, we will explore other framework and language choices. Until next time, take care!


Ashwin Jayaprakash said...

Sweet! I was right about ANTLR3's complex tree building syntax.

Looks like it has been addressed in V4 - just walk the tree in Java.

From the ANTLR V4 page:

"Q: What are the main design decisions in ANTLR4?

A: Ease-of-use over performance. I will worry about performance later. Simplicity over complexity. For example, I have taken out explicit/manual AST construction facilities and the tree grammar facilities. For 20 years I've been trying to get people to go that direction, but I've since decided that it was a mistake. It's much better to give people a parser generator that can automatically build trees and then let them use pure code to do whatever tree walking they want. People are extremely familiar and comfortable with visitors, for example."