OCaml 4.01 with its new feature to disambiguate constructors allows to do a nice trick: a simple and generic syntax extension that allows to define your own syntax without having to write complicated parsetree transformers. We propose an implementation in the form of a ppx rewriter.

it does only a simple transformation: replace strings prefixed by an operator starting with ! by a series of constructor applications

for instance:

[code language=”bash” gutter=”false”]

!! "hello 3"

[/code]

is rewriten to

[code language=”bash” gutter=”false”]

!! (Start (H (E (L (L (O (Space (N3 (End))))))))

[/code]

How is that generic ? We will present you a few examples.

### Base 3 Numbers

For instance, if you want to declare base 3 arbitrary big numbers, let’s define a syntax for it. We first start by declaring some types.

[code language=”fsharp”]

type start = Start of p

and p =

| N0 of stop

| N1 of q

| N2 of q

and q =

| N0 of q

| N1 of q

| N2 of q

| Underscore of q

| End

and stop = End

[/code]

This type will only allow to write strings matching the regexp 0 | (1|2)(0|1|2|_)*. Notice that some constructors appear in multiple types like N0. This is not a problem since constructor desambiguation will choose for us the right one at the right place. Let’s now define a few functions to use it:

[code language=”fsharp”]

open Num

let rec convert_p = function

| N0 (End) -> Int 0

| N1 t -> convert_q (Int 1) t

| N2 t -> convert_q (Int 2) t

and convert_q acc = function

| N0 t -> convert_q (acc */ Int 3) t

| N1 t -> convert_q (Int 1 +/ acc */ Int 3) t

| N2 t -> convert_q (Int 2 +/ acc */ Int 3) t

| Underscore t -> convert_q acc t

| End -> acc

let convert (Start p) = convert_p p

[/code]

[code language=”fsharp” gutter=”false”]

# val convert : start -> Num.num = <fun>

[/code]

And we can now try it:

[code language=”fsharp”]

let n1 = convert (Start (N0 End))

# val n1 : Num.num = <num 0>

let n2 = convert (Start (N1 (Underscore (N0 End))))

# val n2 : Num.num = <num 3>

let n3 = convert (Start (N1 (N2 (N0 End))))

# val n3 : Num.num = <num 15>

[/code]

And the generic syntax extension allows us to write:

[code language=”fsharp”]

let ( !! ) = convert

let n4 = !! "120_121_000"

val n4 : Num.num = <num 11367>

[/code]

### Specialised Format Strings

We can implement specialised format strings for a particular usage. Here, for concision we will restrict to a very small subset of the classical format: the characters %, i, c and space

Let’s define the constructors.

[code language=”fsharp”]

type ‘a start = Start of ‘a a

and ‘a a =

| Percent : ‘a f -> ‘a a

| I : ‘a a -> ‘a a

| C : ‘a a -> ‘a a

| Space : ‘a a -> ‘a a

| End : unit a

and ‘a f =

| I : ‘a a -> (int -> ‘a) f

| C : ‘a a -> (char -> ‘a) f

| Percent : ‘a a -> ‘a f

[/code]

Let’s look at the inferred type for some examples:

[code language=”fsharp”]

let (!*) x = x

let v = !* "%i %c";;

# val v : (int -> char -> unit) start = Start (Percent (I (Space (Percent (C End)))))

let v = !* "ici";;

# val v : unit start = Start (I (C (I End)))

[/code]

This is effectively the types we would like for a format string looking like that. To use it we can define a simple printer:

[code language=”fsharp”]

let rec print (Start cons) =

main cons

and main : type t. t a -> t = function

| I r ->

print_string "i";

main r

| C r ->

print_string "c";

main r

| Space r ->

print_string " ";

main r

| End -> ()

| Percent f ->

format f

and format : type t. t f -> t = function

| I r ->

fun i ->

print_int i;

main r

| C r ->

fun c ->

print_char c;

main r

| Percent r ->

print_string "%";

main r

let (!!) cons = print cons

[/code]

And voila!

[code language=”fsharp”]

let s = !! "%i %c" 1 ‘c’;;

# 1 c

[/code]

## How generic is it really ?

It may not look like it, but we can do almost any syntax we might want this way. For instance we can do any regular language. To explain how we transform a regular language to a type definition, we will use as an example the language a(a|)b

[code language=”fsharp”]

type start = Start of a

and a =

| A of a’

and a’ =

| A of b

| B of stop

and b = B of stop

and stop = End

[/code]

We can try a few things on it:

[code language=”fsharp”]

let v = Start (A (A (B End)))

# val v : start = Start (A (A (B End)))

let v = Start (A (B End))

# val v : start = Start (A (B End))

let v = Start (B End)

# Characters 15-16:

# let v = Start (B End);;

# ^

# Error: The variant type a has no constructor B

let v = Start (A (A (A (B End))))

# Characters 21-22:

# let v = Start (A (A (A (B End))));;

# ^

# Error: The variant type b has no constructor A

[/code]

Assumes the language is given as an automaton that:

- has 4 states, a, a’, b and stop
- with initial state a
- with final state stop
- with transitions: a — A –> a’ a’ — A –> b a’ — B –> stop b — B –> stop

let’s write {c} for the constructor corresponding to the character c and

[c][/c]

for the type corresponding to a state of the automaton.

- For each state q we have a type declaration [q]
- For each letter a of the alphabet we have a constructor {a}
- For each transition p — l –> q we have a constructor {l} with parameter [q] in type [p]:

[code language=”fsharp” gutter=”false”]

type [p] = {l} of [q]

[/code]

- The End constructor without any parameter must be present in any final state
- The initial state e is declared by

[code language=”fsharp” gutter=”false”]

type start = Start of [e]

[/code]

## Yet more generic

In fact we can encode deterministic context free languages (DCFL) also. To do that we encode pushdown automatons. Here we will only give a small example: the language of well parenthesized words

[code language=”fsharp”]

type empty

type ‘a r = Dummy

type _ q =

| End : empty q

| Rparen : ‘a q -> ‘a r q

| Lparen : ‘a r q -> ‘a q

type start = Start of empty q

let !! x = x

let m = ! ""

let m = ! "()"

let m = ! "((())())()"

[/code]

To encode the stack, we use the type parameters: Lparen pushes an r to the stack, Rparen consumes it and End checks that the stack is effectively empty.

There are a few more tricks needed to encode tests on the top value in the stack, and a conversion of a grammar to Greibach normal form to allow this encoding.

## We can go even further

### a^n b^n c^n

In fact we don’t need to restrict to DCFL, we can for instance encode the a^n.b^n.c^n language which is not context free:

[code language=”fsharp”]

type zero

type ‘a s = Succ

type (_,_) p =

| End : (zero,zero) p

| A : (‘b s, ‘c s) p -> (‘b, ‘c) p

| B : (‘b, ‘c s) q -> (‘b s, ‘c s) p

and (_,_) q =

| B : (‘b, ‘c) q -> (‘b s, ‘c) q

| C : ‘c r -> (zero,’c s) q

and _ r =

| End : zero r

| C : ‘c r -> ‘c s r

type start = Start of (zero,zero) p

let v = Start (A (B (C End)))

let v = Start (A (A (B (B (C (C End))))))

[/code]

### Non recursive languages

We can also encode solutions of Post Correspondance Problems (PCP), which are not recursive languages:

Suppose we have two alphabets A = { X, Y, Z } et O = { a, b } and two morphisms m1 and m2 from A to O* defined as

- m1(X) = a, m1(Y) = ab, m1(Z) = bba
- m2(X) = baa, m2(Y) = aa, m2(Z) = bb

Solutions of this instance of PCP are words such that their images by m1 and m2 are equal. for instance ZYZX is a solution: both images are bbaabbbaa. The language of solution can be represented by this type declaration:

[code language=”fsharp”]

type empty

type ‘a a = Dummy

type ‘a b = Dummy

type (_,_) z =

| X : (‘t1, ‘t2) s -> (‘t1 a, ‘t2 b a a) z

| Y : (‘t1, ‘t2) s -> (‘t1 a b, ‘t2 a a) z

| Z : (‘t1, ‘t2) s -> (‘t1 b b a, ‘t2 b b) z

and (_,_) s =

| End : (empty,empty) s

| X : (‘t1, ‘t2) s -> (‘t1 a, ‘t2 b a a) s

| Y : (‘t1, ‘t2) s -> (‘t1 a b, ‘t2 a a) s

| Z : (‘t1, ‘t2) s -> (‘t1 b b a, ‘t2 b b) s

type start = Start : (‘a,’a) z -> start

let v = X (Z (Y (Z End)))

let r = Start (X (Z (Y (Z End))))

[/code]

## Open question

Can every context free language (not deterministic) be represented like that ? Notice that the classical example of the palindrome can be represented (proof let to the reader).

## Conclusion

So we have a nice extension available that allows you to define a new syntax by merely declaring a type. The code is available on github. We are waiting for the nice syntax you will invent !

PS: Their may remain a small problem… If inadvertently you mistype something you may find some quite complicated type errors attacking you like a pyranha instead of a syntax error.