ASM.OCaml

Authors: chambart
Date: 2016-04-01
Category: OCaml
Tags: OCaml



As you may know, there is a subset of Javascript that compiles efficiently to assembly used as backend of various compilers including a C compiler like emscripten. We'd like to present you in the same spirit how never to allocate in OCaml.

Before starting to write anything, we must know how to find if a code is allocating. The best way currently is to look at the Cmm intermediate representation. We can see it by calling ocamlopt with the -dcmm option:

ocamlopt -c -dcmm test.ml

let f x = (x,x)

Some excerpt from the output:

(function camlTest__f_4 (x_6/1204: val) (alloc 2048 x_6/1204 x_6/1204))

To improve readability, in this post we will clean a bit the variable names:

(function f (x: val) (alloc 2048 x x))

We see that the function f (named camlTest__f_4) is calling the alloc primitive, which obviously is an allocation. Here, this creates a size 2 block with tag 0 (2048 = 2 << 10 + 0) and containing two times the value x_6/1204 which was x is the source. So we can detect if some code is allocating by doing ocamlopt -c -dcmm test.ml 2&>1 | grep alloc (obviously any function or variable named alloc will also appear).

It is possible to write some code that don't allocate (in the heap) at all, but what are the limitations ? For instance the omnipresent fibonacci function does not allocate:

let rec fib = function
  | 0 -> 0
  | 1 -> 1
  | n -> fib (n-1) + fib (n-2)
(function fib (n: val)
  (if (!= n 1)
  (if (!= n 3)
    (let Paddint_arg (app "fib" (+ n -4) val)
    (+ (+ (app "fib" (+ n -2) val) Paddint_arg) -1))
    3)
  1))

But quite a lot of common patterns do:

  • Building structured values will allocate (tuple, records, sum types containing an element, ...)
  • Using floats, int64, ... will allocate
  • Declaring non-toplevel functions will allocate

Considering that, it can appear that it is almost impossible to write any non-trivial code without using those. But that's not completely true.

There are some exceptions to those rules, where some normally allocating constructions can be optimised away. We will explain how to exploit them to be able to write some more code.

Local references

Maybe the most important one is the case of local references.

let fact n =
  let result = ref 1 in
  for i = 1 to n do
    result := n * !result
  done;
  !result

To improve readability, this has been cleaned and demangled

(function fact (n: val)
  (let (result 3)
  (seq
    (let (i 3 bound n)
    (catch
      (if (> i bound) (exit 3)
      (loop
        (assign result (+ (* (+ n -1) (>>s result 1)) 1))
        (assign i (+ i 2))
      (if (== i bound) (exit 3) [])
    with(3) []))))
  result)))

You can notice that allocation of the reference disappeared. The modifications were replaced by assignments (the assign operator) to the result variable. This transformation can happen when a reference is never used anywhere else than as an argument of the ! and := operator and does not appear in the closure of any local function like:

let counter () =
  let count = ref 0 in
  let next () = count := !count + 1; !count in
  next

This won't happen in this case since count is in the closure of next.

Unboxing

The float, int32, int64 and nativeint types do not fit in the generic representation of values that can be stored in the OCaml heap, so they are boxed. This means that they are allocated and there is an annotation to tell the garbage collector to skip their content. So using them in general will allocate. But an important optimization is that local uses (some cases that obviously won't go in the heap) are 'unboxed', i.e. not allocated.

If/match couple

Some 4.03 change also improve some cases of branching returning tuples

let positive_difference x y =
  let (min, max) =
    if x < y then
      (x, y)
    else
      (y, x)
  in
  max - min
(function positive_difference (x: val y: val)
  (catch
    (if (< x y) (exit 7 y x)
    (exit 7 x y))
  with(7 max min) (+ (- max min) 1)))

Control flow

You can do almost any control flow like that, but this is quite unpractical and is still limited in many ways.

If you don't want to write everything as for and while loops, you can write functions for your control flow, but to prevent allocation you will have to refrain from doing a few things. For instance, you should not pass record or tupple as argument to functions of course, you should pass each field separately as a different argument.

But what happens when you want to return multiple values ? There is some ongoing project to try to optimise the allocations of some of those cases away, but currently you can't. Really ? NO !

Returning multiple values

If you bend a bit your mind, you may see that returning from a function is almost the same thing as calling one... Or you can make it that way. So let's transform our code in 'Continuation Passing Style'

For instance, let's write a function that finds the minimum and the maximum of a list. That could be written like that:

let rec fold_left f init l =
  match l with
  | [] -> init
  | h :: t ->
    let acc = f init h in
    fold_left f acc t

let keep_min_max (min, max) v =
  let min = if v < min then v else min in
  let max = if v > max then v else max in
  min, max

let find_min_max l =
  match l with
  | [] -> invalid_arg "find_min_max"
  | h :: t ->
    fold_left keep_min_max (h, h) t

Continuation Passing Style

Transforming it to continuation passing style (CPS) replace every function return by a tail-call to a function representing 'what happens after'. This function is usually called a continuation and a convention is to use the variable name 'k' for it.

Let's start simply by turning only the keep_min_max function into continuation passing style.

let keep_min_max (min, max) v k =
  let min = if v < min then v else min in
  let max = if v > max then v else max in
  k (min, max)

val keep_min_max : 'a * 'a -> 'a -> ('a * 'a -> 'b) -> 'b

That's all here. But of course we need to modify a bit the function calling it.

let rec fold_left f init l =
  match l with
  | [] -> init
  | h :: t ->
    let k acc =
      fold_left f acc t
    in
    f init h k

val fold_left : ('a -> 'b -> ('a -> 'a) -> 'a) -> 'a -> 'b list -> 'a
val find_min_max : 'a list -> 'a * 'a

Here instead of calling f then recursively calling fold_left, we prepare what we will do after calling f (that is calling fold_left) and then we call f with that continuation. find_min_max is unchanged and still has the same type.

But we can continue turning things in CPS, and a full conversion would result in:

let rec fold_left_k f init l k =
  match l with
  | [] -> k init
  | h :: t ->
    let k acc =
      fold_left_k f acc t k
    in
    f init h k
val fold_left_k : ('a -> 'b -> ('a -> 'c) -> 'c) -> 'a -> 'b list -> ('a -> 'c) -> 'c

let keep_min_max_k (min, max) v k =
  let min = if v < min then v else min in
  let max = if v > max then v else max in
  k (min, max)
val keep_min_max_k : 'a * 'a -> 'a -> ('a * 'a -> 'b) -> 'b

let find_min_max_k l k =
  match l with
  | [] -> invalid_arg "find_min_max"
  | h :: t ->
    fold_left_k keep_min_max (h, h) t k
val find_min_max_k : 'a list -> ('a * 'a -> 'b) -> 'b

let find_min_max l =
  find_min_max_k l (fun x -> x)
val find_min_max : 'a list -> 'a * 'a

Where rectypes matter for performance reasons

That's nice, we only have tail calls now, but we are not done removing allocation yet of course. We now need to get rid of the allocation of the closure in fold_left_k and of the couples in keep_min_max_k. For that, we need to pass everything that should be allocated as argument:

let rec fold_left_k2 f init1 init2 l k =
  match l with
  | [] -> k init1 init2
  | h :: t ->
    f init1 init2 h t fold_left_k2 k

val fold_left_k2 :
  ('b -> 'c -> 'd -> 'd list -> 'a -> ('b -> 'c -> 'e) -> 'e) ->
  'b -> 'c -> 'd list -> ('b -> 'c -> 'e) -> 'e as 'a

let rec keep_min_max_k2 = fun min max v k_arg k k2 ->
  let min = if v < min then v else min in
  let max = if v > max then v else max in
  k keep_min_max_k2 min max k_arg k2

val keep_min_max_k2 :
  'b -> 'b -> 'b -> 'c -> ('a -> 'b -> 'b -> 'c -> 'd -> 'e) -> 'd -> 'e as 'a

let find_min_max_k2 l k =
  match l with
  | [] -> invalid_arg "find_min_max"
  | h :: t ->
    fold_left_k2 keep_min_max_k2 h h t k

val find_min_max_k2 : 'a list -> ('a -> 'a -> 'b) -> 'b

For some reason, we now need to activate 'rectypes' to allow functions to have a recursive type (the 'as 'a') but we managed to completely get rid of allocations.

(function fold_left_k2 (f: val init1: val init2: val l: val k: val)
  (if (!= l 1)
  (app "caml_apply6" init1 init2 (load val l) l "fold_left_k2" k f val))
  (app "caml_apply2" init1 init2 k val)))

(function keep_min_max_k2 (min: val max: val v: val k: val k: val k2: val)
  (let
  (min
  (if (!= (extcall "caml_lessthan" v min val) 1)
  v min)
  max
  (if (!= (extcall "caml_greaterthan" v max val) 1)
  v max))
  (app "caml_apply5" "keep_min_max_k2" min max k k2 k val)))

(function find_min_max_k2 (l: val k: val)
  (if (!= l 1)
  (let h (load val l)
  (app "fold_left_k2" "keep_min_max_k2" h h t k val))
  (raise "exception")))

So we can turn return points into call points and get rid of a lot of potential allocations like that. But of course there is no way to handle functions passing or returning sum types like that ! Well, I'm not so sure.

Sum types

Let's try with the option type for instance:

type 'a option =
  | None
  | Some of 'a

let add_an_option_value opt v =
  match opt with
  | None -> v
  | Some n -> n + v

let n1 = add_an_option_value (Some 3) 4
let n2 = add_an_option_value None 4

The case of the sum type tells us if there is some more values that we can get and their type. But there is another way to associate some type information with an actual value: GADTs

type ('a, 'b) option_case =
  | None' : ('a, unit) option_case
  | Some' : ('a, 'a) option_case

let add_an_option_value (type t) (opt: (int, t) option_case) (n:t) v =
  match opt with
  | None' -> v
  | Some' -> n + v

let n1 = add_an_option_value Some' 3 4
let n2 = add_an_option_value None' () 4

And voilà, no allocation anymore !

Combining that with the CPS transformation can get you quite far without allocating !

Manipulating Memory

Now that we can manage almost any control flow without allocating, we need also to manipulate some values. That's the point where we simply suggest to use the same approach as ASM.js: allocate a single large bigarray (this is some kind of malloc), consider integers as pointers and you can do anything. We won't go into too much details here as this would require another post for that topic.

For some low level packed bitfield manipulation you can have a look at some more tricks

Conclusion

So if you want to write non allocating code in OCaml, turn everything in CPS, add additional arguments everywhere, turn your sum types in unboxed GADTs, manipulate a single large bigarrays. And enjoy !

Comments

Gaetan Dubreil (3 April 2016 at 11 h 16 min):

Thank you for this attractive and informative post. Just to be sure, is it not ‘t’ rather than ‘l’ that must be past to the fold_left function? You said “we only have tail calls now” but I don’t see any none tail calls in the first place, am I wrong?

Pierre Chambart (4 April 2016 at 14 h 48 min):

There where effectively some typos. Thanks for noticing.

There is one non-tail call in fold_left: the call to f. But effectively the recursion is tail.

kantien (25 May 2016 at 13 h 57 min):

Interesting article, but i have one question. Can we say, from the proof theory point of view, that turning the code in CPS style not to allocate is just an application of the Gentzen’s cut-elimination theorem ? I explain in more details this interpretation : if we have a proof P1 of the proposition A and a proof P2 of the proposition A ⇒ B, we can produce a proof P3 of proposition B by applying the cut rule or modus ponens, but the theorem says that we can eliminate the use of cut rule and produce a direct proof P4 of the proposition B. But modus ponens (or cut rule) is just the rule for typing function application : if f has type ‘a -> ‘b and x has type ‘a then f x has type ‘b. And so the cut-elimination theorem says that we can produce an object of type ‘b without allocate an object of type ‘a (this is not necessary to produce the P1 proof, or more exactly this is not necessary to put the P1’s conclusion in the environment in order to use it as a premise of the P2 proof ). Am I right ?

jdxu (4 January 2021 at 11 h 36 min):

Very useful article. BTW, is there any document/tutorial/article about cmm syntax?



About OCamlPro:

OCamlPro is a R&D lab founded in 2011, with the mission to help industrial users benefit from experts with a state-of-the-art knowledge of programming languages theory and practice.

  • We provide audit, support, custom developer tools and training for both the most modern languages, such as Rust, Wasm and OCaml, and for legacy languages, such as COBOL or even home-made domain-specific languages;
  • We design, create and implement software with great added-value for our clients. High complexity is not a problem for our PhD-level experts. For example, we developed the prototype of the Tezos proof-of-stake blockchain.
  • We have a long history of creating open-source projects, such as the Opam package manager, the LearnOCaml web platform, and contributing to other ones, such as the Flambda optimizing compiler, or the GnuCOBOL compiler.
  • We are also experts of Formal Methods, developing tools such as our SMT Solver Alt-Ergo (check our Alt-Ergo Users' Club) and using them to prove safety or security properties of programs.

Please reach out, we'll be delighted to discuss your challenges: contact@ocamlpro.com or book a quick discussion.