Aug
8
2012

We have recently worked on modifying the OCaml system to be able to profile OCaml code on Linux amd64 systems, using the processor performance counters now supported by stable kernels. This page presents this work, funded by Jane Street.

The patch is provided for OCaml version 4.00.0. If you need it for 3.12.1, some more work is required, as we would need to backport some improvements that were already in the 4.00.0 code generator.

An example: profiling ocamlopt.opt

Here is an example of a session of profiling done using both Linux performance tools and a modified OCaml 4.00.0 system (the patch is available at the end of this article).

Linux performance tools are available as part of the Linux kernel (in the linux-tools package on Debian/Ubuntu). Most of the tools are invoked through the perf command, à la git. For example, we are going to check where the time is spent when calling the ocamlopt.opt command:

perf record -g ./ocamlopt.opt -c -I utils -I parsing -I typing typing/*.ml

This command generates a file perf.data in the current directory, containing all the events that were received during the execution of the command. These events contain the values of the performance counters in the amd64 processor, and the call-chain (backtrace) at the event.

We can inspect this file using the command:

perf report -g

The command displays:

Events: 3K cycles
+   9.81%  ocamlopt.opt  ocamlopt.opt           [.] compare_val
+   8.85%  ocamlopt.opt  ocamlopt.opt           [.] mark_slice
+   7.75%  ocamlopt.opt  ocamlopt.opt           [.] caml_page_table_lookup
+   7.40%            as  as                     [.] 0x5812
+   5.60%  ocamlopt.opt  [kernel.kallsyms]      [k] 0xffffffff8103d0ca
+   3.91%  ocamlopt.opt  ocamlopt.opt           [.] sweep_slice
+   3.18%  ocamlopt.opt  ocamlopt.opt           [.] caml_oldify_one
+   3.14%  ocamlopt.opt  ocamlopt.opt           [.] caml_fl_allocate
+   2.84%            as  [kernel.kallsyms]      [k] 0xffffffff81317467
+   1.99%  ocamlopt.opt  ocamlopt.opt           [.] caml_c_call
+   1.99%  ocamlopt.opt  ocamlopt.opt           [.] caml_compare
+   1.75%  ocamlopt.opt  ocamlopt.opt           [.] camlSet__mem_1148
+   1.62%  ocamlopt.opt  ocamlopt.opt           [.] caml_oldify_mopup
+   1.58%  ocamlopt.opt  ocamlopt.opt           [.] camlSet__bal_1053
+   1.46%  ocamlopt.opt  ocamlopt.opt           [.] camlSet__add_1073
+   1.37%  ocamlopt.opt  libc-2.15.so           [.] 0x15cbd0
+   1.37%  ocamlopt.opt  ocamlopt.opt           [.] camlInterf__compare_1009
+   1.33%  ocamlopt.opt  ocamlopt.opt           [.] caml_apply2
+   1.09%  ocamlopt.opt  ocamlopt.opt           [.] caml_modify
+   1.07%            sh  [kernel.kallsyms]      [k] 0xffffffffa07e16fd
+   1.07%            as  libc-2.15.so           [.] 0x97a61
+   0.94%  ocamlopt.opt  ocamlopt.opt           [.] caml_alloc_shr

Using the arrow keys and the Enter key to expand an item, we can get a better idea of where most of the time is spent:

Events: 3K cycles
-   9.81%  ocamlopt.opt  ocamlopt.opt           [.] compare_val                ◆
   - compare_val                                                               ▒
      - 71.68% camlSet__mem_1148                                               ▒
         + 98.01% camlInterf__add_interf_1121                                  ▒
         + 1.99% camlInterf__add_pref_1158                                     ▒
      - 21.48% camlSet__add_1073                                               ▒
         - camlSet__add_1073                                                   ▒
            + 93.41% camlSet__add_1073                                         ▒
            + 6.59% camlInterf__add_interf_1121                                ▒
      + 1.44% camlReloadgen__fun_1386                                          ▒
      + 1.43% camlClosure__close_approx_var_1373                               ▒
      + 1.43% camlSwitch__opt_count_1239                                       ▒
      + 1.34% camlTbl__add_1050                                                ▒
      + 1.20% camlEnv__find_1408                                               ▒
+   8.85%  ocamlopt.opt  ocamlopt.opt           [.] mark_slice                 ▒
-   7.75%  ocamlopt.opt  ocamlopt.opt           [.] caml_page_table_lookup     ▒
   - caml_page_table_lookup                                                    ▒
      + 50.03% camlBtype__set_commu_1704                                       ▒
      + 49.97% camlCtype__expand_head_1923                                     ▒
+   7.40%            as  as                     [.] 0x5812                     ▒
+   5.60%  ocamlopt.opt  [kernel.kallsyms]      [k] 0xffffffff8103d0ca         ▒
+   3.91%  ocamlopt.opt  ocamlopt.opt           [.] sweep_slice                ▒
Press '?' for help on key bindings

We notice that a lot of time is spent in the compare_val primitive, called from the Pervasives.compare function, itself called from the Set module in asmcomp/interp.ml. We can locate the corresponding code at the beginning of the file:

module IntPairSet =
  Set.Make(struct type t = int * int let compare = compare end)

Let's replace the polymorphic function compare by a monomorphic function, optimized for pairs of small ints:

module IntPairSet =
  Set.Make(struct type t = int * int
                  let compare (a1,b1) (a2,b2) =
                    if a1 = a2 then b1 - b2 else a1 - a2
           end)

We can now compare the speed of the two versions:

peerocaml:~/ocaml-4.00.0%  time ./ocamlopt.old -c -I utils -I parsing -I typing typing/*.ml
./ocamlopt.old  7.38s user 0.56s system 97% cpu 8.106 total
peerocaml:~/ocaml-4.00.0%  time ./ocamlopt.new -c -I utils -I parsing -I typing typing/*.ml
./ocamlopt.new  6.16s user 0.50s system 97% cpu 6.827 total

And we get an interesting speedup ! Now, we can iterate the process, check where most of the time is spent in the new version, optimize the critical path and so on.

Installation of the modified OCaml system

A modified OCaml system is required because, for each event, the Linux kernel must attach a backtrace of the stack (call-chain). However, the kernel is not able to use standard DWARF debugging information, and current OCaml stack frames are too complex to be unwinded without this DWARF information. Instead, we had to modify OCaml code generator to follow the same conventions as C for frame pointers, i.e. using saving the frame pointer on function entry and restoring it on function exit. This required to decrease the number of available registers from 13 to 12, using %rbp as the frame pointer, leading to an average 3-5% slowdown in execution time.

The patch for OCaml 4.00.0 is available here:

omit-frame-pointer-4.00.0.patch (20 kB, v2, updated 2012/08/13)

To use it, you can use the following recipe, that will compile and install the patched version in ~/ocaml-4.00-with-fp.

~%  wget http://caml.inria.fr/pub/distrib/ocaml-4.00.0/ocaml-4.00.0.tar.gz
~%  tar zxf ~/ocaml-4.00.0.tar.gz
~%  cd ocaml-4.00.0
~/ocaml-4.00.0%  wget www.ocamlpro.com/files/omit-frame-pointer-4.00.0.patch
~/ocaml-4.00.0%  patch -p1 < omit-frame-pointer-4.00.0.patch
~/ocaml-4.00.0%  ./configure --prefix ~/ocaml-4.00-with-fp
~/ocaml-4.00.0%  make world opt opt.opt install
~/ocaml-4.00.0%  cd ~
~% export PATH=$HOME/ocaml-4.00.0-with-fp/bin:$PATH

It is important to know that the patch modifies OCaml calling convention, meaning that ALL THE MODULES AND LIBRARIES in your application must be recompiled with this version.

On our benchmarks, the slowdown induced by the patch is between 3 and 5%. You can still compile your application without frame pointers, for production, using a new option -fomit-frame-pointer that was added by the patch.

This patch has been submitted for inclusion in OCaml. You can follow its status and contribute to the discussion here:

http://caml.inria.fr/mantis/view.php?id=5721