A micro-benchmark

joskoot · September 29, 2023, 12:48am

Sorry, discovering Racket was un fact discovering mzscheme.

soegaard · September 29, 2023, 1:13pm

For those interested in reading about set! and why it often leads to boxed variables,
I can't recommend Three Implementation Models for Scheme by Kent Dybvig enough.

sschwarzer · September 30, 2023, 10:56am

A few random thoughts ...

I think you should output the result from a benchmark. Sometimes a compiler might optimize out the whole calculation if the result isn't used. Looking at the runtimes, this doesn't seem to happen here, though.

Although it's often recommended to compile Nim benchmarks with -d:danger, I find it's more representative of real code to use -d:release to include some runtime checks, e.g. bounds checks. In sum_vec_f64, you could use the special result variable, but I think it's unlikely that it'll change the runtime.

Concerning the performance of a given language implementation, I find two aspects interesting:

How fast is idiomatic code, without any optimizations? If unoptimized code is already fast, I have to do less manual optimization work.
How fast can I make the code if I have to and what effort is needed for that?

You need the indirection of polymorphism if you know the type of a value only at runtime. Inheritance may be a reason, but there are also others.

I can't.

In a benchmark (Sudoku solver solving the same puzzle with all the benchmarked programs) I did a few months ago, Racket (compiled with raco exe) was about 2.5 times as fast as the similarly optimized Chicken version. On the other hand, Rust was 25 times as fast as Racket.

Technically, the Racket Reference is book-like, but it's still a reference. I often find it hard to get the gist of some feature or concept, unless I'm looking for a very specific functionality.

hendrikboom3 · September 30, 2023, 2:33pm

Of course, the Rust version likely will not test or use one of Racket's obligatory features: the garbage collector.

-- hendrik

sschwarzer · September 30, 2023, 2:52pm

Correct, although I assume that the largest part of the speedup is because of static typing and therefore more aggressive optimizations.

It might be interesting to try a Go implementation, which would combine static typing and compilation to machine code with a garbage collector. I expect that this would still be much faster than Racket, although probably slower than Rust.

gus-massa · September 30, 2023, 6:47pm

It's very interesting the use of (fl+ sum) to make the code faster. I think the compiler could be improved to make it unnecessary but I don't know how to do that and how hard is it. So I made a feature request in [Feature Request] Sometimes code with unboxed flonum get faster wrapped with fl+ · Issue #4774 · racket/racket · GitHub

lewisl · October 2, 2023, 7:52pm

What’s the alternative?

The for/fold approach avoids it. In a loop, what can be done? This attitude against mutation makes much code very difficult to write. Memory location contents either need to be changed, or the updated value is stored in a new location and the old must be garbage collected—essentially what happens with non-tail optimized recursion.

Generally, performance focused matrix arithmetic code with large arrays must rely on mutation in place. This is true for every ML library written in any language. I’ll grant that 10000 elements in a vector is not even remotely large by present day standards. In fact, a tail call optimized recursive version of this summation can easily be written for lists and it might outperform the vector implementation. But, get to 10,000,000 elements and we are at the teeny end of large and we must update arrays in place. What works for a 100 element list always going forward (with O(N) or O(1) performance won’t work for 10,000,000 elements any more.

soegaard · October 2, 2023, 8:18pm

You need to distinguish mutation of variables and of vectors.

If I undstand you correctly, the alternative you are looking for is an flvector.
Here all elements of the vector are guaranteed to be doubles.

lewisl · October 3, 2023, 12:43am

the variable sum is being updated. not clear there is any way to sum a vector without visiting every element.

Lewis

soegaard · October 3, 2023, 9:58am

the variable sum is being updated. not clear there is any way to sum a vector without visiting every element.

Sorry, I think I misunderstood what you were saying.

This attitude against mutation makes much code very difficult to write.

There is a tradeoff. Boxing mutable variable allows us to have call/cc and friends.
Maybe in theory it is possible to make a special form of module, that disallows
capturing a continuation [and thus avoid boxing mutable variables]?

In the mean time, a rebinding loop like for/fold or a a named let might be faster than
mutating the same variable in a loop.

However, if I understand correctly, your concern was very large vectors of floating point numbers. In flomat I simply used bindings for BLAS and LINPACK to do the grunt work.

github.com

soegaard/sci/blob/master/flomat/flomat.rkt

#lang racket
(provide (all-defined-out))

;;; TODO
;;;        * Implement block. Handle mix between numbers and matrices.
;;;        * Improve matrix-expt! (avoid allocation)
;;;        * schur decomposition
;;;        * sqrtm See Higham paper.
;;;        * logm

;;; NOTES
;;;        * Contracts will be added before release
;;;        * See tests at bottom for examples.

;;; FEEDBACK
;;;        * Where is CBLAS and LAPACK on your platform
;;;          (Windows and Linux)
;;;        * What are the libraries named?
;;;        * Do all tests evaluate to #t on your platform?

This file has been truncated. show original

It's fair to consider this cheating.

lewisl · October 3, 2023, 7:04pm

(standard-bindings) and (extended-bindings) causes a runtime error! but not compile error:

*** ERROR IN benchmark-vector-gambit# -- Operator is not a PROCEDURE
(#!unbound)

This was reported in 2017! with a suggested work-around to compile syntax-case.scm to c and include it when compiling the target as:

gsc -:s -exe syntax-case.c benchmark-vector-gambit.scm

But, work-around no longer seems to work...

lewisl · October 3, 2023, 9:07pm

I thought that your repeat loop might be faster, but the real gain was the suggestion above to use for/fold, which was huge. No meaningful improvement in the repeat loop. I'll look for other ways to make the loop quick, especially as I don't use any values that are generated during the loop.

lewisl · October 3, 2023, 9:47pm

Here is an update on the benchmarks. I have received lots of good input, particularly for improving the performance of the Racket code using faster and more idiomatic Racket--and I'd also say more convenient to write and understand!

I made some changes to other languages as well. Adding type declarations to sbcl improved its performance nearly 5X. I also compiled Nim as d:release, which increased execution time to match Julia (I could add some "unsafe" hints to the Julia code, but that seems a bit unfair).

I received good suggestions from the Gambit master himself, Marc Feeley, which corrected some problems I caused and coaxed out a 16% performance improvement while preserving entirely understandable code.

I tried using typed hints and # lang typed/racket for Racket, but this made no improvement. I'll post it in a separate reply and I'd love any suggestions from the experts. My hunch is that because fl+ and in-flvector are already functions that are specialized for flonums and flvectors and require these types as arguments, there is really nothing to gain. The benefit of optimizations for typing might come when using functions like + and in-vector.

Many people commented that while one should use the preferred approach for each language, one should not bend into contortions for performance. The good news is that relative straight forward code delivers good results across the board.

See the updated version below. Great improvements for both Racket and sbcl; otherwise, not a lot of change. Thanks for everyone's interest and helpful suggestions. Not surprisingly, statically typed, compiled languages do best and all the languages are reasonable. Now, this is a truly trivial benchmark and much more complex linear algebra would spread out the results or require use of appropriately optimized libraries--which is what anyone would do for more involved work.

Micro-Benchmark of Summing a Floating Point Vector

Take this with a grain of salt. All micro-benchmarks are fraught; it's easier to do wrong than to do correctly. Please correct anything I have done incorrectly or suggest a better approach, as long as it's not particularly intricate code: it does matter that good performance can be achieved with straight forward code. For this trivial example, it is in all the languages.

This benchmark compares summing a floating point array in the following languages:

Julia: Design around matrix arithmetic; relatively easy to implement strong typing (though not required)

Nim: a nice c/c++ substitute; requires typing every variable; static compilation

SBCL: a performant lisp which makes typing possible; REPL statements compiled before execution. Setting explicit types resulted in a nearly 5x speedup.

Gambit: a scheme that compiles via c to object code; typing possible but optional

Racket: a scheme with extensive libraries; typing possible but optional; REPL statements compiled before execution; functions in the flonum library already require floating point arguments so explicit type hints didn't further improve performance.

The code for each language appears after the benchmarks. Benchmarks were run on my 2021 MacBook Pro M1. Benchmark timing varies quite a bit depending on other process running on the machine, even if the benchmarks are run in quick succession--yet another reason for healthy dose of salt.

Language	Timing in seconds for 1000 executions
Julia	.07 (using BenchmarkTools package)
Nim	.04 (second run, compile as unsafe)
Nim	.07 (compile as release)
Gambit compiled	.51 (second run)
SBCL in REPL	1.2 (no type declarations)
SBCL in REPL	.255 (do loop with type declarations)
Racket in REPL	.44 (for/fold comprehension)
Racket executable	Same as in REPL

Comments

Racket performs very respectably compared to sbcl even in the REPL. The AOT in REPL compilation of Racket seems faster than compiled Gambit-Scheme. Of course, it's not even fair to compare to Julia and Nim, both of which are strongly typed languges. Julia can also be dynamically typed, but it is usually easy to get "type stability" with Julia's excellent type inference.

I got typed/racket to work, but it did not reduce execution time. My theory--awaiting some comments from experts is that the flonum library functions, fl+ and in-flvector, are already specialized on specific types, so explicit type hinting won't enable the compiler to optimize the time critical functions.

Some suggestions were made to further optimize the Gambit-Scheme code. I could not get those to work; awaiting more input.

Look at the code below. Which language would you choose for numerical work?

Code:

Julia

function bench_vec_sum(len, fval)
    v = fill(fval, len)
    return sum(v)
end

Nim

# compile with nim c -d:danger --app:console --opt:speed --cc:clang --mm:orc  benchmark_vector.nim

import std/algorithm  # for fill
import os             # for cmdline parameters
import std/strutils   # to parse cmdline parameters

proc sum_vec_f64(vec: seq[float64]): float64 =    # returns float64
  var sum: float64 = 0.0
  for x in vec:   # iterator is faster than calculating index and referencing each slot
    sum += x
  return sum


proc vec_bench(vlen: int, fval: float64) =
  var vec: seq[float64]
  newseq(vec, vlen) # create and allocate memory for it
  vec.fill(fval)
  discard sum_vec_f64(vec) # ignore the return value

    
proc dobench(n: int, vlen: int, fval: float64) =
  for i in 1 .. n:
    vec_bench(vlen, fval)
    

# run it and ignore the return value
var fval: float64 # value for element of vector
var n: int        # number of runs
var vlen: int     # length of vector
if paramCount() == 3:
  n = parseInt(paramStr(1))
  vlen = parseInt(paramStr(2))
  fval = parseFloat(paramStr(3))
  dobench(n, vlen, fval)
else:
  quit("Requires exactly 3 integer parameters. Exiting.")

Gambit-Scheme

;; #!/usr/bin/env gsi-script -:d0
;; compile as gsc -exe <source filename>

; this version intended to be used with Gambit Scheme
;   not guaranteed to work with other versions of Scheme


(declare (block)    ; mysterious incantation to enable inlining of procedures
  (standard-bindings)   ; per reco. for improved perf.
  (extended-bindings)   ; ditto
  (not safe))            ; ditto

; sum the vector
(define (f64vector-sum vec)
  (let ((len (f64vector-length vec)))
    (do ((i (fx- len 1) (fx- i 1))
         (sum 0.0 (fl+ sum (f64vector-ref vec i))))
        ((fx< i 0) sum))))

(define (f64vec-bench len fval)
  (let ((vec (make-f64vector len fval)))
        (f64vector-sum vec)))

(define (repeat n len fval)
  (do ((i 0 (fx+ i 1)))
    ((fx= i n))
    (f64vec-bench len fval)))


(time (repeat 1000 100000 0.5)) ; do a sum 1000 times

sbcl

; the following make no difference on execution time
; (declaim (optimize (speed 3) (debug 0) (safety 0)))
; (declaim (inline vec-bench)) ; 30 to 40% performance improvement
; (declaim (inline vector-sum))

; sum the vector
(defun f64vector-sum-do-typed (vec)
    (declare (type (simple-array double-float (*)) vec))
  (let ((len (length vec))
        (sum 0.0d0))
        (declare (type double-float sum))
    (do ((i 0 (+ i 1)))
	((= i len) sum)
      (setf sum (+ sum (aref vec i))))))

; same performance as preceding do loop style
(defun f64vector-sum-dotimes-typed (vec)
  (declare (type (simple-array double-float (*)) vec))
  (let ((sum 0.0d0))
    (declare (type double-float sum))
    (dotimes (i (length vec) sum)
      (incf sum (aref vec i)))))


(defun f64vector-sum-reduce (vec) ; type declaration won't improve performance
    (reduce #'+ vec))

; same performance as do loop, but nicer
(defun f64vector-sum-for (vec) ; loop macro makes type declaration ineffective
    (loop for i across vec
          sum i))

(defun f64vector-sum-loop-typed (vec)  ; slower than do-style loops above
  (declare (type (simple-array double-float (*)) vec))
  (let ((sum 0.0d0))
    (declare (type double-float sum))
    (loop for i across vec
          do (setq sum (+ sum i)))
    sum))

(defun f64vec-bench (len fval)
    (let ((vec (make-array len :element-type 'double-float :initial-element fval)))
        (f64vector-sum-do-typed vec)))

(defun repeat (n len fval)
  (do ((i 0 (+ i 1)))
    ((= i n))
    (f64vec-bench len fval)))

(repeat 1000 100000 0.5d0) ; create vector and sum 1000 times

racket

#lang racket/base

#lang racket/base

(require racket/flonum)

; 4 different ways to sum a vector
; pass the function name as input to repeat

; 1. sum the vector in a do loop
(define (vector-sum-do vec)
 (let ((len (flvector-length vec))
       (sum 0.0)) 
   (do ((i 0 (add1 i)))
	((= i len) sum)
     (set! sum (fl+ sum (flvector-ref vec i))))))

; 2. sum the vector with a naive for/sum iterator: very slow
(define (vector-sum-for/sum vec) 
  (for/sum ((sum vec)) sum))

; 3. sum the vector in a for/sum iterator iterating the vec with in-flvector
(define (vector-sum-for/sum-infl vec)
  (for/sum ([sum (in-flvector vec)])
    sum))

; 4. sum the vector using for/fold: concise, idiomatic, fast
(define (vector-sum-for/fold vec)
  (for/fold ([sum 0.0])
            ([v (in-flvector vec)])
    (fl+ sum v)))

(define (vector-bench func len fval)
  (let ((vec (make-flvector len fval)))
    (func vec)))

(define (repeat func n len fval)
  (do ((i 0 (add1 i)))
    ((= i n))
    (vector-bench func len fval)))


; use as follows:
(time (repeat vector-sum-for/fold 1000 100000 0.5)) ; create vector 1000 times and sum
(#%declare #:unsafe)

shawnw · October 4, 2023, 2:18am

Some timings on my laptop fwiw.

Lisp versions use (time (repeat 1000 100000 0.5d0)) to get timings.

SBCL 2.3.8:

Evaluation took:
  0.260 seconds of real time
  0.258016 seconds of total run time (0.258016 user, 0.000000 system)
  [ Real times consist of 0.040 seconds GC time, and 0.220 seconds non-GC time. ]
  [ Run times consist of 0.032 seconds GC time, and 0.227 seconds non-GC time. ]
  99.23% CPU
  643,265,498 processor cycles
  800,016,000 bytes consed

Clozure CL 1.12.2:

(REPEAT 1000 100000 0.5D0)
took 655,294 microseconds (0.655294 seconds) to run.
     142,075 microseconds (0.142075 seconds, 21.68%) of which was spent in GC.
During that period, and with 12 available CPU cores,
     615,698 microseconds (0.615698 seconds) were spent in user mode
      40,146 microseconds (0.040146 seconds) were spent in system mode
 800,032,000 bytes of memory allocated.
 7 minor page faults, 0 major page faults, 0 swaps.

Gambit 4.9.5 compiled:

(time (repeat 1000 100000 .5))
    0.517657 secs real time
    0.517642 secs cpu time (0.504448 user, 0.013194 system)
    666 collections accounting for 0.065633 secs real time (0.064437 user, 0.001581 system)
    3200799704 bytes allocated
    2270 minor faults
    no major faults
    1292054371 cpu cycles

Racket 8.11:

cpu time: 616 real time: 616 gc time: 55

I don't have julia or nim installed, but here's an ocaml version:

(* ocamlfind ocamlopt -o vec -package benchmark -linkpkg vec.ml *)

let f64vector_sum vec =
  Float.Array.fold_left (+.) 0.0 vec

let f64vec_bench len fval =
  let vec = Float.Array.make len fval in
  f64vector_sum vec

let _ = Benchmark.latency1 1000L (fun () -> f64vec_bench 100000 0.5) ()

and timing for ocaml 5.1.0:

Latencies for 1000 iterations of "":
[run 1000 times]:  0.53 WALL ( 0.52 usr +  0.01 sys =  0.53 CPU) @ 1898.03/s (n=1000)```

bogdan · October 4, 2023, 5:06am

Your Racket code still isn't as fast as it can be. This version of vector-sum-for/fold takes 150ms on my machine compared to 400ms for the one in your post:

(define (vector-sum-for/fold vec)
  (for/fold ([sum 0.0] #:result (fl+ sum))
            ([v (in-flvector vec)])
    (fl+ sum v)))

Or, as @gus-massa pointed out, you can also write:

(define (vector-sum-for/fold vec)
  (fl+
   (for/fold ([sum 0.0])
             ([v (in-flvector vec)])
     (fl+ sum v))))

to get the same improvement.

See Fixnum and Flonum Optimizations.

lewisl · October 4, 2023, 5:29pm

I didn't see Gus-massa's response anywhere except in your post. It certainly does work, almost a 3X speedup!!

But, why? And where did that syntax come from? I could not see it in the manual covering for/fold or for/foldr (the syntax descriptions in the manual are tough to follow, though).

(for/fold ([accum-id init-expr] ... maybe-result) (for-clause ...)
  body-or-break ... body)
 
maybe-result	 	=	 	
 	 	|	 	#:result result-expr

What does putting fl+ in a form by itself do: are we creating a local alias for the fl+ function and then calling it? Seems like that is a "let" for a function: is eliding the "let" a shortcut?

I could not understand the explanation of the #:result argument:

When iteration terminates, if a result-expr is provided then the result of the 
for/fold is the result of evaluating result-expr (with accum-ids in scope and bound 
to their final values), otherwise the results of the for/fold expression are the 
accumulator values.

Sorry, but what does this mean? All we want here is an accumulator.

I have to say that both syntax forms are not at all obvious, though absolutely effective. They seem to violate the idea of using obvious approaches without tricks. But, tricks can be good... Seems like either could be a compiler optimization. Or the language needs an efficient accumulator function that just works with maximum efficiency (as the c++ STL provides for accumulators).

Racket has a massive API, but seemingly avoids providing functions for obvious use cases like sum, count, etc. Countif, sumif, etc. could be provided with a #:filter argument rather than being their own functions. But, you figured it out so I guess I am dumb because I could not despite reading (not closely enough) the reference section.

Racket is just not an easy language at all. I'll say it again: function overloading with compile-time dispatch is perhaps the single most important improvement that could be made in the language. Then we don't need in-flvector and in-vector, we just need in. There are dozens of cases like this. The API surface area then is significantly reduced.

Sure, the specialized functions must be written, but the language developers will do that for dozens of cases (as they already have, but sadly with distinct names for each function or special case arguments). But, dumb users of the language would get the benefit of learning certain canonically named functions (like "sum") that could work for dozens of cases by dispatching to the appropriate specialized function based on arity and argument types and return types. Sounds like I am describing Julia--but also c++, which has done this since its beginning. This approach seems really beneficial if it is a goal to have an approachable language. If that's not a goal, well... that seems strange.

lewisl · October 4, 2023, 5:38pm

Here is the latest with dramatic improvements for Racket: it's the undisputed champion among the Lisp/Scheme languages. I missed the subtlety of earlier suggestions. I'd say the syntax might not be the most obvious, but it delivers performance.