A micro-benchmark

Micro-Benchmark of Summing a Floating Point Vector

Take this with a grain of salt. All micro-benchmarks are fraught; it's easier to do wrong than to do correctly. Please correct anything I have done incorrectly or suggest a better approach, as long as it's not particularly intricate code: it does matter that good performance can be achieved with straight forward code. For this trivial example, it is in all the languages.

This benchmark compares summing a floating point array in the following languages:

Julia: Design around matrix arithmetic; relatively easy to implement strong typing (though not required)

Nim: a nice c/c++ substitute; requires typing every variable; static compilation

SBCL: a performant lisp which makes typing possible; REPL statements compiled before execution

Gambit: a scheme that compiles via c to object code; typing possible but optional

Racket: a scheme with extensive libraries; typing possible but optional; REPL statements compiled before execution; for reasons unclear to me the compiled executable was actually faster than REPL execution though the manual claims that shouldn't be expected

The code for each language appears after the benchmarks. Benchmarks were run on my 2021 MacBook Pro M1. Benchmark timing varies quite a bit depending on other process running on the machine, even if the benchmarks are run in quick succession--yet another reason for healthy dose of salt.

Language Timing in seconds for 1000 executions
Julia .07 (measured with internal BenchmarkTools)
Nim .04 (second run)
Gambit compiled .63 (second run)
SBCL in REPL 1.2
Racket in REPL 1.9
Racket executable .71 (second run)

Comments

Racket performs very respectably compared to sbcl even in the REPL. The executable, despiting being > 60mb, performs very well compared to compiled Gambit--essentially a tie. Of course, it's not even fair to compare to Julia and Nim, both of which are strongly typed languges. Julia can also be dynamically typed, but it is usually easy to get "type stability" with Julia's excellent type inference.

I tried typed/racket but it did not like the do loop and I could not figure out how to introduce the requested type hints. I am fair perplexed by loop. I don't think I could put a do loop in a let block as the do loop introduces its own scope--so not clear it would accomplish anything. Eager to see how to use typed/racket correctly.

Look at the code below. Which language would you choose for numerical work?

Code:

Julia
function bench_vec_sum(len, fval)
    v = fill(fval, len)
    return sum(v)
end
Nim
# compile with nim c -d:danger --app:console --opt:speed --cc:clang --mm:orc  benchmark_vector.nim

import std/algorithm  # for fill
import os             # for cmdline parameters
import std/strutils   # to parse cmdline parameters

proc sum_vec_f64(vec: seq[float64]): float64 =    # returns float64
  var sum: float64 = 0.0
  for x in vec:   # iterator is faster than calculating index and referencing each slot
    sum += x
  return sum


proc vec_bench(vlen: int, fval: float64) =
  var vec: seq[float64]
  newseq(vec, vlen) # create and allocate memory for it
  vec.fill(fval)
  discard sum_vec_f64(vec) # ignore the return value

    
proc dobench(n: int, vlen: int, fval: float64) =
  for i in 1 .. n:
    vec_bench(vlen, fval)
    

# run it and ignore the return value
var fval: float64 # value for element of vector
var n: int        # number of runs
var vlen: int     # length of vector
if paramCount() == 3:
  n = parseInt(paramStr(1))
  vlen = parseInt(paramStr(2))
  fval = parseFloat(paramStr(3))
  dobench(n, vlen, fval)
else:
  quit("Requires exactly 3 integer parameters. Exiting.")
Gambit
;; #!/usr/bin/env gsi-script -:d0

(declare (block))     ; mysterious incantation to enable inlining of procedures

; sum the vector
(define (f64vector-sum vec)
  (let ((len (f64vector-length vec))
        (sum 0.0))
    (do ((i 0 (+ i 1)))
	((= i len) sum)
      (set! sum (+ sum (f64vector-ref vec i))))))

(define (f64vec-bench len fval)
  (let ((vec (make-f64vector len fval)))
    (f64vector-sum vec)))

(define (repeat n len fval)
  (do ((i 0 (+ i 1)))
    ((= i n))
    (f64vec-bench len fval)))


(repeat 1000 100000 0.5) ; do a sum 1000 times
sbcl
sbcl
; the following make no difference on execution time
; (declaim (optimize (speed 3) (debug 0) (safety 0)))
; (declaim (inline vec-bench)) ; 30 to 40% performance improvement
; (declaim (inline vector-sum))

; sum the vector
(defun f64vector-sum (vec)
  (let ((len (length vec))
        (sum 0.0d0))
    (do ((i 0 (+ i 1)))
	((= i len) sum)
      (setf sum (+ sum (aref vec i))))))

(defun f64vec-bench (len fval)
    (let ((vec (make-array len :element-type 'double-float :initial-element fval)))
        (f64vector-sum vec)))

(defun repeat (n len fval)
  (do ((i 0 (+ i 1)))
    ((= i n))
    (f64vec-bench len fval)))

(repeat 1000 100000 0.5d0) ; create vector and sum 1000 times

racket
#lang racket

; (require ffi/vector) not being used
(require racket/flonum)

; sum the vector
(define (vector-sum vec)
  (let ((len (flvector-length vec))
        (sum 0.0)) 
    (do ((i 0 (add1 i)))
	((= i len) sum)
      (set! sum (fl+ sum (flvector-ref vec i))))))

;(define (vector-sum-new vec) ; much slower
;  (for/sum ((sum vec)) sum))

(define (vector-bench len fval)
  (let ((vec (make-flvector len fval)))
    (vector-sum vec)))

(define (repeat n len fval)
  (do ((i 0 (add1 i)))
    ((= i n))
    (vector-bench len fval)))


; use as follows:
(time (repeat 1000 100000 0.5)) ; create vector 1000 times and sum



3 Likes

Two things:

(1) If you write the Racket program idiomatically, you can speed the program up significantly.

;; 711 ms
(define (vector-sum vec)
  (for/sum ([sum (in-flvector vec)])
    sum))

;; 2145 ms
#;
(define (vector-sum vec)
  (for/sum ([sum vec])
    sum))

;; 1840 ms
#;
(define (vector-sum vec)
  (let ([len (flvector-length vec)]
        [sum 0.0]) 
    (do ((i 0 (add1 i)))
      ((= i len) sum)
      (set! sum (fl+ sum (flvector-ref vec i))))))

In particular, in-flvector provides the performance improvement. See the doc for details.

On my computer, the version with for and in-flvector is the fastest, taking only 711 ms. On the other hand, the original program takes 1840 ms.

(2) Modern Racket would prefer for (or a for-like form) over do.

1 Like

Thanks for sharing!

This is not the best style Racket; below is how I'd write this code (it is about 10% faster for me than your version).

I also notice that vector is faster than flvector which suggests some room for improvement somewhere.

I see sorawee beat me to the punch and his improvements seem to produce better speedups than mine did; maybe just we have different machines. I'm on a laptop which aren't the best for timing things, or so I understand.

#lang racket

; (require ffi/vector) not being used
(require racket/flonum)

; sum the vector
(define (vector-sum vec)
  (for/sum ([e (in-flvector vec)])
    e))

(define (vector-bench len fval)
  (let ((vec (make-flvector len fval)))
    (vector-sum vec)))

(define (repeat n len fval)
  (for ([i (in-range n)])
    (vector-bench len fval)))

; use as follows
(collect-garbage) (collect-garbage) ;; try to make timing more reliable; maybe not necessary here
(time (repeat 1000 100000 0.5)) ; create vector 1000 times and sum

I also did Chicken Scheme, though I no longer have it installed.

It came in at 1.3 seconds for a compiled executable, so slower than Gambit, Racket and sbcl.

Chicken does create the smallest executable of any of the compiled schemes. I can't claim I got the code as optimized as was reasonable so don't read anything into this.

I also once made a saved image of the sbcl code. I think it was smaller than Racket, but larger than Gambit or Chicken. You do get a maybe redistributable executable.

The fact is that neither Scheme nor Lisp are really designed to create statically linked executables. A worthy experiment, but none of the various maintainers seriously target such usage. That's what we have compiled languages for.

Here's a collection of versions (including some improvements to the benchmarking harness):

The results are, for me, that for/fold with type specialization is by far the fastest, about twice as fast as the original code. Also just use (#%declare #:unsafe) speeds up the imperative version significantly, but the functional versions less so (because the existing type analysis is does better on functional code).

#lang racket/base

; (require ffi/vector) not being used
(require racket/flonum)


; sum the vector
(define (vector-sum vec)
  (let ((len (flvector-length vec))
        (sum 0.0)) 
    (do ((i 0 (add1 i)))
	((= i len) sum)
      (set! sum (fl+ sum (flvector-ref vec i))))))


(define (vector-sum2 vec)
  (for/fold ([sum 0.0])
            ([v (in-flvector vec)])
    (fl+ sum v)))

(define (vector-sum-new vec) ; much slower
  (for/sum ((sum vec)) sum))
(define (vector-sum-new2 vec) 
  (for/sum ((sum (in-flvector vec))) sum))



(define-syntax-rule (bench vector-sum)
  (begin
    (define (vector-bench len fval)
      (let ((vec (make-flvector len fval)))
        (vector-sum vec)))

    (define (repeat n len fval)
      (for ([_ (in-range n)])
        (vector-bench len fval)))

    (collect-garbage)
    ; use as follows:
    'vector-sum
    (time (repeat 1000 100000 0.5)) ; create vector 1000 times and sum
    ))
(#%declare #:unsafe)


(bench vector-sum)
(bench vector-sum2)
(bench vector-sum-new)
(bench vector-sum-new2)
1 Like

Here is a version that uses unboxed flonums:

#lang racket/base

(require racket/flonum)

(define (vector-sum vec)
  (for/fold ([sum (fl+ 0.0)] #:result (fl+ sum))
            ([v (in-flvector vec)])
    (fl+ sum v)))

(define (vector-bench len fval)
  (vector-sum (make-flvector len fval)))

(define (repeat n len fval)
  (for ([_ (in-range 0 n)])
    (vector-bench len fval)))

(time (repeat 1000 100000 0.5)) ; create vector 1000 times and sum

It takes around 150ms to execute on my machine (2023 Apple M2 Max).

1 Like

Very nice. Thank you. I'll put that in the code. (I did!)

I had tried for/sum but without in-flvector. That was pretty bad as your comparison shows, because I didn't get the benefit of fl+ that I used in the naive do loop. I looked up in-flvector.

OK: on my machine I got .850 for in REPL execution and 0.65 for the executable. That rocks; really impressive.

I will say it is hard to discover these specialized functions without reading lots and lots of documentation. The Racket doc is really easily searched, but not meant to be read as a book (though that is the best way to gain broad perspective on the language's capabilities...).

The unfortunate thing about lisps and schemes is that they don't do compile-time dispatch to appropriately specialized, compiled functions (see Julia, c++ and nim for that, as well as other languages) so there are all these special purpose, specially named functions. That makes discovery difficult. To be clear, the languages that do support function (or operator) overloading (that's really all compile time dispatch is) still require that the specialized functions must be written--it's just that users can use the same function name for semantically equivalent purposes, which makes code easier to write and read. Performance will be great as the compiler binds the name to the right function. Some people call this polymorphism--as does Racket. Of course, OO "bigots" mean something different by polymorphism--typically the horrible dynamic dispatch that comes from object inheritance, which should be avoided like the plague. I am certainly not endorsing that, though it has some limited very good use cases (gui in particular).

Excellent work has gone into the continuing improvement of Racket (I couldn't care less about the religious disputes with other Schemes). It has become a rather gargantuan language, but the same could be said for sbcl--so, it's mostly cured by ramping up one's learning. Re-hosting on Chez has certainly paid off, along with lots of other work on the language.

If 8 out of 10 of all the Schemes would just disappear, the maintaining teams combine, egos go away--then we'd have 3 or fewer Schemes, all well-maintained with good doc, good tooling, and robust sustained teams. Scheme then might actually be a competitive language in the real world (no, sorry--it's not now... YMMV). This has been proven with the Racket/Chez collaboration. Ending the fragmentation has been proposed since the 1970's to no avail. Ego eggs break hard. So counter-productive: why not be part of a productive team like Racket's and see one's work used by more people to more good effect? That should be more gratifying than being alone in one's own sandbox/snow-globe. Alas...

Do you think Cisco will throw in the towel on Chez? If the Racket team can hire a few of the folks, then the benefits of Chez would live on (Cisco could continue to rely on it) even if the independent distributable did not.

I think you should see using in-flvector and fl+ as compile-time dispatch. Julia or C++, in their own ways, are doing compile-time dispatch by declaring the type at the binding site for the variable, and that can be more convenient than declaring it at the operation site but ultimately they are similar mechanisms.

To see that in action, you can write your program in Typed Racket, remove the use of fl+ and get the same compile-time dispatch. Like this:

(: vector-sum2 : FlVector -> Flonum)
(define (vector-sum2 vec)
  (for/fold ([sum 0.0])
            ([v (in-flvector vec)])
    (+ sum v)))

Typed Racket doesn't currently optimize away the need for in-flvector, but that's very doable as well. (Using for/sum in Typed Racket is trickier because it produces exact 0 on an empty sequence.)

Fantastic.

I agree with your comments about compile-time dispatch. Certainly using a type specialized function pays off. My point was about operator and function overloading: that can eliminate the special name (in some cases).

Thanks for the tip on typed/racket. The challenge for using type hints (are these contracts--so the type is imposed?) in dynamic languages is coming up with a convenient syntax as something grafted on. You could say that neither nim's nor c++'s syntax for typing are convenient, but one does get used to it. Julia goes a step further with possibly the most thorough (but still not perfect) type inference: in fact, many people mess up type stability and write overly specialized code by using too many type hints in all the wrong places (is that a song?).

That type is entirely static. If you used it with a untyped module (say, the one providing the flvector) then there would be a contract at the boundary. Unlike in Julia, types don't change the runtime behavior of the program.

The types in Julia are static: that is, the name pointer to the memory object always has the same type if the object itself does not change. You could say that the runtime behavior changes, but only because of jit-compiling. Once Julia has encountered an instance of an object of given type and the corresponding code compiled, that's that. The problem that people encounter is with dynamic dispatch when Julia's compiler can't infer the types it needs to compile for and does so on the fly: kills performance.

You can end up with a name pointing to an object that "changes" type--but that's because you are getting a different object, say when a called function returns objects of different types from different if branches (this is considered bad practice and is called type instability) or when objects are being read from disk and deserialized into Julia objects. This seems more of a code design problem that can occur in any language. In a truly statically typed language it might show up as a compile error, but will certainly end up causing a runtime error. As a dynamic language, Julia lets you get away with it, but the much-vaunted (and generally true) claimed performance goes south on you. Sometimes Python gets faster than Julia because the underlying c code can handle this, but the jit-compiled on the fly Julia code cannot.

....if I understand what you are saying...

BTW, this is one of the very best communities I've encountered. Julia's is pretty good, too. Nim's is just too small but very helpful if you can ignore Araq's tone :-).

What I mean is that if you add a type annotation to a Julia function, for example, that can change what function gets called at runtime. Whereas in Typed Racket, the runtime is entirely unaware of the types and doesn't use them to make any decisions.

Interesting: that is sort of like type hints in Python. Julia is trying to be closer to statically typed without going all the way.

Yes, there are definitely similarities, but in Typed Racket the type system is sound and thus can be used for optimization (so it computes the same answer as before, but faster).

I am going to sign off for now. I learned a lot from this chain.

1 Like

See the Guide and Reference, which are much more book-like. Of course they don’t cover libraries provided by other packages, including main-distribution stuff like the web server, etc.

People are often surprised that set! typically carries a performance cost in Racket. Among other things, the compiler implements set! by boxing the assigned variable on the heap, and it doesn't do analysis to try to avoid that cost. Of course, other techniques for implementing assignment are well known, but, since set! is avoided in idiomatic Racket (and Chez Scheme) code, they have been judged not to be a good use of resources. Aside from development effort, the current model means that the cost of set! only comes if set! is actually used: alternatives generally would require more work from the implementation of closures and continuations, which are higher priorities than set!.

On my machine, the compiled Gambit code runs as follows:
As written: .834 seconds.
Add (standard-bindings) (extended-bindings) declarations (so the compiler can assume that +, =, fl+, etc., have their usual meanings): .619 seconds.
Additionally, add (not safe) declaration: .375 seconds
Perhaps the middle setting has the same semantics as Racket.

I think set! needs boxes, because values can have different sizes. I assume that in the compiled code such a box just contains a pointer. Pointers have fixed size at the C level. Hence replacing a pointer can be done at machine code level. I assume that in fact all scheme like evaluators need indirect references through pointers, interned data possibly excluded. I remember well my first implementation of a lisp like language written in Pascal, full of pointers and an environment containing these pointers such as to make garbage collection possible. A pity I no longer have the source code of that implementation. It was a very poor implementation anyway. I made it because I was not satisfied by the Lisp available on IMB6000 some 25 years ago. I and never used it again after discovering Racket.

628A2AE47B884F238AC4A5F2289837F0.png