_fixnum slow? (despite docs)

I was trying to figure out why my uses of ptr-ref were 10x slower than vector-ref. The main reason appears to be that _fixnum is (much) slower than the rest, like _ulong.

#lang racket/base
(require ffi/unsafe)


(define n-times 100000000)
(define len 10)

;; Racket vectors
(collect-garbage)(collect-garbage)(collect-garbage)
(let ()
  (printf "~a: " 'vector)
  (define v (make-vector len))
  (time (for ([i n-times]) (vector-ref v 0))))

;; C types
;; `ptr-ref` can be far slower if the ctype is not given explicitly, 
;; hence a macro rather than a function.
(define-syntax-rule (stress-test ctype ZERO)
  (begin
    (collect-garbage)(collect-garbage)(collect-garbage)
    (printf "~a: " 'ctype)
    (let ()
      (define ptr (malloc len ctype 'raw))
      (ptr-set! ptr ctype 0 ZERO) ; write a valid value
      (time (for ([i n-times]) (ptr-ref ptr ctype 0))) ; stress test
      (free ptr))))

(stress-test _int 0)
(stress-test _uint 0)
(stress-test _ulong 0)
(stress-test _double 0.)
(stress-test _fixnum 0)
(stress-test _racket 0)

Results:

$ racket ptr-ref-stress-test.rkt 
vector: cpu time: 715 real time: 715 gc time: 0
_int: cpu time: 1189 real time: 1189 gc time: 0
_uint: cpu time: 1119 real time: 1119 gc time: 0
_ulong: cpu time: 1143 real time: 1144 gc time: 0
_double: cpu time: 1328 real time: 1329 gc time: 7
_fixnum: cpu time: 8931 real time: 8937 gc time: 0
_racket: cpu time: 6339 real time: 6343 gc time: 0

_fixnum appears particularly slow here. However, the docs say that _fixnum is "for cases where speed matters". (Perhaps a relic of Racket BC?)

Besides this, ptr-ref is still 1.6x slower than vector-ref. Is there anyway to get ptr-ref on par with vector-ref?
I'm particularly interested in _double and _ulong (or _fixnum).

You're right that _fixnum is a relic of BC.

I think the gap between vector-ref and ptr-ref is larger than you report here. Using in-range for the for cuts 2/3 of the time for the vector-ref loop on my machine, and it's about a 3x difference for _ulong and 3.5x for _double.

There are two sources of slowdown for ptr-ref: the limited amount of inlining that happens for ptr-ref's implementation, and the way that implementation accommodates multiple pointer representations (e.g., pointer objects versus byte strings). Those are related, but it should be possible to better inline a fast path. It would also be nice to have _double access cooperate with local unboxing, but that may be trickier.

I'll hope to look into this more in the near future.

1 Like

Thanks for the explanation.

I'll hope to look into this more in the near future.

That would be marvellous, thank you!

If I can get a fast path even just for int64, I could offload the heavy flonum computation to C. It seems I can gain a speed factor 10 this way for my code, except that the conversion from a fxvector to a C array of int64 (via malloc and ptr-set!) is reversing the gains compared to pure Racket code, even though the number of computation steps for this conversion should be an order of magnitude smaller than the heavy stuff.

It turns out that ptr-set! with simple types like _ulong or _int64 is about 10 slower than it should be due to a broken optimization attempt, while ptr-ref is only about 2 times as slow as it should be.

I'm getting close to finishing an overhaul of the FFI that makes (ptr-ref _ulong ctype 0) twice as fast and makes (ptr-set! _ulong ctype 0 v) similarly fast. I'm hopeful that it will also speed up other FFI tasks.

3 Likes

In the interest of more broadly testing the FFI change, I've pushed the revised FFI implementation for Racket, letting Racket's copy of Chez Scheme temporarily diverge from the main Chez Scheme branch while a PR is considered there.

Here are some rough performance results. The "fix" column shows peformance after repairing a broken check in v8.15 that disabled an intended fast path. The "new" column shows the latest Racket with the revised FFI implementation. The "ref-stress-test" group of results correspond to the original post, but using in-range. The last four examples are about foreign call as much as (or instead of) ptr-ref and ptr-set!. For example, "math.rkt" uses bf+, which is relevant because it uses MPFR and GMP bindings. Programs here.

CPU time in milliseconds
                  v8.15     fix     new
ref-stress-test
  vector   ref      11      11       11
  _ulong   ref      37      38   >>  17
  _ulong   set!    631  >>  72   >>  18
  _int     ref      37      37   >>  16
  _uint    ref      37      37   >>  16
  _double  ref      42      41   >>  18
  _fixnum  ref     509     510   >>  88
  _racket  ref     373     373   >>  96
 from/to bytes
  _ulong   ref      29      29   >>  11
  _ulong   set!    629  >>  62   >>  11
struct ref         171     170    > 164
struct set!        660  >> 180    > 172
math               105   >  84    >  64
plus                29      28   >>  13
strlen              19      19    >  14
draw               228     227      213

There's still room for improvement in foreign calls. A Chez Scheme variant of plus runs in 3ms instead of 13ms, which reflects overhead still added by Racket's more dynamic FFI. The round of improvements here seems on the path to better improvement, though.

4 Likes

Thank you so much, this is awesome!

Also, TIL that make-bytes returns a pointer :exploding_head:

1 Like

I've built your new changes, Matthew, and tested it on an M1 Macbook Pro with Herbie on the hamming benchmark suite. This is a small suite but one that also does more FFI-related computations than the others. It runs fine, no problems. Results:

  • Overall: 2.2 min down to 2.0 min
  • "Sampling" phase (FFI-heavy): 52.4s down to 46.1s
  • "Rival" component (FFI-heavy): 39.0s down to 32.9s

So that's a big win! Other phases are largely unaffected (localize, which is similar to sampling, also speeds up) suggesting that it really was FFI.

If @Laurent.O needed the extra performance in the mean time, would using vm-eval to access the Chez Scheme FFI work? Or maybe it would be better to write a Chez Scheme library and access it as you explained in Calling into a Chez Scheme library from Racket -- imported symbol rewritten in vm-eval? - #9 by mflatt? Or do those e.g. inhibit inlining such that it wouldn't pay off?

An update with the full Herbie benchmark suite:

  • Full Herbie: 32.4min -> 29.2min
  • "Sampling" + "Localize" components: 9.7min -> 7.8min
  • "Rival" library: 5.9min -> 4.3min

Herbie is both quite large and also deterministic, and the two runs I'm comparing had the same seed, so I can say reasonably confidently the impact was almost entirely the FFI speedup. There also seems to be slightly more memory allocated but less time collecting said memory. No idea where that's from.

A 10% speedup for Herbie, end-to-end, is pretty amazing. Thank you @mflatt!