I was trying to figure out why my uses of ptr-ref were 10x slower than vector-ref. The main reason appears to be that _fixnum is (much) slower than the rest, like _ulong.
#lang racket/base
(require ffi/unsafe)
(define n-times 100000000)
(define len 10)
;; Racket vectors
(collect-garbage)(collect-garbage)(collect-garbage)
(let ()
(printf "~a: " 'vector)
(define v (make-vector len))
(time (for ([i n-times]) (vector-ref v 0))))
;; C types
;; `ptr-ref` can be far slower if the ctype is not given explicitly,
;; hence a macro rather than a function.
(define-syntax-rule (stress-test ctype ZERO)
(begin
(collect-garbage)(collect-garbage)(collect-garbage)
(printf "~a: " 'ctype)
(let ()
(define ptr (malloc len ctype 'raw))
(ptr-set! ptr ctype 0 ZERO) ; write a valid value
(time (for ([i n-times]) (ptr-ref ptr ctype 0))) ; stress test
(free ptr))))
(stress-test _int 0)
(stress-test _uint 0)
(stress-test _ulong 0)
(stress-test _double 0.)
(stress-test _fixnum 0)
(stress-test _racket 0)
Results:
$ racket ptr-ref-stress-test.rkt
vector: cpu time: 715 real time: 715 gc time: 0
_int: cpu time: 1189 real time: 1189 gc time: 0
_uint: cpu time: 1119 real time: 1119 gc time: 0
_ulong: cpu time: 1143 real time: 1144 gc time: 0
_double: cpu time: 1328 real time: 1329 gc time: 7
_fixnum: cpu time: 8931 real time: 8937 gc time: 0
_racket: cpu time: 6339 real time: 6343 gc time: 0
_fixnum appears particularly slow here. However, the docs say that _fixnum is "for cases where speed matters". (Perhaps a relic of Racket BC?)
Besides this, ptr-ref is still 1.6x slower than vector-ref. Is there anyway to get ptr-ref on par with vector-ref?
I'm particularly interested in _double and _ulong (or _fixnum).
I think the gap between vector-ref and ptr-ref is larger than you report here. Using in-range for the for cuts 2/3 of the time for the vector-ref loop on my machine, and it's about a 3x difference for _ulong and 3.5x for _double.
There are two sources of slowdown for ptr-ref: the limited amount of inlining that happens for ptr-ref's implementation, and the way that implementation accommodates multiple pointer representations (e.g., pointer objects versus byte strings). Those are related, but it should be possible to better inline a fast path. It would also be nice to have _double access cooperate with local unboxing, but that may be trickier.
I'll hope to look into this more in the near future.
If I can get a fast path even just for int64, I could offload the heavy flonum computation to C. It seems I can gain a speed factor 10 this way for my code, except that the conversion from a fxvector to a C array of int64 (via malloc and ptr-set!) is reversing the gains compared to pure Racket code, even though the number of computation steps for this conversion should be an order of magnitude smaller than the heavy stuff.
It turns out that ptr-set! with simple types like _ulong or _int64 is about 10 slower than it should be due to a broken optimization attempt, while ptr-ref is only about 2 times as slow as it should be.
I'm getting close to finishing an overhaul of the FFI that makes (ptr-ref _ulong ctype 0) twice as fast and makes (ptr-set! _ulong ctype 0 v) similarly fast. I'm hopeful that it will also speed up other FFI tasks.
In the interest of more broadly testing the FFI change, I've pushed the revised FFI implementation for Racket, letting Racket's copy of Chez Scheme temporarily diverge from the main Chez Scheme branch while a PR is considered there.
Here are some rough performance results. The "fix" column shows peformance after repairing a broken check in v8.15 that disabled an intended fast path. The "new" column shows the latest Racket with the revised FFI implementation. The "ref-stress-test" group of results correspond to the original post, but using in-range. The last four examples are about foreign call as much as (or instead of) ptr-ref and ptr-set!. For example, "math.rkt" uses bf+, which is relevant because it uses MPFR and GMP bindings. Programs here.
There's still room for improvement in foreign calls. A Chez Scheme variant of plus runs in 3ms instead of 13ms, which reflects overhead still added by Racket's more dynamic FFI. The round of improvements here seems on the path to better improvement, though.
I've built your new changes, Matthew, and tested it on an M1 Macbook Pro with Herbie on the hamming benchmark suite. This is a small suite but one that also does more FFI-related computations than the others. It runs fine, no problems. Results:
Overall: 2.2 min down to 2.0 min
"Sampling" phase (FFI-heavy): 52.4s down to 46.1s
"Rival" component (FFI-heavy): 39.0s down to 32.9s
So that's a big win! Other phases are largely unaffected (localize, which is similar to sampling, also speeds up) suggesting that it really was FFI.
Herbie is both quite large and also deterministic, and the two runs I'm comparing had the same seed, so I can say reasonably confidently the impact was almost entirely the FFI speedup. There also seems to be slightly more memory allocated but less time collecting said memory. No idea where that's from.
A 10% speedup for Herbie, end-to-end, is pretty amazing. Thank you @mflatt!