Why is this `unsafe-set-car!` so much slower than `unsafe-set-immutable-car!`?

Here in Racket v8.10 CS I have implemented unsafe-set-car! in a way that is equivalent (I think) to unsafe-set-immutable-car! which is provided by racket/unsafe/ops:

#lang racket/base

(module prim racket/base

  (require ffi/unsafe/vm)

  (provide unsafe-set-car!)

  (vm-eval '(define-syntax (unsafe-primitive stx)
              (syntax-case stx ()
                [(_ id)
                 #'($primitive 3 id)])))

  (define unsafe-set-car!
    (vm-eval '(parameterize ([optimize-level 3])
                (lambda (p a) ((unsafe-primitive set-car!) p a))))))

(require racket/unsafe/ops
         'prim)

(let ([p (cons 1 2)])
  (time
   (for ([i (in-range 100000000)])
     (unsafe-set-car! p i)
     ;;(unsafe-set-immutable-car! p i)
     )))

I assumed that unsafe-set-car! and unsafe-set-immutable-car! would be of almost equal speed in that loop, but they weren't.

That method of using virtual machine primitives is the most performant that I know of. Am I missing something in it?

I'm more interested in procedures based on the hashtable primitives but there are no corresponding Racket-level unsafe versions to compare them to, so I used the set-car! primitive as an example instead. The slowness occurs with other primitives too.

IIUC:

I''m using cs: as a prefix for Chez Scheme functions, when they are different from the racket ones.

unsafe-set-immutable-car! is translated as cs:unsafe-set-car!. I think it has a special case in schemify, like most build-in racket functions.

unsafe-set-car! is translated as a function defined by a cs:lambda in another module, that is not inlined.

Your code is expanded to a linklet that (after a lot of simplifications and lies) is:

(lambda (globals my-unsafe-set-car!)
  (let ([p (cons 1 2)])
    (cs:unsafe-set-car! p 7))
  (let ([p (cons 1 2)])
    ((if (cs:procedure? my-unsafe-set-car!)
        my-unsafe-set-car!
        (slow-extract-function my-unsafe-set-car!))
     p 7))
)

I'm surprised it's not even slower.

I thought that perhaps I was just missing some code or something.

Your explanation reminds me of the performance overhead of some FFI calls, so I tried the typical strategy of not crossing the "boundary" as often (to avoid slow-extract-function). That appears to be working well.

Thank you very much!