Requesting help to translate small array program into typed racket

When I added array broadcasting to SRFI 231 here GitHub - gambiteer/srfi-231 at 231-bis I made some design decisions that differ from those in Racket's math/array and I would like to compare the implementations of the two libraries.

I'd like someone to help by adding type annotations to the following small program to test the interaction of array broadcasting with folding along an axis so that the program will run in Typed Racket.

Thanks.

Brad

#lang racket

(require math/array)
(require math/matrix)

(require racket/flonum)

(define (matrix-multiply A B)
  (array-axis-fold
   (array-map
    fl*
    (array-axis-insert A 2)
    (array-axis-insert B 0))
   1
   fl+))

(define A (build-array #(100 100) (lambda (multi-index)
                                    (exact->inexact (- (vector-ref multi-index 0)
                                                       (vector-ref multi-index 1))))))

(define B (build-array #(100 100) (lambda (multi-index)
                                    (exact->inexact (+ (vector-ref multi-index 0)
                                                       (vector-ref multi-index 1))))))

(define C (time (matrix-multiply A B)))

(define A* (build-matrix 100 100 (lambda (i j)
                                   (exact->inexact (- i
                                                      j)))))

(define B* (build-matrix 100 100 (lambda (i j)
                                   (exact->inexact (+ i
                                                      j)))))

(define C* (time (matrix* A* B*)))
#lang typed/racket

(require math/array)
(require math/matrix)

(require/typed
 racket/flonum
 [fl+ (-> Float Float Float)]
 [fl* (-> Float Float Float)])

(: matrix-multiply (-> (Matrix Float) (Matrix Float) (Matrix Float)))
(define (matrix-multiply A B)
  (array-axis-fold
   (array-map
    fl*
    (array-axis-insert A 2)
    (array-axis-insert B 0))
   1
   fl+))

(: indexex In-Indexes)
(define indexex #(100 100))

(define A
  (build-array
   indexex
   (λ ({mi : Indexes}) (exact->inexact (- (vector-ref mi 0) (vector-ref mi 1))))))

(define B
  (build-array
   indexex
   (lambda ({multi-index : Indexes}) (exact->inexact (+ (vector-ref multi-index 0) (vector-ref multi-index 1))))))

(define C (time (matrix-multiply A B)))

(define A* (build-matrix 100 100 (lambda ({i : Index} {j : Index}) (exact->inexact (- i j)))))

(define B* (build-matrix 100 100 (lambda ({i : Index} {j : Index}) (exact->inexact (+ i j)))))

(define C* (time (matrix* A* B*)))

Thanks! The results on my 11-year-old Linux box, after upping the matrix size to 200x200, were

cpu time: 3378 real time: 3383 gc time: 873
cpu time: 36 real time: 37 gc time: 8

So the built-in matrix* is about 100 times as fast as the routine matrix-multiply (which was written to see how implicit array broadcasting interacted with array-axis-fold).

So that's about 8,000,000 floating-point multiply-adds in 36ms or roughly 220 million fma instructions per second. Pretty good.

Similar code using the SRFI 231 followup library, with safe code in Gambit:

(define (matrix-multiply-2 A B)
  ;; We rely on implicit broadcasting of
  ;; the array arguments A* and B*
  (array-copy!
   (array-map (lambda (products)
                (array-fold-left fl+ 0. products))
              (array-curry
               (array-permute (array-map
                               fl*
                               (array-insert-axis A 2)
                               (array-insert-axis B 0))
                              (index-last 3 1))
               1))))


(define A (array-copy! (make-array (make-interval '#(200 200))
                                   (lambda (i j) (inexact (- i j))))))

(define B (array-copy! (make-array (make-interval '#(200 200))
                                   (lambda (i j) (inexact (+ i j))))))

(define D (time (matrix-multiply-2 A B)))

(define G (time (array-copy! (array-inner-product A fl+ fl* B))))

the timings are

(time (matrix-multiply-2 A B))
    0.380979 secs real time
    0.380508 secs cpu time (0.379516 user, 0.000992 system)
    7 collections accounting for 0.003366 secs real time (0.003349 user, 0.000012 system)
    20821184 bytes allocated
    78 minor faults
    no major faults
    1368323814 cpu cycles
(time (array-copy! (array-inner-product A fl+ fl* B)))
    0.349888 secs real time
    0.349500 secs cpu time (0.349500 user, 0.000000 system)
    9 collections accounting for 0.005723 secs real time (0.005717 user, 0.000000 system)
    23837600 bytes allocated
    78 minor faults
    no major faults
    1256656605 cpu cycles

The Typed Racket results were so good, I looked at ./math-lib/math/private/matrix/matrix-arithmetic.rkt, which leads me to ask whether matrix* is an open-coded macro when used in this way.

The matrix multiplication is defined in

So I think, the purpose of .../private/matrix/matrix-arithmetic.rkt is solely to make matrix* work from both untyped and typed Racket at the same time.

But the code is involved. The actual sum-and-multiply loop is here:

The macro stepper yields this for the line in question, and this shows an application of the function make-matrix-multiply"

(define-values:365 (C*)
     (let-values:366 (((v:367 cpu:367 user:367 gc:367)
                       (#%app:367
                        time-apply:367
                        (lambda:367 ()
                          (let-values:368 (((arr:369)
                                            (with-continuation-mark:370
                                             parameterization-key:370
                                             (#%app:370
                                              extend-parameterization:370
                                              (#%app:370 continuation-mark-set-first:370 (quote #f) parameterization-key:370)
                                              array-strictness:371
                                              (quote #f))
                                             (let-values:373 ()
                                               (let-values:374 (((m:375 p:375 n:375 arr-data:375 brr-data:375 bx:375)
                                                                 (#%app:376 matrix-multiply-data:377 A* B*)))
                                                 (#%app:378
                                                  make-matrix-multiply:379
                                                  m:375
                                                  p:375
                                                  n:375
                                                  (lambda:380 (i:375 j:375)
                                                    (let-values:382 (((bx:375) (#%app:383 bx:375)))
                                                      (let-values:382 (((v:375)
                                                                        (#%app:384
                                                                         unsafe-fl*
                                                                         (#%app:385 unsafe-vector-ref:375 arr-data:375 i:375)
                                                                         (#%app:386 unsafe-vector-ref:375 brr-data:375 j:375))))
                                                        (#%app:387
                                                         (letrec-values:388 (((loop:375)
                                                                              (lambda:389 (k:375 v:375)
                                                                                (if (#%app:391 unsafe-fx< k:375 p:375)
                                                                                  (let-values:392 ()
                                                                                    (#%app:393
                                                                                     loop:375
                                                                                     (#%app:394 unsafe-fx+ k:375 (quote 1))
                                                                                     (#%app:395
                                                                                      unsafe-fl+
                                                                                      v:375
                                                                                      (#%app:396
                                                                                       unsafe-fl*
                                                                                       (#%app:397
                                                                                        unsafe-vector-ref:375
                                                                                        arr-data:375
                                                                                        (#%app:398 unsafe-fx+ i:375 k:375))
                                                                                       (#%app:399
                                                                                        unsafe-vector-ref:375
                                                                                        brr-data:375
                                                                                        (#%app:400 unsafe-fx+ j:375 k:375))))))
                                                                                  (let-values:401 ()
                                                                                    (#%app:402 set-box!:375 bx:375 v:375))))))
                                                           loop:375)
                                                         (quote 1)
                                                         v:375)
                                                        (#%app:403 unsafe-unbox bx:375))))))))))
                            (#%app:404 array-default-strict!:405 arr:369)
                            arr:369))
                        null:367)))
       (#%app:406 printf:367 (quote "cpu time: ~s real time: ~s gc time: ~s\n") cpu:367 user:367 gc:367)
       (#%app:367 apply:367 values:367 v:367)))

Thanks, I think I understand Racket’s array/matrix code better.

I want to ask: the definition of matrix*/ns in ./math-lib/math/private/matrix/typed-matrix-arithmetic.rkt is

(: matrix*/ns
   (case-> ((Matrix Flonum) (Listof (Matrix Flonum)) -> (Matrix Flonum))
           ((Matrix Real) (Listof (Matrix Real)) -> (Matrix Real))
           ((Matrix Float-Complex) (Listof (Matrix Float-Complex)) -> (Matrix Float-Complex))
           ((Matrix Number) (Listof (Matrix Number)) -> (Matrix Number))))
(define (matrix*/ns a as)
  (cond [(empty? as) (matrix-shape a) ;; does argument checking
                      a]
        [else  (matrix*/ns (inline-matrix-multiply a (first as)) (rest as))]))

and inline-matrix-multiply is defined as a macro in ./math-lib/math/private/matrix/untyped-matrix-arithmetic.rkt with the comment

  ;; This is a macro so the result can have as precise a type as possible

My question: Does this mean that Typed Racket compiles a version of matrix*/ns specialized to each of the type cases in the signature?

Studying this code leads me to reconsider the specifications of array-outer-product and array-inner-product in SRFI 231, which now seem a bit sloppy. Some things perhaps to copy from how matrix* is coded:

  1. If necessary, copy argument arrays to specialized (strict) arrays, after permuting the second array argument to place its first axis last.
  2. Return a specialized (strict) array.
  3. RIght now, the calling sequence is
(array-inner-product A f g B)

and the inner loop is computed by (a and b are one-dimensional subarrays of A and B):

(lambda (a b)
  (array-reduce f (array-map g a b)))

where, for the matrix* loop, g is *, and f is +.

Perhaps the calling sequence should be

(array-inner-product A f-of-g g B)

where for matrix*, for example, g would be * and f-of-g would be

(lambda (sum a b) (+ sum (* a b)))

so it would be not two but one function call (other than the + and *, which can be inlined by the compiler, as in Gambit) per iteration.

I'll have to think about things more.

On Jan 15, 2026, at 1:59 PM, Gambiteer notifications@racket.discoursemail.com asked whether "this mean that Typed Racket compiles a version of matrix*/ns specialized to each of the type cases in the signature?”

Yes. TR does have type abstraction (polymorphism) but by using a macro, the primitive operations can be specialized to the type that’s provide.

Summary: Changing the calling sequence to array-inner-product from

(array-inner-product A f g B)

to

(array-inner-product A f-of-g g B)

speeds, e.g., matrix multiplication by a factor of 1.5.

Question: Is this speedup enough to justify having a "nonstandard" calling sequence?

I tested these changes. So originally we had

Keeping the calling sequence A f g B, I see with "safe" code

(time (array-inner-product A (lambda (x y) (fl+ x y)) (lambda (x y) (fl* x y)) B))
    0.190770 secs real time
    0.190363 secs cpu time (0.190362 user, 0.000001 system)
    no collections
    3019936 bytes allocated
    451 minor faults
    no major faults
685150626 cpu cycles

and with "unsafe" code we have

(time (array-inner-product A (lambda (x y) (fl+ x y)) (lambda (x y) (fl* x y)) B))
    0.184475 secs real time
    0.183906 secs cpu time (0.181892 user, 0.002014 system)
    no collections
    3019936 bytes allocated
    449 minor faults
    no major faults
    662542767 cpu cycles

We get more of an improvement with the calling sequence A f-of-g g B; so with safe code

(time (array-inner-product A (lambda (sum x y) (fl+ sum (fl* x y))) (lambda (x y) (fl* x y)) B))
    0.137320 secs real time
    0.137020 secs cpu time (0.136496 user, 0.000524 system)
    no collections
    3019936 bytes allocated
    451 minor faults
    no major faults
    493176873 cpu cycles

and with unsafe code

(time (array-inner-product A (lambda (sum x y) (fl+ sum (fl* x y))) (lambda (x y) (fl* x y)) B))
    0.124083 secs real time
    0.123638 secs cpu time (0.123636 user, 0.000002 system)
    no collections
    3019936 bytes allocated
    449 minor faults
    no major faults
    445633485 cpu cycles

So after the internal changes, changing the arguments of array-inner-product to f-of-g and g instead of f and g, we get a speedup of about 1.5 times.

So, is it worth changing the arguments to array-inner-product for a 1.5 times speedup?

OK, I think I understand which things cause the performance differences.

  1. Using (array-inner-product A fl+ fl* B):
    0.190363 secs cpu time (0.190362 user, 0.000001 system)
  1. Using (array-inner-product A (lambda (sum x y) (fl+ sum (fl* x y))) (lambda (x y) (fl* x y)) B))
    0.137020 secs cpu time (0.136496 user, 0.000524 system)
  1. Using generic + and * hard-coded into the inner loop (which in Gambit is analogous to what's done in math/matrix's matrix*):
    0.075155 secs cpu time (0.075154 user, 0.000001 system)
  1. Taking advantage of the fact that + and * don't use call/cc (doesn't really improve CPU time, but cuts memory allocation by 2/3):
    0.075163 secs cpu time (0.075133 user, 0.000030 system)
  1. Hard-code fl+ and fl* rather than using generic + and * (this code is unsafe):
    0.061321 secs cpu time (0.061217 user, 0.000104 system)
  1. Hard-code vector-ref as the accessor into the array body instead of allowing different underlying uniform vectors to store the data (f64vector, f32vector, ...), which is still safe since we can control the storage internally to the library:
    0.033769 secs cpu time (0.033768 user, 0.000001 system)

The final result is more-or-less what math/matrix's matrix* achieves, but Typed Racket allows to do it safely and with less hand coding.

Thanks for answering my questions, the exercise was worthwhile to understand the trade-offs in safety, internal data structures, and performance.

Brad

1 Like

And again broadcasting raises its head.

Copying array arguments to array-inner-product and array-outer-product to "generic" specialized (strict) arrays with entries stored in the "correct" order allows one to (a) use simplified indexing, and (b) hard-code vector-ref as the accessor into array bodies, cutting multiple non-tail calls from computing each entry of the result.

But if one of the arguments is a broadcast specialized (strict) array, where each entry is possibly reused multiple times for different multi-indices, then the copied array argument could be arbitrarily bigger than the array argument itself, possibly defeating the intent of the programmer.

In SRFI 231 we already allow the bodies of arrays to be uniform vectors with entries of size 1, 8, 16, 32, 64, or 128 bits, presumably to optimize memory usage and access patterns. This decision by itself can add run-time overhead to array accesses (but the sample implementation inlines the vector accessors for known storage classes).

So the sample implementation will likely copy to specialized arrays only generalized (nonstrict) array arguments to array-outer-product and array-inner-product. I don't know yet what to put into the SRFI followup specification.

I've now updated the specification and implementation of array-inner-product and array-outer-product in the followup to SRFI 231: GitHub - gambiteer/srfi-231 at 231-bis

Both of these routines almost always re-use each element of the array arguments multiple time, so generalized (nonstrict) array arguments are copied to specialized (strict) arrays---we have no way to know how expensive it is to compute an element of a generalized array, so we want to compute and store each element once before proceeding. (Generally speaking, I prefer each "bulk" array operation to compute each element of argument arrays at most once.)

The routine array-outer-product returns a generalized (nonstrict) array because the result is often "consumed" by another bulk operation before being copied or assigned to a specialized (strict) array (see LU matrix decomposition, for example).

The routine array-inner-product returns a specialized array because, generally, computing each element of the result involves a computationally intensive loop, so we do it only once for each element.

The inner loop of array-inner-product performs six nontail calls each iteration, so it is not particularly fast when the two procedural arguments are cheap (e.g., floating-point addition and multiplication), but I think the performance is good enough for a general routine. Users that need significantly higher performance can write custom loops in Scheme or use an FFI.

The documentation has also been (finally) updated.

Comments welcome.

Brad