Yes, thanks. On my 10-year-old Linux box, I see this difference between untyped Racket and typed Racket:
heine:~/lang/scheme/srfi-231/srfi-231-bis/broadcasting-examples> /usr/local/racket/bin/racket racket-insert-axis-test.rkt
cpu time: 30928 real time: 31006 gc time: 522
heine:~/lang/scheme/srfi-231/srfi-231-bis/broadcasting-examples> /usr/local/racket/bin/racket typed-racket-insert-axis-test.rkt
cpu time: 800 real time: 801 gc time: 147
I understand that this difference in performance between untyped and typed Racket is to be expected.
I haven't yet implemented implicit array broadcasting, so the following Gambit+"SRFI 231 plus" code is a bit clumsy, but I think the assumptions in the code are similar to the Racket code:
(declare (standard-bindings)
(extended-bindings)
(block)
;; (not safe)
)
(define (make-random-3d-points number-of-points)
(let ((domain (make-interval (vector number-of-points 3))))
(specialized-array-reshape
(make-specialized-array-from-data
(random-f64vector (interval-volume domain))
f64-storage-class)
domain)))
(define x-points (make-random-3d-points 1000))
(define y-points (make-random-3d-points 1000))
(define (Frobenius A B)
;; each array-map creates a new 1D array
(flsqrt (array-reduce fl+
(array-map (lambda (a b)
(flsquare (fl- a b)))
A B))))
(pp "naive-mutual-distances")
(define naive-mutual-distances
(time
(array-copy!
(array-outer-product
Frobenius
;; accessing each element of the following curried
;; arrays creates a 1D array of length 3.
(array-curry x-points 1)
(array-curry y-points 1)))))
(pp "mutual-distances")
(define mutual-distances
(time
(array-copy!
(array-outer-product
Frobenius
;; here we cut down on array creation by
;; copying each element of the curried
;; arrays at creation time.
(array-copy! (array-curry x-points 1))
(array-copy! (array-curry y-points 1))))))
(pp "broadcast-mutual-distances")
(define broadcast-mutual-distances
(time
(let* ((X* (array-insert-axis x-points 0))
(Y* (array-insert-axis y-points 1))
(new-domain (compute-broadcast-interval
(list (array-domain X*)
(array-domain Y*)))))
(array-copy!
;; Copying each element requires creating
;; a 1D array with three elements (from the
;; array-curry). There's a lot of overhead.
(array-map
;; take square root of sums of squares of
;; differences
(lambda (x-y)
(flsqrt (array-reduce fl+ x-y)))
(array-curry
;; convert to a 2D array with 1,000,000
;; elements of 1D arrays with 3 elements
(array-map
;; square the differences of the components
;; of all combinations of pairs of x-points
;; and y-points
(lambda (x y)
(flsquare (fl- x y)))
(array-broadcast X* new-domain)
(array-broadcast Y* new-domain))
1))))))
(pp "broadcast-mutual-distances-2")
(define broadcast-mutual-distances-2
(time
(let* ((X* ;; axis 1 is the number of x-points
(array-insert-axis x-points 0))
(Y* ;; axis 0 is the number of y-points
(array-insert-axis y-points 1))
(new-domain
(compute-broadcast-interval
(list (array-domain X*) (array-domain Y*)))))
(array-copy!
;; Until now, the only computations were trivial:
;; to broadcast X* and Y* to new-domain, and to
;; create a list of three generalized arrays; the
;; calls to array-map just set up operations to be done
;; later, the calls to array-permute, array-curry,
;; and array-broadcast just set up some index
;; manipulations.
;; Now we copy the result to a specialized array
;; of generic-storage-class.
(apply
array-map
;; sum the squares of differences and take square root
(lambda (x y z)
(flsqrt (fl+ x y z)))
(array->list
;; convert to a length 3 list of two-dimensional arrays
(array-curry
;; convert to a length 3, 1D array of 2D arrays with
;; 1,000,000 elements
(array-permute
;; put the last axis, with point coordinates, first
(array-map
;; subtract and square the difference between all
;; combinations of pairs of x-points and y-points
(lambda (x y) (flsquare (fl- x y)))
(array-broadcast X* new-domain)
(array-broadcast Y* new-domain))
'#(2 0 1))
2)))))))
(load "gambit-insert-axis-test")
"naive-mutual-distances"
(time (array-copy! (array-outer-product Frobenius (array-curry x-points 1) (array-curry y-points 1))))
0.976458 secs real time
0.971872 secs cpu time (0.965893 user, 0.005979 system)
4 collections accounting for 0.043651 secs real time (0.041417 user, 0.002050 system)
3319839144 bytes allocated
1953 minor faults
no major faults
3507053602 cpu cycles
"mutual-distances"
(time (array-copy! (array-outer-product Frobenius (array-copy! (array-curry x-points 1)) (array-copy! (array-curry y-points 1)))))
0.480126 secs real time
0.477921 secs cpu time (0.474925 user, 0.002996 system)
1 collection accounting for 0.011302 secs real time (0.011184 user, 0.000015 system)
1138227584 bytes allocated
1953 minor faults
no major faults
1724408337 cpu cycles
"broadcast-mutual-distances"
(time (let* ((X* (array-insert-axis x-points 0)) (Y* (array-insert-axis y-points 1)) (new-domain (compute-broadcast-interval (list (array-domain X*) (array-domain Y*))))) (array-copy! (array-map (lambda (x-y) (flsqrt (array-reduce fl+ x-y))) (array-curry (array-map (lambda (x y) (flsquare (fl- x y))) (array-broadcast X* new-domain) (array-broadcast Y* new-domain)) 1)))))
0.201752 secs real time
0.200669 secs cpu time (0.196728 user, 0.003941 system)
no collections
328027568 bytes allocated
1954 minor faults
no major faults
724550841 cpu cycles
"broadcast-mutual-distances-2"
(time (let* ((X* (array-insert-axis x-points 0)) (Y* (array-insert-axis y-points 1)) (new-domain (compute-broadcast-interval (list (array-domain X*) (array-domain Y*))))) (array-copy! (apply array-map (lambda (x y z) (flsqrt (fl+ x y z))) (array->list (array-curry (array-permute (array-map (lambda (x y) (flsquare (fl- x y))) (array-broadcast X* new-domain) (array-broadcast Y* new-domain)) '#(2 0 1)) 2))))))
0.167176 secs real time
0.166388 secs cpu time (0.164387 user, 0.002001 system)
no collections
8029312 bytes allocated
1953 minor faults
no major faults
600407961 cpu cycles
"/home/lucier/lang/scheme/srfi-231/srfi-231-bis/broadcasting-examples/gambit-insert-axis-test.o7"
So the straightforward broadcasting algorithm is faster than the non-naive outer-product algorithm (.20 seconds versus .48 seconds), and exploiting the knowledge that these are three-vectors and decomposing the 3D data grid into planes instead of lines cuts the garbage generated to almost nothing (27,568 bytes of overhead, 8,000,000 bytes of answer) and cuts the time by a little bit. If I declare '(not safe)` then run times are decreased, but I don't think that corresponds well with the Racket code.
I'll continue with the broadcasting implementation.