Arrays proposal: Help to write idiomatic typed math/array example

Racket's math/array is a very large library. I'm trying to refine/expand SRFI 231 to a library that is smaller than math/array but with sufficient performance and expressiveness.

Examples using array libraries in many languages use array broadcasting, so I'm in the process of adding it to SRFI 231's followup. There is an example where you calculate mutual distances of pairs of 3-vectors from

In the existing SRFI 231, the natural way to approach this problem is through an outer product,

(define (make-random-3d-points number-of-points)
  (let ((domain (make-interval (vector number-of-points 3))))
    (specialized-array-reshape
     (make-specialized-array-from-data
      (random-f64vector (interval-volume domain))
      f64-storage-class)
     domain)))

(define x-points (make-random-3d-points 1000))
(define y-points (make-random-3d-points 1000))

(define (Frobenius A B)
  ;; each array-map creates a new 1D array of length 3
  (flsqrt (array-reduce fl+
                        (array-map (lambda (a b)
                                     (flsquare (fl- a b)))
                                   A B))))
(define naive-mutual-distances
  (time
   (array-copy!
    (array-outer-product
     Frobenius
     ;; accessing each element of the following curried
     ;; arrays creates a 1D array of length 3.
     (array-curry x-points 1)
     (array-curry y-points 1)))))

There are various tricks one can do to basically eliminate the construction of many small three-element arrays and cut the CPU time by a factor of 10 or more, but what I'm interested in is how to express this idiomatically in Racket, and especially in Typed Racket. So far I have a couple tests, and the following code appears to work:

#lang racket

(require srfi/27)
(require math/array)
(require racket/base)

(define (make-points n)
  (parameterize ((array-strictness #t))
    (build-array (vector n 3)
                 (lambda (ds) (random-real)))))

(define x-points (make-points 1000))
(define y-points (make-points 1000))

(define distances
  (time
   (array-strict
    (parameterize ((array-strictness #f))
      (array-sqrt
       (array-axis-sum
        (array-sqr
         (array-
          (array-axis-insert x-points 0)
          (array-axis-insert y-points 1)))
        2))))))

(define distances-1
  (time
   (array-strict
    (parameterize ((array-strictness #t))
      (array-sqrt
       (array-axis-sum
        (array-sqr
         (array-
          (array-axis-insert x-points 0)
          (array-axis-insert y-points 1)))
        2))))))

To really be a fair comparison, this code should be written in Typed Racket, but I don't know how to do that. I also don't see how to use flarrays to achieve the same thing, but if someone can show me how to do so, I'd be interested in that, too.

Thanks in advance.

Brad

Something like this?

#lang typed/racket
(require math/array)

(: make-points : Index -> FlArray)
(define (make-points n)
  (array->flarray
   (build-array (vector n 3)
                (λ: ([js : Indexes])
                  (random)))))

(define x-points (make-points 1000))
(define y-points (make-points 1000))

(define distances
  (time
   (array-strict
    (parameterize ((array-strictness #f))
      (array-sqrt
       (array-axis-sum
        (array-sqr
         (array-
          (array-axis-insert x-points 0)
          (array-axis-insert y-points 1)))
        2))))))

Yes, thanks. On my 10-year-old Linux box, I see this difference between untyped Racket and typed Racket:

heine:~/lang/scheme/srfi-231/srfi-231-bis/broadcasting-examples> /usr/local/racket/bin/racket racket-insert-axis-test.rkt
cpu time: 30928 real time: 31006 gc time: 522
heine:~/lang/scheme/srfi-231/srfi-231-bis/broadcasting-examples> /usr/local/racket/bin/racket typed-racket-insert-axis-test.rkt
cpu time: 800 real time: 801 gc time: 147

I understand that this difference in performance between untyped and typed Racket is to be expected.

I haven't yet implemented implicit array broadcasting, so the following Gambit+"SRFI 231 plus" code is a bit clumsy, but I think the assumptions in the code are similar to the Racket code:

(declare (standard-bindings)
  (extended-bindings)
  (block)
  ;; (not safe)
  )

(define (make-random-3d-points number-of-points)
  (let ((domain (make-interval (vector number-of-points 3))))
    (specialized-array-reshape
     (make-specialized-array-from-data
      (random-f64vector (interval-volume domain))
      f64-storage-class)
     domain)))

(define x-points (make-random-3d-points 1000))
(define y-points (make-random-3d-points 1000))

(define (Frobenius A B)
  ;; each array-map creates a new 1D array
  (flsqrt (array-reduce fl+
                        (array-map (lambda (a b)
                                     (flsquare (fl- a b)))
                                   A B))))

(pp "naive-mutual-distances")

(define naive-mutual-distances
  (time
   (array-copy!
    (array-outer-product
     Frobenius
     ;; accessing each element of the following curried
     ;; arrays creates a 1D array of length 3.
     (array-curry x-points 1)
     (array-curry y-points 1)))))

(pp "mutual-distances")

(define mutual-distances
  (time
   (array-copy!
    (array-outer-product
     Frobenius
     ;; here we cut down on array creation by
     ;; copying each element of the curried
     ;; arrays at creation time.
     (array-copy! (array-curry x-points 1))
     (array-copy! (array-curry y-points 1))))))

(pp "broadcast-mutual-distances")
(define broadcast-mutual-distances
  (time
   (let* ((X* (array-insert-axis x-points 0))
          (Y* (array-insert-axis y-points 1))
          (new-domain (compute-broadcast-interval
                       (list (array-domain X*)
                             (array-domain Y*)))))
     (array-copy!
      ;; Copying each element requires creating
      ;; a 1D array with three elements (from the
      ;; array-curry).  There's a lot of overhead.
      (array-map
       ;; take square root of sums of squares of
       ;; differences
       (lambda (x-y)
         (flsqrt (array-reduce fl+ x-y)))
       (array-curry
        ;; convert to a 2D array with 1,000,000
        ;; elements of 1D arrays with 3 elements
        (array-map
         ;; square the differences of the components
         ;; of all combinations of pairs of x-points
         ;; and y-points
         (lambda (x y)
           (flsquare (fl- x y)))
         (array-broadcast X* new-domain)
         (array-broadcast Y* new-domain))
        1))))))

(pp "broadcast-mutual-distances-2")
(define broadcast-mutual-distances-2
  (time
   (let* ((X*   ;; axis 1 is the number of x-points
           (array-insert-axis x-points 0))
          (Y*   ;; axis 0 is the number of y-points
           (array-insert-axis y-points 1))
          (new-domain
           (compute-broadcast-interval
            (list (array-domain X*) (array-domain Y*)))))
     (array-copy!
      
     ;; Until now, the only computations were trivial:
     ;; to broadcast X* and Y* to new-domain, and to
     ;; create a list of three generalized arrays; the
     ;; calls to array-map just set up operations to be done
     ;; later, the calls to array-permute, array-curry,
     ;; and array-broadcast just set up some index
     ;; manipulations.
     
     ;; Now we copy the result to a specialized array
     ;; of generic-storage-class.
     
      (apply
       array-map
       ;; sum the squares of differences and take square root
       (lambda (x y z)
         (flsqrt (fl+ x y z)))
       (array->list
        ;; convert to a length 3 list of two-dimensional arrays
        (array-curry
         ;; convert to a length 3, 1D array of 2D arrays with
         ;; 1,000,000 elements
         (array-permute
          ;; put the last axis, with point coordinates, first
          (array-map
           ;; subtract and square the difference between all
           ;; combinations of pairs of x-points and y-points
           (lambda (x y) (flsquare (fl- x y)))
           (array-broadcast X* new-domain) 
           (array-broadcast Y* new-domain))
          '#(2 0 1))
         2)))))))
(load "gambit-insert-axis-test")
"naive-mutual-distances"
(time (array-copy! (array-outer-product Frobenius (array-curry x-points 1) (array-curry y-points 1))))
    0.976458 secs real time
    0.971872 secs cpu time (0.965893 user, 0.005979 system)
    4 collections accounting for 0.043651 secs real time (0.041417 user, 0.002050 system)
    3319839144 bytes allocated
    1953 minor faults
    no major faults
    3507053602 cpu cycles
"mutual-distances"
(time (array-copy! (array-outer-product Frobenius (array-copy! (array-curry x-points 1)) (array-copy! (array-curry y-points 1)))))
    0.480126 secs real time
    0.477921 secs cpu time (0.474925 user, 0.002996 system)
    1 collection accounting for 0.011302 secs real time (0.011184 user, 0.000015 system)
    1138227584 bytes allocated
    1953 minor faults
    no major faults
    1724408337 cpu cycles
"broadcast-mutual-distances"
(time (let* ((X* (array-insert-axis x-points 0)) (Y* (array-insert-axis y-points 1)) (new-domain (compute-broadcast-interval (list (array-domain X*) (array-domain Y*))))) (array-copy! (array-map (lambda (x-y) (flsqrt (array-reduce fl+ x-y))) (array-curry (array-map (lambda (x y) (flsquare (fl- x y))) (array-broadcast X* new-domain) (array-broadcast Y* new-domain)) 1)))))
    0.201752 secs real time
    0.200669 secs cpu time (0.196728 user, 0.003941 system)
    no collections
    328027568 bytes allocated
    1954 minor faults
    no major faults
    724550841 cpu cycles
"broadcast-mutual-distances-2"
(time (let* ((X* (array-insert-axis x-points 0)) (Y* (array-insert-axis y-points 1)) (new-domain (compute-broadcast-interval (list (array-domain X*) (array-domain Y*))))) (array-copy! (apply array-map (lambda (x y z) (flsqrt (fl+ x y z))) (array->list (array-curry (array-permute (array-map (lambda (x y) (flsquare (fl- x y))) (array-broadcast X* new-domain) (array-broadcast Y* new-domain)) '#(2 0 1)) 2))))))
    0.167176 secs real time
    0.166388 secs cpu time (0.164387 user, 0.002001 system)
    no collections
    8029312 bytes allocated
    1953 minor faults
    no major faults
    600407961 cpu cycles
"/home/lucier/lang/scheme/srfi-231/srfi-231-bis/broadcasting-examples/gambit-insert-axis-test.o7"

So the straightforward broadcasting algorithm is faster than the non-naive outer-product algorithm (.20 seconds versus .48 seconds), and exploiting the knowledge that these are three-vectors and decomposing the 3D data grid into planes instead of lines cuts the garbage generated to almost nothing (27,568 bytes of overhead, 8,000,000 bytes of answer) and cuts the time by a little bit. If I declare '(not safe)` then run times are decreased, but I don't think that corresponds well with the Racket code.

I'll continue with the broadcasting implementation.

In this context yes.

It's not a general difference in performance though. If the untyped Racket version used an entirely untyped version of arrays, then the difference would be much smaller. The observed performance degradation is due to repeated contract checking when going from untyped code to typed code.

I was grepping GitHub for instances of "require math/array" in Racket code and came across this code for Kronecker product:

Then I found that a version of that Kronecker product code is already in ./math-lib/math/private/matrix/matrix-kronecker.rkt Using SRFI 231's array-block! I came up with

(import srfi/231)

(define (kronecker-product/2 A B op)
  (let ((A (if (specialized-array? A) A (array-copy A)))
        (B (if (specialized-array? B) B (array-copy B))))
    (array-block! (array-map (lambda (a) (array-map (lambda (b) (op a b)) B)) A))))

(define (kronecker-product-left-to-right A . As)
  (fold (lambda (A B)
          (kronecker-product/2 B A *))
        A
        As))

(define (kronecker-product-right-to-left A . As)
  (fold-right (lambda (A B)
                (kronecker-product/2 A B *))
              (list*->array 2 '((1)))
              (cons A As)))

(define A
  (list*->array
   2
   '((1 -1 1 -1)
     (-1 1 -1 1)
     (1 -1 1 -1)
     (-1 1 -1 1))))

(define B
  (time (kronecker-product-left-to-right A A A A A A)))

(define C
  (time (kronecker-product-right-to-left A A A A A A)))

Applying the products from the right reduces memory overhead, and the right identity for the Kronecker product is (list*->array 2 '((1))). On my Linux box the time macro returns

(time (kronecker-product-left-to-right A A A A A A))
    3.410823 secs real time
    3.405164 secs cpu time (3.139629 user, 0.265535 system)
    35 collections accounting for 1.192725 secs real time (1.153827 user, 0.035899 system)
    5872915072 bytes allocated
    286835 minor faults
    no major faults
    12250415484 cpu cycles
(time (kronecker-product-right-to-left A A A A A A))
    0.742896 secs real time
    0.741886 secs cpu time (0.714960 user, 0.026926 system)
    no collections
    143861856 bytes allocated
    34945 minor faults
    no major faults
    2668186698 cpu cycles

There are (expt 16 6) elements in the result, each taking 8 bytes, and because of the sequence of operations, the memory in the bodies of the result and all intermediate arrays is

> (inexact (* (expt 16 6) 8 16/15))
143165576.53333333

bytes, so the memory overhead for the right-to-left version is minimal while the memory overhead for the left-to-right version is a factor of

> (/ 5872915072 143165576.53333333)
41.02183789014816

I'm still thinking about how to automatically broadcast generalized (non-strict) array arguments. At the moment I think that copying a generalized array argument to a specialized (strict) array before broadcasting in a nontrivial way (meaning that the broadcast domain is not the same as the argument domain) is the "right thing" to do.

I am wondering whether this approach would be faster?

((A * B)* (C * D)) *( (E * F) * (G * H) )

The idea being to keep intermediary results as small as possible.

Apropos - I can't remember in which order racket/math multiplies matrices.
Maybe there is room for improvement.

Algorithmically, the final result has as entries all products of entries from each of the arrays, so re-associating the products cannot reduce the numbers of operations as it may with matrix multiplication.

The way the SRFI 231 code is written, the overhead consists of building a generalized array for each entry in the left-hand side of each product, so to minimize that overhead the size of the left-hand matrices should be minimized, i.e., one should associate from the right.

Interestingly, the same code also computes the tensor Kronecker product of any order, which is defined in Definition 2.13 of "Kronecker Product of Tensors and Hypergraphs: Structure and Dynamics", found here:

https://arxiv.org/pdf/2305.03875

https://epubs.siam.org/doi/full/10.1137/23M1592547

I think it's a good example use of array-block!.