Why are struct constructors accessed through a parameter 10-20% faster than using the constructor directly?

Hi!

I was looking at the performance impacts of parameters for use in a language project and ran into some odd performance characteristics: accessing struct constructors through a parameterize-introduced binding appears to be 10-20% faster than using them directly, while struct accesses are about 10-20% slower when the accessor procedure is referenced through a parameter.

Take the following file:

#lang racket

(struct tree (l r v))

(define (rand-tree depth)
    (if (<= depth 1)
        (tree #f #f (random 1000))
        (tree (rand-tree (sub1 depth))
           (rand-tree (sub1 depth))
           (random 1000))))

(define (sum todo acc)
    (match todo
        ['() acc]
        [(cons h t)
            (if h
                (sum (cons (tree-l h) (cons (tree-r h) t))
                     (+ acc (tree-v h)))
                (sum t acc))]))

(define (print-time) (println (current-inexact-monotonic-milliseconds)))


(print-time)
(define t1 (rand-tree 25))
(print-time)
(println (sum `(,t1) 0))
(print-time)

This spits out:

364.5869
4900.0113
16759036712
5286.0855

So, about 4.5s to construct a complete binary tree of depth 25, and about .4s to compute its sum. If, instead, we access the struct constructors and accessors through a parameter binding:

#lang racket

(define m-p (make-parameter #f))
(define l-p (make-parameter #f))
(define r-p (make-parameter #f))
(define v-p (make-parameter #f))

(define (memo fun)
    (let ([cache #f])
    (lambda () (if cache cache
        (begin (set! cache (fun))
            cache)))))

(define m (memo m-p))
(define l (memo l-p))
(define r (memo r-p))
(define v (memo v-p))

(define (rand-tree depth)
    (if (<= depth 1)
        ((m) #f #f (random 1000))
        ((m) (rand-tree (sub1 depth))
           (rand-tree (sub1 depth))
           (random 1000))))

(define (sum todo acc)
    (match todo
        ['() acc]
        [(cons h t)
            (if h
                (sum (cons ((l) h) (cons ((r) h) t))
                     (+ acc ((v) h)))
                (sum t acc))]))

(define (print-time) (println (current-inexact-monotonic-milliseconds)))

(struct tree (l r v))

(parameterize ([m-p (lambda (l r v) (tree l r v))]
               [l-p (lambda (node)  (tree-l node))]
               [r-p (lambda (node)  (tree-r node))]
               [v-p (lambda (node)  (tree-v node))])
    (print-time)
    (define t1 (rand-tree 25))
    (print-time)
    (println (sum `(,t1) 0))
    (print-time))

We get these timings:

403.0528
4562.1506
16759097702
5176.3364

4.1s to construct the tree, .6s to traverse and compute the sum.

A longer sum computation time makes sense, as there's an additional layer of indirection when accessing the parameter procedure, but... why is construction faster?

Any insight would be appreciated.

Cheers,

Sean

1 Like

To thicken the plot, it seems that just wrapping the code block in parameterize with some dummy parameter is sufficient to get performance improvements on construction:

#lang racket

(define p (make-parameter #f))
(struct tree (l r v))

(define (rand-tree depth)
    (if (<= depth 1)
        (tree #f #f (random 1000))
        (tree (rand-tree (sub1 depth))
           (rand-tree (sub1 depth))
           (random 1000))))

(define (sum todo acc)
    (match todo
        ['() acc]
        [(cons h t)
            (if h
                (sum (cons (tree-l h) (cons (tree-r h) t))
                     (+ acc (tree-v h)))
                (sum t acc))]))

(define (print-time) (println (current-inexact-monotonic-milliseconds)))

(parameterize ([p #f])
    (print-time)
    (define t1 (rand-tree 25))
    (print-time)
    (println (sum `(,t1) 0))
    (print-time))

This has timings of:

354.8463
4192.6203
16761484464
4550.8924

So, about 3.9s to construct. This consistently outperforms the parameter-less version.
Maybe these are threading/cache related behaviors?

I see some memoization code in the fast example that's not in the slow example. Is that intentional? It's probably the reason for the performance difference.

Use this one weird trick to speed up your program by 25%!

This is a really great puzzle. Here’s a simplification of your first version:

#lang racket

(struct tree (l r v))

(define (rand-tree depth)
  (if (<= depth 1)
      (tree #f #f (random 1000))
      (tree (rand-tree (sub1 depth))
            (rand-tree (sub1 depth))
            (random 1000))))

(time (rand-tree 25))
;; cpu time: 3468 real time: 3518 gc time: 1795

(Tip: use time. No need to roll your own time measurement!)

And your second version is:

#lang racket

(define dummy (make-parameter #f))

(struct tree (l r v))

(define (rand-tree depth)
  (if (<= depth 1)
      (tree #f #f (random 1000))
      (tree (rand-tree (sub1 depth))
            (rand-tree (sub1 depth))
            (random 1000))))

(parameterize ([dummy 0])
  (time (rand-tree 25)))
;; cpu time: 2545 real time: 2582 gc time: 1752

Here’s my third version, which has no parameterize, but with even greater speed up!

#lang racket

(struct tree (l r v))

(define gen (current-pseudo-random-generator))

(define (rand-tree depth)
  (if (<= depth 1)
      (tree #f #f (random 1000 gen))
      (tree (rand-tree (sub1 depth))
            (rand-tree (sub1 depth))
            (random 1000 gen))))

(time (rand-tree 25))
;; cpu time: 2230 real time: 2242 gc time: 1781

The issue is that every time that you call random without the second argument, Racket has to look up for the generator parameter. My understanding is that in the first version, Racket has to traverse multiple frames to find the parameter, while in the second version the usage of parameterize copies the parameter to a frame that can be accessed faster. But if you access the parameter directly, as done in the third version, then there’s no need to look up anything (in the loop).

5 Likes

Based on a quick look at the implementation of parameters in racket/src/cs/rumble/parameter.ss, I don't think the speed difference is due to the nearest parameterization containing an entry for the current-pseudo-random-generator parameter. The parameterization appears to start out as an empty hasheq and only contains the parameters added by parameterize.

My best guess is that the difference is due to the continuation-mark lookup step that fetches the parameterization.

Update: Here's a version that lets you insert extra frames to push the parameterize continuation-mark frame as far away as you like:

#lang racket

(define dummy (make-parameter #f))

(struct tree (l r v))

(define (rand-tree depth)
    (if (<= depth 1)
        (tree #f #f (random 1000))
        (tree (rand-tree (sub1 depth))
           (rand-tree (sub1 depth))
           (random 1000))))

(define (grow-stack-and-call depth proc)
  (cond [(zero? depth)
         (proc)]
        [else
         (with-continuation-mark depth
           'just-taking-up-space
           (begin0 (grow-stack-and-call (sub1 depth) proc)))]))

(define (go)
  (time (rand-tree 25)))

(parameterize ((dummy 0))
  ;;(grow-stack-and-call 5 go)
  (grow-stack-and-call 50 go)
  (void))
2 Likes

Fascinating. I've just confirmed that after eliding random things behave about as expected:

#lang racket

(struct tree (l r v))

(define (make-tree depth)
  (if (<= depth 1)
      (tree #f #f 500)
      (tree (make-tree (sub1 depth))
            (make-tree (sub1 depth))
            500)))

(time (make-tree 28))

gives

cpu time: 19875 real time: 40510 gc time: 19531

while

#lang racket

(define p (make-parameter #f))

(struct tree (l r v))

(define (make-tree depth)
  (if (<= depth 1)
      (tree #f #f 500)
      (tree (make-tree (sub1 depth))
            (make-tree (sub1 depth))
            500)))

(parameterize ([p #f])
    (time (make-tree 28)))

gives

cpu time: 19718 real time: 38212 gc time: 19343

which is practically the same. (also, holy gc workload) Thanks for the heads-up about time.

The memoization in the second example was intentional, otherwise running the parameter procedure seems to eat quite a few cycles. This is probably why @sorawee's third example is so much faster, assuming random
does the dereference each loop. Still not clear on why @sorawee's second example is faster, though @sorawee and @ryanc's hypotheses are very much appreciated.

Ah, awesome. yep, with random, growing the stack from 50 to 500 before calling makes it about 5x slower:

50:  cpu time: 1312 real time: 5711 gc time: 640
500: cpu time: 8234 real time: 24295 gc time: 812

Without random, there seems to be no degradation:

50:  cpu time: 625 real time: 2350 gc time: 562
500: cpu time: 578 real time: 2324 gc time: 562

Thanks everyone for taking a look.