Where to start with optimizing Racket programs?

Given a plain racket program like

#lang racket/base

(require racket/generator
         racket/unsafe/ops)

(define (gen-primes-upto n)
  (generator ()
             (if (= n 2)
                 (void)
                 (let ([table (make-vector n #t)]
                       [sqrtn (inexact->exact (ceiling (sqrt n)))])

                   (for ([v (in-vector table 2 sqrtn)]
                         [i (in-range 2 sqrtn)]
                         #:when v)
                     (for ([j (in-range (* i i) n i)])
                       (unsafe-vector-set! table j #f)))
        
                   (yield 2)
                   (for ([v (in-vector table 3 n 2)]
                         [i (in-range 3 n 2)]
                         #:when v)
                     (yield i))))))

(module+ main
  (for ([i (in-range 3)])
    (time (for ([_ (in-range 100)])
            (for ([_ (in-producer (gen-primes-upto 100000) (void))]
                  [i (in-range 9000)])
              void)))))

(This is based on a translation of the Python code in My favorite prime number generator - Eli Bendersky's website)

How do I approach optimizing it?

I know Optimization Coach exists for Typed Racket, but not for plain Racket.

I ran this using raco profile --use-errortrace sieve.rkt and got output like

Profiling results                                                                   
-----------------                                                                   
  Total cpu time observed: 1913ms (out of 1954ms)                                   
  Number of samples taken: 38 (once every 50ms)                                     
                                                                                    
===================================================                                 
                              Caller                                                
Idx   Total       Self      Name+src         Local%                                 
      ms(pct)     ms(pct)     Callee                                                
===================================================                                 
[1] 1863(97.4%)    0(0.0%)  (for ...) ...-blog/2023/prime-sieve/sieve.rkt:25:2      
                              (time ...) [2] 100.0%                                 
---------------------------------------------------                                 
                              (for ...) [1]  100.0%                                 
[2] 1863(97.4%)    0(0.0%)  (time ...) ...blog/2023/prime-sieve/sieve.rkt:26:4      
                              (for ...) [3]  100.0%                                 
---------------------------------------------------                                 
                              (time ...) [2] 100.0%                                 
[3] 1863(97.4%)    0(0.0%)  (for ...) ...blog/2023/prime-sieve/sieve.rkt:26:10      
                              (for ...) [4]  100.0%                                 
---------------------------------------------------                                 
                              (for ...) [3]  100.0%                                 
[4] 1863(97.4%) 1158(60.5%) (for ...) ...blog/2023/prime-sieve/sieve.rkt:27:12      
                              (let ...) [5]   37.9%                                 
---------------------------------------------------                                 
                              (for ...) [4]  100.0%                                 
[5]  706(36.9%)    0(0.0%)  (let ...) ...-blog/2023/prime-sieve/sieve.rkt:9:17      
                              (for ...) [6]   64.3%                                 
                              (for ...) [7]   35.7%                                 
---------------------------------------------------                                 
                              (let ...) [5]  100.0%                                 
[6]  454(23.7%)    0(0.0%)  (for ...) ...blog/2023/prime-sieve/sieve.rkt:19:19      
                              (yield ...) [8]100.0%                                 
---------------------------------------------------                                 
                              (let ...) [5]  100.0%                                 
[7]  252(13.2%)    0(0.0%)  (for ...) ...blog/2023/prime-sieve/sieve.rkt:12:19      
                              (for ...) [9]  100.0%                                 
---------------------------------------------------                                 
                              (for ...) [6]  100.0%                                 
[8]  454(23.7%)  454(23.7%) (yield ...) ...og/2023/prime-sieve/sieve.rkt:22:21      
---------------------------------------------------                                 
                              (for ...) [7]  100.0%                                 
[9]  252(13.2%)  252(13.2%) (for ...) ...blog/2023/prime-sieve/sieve.rkt:15:21      
---------------------------------------------------

which I don't find useful at all, given it always refers to loops or similar. What I'd like to know is if this is hitting some case preventing optimizations (like boxing integers, or performing too much range checking on the vector, or other things). Is looking at the assembly (Compiler Explorer) in Godbolt the only way? (re: Racket now available on Compiler Explorer - #8 by jryans)

I'm used to languages like C++ or Python, where there are rich line-level or instruction level profilers. Thanks!

1 Like

This does not answer your question, but for this particular program that you are writing, the sieve already takes time and space more than linear, so you won’t really gain anything by using generator. In fact, generator is relatively expensive in Racket. If you switch from generator to returning the simple full list, you can cut the running time by about 6x.

What about using streams? Is that expensive too?

If we had a profiler that could tell us how much time is being spent in contract checking, that would be amazing. In my somewhat limited experience, optimizing Racket code mostly involves reducing contract checking, like by using optimized for loops and sequence constructors.

For numerical code especially, boxing does also come into play of course. So detecting values that could be unboxed would also be extremely useful.

https://docs.racket-lang.org/feature-profile/index.html

1 Like

The feature profiler looks very interesting! Exactly what I was looking for.

This sieve doesn't require a generator, but the next form in the original blog post that creates a generator capable of yielding infinite primes does need it. Are there continuation forms that perform better than generator?

Streams are the right way to hand over an infinite series of numbers (primes, odds, evens).

2 Likes

Conceptually. Turns out the overhead is too large. Sorry.

Managing the generator manually seems to yield a 5-10% improvement:

(define gen-primes-upto
  (let ()
    (define old-n #false)
    (define table #false)
    (define enum  #false)
    (λ (n)
      (cond
        [(= n 2) (void)]
        [(and table (= n old-n))
         (for/first ([i (in-range (+ enum 1) n)] #:when (unsafe-vector-ref table i))
           (set! enum i)
           i)]
        [else
         (set! table (make-vector (+ n 1) #t))
         (set! enum 1)
         (set! old-n n)
         (let ([sqrtn (inexact->exact (ceiling (sqrt n)))])
           (for ([v (in-vector table 2 sqrtn)]
                 [i (in-range 2 sqrtn)]
                 #:when v)
             (for ([j (in-range (* i i) n i)])
               (unsafe-vector-set! table j #f)))
           (gen-primes-upto n))]))))

Buyer beware.

1 Like

I prefer my port of Scheme SRFI-158 generators (In my extra-srfi-libs package) to the ones that come with Racket; among other things, the implementation is noticeably faster. With this minimally rewritten version of your code (Using a SRFI-190 macro which is built on top of SRFI-158):

#lang racket/base

(require srfi/190 racket/unsafe/ops)

(define-coroutine-generator (gen-primes-upto n)
  (when (> n 2)
    (let ([table (make-vector n #t)]
          [sqrtn (inexact->exact (ceiling (sqrt n)))])
      
      (for ([v (in-vector table 2 sqrtn)]
            [i (in-range 2 sqrtn)]
            #:when v)
        (for ([j (in-range (* i i) n i)])
          (unsafe-vector-set! table j #f)))
        
      (yield 2)
      (for ([v (in-vector table 3 n 2)]
            [i (in-range 3 n 2)]
            #:when v)
        (yield i)))))

(module+ main
  (for ([i (in-range 3)])
    (time (for ([_ (in-range 100)])
            (for ([_ (in-producer (gen-primes-upto 100000) eof)]
                  [i (in-range 9000)])
              void)))))

I get the following timings:

$ racket gen2.rkt
cpu time: 484 real time: 492 gc time: 62
cpu time: 468 real time: 456 gc time: 15
cpu time: 500 real time: 500 gc time: 62

compared to yours:

$ racket gen.rkt
cpu time: 734 real time: 728 gc time: 93
cpu time: 718 real time: 719 gc time: 31
cpu time: 703 real time: 703 gc time: 15
1 Like

You should try benchmarking on snapshot, which included some cleanups to racket/generator. I think those made it slightly faster.

2 Likes

Here is the variant using streams (there is probably a better way to do this)

(struct prime-stream (table current)
  #:methods gen:stream
  [(define (stream-empty? stream)
     (>= (add1 (prime-stream-current stream))
         (vector-length (prime-stream-table stream))))
   (define (stream-first stream)
     (prime-stream-current stream))
   (define (stream-rest stream)
     (prime-stream
      (prime-stream-table stream)
      (for/first ([i (in-naturals (add1 (prime-stream-current stream)))]
                  #:when (vector-ref (prime-stream-table stream) i))
        i)))])

(define (gen-primes-upto-stream n)
  (if (<= n 2)
      (empty-stream)
      (let ([table (make-vector n #t)]
            [sqrtn (inexact->exact (ceiling (sqrt n)))])

        (for ([v (in-vector table 2 sqrtn)]
              [i (in-range 2 sqrtn)]
              #:when v)
          (for ([j (in-range (* i i) n i)])
            (unsafe-vector-set! table j #f)))

        (prime-stream table 2))))

On my machine, the generator one vs the stream one

cpu time: 558 real time: 558 gc time: 61
cpu time: 521 real time: 521 gc time: 20
cpu time: 526 real time: 526 gc time: 22

cpu time: 349 real time: 349 gc time: 6
cpu time: 349 real time: 349 gc time: 7
cpu time: 349 real time: 349 gc time: 6

which is pretty nice!

However this still doesn't answer my question of how I would have gone identifying these bottlenecks via tooling.
I tried both the feature profiler and contract profiler on this program. The former reported some samples, but did not attribute anything to features it knew about.
The contract profiler said 0% of time was spent in contracts (for the original generator based impl).

— You’re making your own streams; the built-in streams are sadly slower than these.
— The FSP is programmable via continuation-marks. But no, I can’t think of a way to uncover the bottle-necks here.
My best guess is that continuations are slower than indexing into a stream (assuming main uses the same loops.)
— Can you explain why you suspect contracts here?

I don't suspect contracts. I was just running these various tools to try and understand the program and reported what they said.

Neat!

I was also curious about how thread would compare, since it is a coroutine + things. It was much worse than even generator.
Here is the full code with all implementations, running on Racket v8.9.

#lang racket/base

(require racket/generator
         racket/unsafe/ops
         racket/stream
         (prefix-in srfi: srfi/190))

(define (gen-primes-upto n)
  (generator ()
             (when (> n 2)
               (let ([table (make-vector n #t)]
                     [sqrtn (inexact->exact (ceiling (sqrt n)))])

                 (for ([v (in-vector table 2 sqrtn)]
                       [i (in-range 2 sqrtn)]
                       #:when v)
                   (for ([j (in-range (* i i) n i)])
                     (unsafe-vector-set! table j #f)))
        
                 (yield 2)
                 (for ([v (in-vector table 3 n 2)]
                       [i (in-range 3 n 2)]
                       #:when v)
                   (yield i))))))

(struct prime-stream (table current)
  #:methods gen:stream
  [(define (stream-empty? stream)
     (>= (add1 (prime-stream-current stream))
         (vector-length (prime-stream-table stream))))
   (define (stream-first stream)
     (prime-stream-current stream))
   (define (stream-rest stream)
     (prime-stream
      (prime-stream-table stream)
      (for/first ([i (in-naturals (add1 (prime-stream-current stream)))]
                  #:when (vector-ref (prime-stream-table stream) i))
        i)))])

(define (gen-primes-upto-stream n)
  (if (<= n 2)
      (empty-stream)
      (let ([table (make-vector n #t)]
            [sqrtn (inexact->exact (ceiling (sqrt n)))])

        (for ([v (in-vector table 2 sqrtn)]
              [i (in-range 2 sqrtn)]
              #:when v)
          (for ([j (in-range (* i i) n i)])
            (unsafe-vector-set! table j #f)))

        (prime-stream table 2))))

; matches def gen_primes() in the original Python code.
(define (gen-primes)
  (generator ()
             (let loop ([D (make-hasheq)]
                        [q 2])
               (let ([existing (hash-ref D q #f)])
                 (if (not existing)
                     ; invert the order of operations compared to Python
                     ; so we can re-run the loop with the new set.
                     (begin
                       (yield q)
                       (hash-set! D (* q q) (list q))
                       (loop D (add1 q)))
                     (begin
                       (for ([p (in-list existing)])
                         (hash-update! D
                                       (+ p q)
                                       (lambda (v) (cons p v))
                                       list))
                       (hash-remove! D q)
                       (loop D (add1 q))))))))

; note: Python optimizations may not make sense for Racket.
(define (gen-primes-opt)
  (generator ()
             (yield 2)
             (define D (make-hasheq))

             (let loop ([q 3])
               (define p (hash-ref D q #f))
               (hash-remove! D q)
               (if (not p)
                   (begin
                     (hash-set! D (* q q) q)
                     (yield q))
                   (hash-set! D (let whl ([x (+ q p p)])
                                  (if (hash-has-key? D x)
                                      (whl (+ x p p))
                                      x)) p))
               (loop (+ 2 q)))))

(srfi:define-coroutine-generator
 (gen-primes-opt-srfi)
 (srfi:yield 2)
 (define D (make-hasheq))

 (let loop ([q 3])
   (define p (hash-ref D q #f))
   (hash-remove! D q)
   (if (not p)
       (begin
         (hash-set! D (* q q) q)
         (srfi:yield q))
       (hash-set! D (let whl ([x (+ q p p)])
                      (if (hash-has-key? D x)
                          (whl (+ x p p))
                          x)) p))
   (loop (+ 2 q))))

(define (gen-primes-opt-thread)
  (define ch (make-channel))
  (thread (lambda ()
            (channel-put ch 2)
            (define D (make-hasheq))

            (let loop ([q 3])
              (define p (hash-ref D q #f))
              (hash-remove! D q)
              (if (not p)
                  (begin
                    (hash-set! D (* q q) q)
                    (channel-put ch q))
                  (hash-set! D (let whl ([x (+ q p p)])
                                 (if (hash-has-key? D x)
                                     (whl (+ x p p))
                                     x)) p))
              (loop (+ 2 q)))))
  ch)

(module+ main
  (printf "Sieve of Erastothenes (Generator)~n")
  (for ([i (in-range 3)])
    (time (for ([_ (in-range 100)])
            (for ([_ (in-producer (gen-primes-upto 100000) (void))]
                  [i (in-range 9000)])
              void))))
  
  (printf "Sieve of Erastothenes (Stream)~n")
  (for ([i (in-range 3)])
    (time (for ([_ (in-range 100)])
            (for ([_ (in-stream (stream-take (gen-primes-upto-stream 100000) 9000))])
              void))))

  (printf "Infinite Sieve (Generator)~n")
  (for ([i (in-range 3)])
    (time (for ([_ (in-range 100)])
            (for ([_ (in-producer (gen-primes) (void))]
                  [i (in-range 9000)])
              void))))
  
  (printf "Infinite Sieve (Generator/Opt)~n")
  (for ([i (in-range 3)])
    (time (for ([_ (in-range 100)])
            (for ([_ (in-producer (gen-primes-opt) (void))]
                  [i (in-range 9000)])
              void))))
  
  (printf "Infinite Sieve (Generator/Opt/SRFI)~n")
  (for ([i (in-range 3)])
    (time (for ([_ (in-range 100)])
            (for ([_ (in-producer (gen-primes-opt-srfi) (void))]
                  [i (in-range 9000)])
              void))))

  (printf "Infinite Sieve (Generator/Opt/Thread)~n")
  (for ([i (in-range 3)])
    (time (for ([_ (in-range 100)])
            (define gen-ch (gen-primes-opt-thread))
            (for ([i (in-range 9000)])
              (channel-get gen-ch))))))
Sieve of Erastothenes (Generator)
cpu time: 619 real time: 621 gc time: 69
cpu time: 521 real time: 522 gc time: 20
cpu time: 516 real time: 516 gc time: 18
Sieve of Erastothenes (Stream)
cpu time: 343 real time: 343 gc time: 7
cpu time: 340 real time: 341 gc time: 6
cpu time: 341 real time: 341 gc time: 7
Infinite Sieve (Generator)
cpu time: 2505 real time: 2505 gc time: 116
cpu time: 2490 real time: 2490 gc time: 114
cpu time: 2505 real time: 2506 gc time: 116
Infinite Sieve (Generator/Opt)
cpu time: 1144 real time: 1144 gc time: 87
cpu time: 1130 real time: 1130 gc time: 86
cpu time: 1132 real time: 1132 gc time: 88
Infinite Sieve (Generator/Opt/SRFI)
cpu time: 916 real time: 916 gc time: 78
cpu time: 917 real time: 917 gc time: 78
cpu time: 909 real time: 909 gc time: 75
Infinite Sieve (Generator/Opt/Thread)
cpu time: 1387 real time: 1388 gc time: 82
cpu time: 1398 real time: 1399 gc time: 81
cpu time: 1405 real time: 1407 gc time: 82

All versions beat the Python 3.10 equivalents.

gen_primes_upto [0.6598196209815796, 0.6476345029950608, 0.6442868190060835]
gen_primes_upto_segmented [1.612839948007604, 1.610719264979707, 1.6245120970124844]
gen_primes [3.548573671025224, 3.5614052320015617, 3.619050811015768]
gen_primes_opt [1.1775665110035334, 1.185819651989732, 1.180778084002668]

Note that I was previously using immutable hasheqs, and was getting ~2x worse running time. Switching to mutable made a huge difference.

It's not so directly relevant to the question of how to optimize, but you may be interested in the prime number functions from math/number-theory or their implementation here.

1 Like

I wrote prime sieve as an optimization challenge some time ago. Code is here:

My actual code isn't directly useful to you, since you are interested in using generators and/or streams, but it does give you something to shoot for performance-wise. My version calculates primes up to 1,000,000 as many times as it can in 5 seconds. On my system it completes about 2500 passes:

Completed 2579 passes
cpu time: 4999 real time: 5000 gc time: 20

When using the statistical profiler, is there a way to customize what nodes are shown? raco profile (when using errortrace) only shows time consumed by my code and one level deep calls (stuff like hash-ref).
If I was curious how time was being spent in the various forms hash-ref itself may call, is there a way to obtain that information?

The profiler works by annotating your program -- effectively rewriting it to be a "profile-able version of your program".

IIUC it will only do so for sieve.rkt -- not for required modules like racket/generator, much less transitively for its requires. Doing so could be very slow.

I could imagine raco profile could do what the debugger does: Use a current-load/use-compiled handler also to annotate required modules -- but with some limit. Maybe you tell it which modules. Maybe you tell it the require transitive depth. And/or some combo.


p.s. My impression is Racketeers don't use tools like the profiler very much. Maybe they think more about algorithms when optimizing (reaching for tools like profilers later, if ever).

Of course if people don't use a tool much, it's unlikely to be improved. Which makes it even less likely to be used. And so on.

Another factor is the Racket community of tool users and tool improving contributors is some orders of magnitude smaller than say Python.

2 Likes