which I don't find useful at all, given it always refers to loops or similar. What I'd like to know is if this is hitting some case preventing optimizations (like boxing integers, or performing too much range checking on the vector, or other things). Is looking at the assembly (Compiler Explorer) in Godbolt the only way? (re: Racket now available on Compiler Explorer - #8 by jryans)
I'm used to languages like C++ or Python, where there are rich line-level or instruction level profilers. Thanks!
This does not answer your question, but for this particular program that you are writing, the sieve already takes time and space more than linear, so you won’t really gain anything by using generator. In fact, generator is relatively expensive in Racket. If you switch from generator to returning the simple full list, you can cut the running time by about 6x.
If we had a profiler that could tell us how much time is being spent in contract checking, that would be amazing. In my somewhat limited experience, optimizing Racket code mostly involves reducing contract checking, like by using optimized for loops and sequence constructors.
For numerical code especially, boxing does also come into play of course. So detecting values that could be unboxed would also be extremely useful.
This sieve doesn't require a generator, but the next form in the original blog post that creates a generator capable of yielding infinite primes does need it. Are there continuation forms that perform better than generator?
I prefer my port of Scheme SRFI-158 generators (In my extra-srfi-libs package) to the ones that come with Racket; among other things, the implementation is noticeably faster. With this minimally rewritten version of your code (Using a SRFI-190 macro which is built on top of SRFI-158):
On my machine, the generator one vs the stream one
cpu time: 558 real time: 558 gc time: 61
cpu time: 521 real time: 521 gc time: 20
cpu time: 526 real time: 526 gc time: 22
cpu time: 349 real time: 349 gc time: 6
cpu time: 349 real time: 349 gc time: 7
cpu time: 349 real time: 349 gc time: 6
which is pretty nice!
However this still doesn't answer my question of how I would have gone identifying these bottlenecks via tooling.
I tried both the feature profiler and contract profiler on this program. The former reported some samples, but did not attribute anything to features it knew about.
The contract profiler said 0% of time was spent in contracts (for the original generator based impl).
— You’re making your own streams; the built-in streams are sadly slower than these.
— The FSP is programmable via continuation-marks. But no, I can’t think of a way to uncover the bottle-necks here.
My best guess is that continuations are slower than indexing into a stream (assuming main uses the same loops.)
— Can you explain why you suspect contracts here?
I was also curious about how thread would compare, since it is a coroutine + things. It was much worse than even generator.
Here is the full code with all implementations, running on Racket v8.9.
#lang racket/base
(require racket/generator
racket/unsafe/ops
racket/stream
(prefix-in srfi: srfi/190))
(define (gen-primes-upto n)
(generator ()
(when (> n 2)
(let ([table (make-vector n #t)]
[sqrtn (inexact->exact (ceiling (sqrt n)))])
(for ([v (in-vector table 2 sqrtn)]
[i (in-range 2 sqrtn)]
#:when v)
(for ([j (in-range (* i i) n i)])
(unsafe-vector-set! table j #f)))
(yield 2)
(for ([v (in-vector table 3 n 2)]
[i (in-range 3 n 2)]
#:when v)
(yield i))))))
(struct prime-stream (table current)
#:methods gen:stream
[(define (stream-empty? stream)
(>= (add1 (prime-stream-current stream))
(vector-length (prime-stream-table stream))))
(define (stream-first stream)
(prime-stream-current stream))
(define (stream-rest stream)
(prime-stream
(prime-stream-table stream)
(for/first ([i (in-naturals (add1 (prime-stream-current stream)))]
#:when (vector-ref (prime-stream-table stream) i))
i)))])
(define (gen-primes-upto-stream n)
(if (<= n 2)
(empty-stream)
(let ([table (make-vector n #t)]
[sqrtn (inexact->exact (ceiling (sqrt n)))])
(for ([v (in-vector table 2 sqrtn)]
[i (in-range 2 sqrtn)]
#:when v)
(for ([j (in-range (* i i) n i)])
(unsafe-vector-set! table j #f)))
(prime-stream table 2))))
; matches def gen_primes() in the original Python code.
(define (gen-primes)
(generator ()
(let loop ([D (make-hasheq)]
[q 2])
(let ([existing (hash-ref D q #f)])
(if (not existing)
; invert the order of operations compared to Python
; so we can re-run the loop with the new set.
(begin
(yield q)
(hash-set! D (* q q) (list q))
(loop D (add1 q)))
(begin
(for ([p (in-list existing)])
(hash-update! D
(+ p q)
(lambda (v) (cons p v))
list))
(hash-remove! D q)
(loop D (add1 q))))))))
; note: Python optimizations may not make sense for Racket.
(define (gen-primes-opt)
(generator ()
(yield 2)
(define D (make-hasheq))
(let loop ([q 3])
(define p (hash-ref D q #f))
(hash-remove! D q)
(if (not p)
(begin
(hash-set! D (* q q) q)
(yield q))
(hash-set! D (let whl ([x (+ q p p)])
(if (hash-has-key? D x)
(whl (+ x p p))
x)) p))
(loop (+ 2 q)))))
(srfi:define-coroutine-generator
(gen-primes-opt-srfi)
(srfi:yield 2)
(define D (make-hasheq))
(let loop ([q 3])
(define p (hash-ref D q #f))
(hash-remove! D q)
(if (not p)
(begin
(hash-set! D (* q q) q)
(srfi:yield q))
(hash-set! D (let whl ([x (+ q p p)])
(if (hash-has-key? D x)
(whl (+ x p p))
x)) p))
(loop (+ 2 q))))
(define (gen-primes-opt-thread)
(define ch (make-channel))
(thread (lambda ()
(channel-put ch 2)
(define D (make-hasheq))
(let loop ([q 3])
(define p (hash-ref D q #f))
(hash-remove! D q)
(if (not p)
(begin
(hash-set! D (* q q) q)
(channel-put ch q))
(hash-set! D (let whl ([x (+ q p p)])
(if (hash-has-key? D x)
(whl (+ x p p))
x)) p))
(loop (+ 2 q)))))
ch)
(module+ main
(printf "Sieve of Erastothenes (Generator)~n")
(for ([i (in-range 3)])
(time (for ([_ (in-range 100)])
(for ([_ (in-producer (gen-primes-upto 100000) (void))]
[i (in-range 9000)])
void))))
(printf "Sieve of Erastothenes (Stream)~n")
(for ([i (in-range 3)])
(time (for ([_ (in-range 100)])
(for ([_ (in-stream (stream-take (gen-primes-upto-stream 100000) 9000))])
void))))
(printf "Infinite Sieve (Generator)~n")
(for ([i (in-range 3)])
(time (for ([_ (in-range 100)])
(for ([_ (in-producer (gen-primes) (void))]
[i (in-range 9000)])
void))))
(printf "Infinite Sieve (Generator/Opt)~n")
(for ([i (in-range 3)])
(time (for ([_ (in-range 100)])
(for ([_ (in-producer (gen-primes-opt) (void))]
[i (in-range 9000)])
void))))
(printf "Infinite Sieve (Generator/Opt/SRFI)~n")
(for ([i (in-range 3)])
(time (for ([_ (in-range 100)])
(for ([_ (in-producer (gen-primes-opt-srfi) (void))]
[i (in-range 9000)])
void))))
(printf "Infinite Sieve (Generator/Opt/Thread)~n")
(for ([i (in-range 3)])
(time (for ([_ (in-range 100)])
(define gen-ch (gen-primes-opt-thread))
(for ([i (in-range 9000)])
(channel-get gen-ch))))))
Sieve of Erastothenes (Generator)
cpu time: 619 real time: 621 gc time: 69
cpu time: 521 real time: 522 gc time: 20
cpu time: 516 real time: 516 gc time: 18
Sieve of Erastothenes (Stream)
cpu time: 343 real time: 343 gc time: 7
cpu time: 340 real time: 341 gc time: 6
cpu time: 341 real time: 341 gc time: 7
Infinite Sieve (Generator)
cpu time: 2505 real time: 2505 gc time: 116
cpu time: 2490 real time: 2490 gc time: 114
cpu time: 2505 real time: 2506 gc time: 116
Infinite Sieve (Generator/Opt)
cpu time: 1144 real time: 1144 gc time: 87
cpu time: 1130 real time: 1130 gc time: 86
cpu time: 1132 real time: 1132 gc time: 88
Infinite Sieve (Generator/Opt/SRFI)
cpu time: 916 real time: 916 gc time: 78
cpu time: 917 real time: 917 gc time: 78
cpu time: 909 real time: 909 gc time: 75
Infinite Sieve (Generator/Opt/Thread)
cpu time: 1387 real time: 1388 gc time: 82
cpu time: 1398 real time: 1399 gc time: 81
cpu time: 1405 real time: 1407 gc time: 82
It's not so directly relevant to the question of how to optimize, but you may be interested in the prime number functions from math/number-theory or their implementation here.
I wrote prime sieve as an optimization challenge some time ago. Code is here:
My actual code isn't directly useful to you, since you are interested in using generators and/or streams, but it does give you something to shoot for performance-wise. My version calculates primes up to 1,000,000 as many times as it can in 5 seconds. On my system it completes about 2500 passes:
Completed 2579 passes
cpu time: 4999 real time: 5000 gc time: 20
When using the statistical profiler, is there a way to customize what nodes are shown? raco profile (when using errortrace) only shows time consumed by my code and one level deep calls (stuff like hash-ref).
If I was curious how time was being spent in the various forms hash-ref itself may call, is there a way to obtain that information?
The profiler works by annotating your program -- effectively rewriting it to be a "profile-able version of your program".
IIUC it will only do so for sieve.rkt -- not for required modules like racket/generator, much less transitively for its requires. Doing so could be very slow.
I could imagine raco profile could do what the debugger does: Use a current-load/use-compiled handler also to annotate required modules -- but with some limit. Maybe you tell it which modules. Maybe you tell it the require transitive depth. And/or some combo.
p.s. My impression is Racketeers don't use tools like the profiler very much. Maybe they think more about algorithms when optimizing (reaching for tools like profilers later, if ever).
Of course if people don't use a tool much, it's unlikely to be improved. Which makes it even less likely to be used. And so on.
Another factor is the Racket community of tool users and tool improving contributors is some orders of magnitude smaller than say Python.