We expect that the next Racket release (tentatively v9.0) will have additional support for parallelism. We need your help to get it right.
Racket has always provided lightweight threads for concurrency, but Racket threads have never taken advantage of multiple processor cores to run in parallel, unlike threads provided by an operating system or in many other languages. Racket provides places and futures for parallelism, but those constructs have restrictions that make them less easy to use than threads.
Current Racket snapshots (https://snapshot.racket-lang.org/) integrate parallelism into the thread layer. Use (thread thunk #:pool p) to put the new thread in the pool p as created with make-parallel-thread-pool, or use (thread thunk #:pool 'own) to have the Racket-level thread backed by its own OS-level thread.
It's one thing to support the creation of parallel threads, and it's another to have those threads provide any performance benefit. Of course, locking and other synchronization requirements can reduce performance, instead of improving it. Worse performance can be inherent to a task, or it can be due to limitations in the implementation of Racket. The latter is a problem that can be addressed, but it will take a while to work through all the obstacles.
So, this is a request for Racketeers to try out the initial support for parallel threads and report back on what you find, at least for tasks that you're pretty sure are amenable to parallelism. You can send feedback as a reply here, as an issue on Github, or as a note to me.
Some things that should parallelize well in the current implementation:
computations that fit the constraint of futures, such as numeric work or programs that involve a lot of immutable or unshared data structures;
things that almost work in futures, but that need occasional I/O, parameters, or exceptions;
basic file I/O, such as reading and writing regular (non-blocking) files; and
compiling/expanding programs in an isolated namespace, but only if the fragments to compile are large or plentiful enough.
Some things that will not parallelize well currently:
synchronization via sync, contested semaphores (in some cases, make-fsemaphore is a useful alternative), or creating and waiting on threads, since those must coordinate with the Racket thread scheduler; and
intermittent blocking I/O with little data streamed in between, such as network communication with short messages.
Support for parallel threads takes advantage of opportunities created by rebuilding Racket on Chez Scheme. (Racket BC still does not support parallel threads, and it never will.) Internally, a parallel thread is backed by both a future and a coroutine thread (i.e., regular Racket thread). The main extra work here was making Racket's coroutine thread scheduler cooperate more with the future scheduler and making the I/O layer safe for Chez Scheme threads — while making locks fine-grained enough to enable parallelism, and also keeping the cost of needed synchronization as low as possible, including for non-parallel Racket programs. There's still plenty of room for improvement, and it's a question of which directions would be most useful to explore.
I've created a set of tests rendering some triangles using a software
triangle rasterizer written in pure Racket. I tried futures, threads
with 'own pool, a threadpool and a persistent threadpool. In all
cases I partitioned the rendering area into the same number of stripes
as is the number of available HW threads on my CPU (8).
In all cases the work is distributed between different threads using
lock-free queue and a fsemaphore.
Racket 8.18.0.13 - futures (create futures for the work, RTT sends
work to the queues, then touches all futures):
So it seems that really when the workload was already prepared for
future-like parallelization, it may slightly benefit from being run in
a persistent threadpool. However the numbers are so close that it
might be just a noise. On the other hand, no crashes and no
performance drops - and possibly a cleaner code.
Thank you for all this work!
Dominik
P.S.: I can share the code if anyone is interested but it is a pile of various experiments hacked together just for testing the new parallelism ...
Like Dominik, I think the code with parallel threads is definitely cleaner. I also see about a 10% improvement (39s vs 44s) compared to places, when running with a pool of size (processor-count)/2. When running with a pool of size (processor-count), both solutions take longer and take about the same amount of time (about 49s).
(I haven't done any profiling after switching to parallel threads, so there might be more that can be improved here.)
i re-run some threaded code for Racket i had (which in the past,both for Racket or Guile showed no speedup) and i have now, after installing Racket 8.18 a speedup of approximatively 90% as the time of my running code pass from 1'25" to 42" in Racket GUI (in command line it even better : 25")
I will develop the explanations and give more information in another reply.
First, notice that it is something hard to compare the performance of a sequential code with a parallelized one because the transformation of code can add some lines of code and even require to change the data structure that lead to false hope of any better performance and also we can not measure the same things. When parallelizing a scheme code using lists i had to use vectors instead sometimes.
Note that also some // code need to work on a large amount of data because on little sets the time to split the job in multiple threads, can false the measure and you gain nothing.For this reason sometimes i add a test on size of data in code when below a threshold works only on one cpu and when there is enought data split the job between multiple cpus.
The more reliable way to make any measure i think is to use the same // code and change progressively the number of processors in use and of course compare all the results with the different non // code.
So in my example i had a pure Scheme code that deal with lists.
To // it ,at some point it was very much easy to recode the critical parts with vectors as it is more easy to give each threads a different index in the same vector to do his job than to split a list.
For the test i use a logic program of my own where the critical part is in unifying the minterms sets (Canonical normal form - Wikipedia) In general the problem is exponential in time and often ? in space (memory).
The result working with lists without // (original code): (funct-unify-minterms-set-1-unit mt-set1 mt-set2)
cpu time: 44536 real time: 44543 gc time: 22180
45" Chrono (means measured with a real chronometer)
other units are milliseconds
with vectors without any //: (funct-unify-minterms-set-1-unit-vector-1cpu mt-set1 mt-set2)
cpu time: 42497 real time: 42504 gc time: 11899
43" Chrono
note: in Racket vectors have little difference in speed ,if i remember well when i tested this with Guile there was a lot of difference, lists where very slower than vectors and slower than with Racket)
with future : (funct-unify-minterms-set-1-unit-future set1 set2)
cpu time: 47454 real time: 46895 gc time: 11676
47" Chrono
note: future brings no speed up, even a little slower but the code was a port from Guile,and i never understand it, in Guile too there was no improvement.
with threads : (funct-unify-minterms-set-1-unit-threads set1 set2)
cpu time: 90135 real time: 24133 gc time: 19252 25" Chrono almost twice faster than without //
note the cpu time show 90" of CPUs usage but the real time and chronometer say 25".I suppose cpu time is multiplied by a number of used cpus.
Same time with options 'own or not in command line.
Doing additional tests modifying the number of allocated processors:
10 cpus which is on the 12th Gen Intel(R) Core(TM) i7-12650H the number of hardware processors (not threads):
cpu time: 65509 real time: 25603 gc time: 16779
26" Chrono
8 cpus:
cpu time: 58822 real time: 25553 gc time: 15113
26" Chrono
4 cpus:
cpu time: 50010 real time: 27161 gc time: 13569
28" Chrono
2 cpus:
cpu time: 47268 real time: 32625 gc time: 12003
33 " Chrono
1 cpu:
cpu time: 44524 real time: 44530 gc time: 11267
45" Chrono
A few test in Racket GUI:
no // with vectors: (funct-unify-minterms-set-1-unit-vector-1cpu mt-set1 mt-set2)
1'19" Chrono
no // with lists:
cpu time: 68391 real time: 68402 gc time: 30053
1'08" Chrono
// 1 cpu with (make-parallel-thread-pool):
cpu time: 77519 real time: 77529 gc time: 15391
1'18" Chrono
// 16 cpus with (make-parallel-thread-pool):
cpu time: 213546 real time: 41916 gc time: 36846
Chrono 42"
// 16 cpus with 'own
cpu time: 212210 real time: 42922 gc time: 38442
Chrono 43"
i also made intermediate test with cpu number progression that show progression like in command line, i will not display them.
The threaded routines of the code are:
#reader SRFI-105
(require Scheme+)
...
{p <- (make-parallel-thread-pool)} ; C12 : 43" in Racket GUI ,25" in CLI (command line)
(define (funct-unify-minterms-set-1-unit-threads set1 set2)
(nodebug
(display-nl "funct-unify-minterms-set-1-unit-thread : begin"))
(nodebug
{set1-length <- (length set1)}
{set2-length <- (length set2)}
(dv set1-length)
(dv set2-length)
(display-nl "before Cartesian product set"))
(nodebug
(dvs set1)
(dvs set2))
;; note : sorting is useless
;; modified code
{minterms-set <- (product-set-with-set-imperative set1 set2)} ;;(product-set-with-set-imperative-sorted set1 set2)} ;;(product-set-with-set set1 set2)} ;;(associate-set-with-set set1 set2)} ;; set multiplication : create list of pair of minterms - pair is a 2 element list
(nodebug
(dvs minterms-set))
(nodebug
(display-nl "after Cartesian product set")
{minterms-set-length <- (length minterms-set)}
{minterms-set-first <- (first minterms-set)}
(dv minterms-set-length)
(dv minterms-set-first))
{minterms-vector <- (list->vector minterms-set)} ;; vector of pair (mathematic) of minterms - pair (mathematic) is a 2 element list, not a pair (Lisp)
(nodebug
(dv minterms-vector))
{minterms-vector-length <- (vector-length minterms-vector)}
(nodebug
(dv minterms-vector-length))
;; on Acer Aspire 5 17" the result below display 16 processors
;;(display "logiki+ : funct-unify-minterms-set-1-unit-threads : (processor-count) = ") (display (processor-count)) (newline)
;; warning : // gives almost no better result
;; it (// procedures) uses Vectors instead of Lists, with Guile it is faster than the sequential procedures written initially in Lists
{nb-procs <- (processor-count)} ;16}; 10} ;1} ;2} ; 4};; 8} ;; 16} ;;(processor-count)} ;; 4};; C12 :1'25" with processor-count ; 10 seems to be the number of hardware core of 12th Gen Intel(R) Core(TM) i7-12650H and on of the best performance for computing C12: 44", 42" on 16 threads, 25" in command line !
;; (when {minterms-vector-length < 500000} ;; 1'21" C12 with 16 threads on Mac OS M1 , 1'57" on Linux intel, 1' 52" with 8 threads
;; (display "logiki+ : funct-unify-minterms-set-1-unit-threads : WARNING : falling back on one cpu (number of elements below threshold)") (newline)
;; {nb-procs <- 1})
;; (when (not {minterms-vector-length < 500000})
;; (display "logiki+ : funct-unify-minterms-set-1-unit-threads : run on multiple processors") (newline))
{segmts <- (segment 0
{minterms-vector-length - 1}
nb-procs)} ;; compute the segments
(nodebug
(dv segmts))
{unified-minterms-vector-1 <- (make-vector minterms-vector-length #f)}
(if {nb-procs = 1} ; when only one statement in then block, 'then' is useless in scheme+
(proc-unify-minterms-seg-and-tag (first segmts)) ;;(proc-unify-minterms-seg (first segmts))
else
(nodebug
(display-nl "before //"))
;; run the parallel code
{threads <- (map (λ (seg)
;;(display "initialising thread ")
;;(dv seg)
(thread (λ ()
;;(display "starting thread ")
;;(dv seg)
(proc-unify-minterms-seg-and-tag seg))
#:pool ; added for Racket 8.18
;;'own
p
)) ;; (proc-unify-minterms-seg-inner-definitions seg))));; (proc-unify-minterms-seg seg))))
segmts)}
(nodebug
(display-nl "waiting for threads to finish..."))
;; wait for threads to finish
(map (λ (thread)
;;(display "waiting thread ")
;;(dv thread)
(thread-wait thread)) ;;(+ start-time max-sleep)))
threads)
(nodebug
(display-nl "after //"))) ; end if ... then ... else
(nodebug
{unified-minterms-vector-1-length <- (vector-length unified-minterms-vector-1)}
(dv unified-minterms-vector-1-length)
(newline))
;; (unless {nb-procs = 1}
;; (vector-for-each tag-minterms unified-minterms-vector-1))
;; tag the minterms in the hash table
{unified-minterms-set-1 <- (vector->list unified-minterms-vector-1)}
(nodebug
(dvs unified-minterms-set-1))
{unified-minterms-set-2 <- (filter (λ (x) x) unified-minterms-set-1)} ;; remove #f results
(nodebug
{unified-minterms-set-2-length <- (length unified-minterms-set-2)}
(dv unified-minterms-set-2-length))
{unified-minterms-set <- (remove-duplicates unified-minterms-set-2)} ;;(remove-duplicates-sorted unified-minterms-set-2)} ;; uniq MODIF
(nodebug
{unified-minterms-set-uniq-length <- (length unified-minterms-set)}
(dv unified-minterms-set-uniq-length))
(nodebug
(dvs unified-minterms-set))
(nodebug
(display-nl "funct-unify-minterms-set-1-unit-thread : end"))
unified-minterms-set)