Help test via snapshots: parallel threads

We expect that the next Racket release (tentatively v9.0) will have additional support for parallelism. We need your help to get it right.

Racket has always provided lightweight threads for concurrency, but Racket threads have never taken advantage of multiple processor cores to run in parallel, unlike threads provided by an operating system or in many other languages. Racket provides places and futures for parallelism, but those constructs have restrictions that make them less easy to use than threads.

Current Racket snapshots (https://snapshot.racket-lang.org/) integrate parallelism into the thread layer. Use (thread thunk #:pool p) to put the new thread in the pool p as created with make-parallel-thread-pool, or use (thread thunk #:pool 'own) to have the Racket-level thread backed by its own OS-level thread.

It's one thing to support the creation of parallel threads, and it's another to have those threads provide any performance benefit. Of course, locking and other synchronization requirements can reduce performance, instead of improving it. Worse performance can be inherent to a task, or it can be due to limitations in the implementation of Racket. The latter is a problem that can be addressed, but it will take a while to work through all the obstacles.

So, this is a request for Racketeers to try out the initial support for parallel threads and report back on what you find, at least for tasks that you're pretty sure are amenable to parallelism. You can send feedback as a reply here, as an issue on Github, or as a note to me.

Some things that should parallelize well in the current implementation:

  • computations that fit the constraint of futures, such as numeric work or programs that involve a lot of immutable or unshared data structures;
  • things that almost work in futures, but that need occasional I/O, parameters, or exceptions;
  • basic file I/O, such as reading and writing regular (non-blocking) files; and
  • compiling/expanding programs in an isolated namespace, but only if the fragments to compile are large or plentiful enough.

Some things that will not parallelize well currently:

  • synchronization via sync, contested semaphores (in some cases, make-fsemaphore is a useful alternative), or creating and waiting on threads, since those must coordinate with the Racket thread scheduler; and
  • intermittent blocking I/O with little data streamed in between, such as network communication with short messages.

Support for parallel threads takes advantage of opportunities created by rebuilding Racket on Chez Scheme. (Racket BC still does not support parallel threads, and it never will.) Internally, a parallel thread is backed by both a future and a coroutine thread (i.e., regular Racket thread). The main extra work here was making Racket's coroutine thread scheduler cooperate more with the future scheduler and making the I/O layer safe for Chez Scheme threads — while making locks fine-grained enough to enable parallelism, and also keeping the cost of needed synchronization as low as possible, including for non-parallel Racket programs. There's still plenty of room for improvement, and it's a question of which directions would be most useful to explore.

13 Likes

Hi Matthew,

I've created a set of tests rendering some triangles using a software
triangle rasterizer written in pure Racket. I tried futures, threads
with 'own pool, a threadpool and a persistent threadpool. In all
cases I partitioned the rendering area into the same number of stripes
as is the number of available HW threads on my CPU (8).

In all cases the work is distributed between different threads using
lock-free queue and a fsemaphore.

Racket 8.18.0.13 - futures (create futures for the work, RTT sends
work to the queues, then touches all futures):

duration = 9480.26 ms
frames = 500
fps = 52.74
triangles = 5644

Racket 8.18.0.13 - threads with 'own pool (the same process
otherwise):

duration = 9788.13 ms
frames = 500
fps = 51.08
triangles = 5644

Racket 8.18.0.13 - threadpool allocated and closed for each frame:

duration = 9937.89 ms
frames = 500
fps = 50.31
triangles = 5644

Racket 8.18.0.13 - a persistent threadpool with 8 threads for all 500
frames (another fsemaphore used to signal to the RTT the work is
done):

duration = 9432.78 ms
frames = 500
fps = 53.01
triangles = 5644

For comparison, Racket 8.16 - futures:

duration = 10207.28 ms
frames = 500
fps = 48.98
triangles = 5644

So it seems that really when the workload was already prepared for
future-like parallelization, it may slightly benefit from being run in
a persistent threadpool. However the numbers are so close that it
might be just a noise. On the other hand, no crashes and no
performance drops - and possibly a cleaner code.

Thank you for all this work!

Dominik
P.S.: I can share the code if anyone is interested but it is a pile of various experiments hacked together just for testing the new parallelism ...

4 Likes

I updated my Racket solution to the one billion row challenge to use parallel threads:

Like Dominik, I think the code with parallel threads is definitely cleaner. I also see about a 10% improvement (39s vs 44s) compared to places, when running with a pool of size (processor-count)/2. When running with a pool of size (processor-count), both solutions take longer and take about the same amount of time (about 49s).

(I haven't done any profiling after switching to parallel threads, so there might be more that can be improved here.)

5 Likes

Hello Matthew,

i re-run some threaded code for Racket i had (which in the past,both for Racket or Guile showed no speedup) and i have now, after installing Racket 8.18 a speedup of approximatively 90% as the time of my running code pass from 1'25" to 42" in Racket GUI (in command line it even better : 25")
I will develop the explanations and give more information in another reply.

1 Like

First, notice that it is something hard to compare the performance of a sequential code with a parallelized one because the transformation of code can add some lines of code and even require to change the data structure that lead to false hope of any better performance and also we can not measure the same things. When parallelizing a scheme code using lists i had to use vectors instead sometimes.

Note that also some // code need to work on a large amount of data because on little sets the time to split the job in multiple threads, can false the measure and you gain nothing.For this reason sometimes i add a test on size of data in code when below a threshold works only on one cpu and when there is enought data split the job between multiple cpus.

The more reliable way to make any measure i think is to use the same // code and change progressively the number of processors in use and of course compare all the results with the different non // code.

So in my example i had a pure Scheme code that deal with lists.
To // it ,at some point it was very much easy to recode the critical parts with vectors as it is more easy to give each threads a different index in the same vector to do his job than to split a list.

For the test i use a logic program of my own where the critical part is in unifying the minterms sets (Canonical normal form - Wikipedia) In general the problem is exponential in time and often ? in space (memory).

The result working with lists without // (original code):
(funct-unify-minterms-set-1-unit mt-set1 mt-set2)
cpu time: 44536 real time: 44543 gc time: 22180
45" Chrono (means measured with a real chronometer)
other units are milliseconds

with vectors without any //:
(funct-unify-minterms-set-1-unit-vector-1cpu mt-set1 mt-set2)
cpu time: 42497 real time: 42504 gc time: 11899
43" Chrono

note: in Racket vectors have little difference in speed ,if i remember well when i tested this with Guile there was a lot of difference, lists where very slower than vectors and slower than with Racket)

With // code using threads on 12th Gen Intel(R) Core(TM) i7-12650H with (processor-count) threads which is here 16.

with future :
(funct-unify-minterms-set-1-unit-future set1 set2)
cpu time: 47454 real time: 46895 gc time: 11676
47" Chrono

note: future brings no speed up, even a little slower but the code was a port from Guile,and i never understand it, in Guile too there was no improvement.

with threads :
(funct-unify-minterms-set-1-unit-threads set1 set2)
cpu time: 90135 real time: 24133 gc time: 19252
25" Chrono
almost twice faster than without //
note the cpu time show 90" of CPUs usage but the real time and chronometer say 25".I suppose cpu time is multiplied by a number of used cpus.
Same time with options 'own or not in command line.

Doing additional tests modifying the number of allocated processors:
10 cpus which is on the 12th Gen Intel(R) Core(TM) i7-12650H the number of hardware processors (not threads):
cpu time: 65509 real time: 25603 gc time: 16779
26" Chrono

8 cpus:
cpu time: 58822 real time: 25553 gc time: 15113
26" Chrono

4 cpus:
cpu time: 50010 real time: 27161 gc time: 13569
28" Chrono

2 cpus:
cpu time: 47268 real time: 32625 gc time: 12003
33 " Chrono

1 cpu:
cpu time: 44524 real time: 44530 gc time: 11267
45" Chrono

A few test in Racket GUI:
no // with vectors:
(funct-unify-minterms-set-1-unit-vector-1cpu mt-set1 mt-set2)
1'19" Chrono

no // with lists:
cpu time: 68391 real time: 68402 gc time: 30053
1'08" Chrono

// 1 cpu with (make-parallel-thread-pool):
cpu time: 77519 real time: 77529 gc time: 15391
1'18" Chrono

// 16 cpus with (make-parallel-thread-pool):
cpu time: 213546 real time: 41916 gc time: 36846
Chrono 42"

// 16 cpus with 'own
cpu time: 212210 real time: 42922 gc time: 38442
Chrono 43"

i also made intermediate test with cpu number progression that show progression like in command line, i will not display them.

The threaded routines of the code are:

#reader SRFI-105
(require Scheme+)

...

{p <- (make-parallel-thread-pool)} ; C12 : 43" in Racket GUI ,25" in CLI (command line)

(define (funct-unify-minterms-set-1-unit-threads set1 set2)

  (nodebug
   (display-nl "funct-unify-minterms-set-1-unit-thread : begin"))
  
  (nodebug
   {set1-length <- (length set1)}
   {set2-length <- (length set2)}
   (dv set1-length)
   (dv set2-length)
   (display-nl "before Cartesian product set"))
  
  (nodebug
   (dvs set1)
   (dvs set2))

  ;; note : sorting is useless
  ;; modified code
  {minterms-set <- (product-set-with-set-imperative set1 set2)} ;;(product-set-with-set-imperative-sorted set1 set2)} ;;(product-set-with-set set1 set2)} ;;(associate-set-with-set set1 set2)} ;; set multiplication : create list of pair of minterms - pair is a 2 element list 

  (nodebug
   (dvs minterms-set))

  (nodebug
   (display-nl "after Cartesian product set")
   {minterms-set-length <- (length minterms-set)}
   {minterms-set-first <- (first minterms-set)}
   (dv minterms-set-length)
   (dv minterms-set-first))

  {minterms-vector <- (list->vector minterms-set)} ;; vector of pair (mathematic) of minterms - pair (mathematic) is a 2 element list, not a pair (Lisp)

  (nodebug
   (dv minterms-vector))

  {minterms-vector-length <- (vector-length minterms-vector)}

  (nodebug
   (dv minterms-vector-length))

  ;; on Acer Aspire 5 17" the result below display 16 processors
  ;;(display "logiki+ : funct-unify-minterms-set-1-unit-threads : (processor-count) = ") (display (processor-count)) (newline)
  
  ;; warning : // gives almost no better result
  ;; it (// procedures) uses Vectors instead of Lists, with Guile it is faster than the sequential procedures written initially in Lists 
  {nb-procs <- (processor-count)} ;16}; 10} ;1} ;2} ; 4};; 8} ;; 16} ;;(processor-count)} ;; 4};; C12 :1'25" with processor-count ; 10 seems to be the number of hardware core of 12th Gen Intel(R) Core(TM) i7-12650H and on of the best performance for computing C12: 44", 42" on 16 threads, 25" in command line !


  ;; (when {minterms-vector-length < 500000} ;; 1'21" C12 with 16 threads on Mac OS M1 , 1'57" on Linux intel, 1' 52" with 8 threads
  ;;   (display "logiki+ : funct-unify-minterms-set-1-unit-threads : WARNING : falling back on one cpu (number of elements below threshold)") (newline)
  ;;   {nb-procs <- 1})

  ;; (when (not {minterms-vector-length < 500000})
  ;;   (display "logiki+ : funct-unify-minterms-set-1-unit-threads : run on multiple processors") (newline))
  
  {segmts <- (segment 0
		      {minterms-vector-length - 1}
		      nb-procs)} ;; compute the segments

  (nodebug
   (dv segmts))

  {unified-minterms-vector-1 <- (make-vector minterms-vector-length #f)}

  (if {nb-procs = 1} ; when only one statement in then block, 'then' is useless in scheme+
      
      (proc-unify-minterms-seg-and-tag (first segmts)) ;;(proc-unify-minterms-seg (first segmts))

    else

       (nodebug
	(display-nl "before //"))
       
       ;; run the parallel code
       {threads <- (map (λ (seg)
			  ;;(display "initialising thread ")
			  ;;(dv seg)
			  (thread (λ ()
				    ;;(display "starting thread ")
				    ;;(dv seg)
				    (proc-unify-minterms-seg-and-tag seg))
				  #:pool ; added for Racket 8.18
				  ;;'own
				  p
				  )) ;; (proc-unify-minterms-seg-inner-definitions seg))));; (proc-unify-minterms-seg seg))))
			segmts)}

       (nodebug
	(display-nl "waiting for threads to finish..."))
  
       ;; wait for threads to finish
       (map (λ (thread)
	      ;;(display "waiting thread ")
	      ;;(dv thread)
	      (thread-wait thread)) ;;(+ start-time max-sleep)))
	    threads)
       
       (nodebug
	(display-nl "after //"))) ; end if ... then ... else

  (nodebug
   {unified-minterms-vector-1-length <- (vector-length unified-minterms-vector-1)}
   (dv unified-minterms-vector-1-length)
   (newline))

  ;; (unless {nb-procs = 1}
  ;; 	  (vector-for-each tag-minterms unified-minterms-vector-1))
  ;; tag the minterms in the hash table
  
  {unified-minterms-set-1 <- (vector->list unified-minterms-vector-1)}
  
  (nodebug
   (dvs unified-minterms-set-1))
  
  {unified-minterms-set-2 <- (filter (λ (x) x) unified-minterms-set-1)} ;; remove #f results
  (nodebug
   {unified-minterms-set-2-length <- (length unified-minterms-set-2)}
   (dv unified-minterms-set-2-length))

  {unified-minterms-set <- (remove-duplicates unified-minterms-set-2)} ;;(remove-duplicates-sorted unified-minterms-set-2)} ;; uniq MODIF
  (nodebug
   {unified-minterms-set-uniq-length <- (length unified-minterms-set)}
   (dv unified-minterms-set-uniq-length))
  
  (nodebug
   (dvs unified-minterms-set))

  (nodebug
   (display-nl "funct-unify-minterms-set-1-unit-thread : end"))
      
  unified-minterms-set)

more code is here:

note: if you see some C12 or Cn it is like X12 or Xn in the portion 'a math logic problem ' in library-FunctProg/README.md at master · damien-mattei/library-FunctProg · GitHub
it is the 12th value computed but the full code of this problem is not public.

In conclusion it is very good work to see that the threaded versions now improve the speed , making it almost 2 times faster.

2 Likes

I haven't experimented with the snapshots yet, but I have some design questions.

This struck me as requiring the programmer to be much more explicit about resource use than with futures. It seems clearly useful to be able to be explicit about how OS threads are created, but would it be reasonable to also have a mode more like futures? My understanding (though I'm not 100% sure of this) is that futures are multiplexed over a pool of OS threads managed by the Racket implementation.

Is it correct that current-thread-group has no affect on parallel threads?

This made me wonder about the relationship between make-fsemaphore and make-os-semaphore. (The fact that for os-semaphore-wait, “if the current thread is a Racket thread, then waiting also blocks all Racket threads” seems like one notable difference.)

Similarly, how do os-async-channel?s interact with parallel threads? In particular, are they still “Racket threads” that should use sync instead of os-async-channel-get?

It is clear that call-in-os-thread has significant restrictions not applicable to the new parallel threads, particularly atomic mode and the lack of an associated coroutine thread, but how should programmers think about the way the two fit together overall?

These questions and the recent addition of uninterruptible mode also drew my attention back to doc: improve explanation of unsafety for atomic mode by mflatt · Pull Request #4876 · racket/racket · GitHub about clarifying the restricted modes that exist, particularly which are subsets of others.

I originally expected to add a default thread poll in some way, but I'm not sure whether it's a good idea. It might be that a programmer using parallel threads should be expected to think about this more explicitly. I was hoping to hear more about use cases.

The effect is smaller, but it's still relevant to the degree that the thread has to synchronize with other threads (which, in the implementation, corresponds to running in a backing coroutine thread). I will update the documentation to clarify.

I'll update the documentation to clarify that os-semaphore-wait (1) blocks all coroutine threads when called from a coroutine thread, and (2) effectively makes a parallel thread's pool have one less processor while blocking when called from a parallel thread.

Yes, they are still racket threads that should use sync.

Well, I will personally have many more use cases - it requires a LOT of time, however. Given the current pace of my projects it's gonna be months not weeks.

Just to confirm my understanding here. If I am using parallel threads (the new ones), make-fsemaphore is the way to go for the synchronization primitive with the smallest overhead. Correct?

However what if I wanted to use something that can be sync'd? Would it be possible to make fsemaphore a synchronizable event with these latest changes? The use case is a "runtime" (in a very general meaning of that word) thread waiting for multiple events, some of them coming from parallel threads and/or futures via fsemaphores. Currently I can only think of busy-wait loop using fsemaphore-try-wait, which is not a great option.

Another thought on this one - I would strongly advocate for NOT creating a default parallel thread pool. At least for the time being. I agree that a programmer should make a conscious decision here and think about the true parallelism explicitly. If you are working with shared data structures, you quickly run into weird issues that are hard to debug.

For example: process a big pile of values in fxvector and in a second thread read/write from time to time random values in the same fxvector (there are valid use cases for this). Now partition the fxvector into as many partitions as you have cores, use futures/parallel threads/os threads to perform the big processing in parallel (and therefore much faster). With the latter, it can be shown, that the truly parallel version is much slower than the original one due to CPU cache misses. Not a joke, I've ran into this some time around CS switch and with 8 cores it didn't give 8x speedup, it was more like 10x slowdown.

So although the true parallelism is REALLY useful for big data crunching, it gets always almost immediately very tricky and making the parallelism implicit rather than having to explicitly opt-in for it seems like a recipy for trouble.

I updated my array programming tests using futures to use parallel threads and did some tests. This code is doing the bulk of the work via FFI, but splits the work up on the racket side. This is the same code that prompted the creation of the uninterruptible keyword. Here's the test code:

(define (time-sum-columns size)
  (define (cleanup)
    (collect-garbage)
    (collect-garbage)
    (collect-garbage))
  (define a (make-tensor (vector size size) 'index))
  (define num-threads (processor-count))
  (define batch-size (/ size num-threads))
  (define results (make-vector size))
  (cleanup)
  ;; serial version
  (time ((thunk
         (for ([i (in-range 0 size)])
           (vector-set! results
                        i
                        (tsum (tslice a `(() (,i ,i 1)))))))))
  ;; future version
  (cleanup)
  (time ((thunk
         (define fs
           (for/list ([i (in-range num-threads)])
             (future (lambda ()
                       (for ([j (in-range (* i batch-size) (* (add1 i) batch-size))])
                  (vector-set! results
                               j
                               (tsum (tslice a `(() (,j ,j 1))))))))))
         (for ([f (in-list fs)])
           (touch f)))))
  ;; parallel thread version
  (cleanup)
  (time ((thunk
         (define thds
           (for/list ([i (in-range num-threads)])
             (thread (lambda ()
                       (for ([j (in-range (* i batch-size) (* (add1 i) batch-size))])
                         (vector-set! results
                                      j
                                      (tsum (tslice a `(() (,j ,j 1)))))))
                     #:pool 'own)))
         (for ([thd (in-list thds)])
           (thread-wait thd))))))

And here are the results from a series of runs,
time for serial
time for futures
time for parallel threads:

tensor-test.rkt> (time-sum-columns 10000)
cpu time: 1693 real time: 1693 gc time: 7
cpu time: 3097 real time: 245 gc time: 178
cpu time: 2998 real time: 229 gc time: 113
tensor-test.rkt> (time-sum-columns 10000)
cpu time: 1671 real time: 1671 gc time: 6
cpu time: 2965 real time: 226 gc time: 180
cpu time: 2921 real time: 226 gc time: 112
tensor-test.rkt> (time-sum-columns 10000)
cpu time: 1674 real time: 1674 gc time: 6
cpu time: 2991 real time: 234 gc time: 100
cpu time: 4687 real time: 420 gc time: 79
tensor-test.rkt> (time-sum-columns 10000)
cpu time: 1672 real time: 1672 gc time: 5
cpu time: 2953 real time: 198 gc time: 123
cpu time: 3320 real time: 415 gc time: 133
tensor-test.rkt> (time-sum-columns 10000)
cpu time: 1666 real time: 1666 gc time: 6
cpu time: 3102 real time: 206 gc time: 119
cpu time: 3933 real time: 480 gc time: 102
tensor-test.rkt> (time-sum-columns 10000)
cpu time: 1708 real time: 1708 gc time: 5
cpu time: 2955 real time: 221 gc time: 130
cpu time: 4915 real time: 487 gc time: 74
tensor-test.rkt> (time-sum-columns 10000)
cpu time: 1716 real time: 1717 gc time: 5
cpu time: 3074 real time: 236 gc time: 184
cpu time: 3189 real time: 315 gc time: 107
tensor-test.rkt> (time-sum-columns 10000)
cpu time: 1664 real time: 1664 gc time: 6
cpu time: 2963 real time: 235 gc time: 167
cpu time: 4333 real time: 429 gc time: 81

Parallel thread performance is less consistent than futures, usually slower but sometimes just as fast.

Note that I ran these tests in racket-mode, in case that makes a difference.

2 Likes

I can definitely see the logic for deferring this feature, and the design of the #:pool argument leaves plenty of room for future extension.


I have some questions about the the interactions between parallel threads and custodians. Again, the status quo may be fine, either forever or until future extension: I'm just trying to understand the design.

I noticed that thread/suspend-to-kill doesn't accept either of the new arguments: it seems like #:keep would be a straightforward addition, but, as I understand the design, I don't think a parallel thread can be kill safe, because custodian-shutdown-all could permanently shut down its thread pool.

I'm a little uneasy in general about the fact that:

If the custodian is shut down, then the pool is closed in the same way as with parallel-thread-pool-close, but in addition, no parallel threads in the pool will be able to continue. The threads will not count as terminated (since each thread has its own custodians for that purpose), but they will cease to make progress.

Of course, it is already possible to have threads that fail to make process (e.g. (thread (λ () (let loop () (loop))))), but it seems to have a high potential for confusion, especially because there doesn't seem to be any way to detect this state, either from the pool or from the thread.

For comparison with futures:

A future never runs in parallel if all of the custodians that allow its creating thread to run are shut down. Such futures can execute through a call to touch, however.

I'm realizing as I'm writing this that I'm not sure if futures can resume running in parallel if their creating thread/suspend-to-kill thread gets new custodians.

Some other possible (not necessarily good) designs I can imagine:

  1. The custodians managing thread that called make-parallel-thread-pool also control whether the pool's threads are allowed to run in parallel, as with futures.
  2. Shutting down a pool's custodian just calls thread-pool-close, allowing the pool's threads to continue as long as their respective custodians permit.
  3. There could be a thread-resume-like operation for thread pools. That would raise many more questions.

I never wrote it up, but I had thought a while ago that the pattern for kill safety of using thread-resume, especially with (current-thread) as the benefactor, is hard to replicate if the underlying custodian-managed value is not a thread. This came up when I was experimenting with (still incomplete) FFI bindings for libVLC. I wanted pressing Run in DrRacket to immediately stop any audio I might have had playing in the previous session, which worked nicely with ffi/unsafe/custodian, but I couldn't see a way to make a player instance kill-safe, other than perhaps (I didn't try this) by doing something like (thread/suspend-to-kill (λ () (sync never-evt))) to create a proxy kill-safe thread that could just be observed to see if the player was supposed to be allowed to run. (One of the reasons I didn't explore this further is that I wasn't sure if a kill-safe media player value would actually be useful.)

Yes, most likely. This is a place where there's definitely room for improvement.

One reason the two kinds of semaphores still exist: they have different behaviors with respect to futures. Regular semaphores continue to block futures (disabling parallelism) on the grounds that the semaphore may implement a lock, and futures are not guaranteed to keep running concurrently — so, a lock that's taken by a future might not be released. In a parallel thread, both kind of semaphores behave the same; also, the implementation of fsemaphores has improved with respect to coroutine threads. Still, fsemaphores are faster at swapping parallel threads, and regular semaphores are faster at swapping coroutine threads. There may be a path to unifying semaphore and fsemaphore performance and have them differ only in how they work with futures.

For now, though, using fsemaphores is probably the right choice in parallel threads if you don't want sync, and using a regular semaphore in parallel threads should work reasonably well if you do want sync.

Probably the right API change would be to add #:suspend-to-kill? to thread, instead of adding #:keep to thread/suspend-to-kill. I didn't do that, yet, because it didn't seem useful enough, but you raise a good point that it's helpful to think more about the possible interactions and check whether the API makes sense in that direction.

I agree with this idea. Although I was worried about the possibility that a thread pool could escape a custodian shutdown if its threads all became unreachable, the implementation was also trying to treat an 'own thread pool specially, and it wasn't correct. Some internal finalizer adjustments can make this work, so I'll try changing the rule for a custodian shutdown on a thread pool as you suggest.

If we one day allow a combination of non-#f #:system-to-kill? and #:pool, probably the rule would need to be that a parallel thread gets demoted to a coroutine thread if it ever loses all of its custodians. It would also make sense to have a way to add it to a new pool, if that turns out to be worthwhile.

1 Like

Code at the bottom. Running on the snapshot on Windows 11 (24H2) on a Core Ultra 9 285K (8/16 cores).

Note, this code is meant to be bad. I expect chaos in the output file as threads dive in and out. And this is the case. The results:

cpu time: 5937 real time: 1721 gc time: 15

Now, if I don't use the pool at all, this is the result:

cpu time: 281 real time: 276 gc time: 46

I did this multiple times with different numbers. Same overall behavior.

What I was expecting is the thread pool would make things a bit better, but not by much.

This seems like a real issue. Yes, I know the code is bad, but adding a thread pool should not make it that much worse. Especially with the sizes we are talking about here.

#lang racket

(define output-port (open-output-file "C:\\temp\\randomness.txt" #:mode 'text))

(define pool (make-parallel-thread-pool 4))

(define (make-threads count)
  (for/list ([i (in-range count)])
      (thread
       (lambda ()
         (for/list
             ([t (range 100000)])
           (writeln (format "Thread ~a: ~a" i (random 100)) output-port))) #:pool pool
       )))

(time (map thread-wait (make-threads 4)))

Thanks for trying out parallel threads and reporting back!

I would not expect this program to benefit from parallel threads, since it spends most of its time working with a shared, contested output port. (There's also a contested pseudo-random generator, but that seems not to matter here, because random-number generation is so much faster than output.)

The example does show that the penalty for parallelism blockers in Racket can be higher than in other settings.

An example C program shows a similar effect. When THREADED is set to 0, then on my machine, I get

laptop$ time ./randomness
./randomness  0.06s user 0.01s system 96% cpu 0.071 total

When I set THREADED to 1, then it takes about 1.5x as long

laptop$ time ./randomness
./randomness  0.08s user 0.23s system 264% cpu 0.117 total

A 1.5x factor is much less than the 6-7x factor that you're seeing for Racket. The C program is also locking differently. If I turn on PRINT_IN_PIECES (which is closer to what println is going to do), the difference with threading is a bit more than 2x, so that's a small piece of the difference.

In Racket, when different threads have different output ports, then there's less contention and more opportunity for parallelism:

#lang racket

(define path "/tmp/randomness.txt")

(when (file-exists? path)
  (delete-file path))
(define (get-output-port) (open-output-file path
                                            #:mode 'text
                                            #:exists 'append))
(define pool (make-parallel-thread-pool 4))

(define (make-threads count)
  (for/list ([i (in-range count)])
      (thread
       (lambda ()
         (define output-port (get-output-port))
         (for/list ([t (range 100000)])
           (println (format "Thread ~a: ~a\n" i t) output-port)))
       #:pool pool)))

(time (map thread-wait (make-threads 4)))

In this last example, somewhere in the OS, there's still a shared file to be managed with data arriving through different file descriptors. But by pushing sharing down into the OS, the locks get more fine-grained, and that's the name of the game with contention. The penalty of a shared output port in Racket is much bigger than the penalty for a shared FILE* in C due to things around a lock that make them heavier and coarser in Racket.

The relative penalty for contention is not something we can just ignore, of course, because there's always sharing at some level. (The current snapshot implementation reflects effort to reduce the cost; the penalty in my first cut would have been 10x greater, even, because the lock for ports was not yet specialized.) This is a direction for further improvement. But the difference in this program is at least explainable and an expected limitation, not something going terribly wrong.

Thanks for the kind response. It's nice to be interacting with you again. I made an error in stating I thought the thread pool would be slightly better, I did mean worse. The magnitude of the difference is what caught me off guard and I thought it might be a bug.

Of course there is an argument that my racket code shouldn't run at all. A C# example (I know, don't panic) I was looking at to compare and contrast.

var outputStream = new StreamWriter("C:\\temp\\randomness2.txt");
List<Task> tasks = new List<Task>();
Random rnd = new Random();

for (int i = 0; i < 8; i++) 
{
    tasks.Add(new Task (() =>
        {
            for (int j = 0; j < 10000; j++)
            {
                var outString = $"Thread {i}, {rnd.Next(1, 1000)}";
                outputStream.WriteLine(outString);
            }
        }));
}

foreach (var t in tasks) 
{
    t.Start();
}

Task.WaitAll(tasks);
outputStream.Close();

This dies with a fairly unhelpful exception. The StreamWriter isn't thread safe. Nor are objects of the Random object (and the bugs there are much more subtle).

So, this really comes down to API documentation and that's out of scope here. I have some other thoughts, but I need some time to bake those.

Nathan

Firstly, I apologize if I am off track again. I got confused about threads in Racket (they aren't tied to OS threads at all).

I changed the code to spray output to multiple files. I am not expecting any big change in actual run times as there is still a lot of (blocking?) I/O being done here. This is about throughput. I wrote the code this way to experiment with multiple pools.

(define output-ports
  (for/list ([i (range 8)]) (open-output-file (format "C:\\temp\\randomness-~a.txt" i) #:mode 'text #:exists 'truncate)))

(define pool (make-parallel-thread-pool 8))

(define (make-threads count)
  (for/list ([i (in-range count)])
    (list                                                                                              
       (thread
        (lambda ()
         (for/list
             ([t (range 100000)])
           (writeln (format "Thread/p1 ~a: ~a" i (random 1000)) (list-ref output-ports i)))) #:pool pool
        )                                                                                            
       (thread
        (lambda ()
         (for/list
             ([t (range 100000)])
           (writeln (format "Thread/p2 ~a: ~a" i (random 1000)) (list-ref output-ports (+ i 4))))) #:pool pool
       )                                                                                                  
       )))

(time (map thread-wait (flatten (make-threads 4))))
(map close-output-port output-ports)

Code running without the pool:

cpu time: 578 real time: 622 gc time: 62

With the pool:

cpu time: 1578 real time: 261 gc time: 203

Pool size from 8 to 12:

cpu time: 1687 real time: 306 gc time: 250

Bumping up the work done (10x the output per file).

Without the pool:

cpu time: 5968 real time: 6057 gc time: 296

With the pool (size 8):

cpu time: 16125 real time: 2244 gc time: 2140

Pool size from 8 to 12:

cpu time: 17531 real time: 2599 gc time: 2218

I don't understand why there is a noticeable slowdown here. We are no longer contending with a single file in terms of I/O. It seems as the size of the pool grows, there is more overhead as well.

In general, this was confusing to me. Thread pools are a different beast in other languages. They are the underlying mechanism on which futures, tasks, non-blocking I/O, sockets and so on are based on in terms of parallelism. You interact with them via queuing of workers or work items and the goal is never use threads directly.

It makes much more sense to me if racket threads stay tied to a single execution unit. For low-level interaction with OS threads, I'd create a separate worker or job abstraction that can be queued into a thread pool or tied to a separate OS thread.

Also, in other languages a lot of what custodians do is what a thread pool does. Since threads are so tied to the resources they use, managing those resources in the thread pool makes sense. At a low level, a thread is just another handle with a lifecycle that needs to be managed much like files, sockets and so on.

I also must stress how useful it is to study what C# does in this area (and what they stole from F#) They were the first to bring the whole async/await abstraction in 2012 alongside well as all the API changes to support non-blocking I/O with it. C# had pattern to handle non-blocking I/O since the first version.

As an aside, why C# did callbacks, async and non blocking was because of Windows NT. The kernel had completion ports, making non-blocking I/O quite easy to support. Linus himself called it "the right way to do it [non-blocking I/O]" and eventually ported the model to Linux much later.

Nathan Dykman

Yes, this is definitely another limitation. My rule of thumb is that parallelism becomes counterproductive for most Racket programs beyond 6 or 8 parallel tasks, whether via futures, places, or parallel threads.

I'm pretty sure the relevant shared resource is the memory manager. Allocation is thread-local, and the garbage collector itself runs in parallel for majors collections, but all collections (including minor collections) synchronize all threads. Getting to the next level probably requires support for thread-local minor collections.

So, having more than 8 OS threads starts to be counterproductive, regardless of the concurrency mechanism used. Did I understand that correctly?

It's a good thing to document. People coming from other runtimes might expect to have two times the core count in threads (or more) before overhead became an issue, so noting that Racket is different would help a lot.

Makes sense. The GC in C# does that (more info) There is a dedicated GC thread for cleaning up and that checks for and yields if a foreground (major) collection is going on in any thread. So, some significant changes might indeed be in order.

It will be interesting to see where this goes. I do believe a good goal would be to have "reasonable" programs to not have any noticeable slowdown if a pool is used. By reasonable, I mean programs that don't have massive contention issues that were hidden by running in a single-threaded context.

Thanks for your consideration. I do miss teaching and (some) research (I can't work due to health reasons) and it was fun to look and some of this stuff again.

Nathan Dykman