Are these 100ms work units, really all 100ms (or is that some kind of average) and are they algorithmically of the same kind, or are they just similar and treated in a generalized way?
If theses tasks are heterogeneous but treated in a generic way, then I still think that there might be possibilities to instead treat each kind in a specific way to get speed ups.
(I mostly write this, because sometimes people write way to general code and then wonder why it is slow, not saying that you in particular do this)
No matter whether this is the case or not, here is another project that might be of interest:
Sadly currently sham does not seem to have a package that can be easily installed to start using it.
But I haven't actually spend time trying to use sham, so I don't know if it is easy or difficult to get started with it.
Also I mostly can give hints of what you could try, I haven't had a practical need to do high performance/throughput computations so my practical experience is limited, although I find it interesting (someday I will find the time or have a practical need).
I am not sure whether I believe these abstract numbers are that useful / can be connected to some real implementation. Why? I can't see the part that is potentially cache-able and the part that is actually doing the computation.
I also am not sure in what way caching is actually that helpful if all tasks could be run completely independently in 100ms in parallel, adding caching to that would likely increase latency.
Is your goal lower latency?
Also what can be shared between the tasks, what is the distribution of the parts that are shareable? Can it be predicted/grouped? Is there some heuristic that could group tasks that are likely to have good sharing of some sub-computation?
If these numbers were real I could say:
D
implies that every task can be run completely independently.
E
wants 25,000 ms with caching
Then my question would be: why don't you just use 4 workers, to roughly get 25,000 ms without caching?
Or if they are that independent maybe even use some kind of compute shader to implement D
(half serious / half joke)
I think numbers like these may not be that helpful, as a basis of actually giving you advice, because we all have different ideas about what is calculated etc..
I like that you provided them, because they gave me some sense of how the problem looks, but it still could be a wide variety of different problems.
If neither futures, places, sham, shared-vectors, or memory mapped files/pages, solve your issue, then my next approach would probably be to look whether some lowlevel language (like C, rust, zig, etc.) can be used to build a shared library that does the actual work and use that with rackets ffi.
With really lowlevel code you want to ask yourself what the memory access patterns are, if cache misses can be avoided by tightly packed sequential arrays, then it may turn out that just recomputing a bunch of things is faster then trying to cache them by using non-local/scattered memory access patterns.
Also "Data Oriented Programming/Design" describes some ways how to write lowlevel code so that it fits the hardware, this page seems to have a bunch of links to further resources about its ideas: Mike Acton | Data Oriented Programming
I hope some of this is useful for you. I asked a bunch of questions, maybe some of those apply to your problem.