First naive suggestion: call (collect-garbage) three times before measuring memory use. (The 3x may be a relic from the deep past, no idea if this is still the right heuristic.)
Ooh... except that if the value doesn't survive, that call could actually collect it. So I would return it after checking the memory use, to make sure it doesn't die. Even then, you run the risk of optimizers cleverly discovering that it's not actually needed.
My guess is 32 is the size of the temporal variables,. fixnums like 1 or 2 don't allocate memory neither in CS or BC.
I made another version with void/reference-sink because otherwise the compiler may notice that it's unused and just avoid allocating it. (And there are a few more optimization that may remove the reference before you expect it.)
For example try
(size (list 0 1 2 3 4 5 6 7 8 9))
Also, I added (sleep 1). I never heard it's necessary but trying the program a few times I get more consistent results with a small pause (???).
With my version I get size that is like x10 the number I get with your version.
My original scenario is I have many long-live threads, they will exchange messages to complete different jobs, and each has its own local storage. I want to measure the accurate memory usage so that I can know the limitation of the design, to know whether I should add something into the thread or I should put them somewhere else.