Why does my code run significantly faster on BC than CS?

I'm working on a NES emulator in Typed Racket and I've just gotten it complete enough to do some realistic performance testing. Racket 8.12 BC can run the emulator at 105 FPS, but the same code using 8.12 CS peaks at 45 FPS. (This is "headless" emulation speed, unaffected by racket/gui.)

The profiler showed me that it is the CPU emulation which is taking the majority of the time. This code really only does these few things:

  1. Read and write RAM, using unsafe-bytes-set! and unsafe-bytes-ref
  2. Fixnum arithmetic (eg. unsafe-fx+ and unsafe-fxior)
  3. Conditionals based on fixnum comparisons

Is it possible that CS is actually more than 2x slower than BC for this kind of workload? Are there any CS-specific performance pitfalls to be aware of? If BC is compiled and CS is interpreted it would seem to explain things, but I thought CS was compiled also.

I'll be happy to share the code once I do a little cleanup if anyone is interested. Thanks in advance!

2 Likes

I'm no expert, but what platform did you run the performance tests on? CS compiles to native machine code(generally speaking, at least) while BC compiles to bytecode(again, generally speaking).

From what I understand, CS's native code is not as compact as BC bytecode. Perhaps the BC JIT compiler is more efficient in this case. Someone who knows more than me will have to comment on that possibility though.

I ran it on Windows 10, x64 (Intel Core i7). Thanks for confirming that CS does compile to native machine code.

I'm wondering if there's an unfortunate interaction between Typed Racket, CS, and unsafe operations. How hard would it be to strip the types out, just to see what difference it makes?

Another random check: are you running this code using DrRacket, or by running it at the command-line?

I tested it from the command line like racket my-file.rkt. When you say "strip the types out", do you mean change the #lang to racket? I don't think that will be too difficult, I'll try it out soon.

Yes, that's what I mean.

Are you using the FFI at all? There are some ways of writing FFI code that would end up making copies of byte strings on CS, but not copying on BC.

Updated times:

  • Typed, CS: 45 FPS (within 1 FPS every run)
  • Untyped, CS: 50FPS (within 1 FPS every run)
  • Typed, BC: 124-130 FPS
  • Untyped, BC: 122-128 FPS

So it would seem that Typed Racket + CS is causing a bit of slowdown, but nothing major.

1 Like

No, I'm not using FFI yet. But that's good to know as I expect I will need FFI if/when I try to implement the audio output.

John meant something different: whether TR and R interact in your program. If a program mixes R and TR, there are bad cases where the type-protection scheme imposes serious penalties (order of magnitude). This is not the case with your program.

;; - - -

Is it possible that your installation of Racket/CS did not compile the libraries?

How could I check whether the libraries were compiled or not?

I found the Inspecting Compiler Passes documentation and was able to view the linklet and the machine code that CS generates. Nothing jumps out at me, but that's mostly because my emulate-one-instruction procedure is very large and hard to read. Maybe BC is better than CS at optimizing large procedures? In any case, with this tool in hand I think I should be able to refactor the code starting with smaller, simpler functions and verifying the machine code at each step.

1 Like

Please share your results if you are able to improve the CS performance(or even if not).

1 Like

For example: do the packages you have installed have compiled directories? You could try running raco setup with your CS installation to make sure everything is compiled.

1 Like

Whether the files are compiled or not - that will only affect the startup time.
Here the issue is that the number of fps dropped.

1 Like

It depends how the per-s is measured, say if it includes the start-up time and is about short runs.

I do not know the details, but seeing "Maybe BC is better than CS at optimizing large procedures?" made this jump out of the depths of my memory:

If you write the same program in Racket and in Chez, it will run at almost exactly the same speed, unless it has very very large functions that are nonetheless important to compile efficiently, in which case there is interpretation overhead

discord post by samth

but I guess from what I read above the long function was indeed compiled? maybe you could try setting PLT_CS_COMPILE_LIMIT to something larger and see if it makes any difference.

see 18.7 Controlling and Inspecting Compilation

7 Likes

The PLT_CS_COMPILE_LIMIT did it! I bumped it up to 20000 and it now it runs very fast. Thanks everyone! I will make the code public pretty soon.

8 Likes