Fast I/O and byte strings

I have some experimental code that provides for fast byte string I/O. An example cat program is roughly 4.7x faster than using in-bytes-lines and byte string I/O. Additionally, using this instead of strings allowed me to drop the runtime of a Racket program from 1,117 seconds down to 50 seconds for a single file (22x faster), or from 149 minutes down to 6.7 minutes for the full data set. Not quite as fast as C++ (using string_view and a similar block I/O method) which was 1.2 minutes, but not bad.

There are two ideas:

  1. For the I/O, I use the traditional manual buffering approach i.e. read big chunks of data into a buffer, and manually "parse" lines by finding indices of newline bytes.

  2. Additionally, instead of using byte strings, I've created byte string views, analogous to C++ string_view, which are simply structs with a bytes buffer, an index to the beginning of the view, and the index to the exclusive end of the view. This minimizes allocations and copying significantly.

As part of the work, I had to recreate some functions to operate on byte string views instead of byte strings (split, trim, etc.). One of these is a soundex algorithm that is also provided.

The code was somewhat "stream of consciousness" to prove the concept, but I hope to refine the code and create some packages later.

Here are the files in a gist:

10 Likes