Recently, there has been widespread discussion in the community about the performance of the Racket language. It applies to a benchmarks game and I find it interesting to solve those problems about DNA because I'm a medical student.
Here I try refactoring this benchmark code using Typed Racket in order to make it more readable and accelerate its execution. As a result, however, the new program seems inefficient compared with the original one.
My conjecture is that your re-implementation of sequence-* and generator may slow down things.
I have therefore ported Matthew’s code as faithfully as possible using require/typed and a bit of a
macro trick. The types are a bit more approximate than yours at first glance but I doubt this will make
a difference.
I don’t know where the input file is and I am out of time.
Feedback much appreciated — Matthias
p.s. I moved the “tests” into a (module+ test .. you may wish to undo this move again.
#lang typed/racket/base
(require/typed racket/sequence
[sequence-fold (-> (-> Natural Natural Any) Natural [Sequenceof Natural] Any)])
(require/typed
racket/generator
[yield (-> Integer Any)])
;; ---------------------------------------------------------------------------------------------------
(module generator racket/base
(provide create-generator)
(require racket/generator)
(define (create-generator name f loop)
(generator () (f))))
(require/typed
'generator
[create-generator (-> Symbol (-> Void) Boolean (-> Integer))])
;; ---------------------------------------------------------------------------------------------------
;;;
;;; The Computer Language Benchmarks Game
;;; https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
;;; contributed by Matthew Flatt, modified by
;;; modified by James Bergstra
;;; Notes on the implementation: the strategy is to map the DNA letters to the
;;; bytes 0, 1, 2, 3, and then create a hash function that is simply the
;;; concatenation of these two-byte codes. This is handy because the slow part
;;; of this test is building the hash table, and this hash function means that
;;; we can take advantage of overlapping DNA sub-sequences to get a
;;; constant-time hash function (that does not depend on the sequence length).
;;;
;;; The bottleneck in this code seems to be Racket's hash table. The time to
;;; create the last hash table (for the len-18 string) seems to be about half
;;; the runtime of the whole program.
;; Map A->0, C->1, G->2 T->3 (and lowercase too)
(: dna->num (Byte -> Natural))
(define dna->num
(let ([tbl (make-bytes 256 255)])
(for ([ch : Byte (in-list (bytes->list #"ACGTacgt"))]
[ii (in-list '(0 1 2 3 0 1 2 3))])
(bytes-set! tbl ch ii))
(lambda ({ch : Byte}) (bytes-ref tbl ch))))
;;; map a hash key back to a string (needed for printing)
(: unhash (-> Natural Natural String))
(define (unhash key len)
(let ([rval (make-string len)])
(sequence-fold
(lambda ({key : Integer} {pos : Natural})
(string-set! rval pos (string-ref "ACGT" (bitwise-and key 3)))
(arithmetic-shift key -2))
key
(in-range len))
rval))
;;; Ideally this would serve all-counts, but my attempt to do that
;;; was horribly slow.
(: hashes (-> Natural Bytes Any (-> Integer)))
(define (hashes keylen dna as-codes)
(create-generator
'generator
(λ ()
(let ([key 0] [ishift (* 2 keylen)] [thresh (sub1 keylen)])
(for
([bb : Byte (in-bytes dna)]
[ii : Natural (in-range (bytes-length dna))])
(set! key (arithmetic-shift
(+ key (arithmetic-shift (if as-codes bb (dna->num bb)) ishift)) -2))
(if (>= ii thresh) (yield key) #f))
))
#false))
(: all-counts (-> Natural Bytes [HashTable Natural Natural]))
(define (all-counts keylen dna)
(let ([table : [HashTable Natural Natural] (make-hasheq)]
[key : Natural 0]
[ishift : Natural (* 2 keylen)]
)
(for
([bb : Byte (in-bytes dna)]
[ii : Natural (in-range (bytes-length dna))])
(set! key (arithmetic-shift (+ key (arithmetic-shift bb ishift)) -2))
(define r : (U False Natural) (hash-ref table key))
(if (>= ii (- keylen 1)) (hash-set! table key (add1 (if r r 0))) #f)
)
table))
(: readbuf (-> Input-Port (-> Byte Integer) Bytes))
(define (readbuf in foo)
(let ([s (open-output-bytes)])
;; Skip to ">THREE ..."
(regexp-match #rx#"(?m:^>THREE.*$)" in)
;; Copy everything but newlines
(for ([l (in-bytes-lines in)])
(write-bytes l s))
;; Replace letters with numbers 0, 1, 2, 3
(let ([actg : Bytes (get-output-bytes s)])
(for ([ii (in-range (bytes-length actg))])
(bytes-set! actg ii (foo (bytes-ref actg ii))))
actg)))
(define-type PosCart [Pairof Natural Natural])
(: write-freqs (-> [HashTable Natural Natural] Natural Void))
(define (write-freqs table len)
(let* ([content : [Listof PosCart] (hash->list table)]
[total : Real (exact->inexact (apply + ((inst map Real PosCart) cdr content)))])
(for ([a : PosCart
(sort content (λ ({x : PosCart} {y : PosCart}) (> (cdr x) (cdr y))))])
(printf "~a ~a\n"
(unhash (car a) len)
(real->decimal-string (* 100 (/ (cdr a) total)) 3)))))
(: write-one-freq (-> {HashTable Natural Natural} Bytes Void))
(define (write-one-freq table key)
(let* ([pre : Any (hash-ref table ((hashes (bytes-length key) key #f)) #false)]
[cnt : Integer (if (exact-integer? pre) pre 0)])
(printf "~a\t~a\n" cnt key)))
(module+ test
(define dna (readbuf (current-input-port) dna->num))
(write-freqs (all-counts 1 dna) 1)
(newline)
(write-freqs (all-counts 2 dna) 2)
(newline)
;; Specific sequences:
(for ([seq '(#"GGT" #"GGTA" #"GGTATT" #"GGTATTTTAATT" #"GGTATTTTAATTTATAGT")])
(write-one-freq (all-counts (bytes-length seq) dna) seq)))
I've tested this program and indeed it is more efficient than my reimplementation. Though it is still slower than the original one, I think your modification points the way.
Here is the result.
$ cat input25000000.txt | time racket emef-typed-kn.rkt
A 30.295
T 30.151
C 19.800
G 19.754
AA 9.177
TA 9.132
AT 9.131
TT 9.091
CA 6.002
AC 6.001
AG 5.987
GA 5.984
CT 5.971
TC 5.971
GT 5.957
TG 5.956
CC 3.917
GC 3.911
CG 3.909
GG 3.902
1471758 GGT
446535 GGTA
47336 GGTATT
893 GGTATTTTAATT
893 GGTATTTTAATTTATAGT
86.84user 2.57system 1:29.44elapsed 99%CPU (0avgtext+0avgdata 1085928maxresident)k
64inputs+0outputs (0major+393115minor)pagefaults 0swaps
diff program output for this 10KB input file (generated with the fasta program N = 1000) with this output file to check your program output has the correct format, before you contribute your program.
Generate a larger input file (using one of the fasta programs with command line arguments: 25000000 > input25000000.txt) to check program performance.
OK, I created two main submodules (appended below) to time the three variants (untyped, emef-typed, and hz-typed):
[matthias@Texas DNA-discourse]$ for x (1 2 3 4 5); do ./untyped.rkt < input.txt | diff - output.txt ; done
28d27
< cpu time: 1055 real time: 1067 gc time: 17
28d27
< cpu time: 1074 real time: 1086 gc time: 18
28d27
< cpu time: 1074 real time: 1087 gc time: 18
28d27
< cpu time: 1074 real time: 1087 gc time: 18
28d27
< cpu time: 1076 real time: 1089 gc time: 18
[matthias@Texas DNA-discourse]$ for x (1 2 3 4 5); do ./emef-typed.rkt < input.txt | diff - output.txt ; done
28d27
< cpu time: 1107 real time: 1120 gc time: 21
28d27
< cpu time: 1108 real time: 1120 gc time: 21
28d27
< cpu time: 1107 real time: 1121 gc time: 21
28d27
< cpu time: 1108 real time: 1120 gc time: 21
28d27
< cpu time: 1107 real time: 1120 gc time: 22
[matthias@Texas DNA-discourse]$ for x (1 2 3 4 5); do ./hz-typed.rkt < input.txt | diff - output.txt ; done
28d27
< cpu time: 1733 real time: 1793 gc time: 17
28d27
< cpu time: 1739 real time: 1798 gc time: 18
28d27
< cpu time: 1735 real time: 1794 gc time: 17
28d27
< cpu time: 1739 real time: 1799 gc time: 18
28d27
< cpu time: 1737 real time: 1797 gc time: 18
The emef-variant is roughly 3.5% slower than the untyped one. This is in all likelihood due to the implied contracts on yield and the contract on the fold-function passed to sequence-fold.
The hz-variant is roughly 65% slower than the untyped one, and as I wrote before, probably due to re-implementing some Racket functionality. That’s good exercise but not good for performance.
I just realized that this make-hash-sequence function encodes every frame multiple times and got it fixed. It's currently still slower than your version but not that inefficient. Thanks for the effort you put into this problem!