Scramble/regexp question

jbclements · November 14, 2024, 7:04pm

I've just started using Ryan Culpepper's scramble/regexp library for constructing regular expressions in a structured way (a la "SRE"s), and I'm liking it a lot. One thing I haven't figured out yet is a nice way to write the regular expression ".", that matches any character. It looks like the right way is with ... well, no, I'm not sure. I think (inject ".") works, but there must be a nicer way?

I should say, having abstraction means I can at least (define-RE dot (inject ".")), or even (define-RE d (inject ".")), so this is not a problem without nice workarounds, but I feel like I must be missing something obvious.

jbclements · November 14, 2024, 7:05pm

p.s.: why not use Alex Shinn's irregex ? Sadly, the performance of the irregex library seems quite a bit worse than the built-in regexp library, in my experience.

shawnw · November 14, 2024, 9:25pm

I think the port of irregex to Racket is badly out of date, too.

jbclements · November 14, 2024, 10:08pm

Well, I totally believe that... I'm the maintainer! But I stopped maintaining it when I discovered how much slower it was. Specifically, I think that implementing fast regexps is hard, and chez & racket have done a lot of work on making it fairly fast, and my only issue is with the surface syntax, so I think the approach that Ryan takes, of compiling SRE's into native regexps, is almost certainly the right one.

Ultimately, this is kind of the "cross-platform tools are rarely faster" concept; if irregex came up with a clever way to make regexp matching faster, I'm relatively confident that lower-level implementations, such as the ones attached to chez/racket would use those techniques too, and probably have access to lower-level knobs and dials that can allow them/us to tune things better.

Gambiteer · November 14, 2024, 11:33pm

That's interesting, Gambit just added irregex to the list of distributed libraries.

Does anyone have a set of RE benchmarks that people think are relevant?

jbclements · November 15, 2024, 6:51am

I don't have a benchmark. My personal experience comes from parsing GEDCOM files, an utter abomination of a file format that helps you understand just how terrible things were in the old days, and how hard it is to actually define and stick to a sensible set of conventions. Parsing multi-megabyte gedcom files with irregex was very slow, and parsing them with racket regexps was much faster. It's possible that I was "doing it wrong" somehow in irregex, but I couldn't tell you how.

If you want to see some code that runs (but not a multi-megabyte test file, alas) you can take a look at my racket-gedcom/line-parser.rkt at main · jbclements/racket-gedcom · GitHub repository, which is clearly labeled as essentially a sandbox.

jbclements · November 15, 2024, 6:55am

Also, after spending 15 minutes looking for regexp benchmarks, I'm coming to realize what should have been obvious to me from the start, which is that it might well be the case that regexp engines can be tuned for different structures, lengths of pattern, et cetera. So in fact it might be the case that certain libraries, because of the choices that they make, are much better for certain matching tasks and much worse for others.

ryanc · November 15, 2024, 2:02pm

I don't remember why I didn't include a notation for ".", but it was probably some combination of overlooking it, not needing it, and not knowing what to call it.

Gambiteer · November 15, 2024, 8:10pm

This example is very interesting to me personally. My daughter did a lot of work at Ancestry.com during Covid and I, too, now have a multi-megabyte GEDCOM file. (I remarked to my daughter that the data Ancestry.com holds would be a lot more useful if they hired a few graph theorists, but it might not be a good business plan to give people tools to "finish" the job in a short time.)

I'll look at your code. I don't have much experience in either regexps or graph algorithms, but if others want to look at this problem together I'd be happy to contribute where I can.

Gambiteer · May 21, 2025, 4:28pm

In looking at Gambit's irregex library, I noticed that the bit-field routines, used in manipulating flag bit-fields, is generic and probably very slow:

(define (bit-shr n i)
  (quotient n (expt 2 i)))

(define (bit-shl n i)
  (* n (expt 2 i)))

(define (bit-not n) (- #xFFFF n))

(define (bit-ior a b)
  (cond
   ((zero? a) b)
   ((zero? b) a)
   (else
    (+ (if (or (odd? a) (odd? b)) 1 0)
       (* 2 (bit-ior (quotient a 2) (quotient b 2)))))))

(define (bit-and a b)
  (cond
   ((zero? a) 0)
   ((zero? b) 0)
   (else
    (+ (if (and (odd? a) (odd? b)) 1 0)
       (* 2 (bit-and (quotient a 2) (quotient b 2)))))))

(define (integer-log n)
  (define (b8 n r)
    (if (>= n (bit-shl 1 8)) (b4 (bit-shr n 8) (+ r 8)) (b4 n r)))
  (define (b4 n r)
    (if (>= n (bit-shl 1 4)) (b2 (bit-shr n 4) (+ r 4)) (b2 n r)))
  (define (b2 n r)
    (if (>= n (bit-shl 1 2)) (b1 (bit-shr n 2) (+ r 2)) (b1 n r)))
  (define (b1 n r) (if (>= n (bit-shl 1 1)) (+ r 1) r))
  (if (>= n (bit-shl 1 16)) (b8 (bit-shr n 16) 16) (b8 n 0)))

(define (flag-set? flags i)
  (= i (bit-and flags i)))
(define (flag-join a b)
  (if b (bit-ior a b) a))
(define (flag-clear a b)
  (bit-and a (bit-not b)))

(define ~none 0)
(define ~searcher? 1)
(define ~consumer? 2)

I don't know how crucial these routines are to the performance of the irregex library, but if I had a reasonably sized benchmark I'd replace all but bit-not (which assumes size-16 bit fields) with native versions of these routines to see what might improve.

Topic		Replies	Views
Break out of the cycle. What's the Racket way alternative? Questions & Answers	34	595	March 30, 2024
Regular expression General	5	70	November 25, 2024
Locale for regexp-quote General	1	46	July 1, 2024
Suggestions on refactoring codes using Typed Racket Questions & Answers typed-racket	11	211	June 8, 2024
What would it take to write an independent Racket interpreter? Questions & Answers question	9	1344	May 5, 2022

Scramble/regexp question

Related topics