Locale for regexp-quote

@gus-massa asked in Fix regexps docs by gus-massa · Pull Request #5027 · racket/racket · GitHub

Which locale is used in regexp-quote when it compare bytes in case insensitive?

(regexp-match (regexp-quote #"Á" #f) #"á") ; ==> (#"\341")
(regexp-match (regexp-quote #"á" #f) #"Á") ; ==> (#"\301")
(regexp-match (regexp-quote #"A" #f) #"a") ; ==> (#"a")
(regexp-match (regexp-quote #"a" #f) #"A") ; ==> (#"A")

Not sure if this is relevant? What locale is for string-upcase?

1 Like

Telling regexp-quote to make a case-insensitive pattern just wraps the escaped pattern in a (?i:...):

> (regexp-quote #"Á" #f)
#"(?i:\301)"

Did some digging in the source, and if I'm following it correctly (A big if since I didn't spend that much time digging in), when in case-insensitive mode, literal characters/bytes are added to a range of codepoints/bytes that can match with this function:

(define (range-add* range c config)
  (cond
   [(not c) range]
   [else
    (define range2 (range-add range c))
    (cond
     [(parse-config-case-sensitive? config) range2]
     [else
      (define range3 (range-add range2 (char->integer (char-upcase (integer->char c)))))
      (define range4 (range-add range3 (char->integer (char-foldcase (integer->char c)))))
      (range-add range4 (char->integer (char-downcase (integer->char c))))])]))

c is for string regular expressions the codepoint, and for byte regexps, the byte in question. So it's effectively using Latin-1 case mappings for byte regexps (Since Latin-1 corresponds to the first 256 codepoints in Unicode).


Aside: the Reference says of mixing string and bytestring regexps and text:

If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see Encodings and Locales) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string.

but the above routine means you can't do case-insensitive matching of non-ASCII characters when using a byte regexp and a string match text:

> (regexp-match? (string->bytes/utf-8 "(?i:Á)") "Á")
#t
> (regexp-match? (string->bytes/utf-8 "(?i:Á)") "á") ; fails!
#f
> (regexp-match? (string->bytes/utf-8 "(?i:A)") "A")
#t
> (regexp-match? (string->bytes/utf-8 "(?i:A)") "a")
#t

The other way around works though:

> (regexp-match? #rx"(?i:Á)" (string->bytes/utf-8 "á"))
#t
1 Like