@gus-massa asked in Fix regexps docs by gus-massa · Pull Request #5027 · racket/racket · GitHub
Which locale is used in regexp-quote
when it compare bytes in case insensitive?
(regexp-match (regexp-quote #"Á" #f) #"á") ; ==> (#"\341")
(regexp-match (regexp-quote #"á" #f) #"Á") ; ==> (#"\301")
(regexp-match (regexp-quote #"A" #f) #"a") ; ==> (#"a")
(regexp-match (regexp-quote #"a" #f) #"A") ; ==> (#"A")
Not sure if this is relevant? What locale is for string-upcase?
1 Like
Telling regexp-quote
to make a case-insensitive pattern just wraps the escaped pattern in a (?i:...)
:
> (regexp-quote #"Á" #f)
#"(?i:\301)"
Did some digging in the source, and if I'm following it correctly (A big if since I didn't spend that much time digging in), when in case-insensitive mode, literal characters/bytes are added to a range of codepoints/bytes that can match with this function:
(define (range-add* range c config)
(cond
[(not c) range]
[else
(define range2 (range-add range c))
(cond
[(parse-config-case-sensitive? config) range2]
[else
(define range3 (range-add range2 (char->integer (char-upcase (integer->char c)))))
(define range4 (range-add range3 (char->integer (char-foldcase (integer->char c)))))
(range-add range4 (char->integer (char-downcase (integer->char c))))])]))
c
is for string regular expressions the codepoint, and for byte regexps, the byte in question. So it's effectively using Latin-1 case mappings for byte regexps (Since Latin-1 corresponds to the first 256 codepoints in Unicode).
Aside: the Reference says of mixing string and bytestring regexps and text:
If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see Encodings and Locales) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string.
but the above routine means you can't do case-insensitive matching of non-ASCII characters when using a byte regexp and a string match text:
> (regexp-match? (string->bytes/utf-8 "(?i:Á)") "Á")
#t
> (regexp-match? (string->bytes/utf-8 "(?i:Á)") "á") ; fails!
#f
> (regexp-match? (string->bytes/utf-8 "(?i:A)") "A")
#t
> (regexp-match? (string->bytes/utf-8 "(?i:A)") "a")
#t
The other way around works though:
> (regexp-match? #rx"(?i:Á)" (string->bytes/utf-8 "á"))
#t
1 Like