Oh gee whiz... I have a ten-line program trying to check whether certain documentation URLs are good, and I keep getting "403 Forbidden" results from cloudflare ... even when curl -I
works fine with what appears to be the same URL. I'm flip-flopping between trying percent-encoded URLs and non-percent-encoded URLs, and they both seem to work fine with curl -I
, and they both seem to moderately reliably (ugh) return 403 forbidden from Racket. My guess is that there's a header I need to be providing.
Hmm... looking at the output of curl -v -I
, it looks to me like curl is issuing an HTTP/2 request, rather than an HTTP/1.1 request. I hope that's not the problem? Oog... Maybe this is more of a discord question, but I'm going to post my source code anyhow. Apologies in advance, for in-flow code.
#lang racket
(require net/url)
(provide url-str-response)
;; ensure links are live
;; responses:
;; - 'okay
;; - 'not-found
;; - 'forbidden
;; - 'bad-request
;; returns 'okay or 'not-found or signals an error?
(define (url-str-response url-str)
(define get-port
(get-impure-port (string->url url-str)))
(define first-line (regexp-match #px"^([^\r]+)\r\n" get-port))
(define response-line
(match first-line
[(list _ first-line)
first-line]
[other
(error 'response-line "no response line... more info here")]))
(match response-line
[(regexp #px#"^HTTP/1.[[:digit:]] ([[:digit:]]{3}) " (list _ code))
(match code
;; I bet this list appears somewhere else, sigh...
[#"200" 'okay]
[#"400" 'bad-request]
[#"403" 'forbidden]
[#"404" 'not-found]
[other (error 'uhoh "unexpected response: ~e" response-line)])]))
(url-str-response
#;"https://docs.racket-lang.org/reference/treelist.html#%28def._%28%28lib._racket%2Ftreelist..rkt%29._treelist-filter%29%29"
"https://docs.racket-lang.org/reference/treelist.html#(def._((lib._racket/treelist..rkt)._treelist-filter))")
oog.. well, it's not http/2 that's the problem.
Yikes! Okay, I ... learned something? But I'm still a bit confused? It appears that the "fragment" portion of the URL "is not sent to the server when the URI is requested"[1]. When I use url->string I can see that the fragment is definitely getting fenced off into it's own section, but then... the request fails. It looks like our get-impure-port and friends are somehow trying to stuff the fragment into the server request, but it seems that ... you're not supposed to do this?
For now, then, it looks like the right way to go here is to manually remove the fragment portion from the URL before sending it, which I really hope isn't the "right thing" here.
(cf URI fragment - URIs | MDN )
Quick followup, using netcat on localhost I can see that the fragment is getting put into the request to the server. For instance, if I start listening on localhost with nc -l 9303
and then run this program:
#lang racket
(require net/url)
(get-impure-port (string->url "http://localhost:9303/index.html#zzzz"))
I see this text in the terminal:
GET /index.html#zzzz HTTP/1.1
Host: localhost:9303
User-Agent: Racket/8.15.900 (net/http-client)
Content-Length: 0
... so the fragment is definitely getting embedded in the file request.
Is this not a bug in get-impure-port
? My understanding—very limited!—is that the fragment should never wind up in the request line like this.
Dev tools on Safari confirms for me that when I visit a URL with a fragment, Safari sends the request without the fragment and then post-processes. Same with the netcat trick (and sending the request from my browser).
To my surprise, RFC 3986 § 3.5 indeed says:
Fragment identifiers have a special role in information retrieval systems as the primary form of client-side indirect referencing, allowing an author to specifically identify aspects of an existing resource that are only indirectly provided by the resource owner. As such, the fragment identifier is not used in the scheme-specific processing of a URI; instead, the fragment identifier is separated from the rest of the URI prior to a dereference, and thus the identifying information within the fragment itself is dereferenced solely by the user agent, regardless of the URI scheme. Although this separate handling is often perceived to be a loss of information, particularly for accurate redirection of references as resources move over time, it also serves to prevent information providers from denying reference authors the right to refer to information within a resource selectively. Indirect referencing also provides additional flexibility and extensibility to systems that use URIs, as new media types are easier to define and deploy than new schemes of identification.
I'd want to do more digging to be sure, especially checking if older documents differed and if other implementations (especially non-browser implementations, since browsers follow the WHATWG URL spec), but it tentatively does sound like a bug in the Racket HTTP client.