Reading a web directory (slow) vs. reading a file (quick)

I'm reading a couple of NOAA directories on the web, just to find out what the latest data is. But that is taking forever, like 500 seconds (!). Whereas once I find the file I need, I can read its 28756136 bytes in 6 seconds. The directory listings are really short. I'm trying to figure out why reading them it taking so long.

#lang racket

(require net/url html-parsing)

(define nbm-base "https://nomads.ncep.noaa.gov/pub/data/nccf/com/blend/prod/blend.")

; (nbm-webdir "20230731")
; => "https://nomads.ncep.noaa.gov/pub/data/nccf/com/blend/prod/blend.20230731/"

(define (nbm-webdir ymd8)
  (string-append nbm-base ymd8 "/"))

; (nbm-textdir "20230731" "18")
; => "https://nomads.ncep.noaa.gov/pub/data/nccf/com/blend/prod/blend.20230731/18/text/"

(define (nbm-textdir ymd8 hour)
  (string-append (nbm-webdir ymd8) hour "/text/"))

; (nbm-product "20230731" "18" 'hourly)
; => "https://nomads.ncep.noaa.gov/pub/data/nccf/com/blend/prod/blend.20230731/18/text/blend_nbhtx.t18z"

(define (nbm-product ymd8 hour product-code)
  (string-append (nbm-textdir ymd8 hour) "blend_nb" (product-letter product-code) "tx.t" hour "z"))

(define (product-letter product-code)
  (match product-code
    ['hourly "h"]
    ['short "s"]
    ['extended "e"]
    ['super-extended "x"]
    ['probabilistic-extended "p"]
    [_ (error (format "unknown product-code: ~a" product-code))]))

(define (nmb-product ymd8 hour product-code)
  (string-append (nbm-textdir ymd8 hour) "/"))

(define (get-webdir-xexp ymd8)
  (html->xexp (get-pure-port (string->url (nbm-webdir ymd8)))))

(define (get-textdir-xexp ymd8 hour)
  (html->xexp (get-pure-port (string->url (nbm-textdir ymd8 hour)))))

(define (get-product-bytes ymd8 hour product-code)
  (port->bytes (get-pure-port (string->url (nbm-product ymd8 hour product-code)))))

(define x-webdir-xexp   (time (get-webdir-xexp   "20230731")))
(define x-textdir-xexp  (time (get-textdir-xexp  "20230731" "18")))
(define x-product-bytes (time (get-product-bytes "20230731" "18" 'hourly)))
cpu time: 67555 real time: 500331 gc time: 93
cpu time: 64171 real time: 500488 gc time: 81
cpu time: 2867 real time: 6364 gc time: 77
2 Likes

Since the cpu time is much lower than the real time, your problem seems to be limited by the download. But the cpu time also seems quite large.

Can you maybe for comparison try to download the file in a webbrowser and measure how long it takes?

In a web browser (Firefox), the 27MB file loads pretty quickly, and the directory listings are instaneous.

In DrRacket, the 27MB file loads quickly enough for me (6 secs is fine), but I'm puzzled why the directory listings take 8+ minutes each. The directory listings are very short.

I'd suggest using net/http-client instead of net/url to interact with today's web servers. It will make HTTP 1.1 requests, it handles compression like gzip and deflate, it follows redirects, and it supplies other request headers that might impact this somehow. In general, it's closer to what Firefox would be doing.

For example, rewriting your program to this:

#lang racket

(require net/http-client
         ;html-parsing
         )

(define nbm-host "nomads.ncep.noaa.gov")
(define nbm-base "/pub/data/nccf/com/blend/prod/blend.")

(define (get-from-noaa uri-path)
  (define-values (_status _headers input-port)
    (http-sendrecv nbm-host
                   uri-path
                   #:ssl? #t))
  input-port)

;; Dummy because I don't have html-parsing installed.
(define (html->xexp in)
  (port->bytes in))

; (nbm-webdir "20230731")
; => "/pub/data/nccf/com/blend/prod/blend.20230731/"

(define (nbm-webdir ymd8)
  (string-append nbm-base ymd8 "/"))

; (nbm-textdir "20230731" "18")
; => "/pub/data/nccf/com/blend/prod/blend.20230731/18/text/"

(define (nbm-textdir ymd8 hour)
  (string-append (nbm-webdir ymd8) hour "/text/"))

; (nbm-product "20230731" "18" 'hourly)
; => "/pub/data/nccf/com/blend/prod/blend.20230731/18/text/blend_nbhtx.t18z"

(define (nbm-product ymd8 hour product-code)
  (string-append (nbm-textdir ymd8 hour) "blend_nb" (product-letter product-code) "tx.t" hour "z"))

(define (product-letter product-code)
  (match product-code
    ['hourly "h"]
    ['short "s"]
    ['extended "e"]
    ['super-extended "x"]
    ['probabilistic-extended "p"]
    [_ (error (format "unknown product-code: ~a" product-code))]))

(define (nmb-product ymd8 hour product-code)
  (string-append (nbm-textdir ymd8 hour) "/"))

(define (get-webdir-xexp ymd8)
  (html->xexp (get-from-noaa (nbm-webdir ymd8))))

(define (get-textdir-xexp ymd8 hour)
  (html->xexp (get-from-noaa (nbm-textdir ymd8 hour))))

(define (get-product-bytes ymd8 hour product-code)
  (port->bytes (get-from-noaa (nbm-product ymd8 hour product-code))))

(module+ example
  (define x-webdir-xexp   (time (get-webdir-xexp   "20230731")))
  (define x-textdir-xexp  (time (get-textdir-xexp  "20230731" "18")))
  (define x-product-bytes (time (get-product-bytes "20230731" "18" 'hourly))))

I get times like these:

cpu time: 13 real time: 553 gc time: 1
cpu time: 15 real time: 126 gc time: 0
cpu time: 555 real time: 838 gc time: 65

Note this is cheating because I'm not actually parsing the HTML using html-parsing (I don't have it installed on this computer). However, if you add that part back in, and the times explode, which seems unlikely, at least we'll have narrowed it down to that.


Note: I'm using the simplest, least efficient http-sendrecv which uses a fresh connection each time, as does get-pure-port. You probably don't care, but if you do, there's an API where you make a connection and reuse it for multiple requests.


Note: Because http-client is closer to reality, there's no URL abstraction over host, path, scheme, port, etc. In most cases I find that's fine for web-service-y purposes. e.g. You can just split your nbm-base URL into nbm-host and nbm-base-path, assume SSL, etc.

3 Likes

Below I inserted a port->string call in order to fetch the entire file at once and then start parsing.

(define (get-webdir-xexp2 ymd8)
  (html->xexp (port->string (get-pure-port (string->url (nbm-webdir ymd8))))))

(define (get-webdir-xexp ymd8)
  (html->xexp (get-pure-port (string->url (nbm-webdir ymd8)))))


(define x-webdir-xexp2  (time (get-webdir-xexp2   "20230731")))
(define x-webdir-xexp   (time (get-webdir-xexp    "20230731")))

The timings were:

cpu time: 27 real time: 500713 gc time: 0
cpu time: 34 real time: 500584 gc time: 0

I therefore come to the same conclusion as sschwarzer.
The bottleneck must the be the download.

Are there redirects involved in resolving the address?

Hmm. The command line tool curl is not too happy about the url
Index of /pub/data/nccf/com/blend/prod/blend.20230731

% curl --verbose "https://nomads.ncep.noaa.gov/pub/data/nccf/com/blend/prod/blend.20230731/"                             

*   Trying 95.166.120.178:443...
* Connected to nomads.ncep.noaa.gov (95.166.120.178) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* [CONN-0-0][CF-SSL] (304) (OUT), TLS handshake, Client hello (1):
* [CONN-0-0][CF-SSL] (304) (IN), TLS handshake, Server hello (2):
* [CONN-0-0][CF-SSL] (304) (IN), TLS handshake, Unknown (8):
* [CONN-0-0][CF-SSL] (304) (IN), TLS handshake, Certificate (11):
* [CONN-0-0][CF-SSL] (304) (IN), TLS handshake, CERT verify (15):
* [CONN-0-0][CF-SSL] (304) (IN), TLS handshake, Finished (20):
* [CONN-0-0][CF-SSL] (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-AES256-GCM-SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=nomads.ncep.noaa.gov
*  start date: Jul 17 12:10:12 2023 GMT
*  expire date: Oct 15 12:10:11 2023 GMT
*  subjectAltName: host "nomads.ncep.noaa.gov" matched cert's "nomads.ncep.noaa.gov"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* h2h3 [:method: GET]
* h2h3 [:path: /pub/data/nccf/com/blend/prod/blend.20230731/]
* h2h3 [:scheme: https]
* h2h3 [:authority: nomads.ncep.noaa.gov]
* h2h3 [user-agent: curl/7.87.0]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x7f921c010a00)
> GET /pub/data/nccf/com/blend/prod/blend.20230731/ HTTP/2
> Host: nomads.ncep.noaa.gov
> user-agent: curl/7.87.0
> accept: */*
> 
* HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
* Connection #0 to host nomads.ncep.noaa.gov left intact
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)

Maybe it's a bad interaction between the net/url library and the web-server?

2 Likes

This is really great, thank you so much, I'll use this for now and maybe tweak it as you suggest later. Your numbers are way better than mine, I'm not doing anything fancy parsing-wise, your solution is perfect for now.

What I'm doing this for: I have CAP [Civil Air Patrol] gliders and tow planes to move back and forth between two locations, one in Pennsylvania and one in Vermont. For aerotow I need good visibility for 250+ miles over a 4-hour period, and generally favorable weather. I need to look ahead up to a week but certainly several days ahead. There are multiple stations along the route, and the NBM (National Blended Model} is very good for all the surface weather stuff plus ceilings, cloud bases and cloud cover. I'm using this to put together a visualization of when the best windows of opportunity are (if they exist). Then I also have to check other things, like winds aloft at 10,000' (a good height for aerotow in case emergency glides to an airfield are needed by either tow plane or glider). It's a data synthesis / visualization effort to help decision making.

Thank you, I should have used curl as you did to diagnose things, you found what I suspected, that the client side of things was waiting for something that wasn't happening right on the server side. I'm glad html-client is an easy way out of that mess.

IIRC, the http123 package and library offers a good abstraction over http-client that includes some URL abstraction and connection pooling.

2 Likes

Thanks, I'll check that out too!

From the curl messages alone it's not clear to me whether the client was waiting for the server or the other way around (i.e. the server waiting for more data in the request).

1 Like