Walking a file tree as a study in recursive generators

Hi,

I want to learn a Racket way of writing recursive generators. Speaking of the programming languages in general, it's either yield or flatMap approach, depending on the language.

Walking the file tree, MWE, Python:

#!/usr/bin/env python

import os
from pathlib import Path
from typing import Iterator, List, Tuple

SRC_DIRS = (".",)
EXTENSIONS = (".RKT",)


def children(parent: Path) -> Tuple[List[Path], List[Path]]:
    lst = os.listdir(parent)
    dirs = [Path(x) for x in lst if (parent / x).is_dir()]
    files = [Path(x) for x in lst if (parent / x).is_file()]
    return dirs, files


def traverse(parent: Path) -> Iterator[Path]:
    dirs, files = children(parent)
    for directory in dirs:
        yield from traverse(parent / directory)
    for file in files:
        yield parent / file


for dir in SRC_DIRS:
    for file in traverse(Path(dir)):
        if file.suffix.upper() in EXTENSIONS:
            print(f">> {file}")

The online sources say that yield is known in the Racket context. Unfortunately, it's only bits and pieces, no clean picture. Today one could ask the AI, which I did. The translation looks uncannily clean and plausible:

#!/usr/bin/env racket
#lang racket

(require racket/path)

(define SRC_DIRS '("."))

(define EXTENSIONS '(".RKT"))

(define (children parent)
  (define lst (directory-list parent))
  (define dirs (filter path-directory? lst))
  (define files (filter path-file? lst))
  (values dirs files))

(define (traverse parent)
  (define-values (dirs files) (children parent))
  (for*/flatten ([directory dirs])
    (traverse (path->string (build-path parent directory))))
  (for ([file files])
    (path->string (build-path parent file))))

(for ([dir SRC_DIRS])
  (for ([file (traverse (path->string (build-path dir)))])
    (when (member (path-extension file) EXTENSIONS)
      (displayln (format ">> ~a" file)))))

It won't run, though: the AI invented path-file? and path-directory? functions. Which is understandable, because I couldn't find them, or their counterparts, either.

Of course, I already have the key bit of information: flatMap approach is supported by Racket, see and learn for*/flatten.

How can I make it work? How can I make it even more beautiful, if possible :blush: ?

I'll point you to 15.2 Filesystem
for functions related to the file system.

As an example directory-exists? checks whether a given path designates a directory.
See also file-exists? .

Now that you mention it! I'm grappling with the children function. And it shows weird behavior. Here's a test case:

#!/usr/bin/env racket
#lang racket

(require racket/path
         racket/stream
         racket/file)

(define (children parent)
 (define-values (all-items) (directory-list parent))
 (let-values ([(dirs files) (partition directory-exists? all-items)])
   (values dirs files)))

(define (traverse-and-print)
    (define-values (dirs files) (children "."))
    (displayln dirs)
    (displayln files))

; Run the main function
(traverse-and-print)

It works - well, almost. Sometimes, on different parent directories, it takes a child directory for a file. I can't see any pattern; besides, the function is really simple, nowhere to dig into...

UPD

This one

(define (children parent)
  (define lst (directory-list parent))
  (define dirs (filter directory-exists? lst))
  (define files (filter file-exists? lst))
  (values dirs files))

also works, with the same lethal bug. Some ordinary items are invisible. For instance, output:

~/spaces/lisp> ./children.rkt
(ff)
(fmt.rkt)
~/spaces/lisp> ls -lh ff
total 16K
drwxr-xr-x 2 alexey alexey 4.0K Mar 13 16:01 bobo
-rwxr-xr-x 1 alexey alexey  157 Mar 13 15:21 boo.rkt
drwxr-xr-x 3 alexey alexey 4.0K Mar 12 20:49 ff
-rwxr-xr-x 1 alexey alexey  143 Mar 12 21:07 fmt.rkt

One directory and one file is missing.

UPD

The trouble shows only when I've been passing to children (and so to directory-list) a parent which is different from the current directory. Specifically, it behaves only if I pass "." for a path. I don't believe in such core library bugs.

~/spaces/lisp/ff> ../children.rkt
(bobo boo.rkt ff fmt.rkt)
(bobo ff)
(boo.rkt fmt.rkt)
~/spaces/lisp/ff> ls -lh
total 16K
drwxr-xr-x 2 alexey alexey 4.0K Mar 13 16:01 bobo
-rwxr-xr-x 1 alexey alexey  157 Mar 13 15:21 boo.rkt
drwxr-xr-x 3 alexey alexey 4.0K Mar 12 20:49 ff
-rwxr-xr-x 1 alexey alexey  143 Mar 12 21:07 fmt.rkt

#:build? #t is the answer:

(define (children parent)
  (define lst (directory-list parent #:build? #t))
  (define dirs (filter directory-exists? lst))
  (define files (filter file-exists? lst))
  (values lst dirs files))

directory-list returns a list of single names. Meanwhile, directory-exists? tests if the input is a directory relative to the current directory. So what you are doing is to test if these names are immediate directories under the current directory. Obviously, that’s not what you want to do. You should follow what you did in Python more closely. In your Python code, (parent / x).is_dir() checks if a single name x is a directory under parent, by using / to concatenate parent and x together. You can do the same in Racket with build-path.

Here’s a rough translation of your Python code to Racket, using Racket’s generator. Unlike the generator in Python, the Racket generator is not built into the language, but it’s instead provided as a library.

#lang racket

(require racket/generator)

(define (children parent)
  (define dirlist
    (with-handlers ([exn:fail? (λ (e) '())])
      (directory-list parent)))
  (define lst
    (for/list ([path (in-list dirlist)])
      (build-path parent path)))
  (partition directory-exists? lst))

(define (traverse parent)
  (define-values (dirs files) (children parent))
  (for ([directory (in-list dirs)])
    (traverse directory))
  (for ([file (in-list files)])
    (yield file)))

(time
 (for/list ([path (in-generator (traverse "/usr"))]
            #:when (path-has-extension? path ".rkt"))
   path))
;; cpu time: 1416 real time: 1477 gc time: 19

I think it's obligatory to note that generators usually are overkill. There are many other possible approaches. One possibility is to use make-do-sequence. In this approach, you have a state called position (which is essentially an iterator), along with operations like how to advance a state, how to extract a value from a state, and how to tell if a state has no more elements.

#lang racket

(define (children parent)
  (define dirlist
    (with-handlers ([exn:fail? (λ (e) '())])
      (directory-list parent)))
  (define lst
    (for/list ([path (in-list dirlist)])
      (build-path parent path)))
  (partition directory-exists? lst))

;; a position is either:
;; - '()
;; - (listof (or/c file/c dir/c))
;;
;; file/c = path?
;; dir/c = (list/c path?)
;;
;; a proper position is a position that, if it's not empty, 
;; then the first element must be a file/c

(define (init-pos path)
  (proper-ize (list (list path))))

(define (proper-ize improper-pos)
  (match improper-pos
    ['() '()]
    [(cons (list dir) ps)
     (define-values (dirs files) (children dir))
     (proper-ize (append (map list dirs) files ps))]
    [(cons p ps) improper-pos]))

(define (next-pos pos)
  (proper-ize (cdr pos)))

(define (traverse* path)
  (make-do-sequence
   (λ ()
     (initiate-sequence
      #:pos->element car
      #:next-pos next-pos
      #:init-pos (init-pos path)
      #:continue-with-pos? pair?))))

(time
 (for/list ([path (traverse* "/usr")]
            #:when (path-has-extension? path ".rkt"))
   path))
;; cpu time: 1394 real time: 1444 gc time: 18

2 Likes

Thank you sooo very much!

One can even use streams, which is one of the nicest and cleanest solutions. Here's what I have right now. Of course, the traverse part is all wrong, because I haven't yet figured out how to flatten the recursive part of the stream (the flatMap approach, yes). Now I'm here:

#!/usr/bin/env racket
#lang racket

(require racket/path
         racket/stream
         racket/file)

; Function to list all files and directories in a directory
(define (children parent)
 (define-values (all-items) (directory-list parent #:build? #t))
 (let-values ([(dirs files) (partition directory-exists? all-items)])
   (values dirs files)))

; Function to traverse directories and produce a flat stream of files
(define (traverse parent)
 (define-values (dirs files) (children parent))
 (stream-append
    (for/stream ([dir dirs])
      (traverse dir))
    (for/stream ([file files])
      (stream file))))

(define reds (stream-cons "red" reds))

; Main function to traverse directories and print matching files
(define (traverse-and-print)
    (define-values (dirs files) (children "ff"))
    (displayln dirs)
    (displayln files)

    (stream-for-each displayln (traverse "ff")))
    ;(stream-for-each displayln reds))

; Run the main function
(traverse-and-print)

UPD

; Function to traverse directories and produce a flat list of files
(define (traverse parent)
 (define-values (dirs files) (children parent))
 (stream-append
    (for/stream ([dir dirs])
      (traverse dir)) ; Recursively traverse subdirectories
    (stream files))) ; Append files in the current directory

this one shows something immediately under parent, but no deeper. It isn't getting flattened:

#<stream>
#<stream>
(ff/boo.rkt ff/fmt.rkt)

This is the solution I've been after, based on streams. The honor is due mostly to sorawee, and to him goes the green checkbox. This snippet is good to copy, paste, and run.

#!/usr/bin/env racket
#lang racket

(require racket/path
         racket/stream
         racket/file)

; Function to list all files and directories in a directory
(define (children parent)
 (define-values (all-items) (directory-list parent #:build? #t))
 (let-values ([(dirs files) (partition directory-exists? all-items)])
   (values dirs files)))

; Function to traverse directories and produce a flat stream of files
(define (traverse parent)
  (define-values (dirs files) (children parent))
  (stream-append
   (apply stream-append
          (for/list ([dir (in-list dirs)])
            (stream-lazy (traverse dir))))
   files))

(define reds (stream-cons "red" reds))

; Main function to traverse directories and print matching files
(define (traverse-and-print)
    (define-values (dirs files) (children "."))
    (displayln (format "dirs: ~a" dirs))
    (displayln (format "files: ~a" files))

    ;(stream-for-each displayln (traverse ".")))
    ;(stream-for-each displayln reds))
    (for ([item (in-stream (traverse "."))]
          [idx (in-naturals 1)])
      (printf "~a ~a\n" idx item)))


; Run the main function
(traverse-and-print)

Your stream solution is not correct though, in a sense that it is not “lazy”. The initial call of (traverse ".") will already traverse the whole filesystem (supposing you run the program from the root). This is unlike the generator solution or the make-do-sequence approach, where the filesystem is traversed only as necessary to give you elements as you request.

1 Like

Thanks!

While I got a solution I sought, this topic becomes an Encyclopedia. Just as I hoped :blush: .

Oh, wait...

Oh, that defeats the purpose all right. There's nothing to be done within the stream concept?

Here’s one possible implementation of traverse that doesn’t suffer from the aforementioned issue.

(define (traverse parent)
  (define-values (dirs files) (children parent))
  (stream-append
   (apply stream-append
          (for/list ([dir (in-list dirs)])
            (stream-lazy (traverse dir))))
   files))

(There are other possible places to put stream-lazy. It could be that you want a different behavior, so adjust it as needed)

The stream solution is slower than the other solutions though.

Generator: cpu time: 1561 real time: 1617 gc time: 9
make-do-sequence: cpu time: 1549 real time: 1594 gc time: 24
stream: cpu time: 2117 real time: 2209 gc time: 176

2 Likes

Curioser and curioser! Does gc time mean garbage collection time? Did you use this? Or is there something more quick and dirty?

Use time. The explanation is under time-apply.

https://docs.racket-lang.org/reference/time.html#(def._((quote._~23~25kernel)._time-apply))

I tried the script on my home directory, 1259148 files. No visible issues. What kind of different behavior do you mean?

If you want (traverse ...) to not do anything at all until we start iterating on the sequence, you might want to wrap the body under stream-lazy, rather than putting stream-lazy in the recursive call.

(define (traverse parent)
  (stream-lazy
   (let-values ([(dirs files) (children parent)])
     (stream-append
      (apply stream-append (map traverse dirs))
      files))))

I was too lazy (see what I did) to figure out an appropriate place to put stream-lazy.

1 Like

You mean, dirs and files are now lazy, too? Looks like an unnecessary purism, unless I miss something important.