Color lexer backup distances and `dont-stop`

I'm trying to implement a color lexer for the syntax of .gitignore files. For reasons I'll explain, I think I need non-zero backup distances and dont-stop, which have always confused me, and I want to check my understanding.

  1. The docs say:

    A dont-stop result is useful, for example, when a lexer has to read ahead in input-port to decide on the tokens at this point; that read-ahead will be inconsistent if an edit happens, so a dont-stop structure ensures that no changes to the buffer happen between calls.

    Do I understand correctly that, if the token at some point depends on peeking beyond the end of the token that's ultimately returned, a dont-stop is not only "useful", but necessary?

  2. Does the rule that:

    A change to the stream must not change the tokenization of the stream prior to the token immediately preceding the change plus the backup distance.

    mean that, if more than one token of look-ahead is required, both dont-stop and a non-zero backup distance are required?

  3. Is it reasonable to simplify the implementation by using a conservative approximation of backup distances and dont-stop, rather than the precise minimum?

Some specifics might be helpful. The .gitignore syntax is defined in terms of the POSIX function fnmatch and globbing. I want to implement the syntax and semantics of standard .gitignore, so I need to handle ranges in square brackets: the relevant parts are almost the same as for regexp, but also support the ‹posix› character classes like pregexp. (They're not a subset of pregexp, either, though.) The syntax has some confusing subtleties. For example:

  • [[:alpha:]]b matches ab, but
    [[:alpha]]b matches :]b
  • [a-y]z matches xz, but
    [a-]z matches -z

Because I find this deeply confusing, I want to color these cases differently. I need at least one character of look-ahead to distinguish - as range syntax vs. a literal character, and a bunch of look-ahead for the POSIX character classes. But all of these cases can only occur in the mode for lexing the tail of a bracketed form, so I'm hoping I can just track the backup distance to the beginning of that mode and use dont-stop until the end of it, rather than tracking specific backup distances within the mode.


ETA, for posterity:

Despite what gitignore(5) currently says, Git has not used fnmatch at all since v 2.0.0 (May 2014) and not by default since v1.8.4 (August 2013): see upstream commit 70a8fc9. Instead, they use an implementation called wildmatch inherited from rsync, which is under a variant of the BSD license (I haven't figured out the specific SPDX identifier, but see Replace fnmatch with wildmatch by pks-t · Pull Request #5110 · libgit2/libgit2 · GitHub for details). It has similar but different semantics than fnmatch, generally simplified (e.g. no LC_CTYPE sensitivity, no POSIX collating symbols (i.e. [.name.]) or equivalence class expressions (i.e. [=name=])). On the one hand, the semantics are basically implementation-defined; on the other hand, and least they are defined by one implementation.

While waiting for someone like @robby or @mflatt to give a more-informed answer, in case it helps I can share an example of code using this interface:

https://github.com/greghendershott/racket-mode/blob/hash-lang/racket/hash-lang.rkt#L280-L404

On the downside:

  • This is from the hash-lang branch not yet merged to main, so it's beta quality.
  • Although "it works" I may have misunderstood the intent.
  • The token-tree% and paren-tree% API somewhat obscure things.

On the upside, compared the to the equivalent (ish?) code in DrRacket, I think the logic is maybe easier to follow, because the code:

  • is more centralized
  • is less concerned with editor change coordination
  • has more comments explaining the intent (although, again, my intent might be wrong)

p.s. In your specific case, because a .gitignore file is line-oriented, some back-up/look-ahead strategy involving whitespace tokens for newlines might suffice? (That might re-tokenize more than necessary, but probably won't be so inefficient to matter in practice?)

It sounds to me like you understand @LiberalArtist .

  1. I would say that dont-stop is useful when you want to color some region in two (or more) different colors but you cannot parse them independently. So, you parse them both then return the first token with a dont-stop and then return the second one. It is shoehorning the ability to return lists of tokens into a flat-style API. (Hope this makes sense.)

  2. I think that the backup distance is a separate thing from the dont-stop result. It may be that you can use one to get the same effect as the other, tho, I'm not sure. Maybe if you have a specific example I can answer this one better. (@mflatt added the backup distance to support the scribble lexer, tho, so he may have a better answer here.)

  3. Sure. The main thing is to keep tokens relatively small; they don't have to be tiny (for example, strings are single tokens in #lang racket and then can, in principle, span a giant file) but the size of the token controls the interactivity when editing the files in DrRacket. Smaller and more tokens means more opportunities for DrRacket to respond to the user's editing events, as DrRacket won't interrupt the colorer in the middle of computing a single token.

I think @greghendershott 's advice is good, tho. Processing an entire line at once and putting the processing result into the mode and then returning a sequence of tokens via dont-stop until the end of the line seems like a (relatively?) easy way to get something going.

hth!

Thanks @greghendershott and @robby.

This makes a lot of sense! (I think.)

Trying to write an example left me feeling even more shaky about my understanding here, but here's an attempt based on the " 1 2 3 example from the start-colorer docs.

Imagine that, if a buffer contains:

[ 1 2 3

we want to color [ as an error token, but if it is edited to contain:

[ 1 2 3]

then we instead want to color [ as a parenthesis token. As the docs explain, since that would "change the tokenization of the stream prior to the token immediately preceding the change", we would need a non-zero backup distance.

But do we also need a dont-stop? It seems like we "cannot parse … independently" the opening [, since we need to peek ahead to see if there is a matching ]. Per the docs, "that read-ahead will be inconsistent if an edit happens, so a dont-stop structure ensures that no changes to the buffer happen between calls."

If you only need to peek into the next token, as with [a-] vs. [a-z], IIUC you need dont-stop but don't need a backup distance. I know that syntax-color/scribble-lexer tracks backup distances but doesn't use dont-stop, so I gather that's a possible set of circumstances, but I don't understand how or why.

(When I wrote that these "have always confused me", I had scribble-lexer especially in mind. I think it would help to have an example of "a change later [that] could turn the non-match into a match".)

;; If we don't match rx:opener in a place where we might otherwise
;;  match, and there is a "|" at that point, then a change later
;;  could turn the non-match into a match, AND there could be
;;  arbitrarily many Scheme tokens in between. So we carry the backup
;;  position, use it as necessary (from places that might be between a "|"
;;  and a potential match creator), and cancel it when it's clearly
;;  not needed anymore (which includes after any token that isn't a 
;;  Scheme token).

This is a very appealing idea. As it happens, I got the skeleton of an incremental lexer for the actual patterns (by closely imitating racket/src/regexp/parse/range.rkt), but it would work very well for the first line: since the whole #lang line counts as a comment for Git, it could contain some optional Racket tokens, e.g. to have:

#lang foo/gitignore #:skip-bad-patterns
[[:ctrl:]]

implement the strictly Git-compatible behavior of a pattern that never matches, instead of telling you at read or compile time that you misspelled [[:cntrl:]].

Would I need to peek the line rather than read it, though? I'm not sure if anything explicitly says this—maybe check-colorer-results-match-port-before-and-afters?—but I was under the impression that you need to keep the position of the input port consistent with the token positions.

Using backup without dont-stop in scribble-lexer is probably a bug. I think the history is that the need for backup was noticed before the need for something like dont-stop. When dont-stop was added later, it wasn't immediately obvious that the two always need to go together. As far as I can tell now, they need to be used in tandem.

A problem with dont-stop as implemented currently is that it makes the editor less interactive, because dont-stop forces the colorer to keep going before editing is allowed again. The problem with backup is that it doesn't tell the colorer to expect backup information in the future, so the colorer doesn't know that editing after the current coloring frontier may need to move the coloring frontier backward. The combination of backup information with dont-stop gives the colorer the information that it needs, although not in a great form. Even without improving the interface, though, maybe the colorer could be improved by treating dont-stop to mean that editing later needs to rewind coloring to the point where dont-stops started.

I'm persuaded that needing a backup distance always implies needing dont-stop, but I thought the following was an example of needing dont-stop not necessarily implying the need for a non-zero backup distance (I could be wrong, though!):

(I'm assuming that in [a-] you want to color - as a literal character, but in [a-z] you want to color it as an infix operator.)

In the course of looking into this, I had also come across @greghendershott's call for an example of dont-stop in use:

Since no one came up with an example, it might be that there's more ability to make changes without causing problems in practice than I would otherwise assume.

I'm not sure why you'd need dont-stop in that case. If I understand correctly, no backup is needed because any change to the next character will restart lexing at the token that ends at the edited position. But it seems like that would also avoid the need for dont-stop. Maybe my conclusion is based more on what dont-stop does than how it's documented.

Not sure whether I missed that message or just forgot to answer, but the shrubbery lexer does produce dont-stop any time it's going to produce a non-zero backup next:

(require shrubbery/syntax-color)
(define in (open-input-string "@/{}")) ;; extra "/" would change "@" treatment
(shrubbery-lexer in 0 #f)
;; last result is `(dont-stop '#s(pending-backup-mode 1 initial))`

That makes sense. The documentation does say, albeit indirectly, that "the token immediately preceding" a change to the input stream may change even if the backup distance is 0. It does seem that, to implement that behavior, the caller of a lexer/c would need to, as you say, "restart lexing at the token that ends at the edited position" (or, IIUC, if the edited position were in the middle of a token spanning more than one position, the token before the one containing the edited position). So I understand why that scenario would work out without dont-stop. But I don't think the docs explicitly say that it will.

Is there ever a case when the fourth result of a lexer, i.e.:

start — The starting position of the token (or #f for an end-of-file). This number is relative to the third result of (port-next-location input-port).

may licitly be something other than #f or exactly the third result of (port-next-location input-port)?

The documentation for check-colorer-results-match-port-before-and-afters says the following must hold:

(or (equal? type 'eof)
    (and (<= pos-before new-token-start pos-after)
         (<= pos-before new-token-end pos-after)))

Setting aside 'eof, we could rewrite this as a system of four inequalities:

(and (<= pos-before new-token-start) ;[1]
     (<= new-token-start pos-after)  ;[2]
     (<= pos-before new-token-end)   ;[3]
     (<= new-token-end pos-after))   ;[4]

The start-colorer documentation states the invariant that:

Every position in the buffer must be accounted for in exactly one token, and every token must have a non-zero width.

This seems to add a fifth requirement:

(< new-token-start new-token-end) ;[5]

Assuming re [1] that (< pos-before new-token-start) were true, the positions between pos-before and new-token-start would not be part of the current token. They must be part of some token ("every position in the buffer must be accounted for in exactly one token"), so they would have to be part of the token preceding the current token. But if the positions were part of the proceeding token, the preceding call would have violated [4]. So, by contradiction, [1] really requires (= pos-before new-token-start). The same logic seems to apply to [4], changing it to (= new-token-end pos-after). If so, [5] would imply [2] and [3] by substitution. So I think the system reduces to:

(and (= pos-before new-token-start)    ;[1]
     (< new-token-start new-token-end) ;[5]
     (= new-token-end pos-after))      ;[4]

This logic is compelling to me. I also tried changing the contract and running the tests (and opened up a few big files in DrRacket to see them color properly) and didn't find any counterexamples, so I've pushed the change to master in the syntax color package. Since a release is imminent, we'll have a good amount of time to see if something pops up before the next release.