I'm trying to implement a color lexer for the syntax of .gitignore
files. For reasons I'll explain, I think I need non-zero backup distances and dont-stop
, which have always confused me, and I want to check my understanding.
-
The docs say:
A
dont-stop
result is useful, for example, when a lexer has to read ahead ininput-port
to decide on the tokens at this point; that read-ahead will be inconsistent if an edit happens, so adont-stop
structure ensures that no changes to the buffer happen between calls.Do I understand correctly that, if the token at some point depends on peeking beyond the end of the token that's ultimately returned, a
dont-stop
is not only "useful", but necessary? -
Does the rule that:
A change to the stream must not change the tokenization of the stream prior to the token immediately preceding the change plus the backup distance.
mean that, if more than one token of look-ahead is required, both
dont-stop
and a non-zero backup distance are required? -
Is it reasonable to simplify the implementation by using a conservative approximation of backup distances and
dont-stop
, rather than the precise minimum?
Some specifics might be helpful. The .gitignore
syntax is defined in terms of the POSIX function fnmatch
and globbing. I want to implement the syntax and semantics of standard .gitignore
, so I need to handle ranges in square brackets: the relevant parts are almost the same as for regexp
, but also support the ‹posix›
character classes like pregexp
. (They're not a subset of pregexp
, either, though.) The syntax has some confusing subtleties. For example:
[[:alpha:]]b
matchesab
, but
[[:alpha]]b
matches:]b
[a-y]z
matchesxz
, but
[a-]z
matches-z
Because I find this deeply confusing, I want to color these cases differently. I need at least one character of look-ahead to distinguish -
as range syntax vs. a literal character, and a bunch of look-ahead for the POSIX character classes. But all of these cases can only occur in the mode
for lexing the tail of a bracketed form, so I'm hoping I can just track the backup distance to the beginning of that mode and use dont-stop
until the end of it, rather than tracking specific backup distances within the mode.
ETA, for posterity:
Despite what gitignore(5)
currently says, Git has not used fnmatch
at all since v 2.0.0 (May 2014) and not by default since v1.8.4 (August 2013): see upstream commit 70a8fc9. Instead, they use an implementation called wildmatch
inherited from rsync
, which is under a variant of the BSD license (I haven't figured out the specific SPDX identifier, but see Replace fnmatch with wildmatch by pks-t · Pull Request #5110 · libgit2/libgit2 · GitHub for details). It has similar but different semantics than fnmatch
, generally simplified (e.g. no LC_CTYPE
sensitivity, no POSIX collating symbols (i.e. [.
name.]
) or equivalence class expressions (i.e. [=
name=]
)). On the one hand, the semantics are basically implementation-defined; on the other hand, and least they are defined by one implementation.