The docs for syncheck:find-source-object imply you can return any non-false value:
(send a-syncheck-annotations ^syncheck:find-source-object stx)
â (or/c #f (not/c #f))
stx : syntax?
This should return #f if the source of this syntax object is uninteresting for
annotations (if, for example, the only interesting annotations are those in the
original file and this is a syntax object introduced by a macro and thus has a
source location from some other file).
Otherwise, it should return some (non-#f) value that will then be passed to one of
the other methods below as a source-obj argument.
However check-syntax seems to use this value with the undocumented assumption that it will be the same value per source file.
Background:
In Racket Mode's use of check-syntax, where src is the path of the file being analyzed, I do this:
In other words, I have it return the syntax object.
I do this because in one other method, syncheck:add-arrow/name-dup/pxpy, I want syntax objects passed as the from and to arguments, to do some extra analysis.
Unfortunately when I do that, I get very weird mouse-overs for the a-struct on line 2 in this little example program:
In Dr Racket and classic Racket Mode, it says "3 bound occurrences" for that span. Good.
However with pdb's find-source-object returning a syntax object, there are two mouse-overs for that span, "6 bound occurrences" and "2 bound occurrences". (Not only don't they agree, neither is the right answer.)
What to do?
I started to review the use of find-source-editor (which ultimately calls syncheck:find-source-object) within traversals.rkt. I think what's going on is the value is used as a key in one or more hash-tables, for "connections"? Perhaps the fix is instead to record the syntax-source as the key, independent of what find-source-editor says?? I'm not sure; do you know, @robby?
[I guess another approach would be: 1. Change the documentation of `find-source-object` to clarify that you can't return just _any_ non-false value -- it needs to be the _same value per analyzed source file_, or various things might break. And also, 2. I would need some new `syncheck:add-arrow` flavor that also passes the full syntax objects. Although this is probably less tricky, e.g. this is something I could probably do as a pull request, it seems ad hoc and not "the right way"?]
Yes, you're absolutely right that it needs to be the same value (in the sense of equal?) for the same source file and that's something that's relied on. So the approach in your followup message is totally fair game but I'm not sure it is guaranteed to work as you might get the "wrong" equal object back sometimes. That is, these source objects are being used as keys in a hash and for two that are equal you might get back the first one today but the second one tomorrow.
My knee-jerk reaction to the larger issue that leads you to want to do this is: can we push that analysis into check syntax itself and then send back the results so all clients can benefit from this analysis? Does that make sense for what you're doing?
So first, I should say: The find-source-editor and find-source-object names remind me that once upon a time you extracted all this functionality from Dr Racket itself, to make available in a library. Of course some assumptions rode along; some hard to document because they felt so obvious. Been there myself, many times. The bigger point is, you went to all that trouble, and it's turned out so successfully. As I can never say enough times: Thanks!!
My knee-jerk reaction to the larger issue that leads you to want to do this is: can we push that analysis into check syntax itself and then send back the results so all clients can benefit from this analysis? Does that make sense for what you're doing?
My immediate need is that in syncheck:add-arrow, for the to-object (the "use" end of the arrow) I want the syntax object so I can give it to identifier-binding. When it's an import, I want to record both the "from" and "nominal-from" fields. The former is I guess redundant with what syncheck:add-jump-to-definition already supplies. But the latter, the nominal import info, I also need in order to later be able to discover a graph of "rename sites" for multi-file rename. And so in pdb, today, I just record both the from and nominal-from fields at once, for later use (and ignore syncheck:add-jump-to-definition, just for consistency in how I record things).
But in the spirit of your reaction:
I guess maybe a new syncheck:add-jump-to-nominal-import method would probably address my immediate need??
Or a new syncheck:add-arrow/name-dup/pxpy/identifier-binding (how many hyphens can we add ) flavor with two new parameters, for the result of calling identifier-binding on each end of the arrow?? Tho this feels like why not just pass the syntax objects...
Even if we do that, I guess I still feel it would be a useful "escape hatch" for every syncheck method to get (if the user desires) the full syntax objects. That way people can explore new ideas, today, some of which might end up as official new syncheck methods later. (Or not; in this specific case it's debatable whether I'm doing more analysis per se, or just querying more information from the results of the analysis check-syntax is already doing. The part where I store some info in a "database", for later queries, I think is useful, but probably orthogonal from drracket/check-syntax?)
So I still think it would definitely be cool if find-source-object could live up to what the docs seem to promise. But I'd understand if you think that's impractical, or undesirable, or both.
Overall, I'm happy to make any of these changes and would be in favor of the one that you think works the best for the purpose of this database.
I have a question about the database, tho: is it recording the raw information that comes of check syntax apis current, or is it computing something from that and then recording only the information needed to do the rename-across-files refactoring? If the latter, do you think it makes sense to refactor somewhat so that we can cache all check syntax-related information and the rename-refactoring can consult the data base, but also Emacs mode and DrRacket could consult the database when opening a file (potentially recomputing its content if necessary)?
Anyway, for the immediate use: do you want it only for imported identifiers? If so, do you think it makes sense to add a new method called something syncheck:imported-identifier and that it would include the information from identifier-binding? The the rename refactoring can just look at that one only (well, I guess I should be asking! Is that all it needs?). Maybe this new piece of information could subsume add-jump-to-definition; it could just be "I found a reference to an imported identifier -- here's what I know about it" and the jump-to-definition functionality could just use a portion of that information.
My thought, earlier, when designing this part of the API was to remove syntax objects from it so that all the values are marshallable. And DrRacket runs the check syntax analysis in a separate place and sends the values back over a place channel and I am not sure that syntax objects can make that transition (but maybe they can?). Probably, if we were to store all this stuff in a database, we'd to not include syntax objects in there. More generally, I just think of syntax objects as having a lot of information and complexity in them that we need to compute the information that the IDE wants to see but once it is computed, we'd rather discard the syntax objects. That is, I see the layer in check syntax as the place to put those computations of "interesting information" and then hand out something that's somewhat durable (ie values that can be read and written and, as long as the source doesn't change, will always make some sense). Not sure if this is totally well-founded thinking, tho! Happy to consider alternatives.
For most things like mouse-overs it is somewhat "raw".
For arrows, there are various structs for various kinds of arrows. These are all serializable, no syntax objects. Just info e.g. that identifier-binding can return, about a syntax object.
The data-types.rkt has the whole overview. Any kind of arrow uses this base struct:
;; An arrow always has both ends in the same file. (Arrows for
;; imported definitions point to e.g. the `require` or module language
;; syntax in the importing file. For such arrows, the derived struct
;; `import-arrow`, below, also says where to look outside the file.)
(struct arrow
(phase
use-beg
use-end
def-beg
def-end)
#:prefab)
And the specific example we were discussing ends up as an import-arrow:
;; identifier-binding uniquely refers to a non-lexical binding via a
;; tuple of <path mod phase symbol>. Often we need all but the path
;; since we're working on things per-file. A struct for that:
(struct ibk (mods phase sym) #:prefab)
...
;; For syncheck:add-arrow require or module-lang arrows. `from` and
;; `nom` correspond to identifier-binding-resolved fields.
(struct import-arrow arrow
(sym
from ;(cons path? ibk?) used to look up in file's `defs` hash-table
nom) ;(cons path? ibk?) used to look up in file's `exports hash-table
#:prefab)
So here you see the business using identifier-binding on the "use" syntax. (I just have a helper function that resolves module path indexes to actual paths.)
So it's "not raw" in the sense that in some cases I take a different view than syncheck:add-arrow about what the arrows mean, and where they should point. I also do an extra analysis pass, similar to the flow of traversals.rkt, where I dig for extra things to create import-rename-arrow and export-rename-arrow (which I need for smart renaming, as those terminate name chains).
Having said all that, by volume, probably 90% of the arrows are just straight up following what syncheck:add-arrow says. I'm just handling some edge cases differently.
So far I've just been slogging away, trying to discover what is necessary to support multi-file rename, as well as "just databae-izing the check-syntax results".
I also have a test branch for Racket Mode, which uses this pdb package to do queries. I've just recently gotten that to be extremely query-driven. So e.g. in a huge file like class-internal.rkt, it's querying only for the visible window in the editor -- what's the mouse-over at point, is there a use or def at point and if so what's related, etc. So although check-syntax takes 10 seconds to analyze that file, at no point is Racket Mode "stuck" trying to stuff the results into text properties for the entire big buffer. That's a win vs. the status quo, as well as the multi-file rename.
But it's all still pretty new, and a bit janky. As I feel more confident from dog-fooding the prototype, I definitely want to take stock:
Could this be useful in Dr Racket, in vim, in VSCode? For command-line tools? And probably the answer is, yes, it could be, but I'll have made my own assumptions that aren't quite right, for those. And we can sort that out.
Some of the stuff I do in analyze-more.rkt: It's a quick extra pass, but it's extra. And maybe some of that should live in traversals.rkt. Again, it's only recently I have much confidence to advocate that, but we're getting there.
This all looks really great to me! I would love to fold the improvements to the data back into traversals.rkt (or at least just into the side of the world that computes the information to show that's shared by all the IDEs). It would be wonderful if we arranged things so that all the parts that conceptually can be used in common for all tools that edit Racket code are in one part of the code and then VS Code, Racket Mode, and DrRacket all have the same information (but perhaps shown differently based on the customs of the users of the tools).
As a (recently added) maintainer of Magic Racket, I'd like to build on the work happening here to eventually add multi-file queries and refactorings to VS Code workflows as well via Magic Racket and/or the langserver where appropriate.
I haven't yet investigated the details here enough to offer precise suggestions on API shape yet... Mostly I just wanted to make sure it's known that several editor workflows are interested in this work, and it would be great to keep information and operations a bit more "general" so they can hopefully be adapted to many different tools and workflows.
One issue here is that some tools might prefer or need line:column
coordinates instead of positions. [Effectively drracket/check-syntax
and our own analysis use syntax-position and syntax-span, ignoring syntax-line and syntax-column.] Either we could try to store
line:col-denominated spans, also, in the db when we analyze (at some
cost in space). Or we could just synthesize these as/when needed by
such an API, by running through the file using port-count-lines! (at
some cost in time).
I think there might be some tension between what syncheck:add-arrow says today, and what I need it to say to build a "rename site graph". I don't mean just getting identifier-binding info. I mean sometimes drawing more or different arrows. Especially for things like rename-in, prefix-in, and so on.
One example is for (prefix-in pre: mod), syncheck:add-arrow draws an arrow between a use of pre: and mod, but I want it to connect both pre: sites. So when I make my own arrow structs, I fix this up.
I also do an extra analysis pass to scrounge for some more such things, to add arrow structs that syncheck:add-arrow wouldn't.
If we change what syncheck:add-arrow does, then there are backward compatibility concerns.
Maybe it would work to have a couple "levels" to this. Like there could be a database that really is hardly more than a cache/capture of "raw" check-syntax, plus errors and online-check-syntax logger stuff generated during expand. No adjustments or additions. And if that's useful to some tools, they can use that. One such tool being the one that builds the following.
And then another level with a different view of things, to support e.g. multi-file renames. And some tools can use that, instead of, or in addition to, the lower-level db.
VS Code APIs use line and column coordinates, but I think the mostly likely way of integrating your work into a VS Code workflow would be via the langserver. The langserver is already regularly transforming back and forth between Racket GUI framework positions (these are item positions, which can each be multiple bytes long, whereas syntax positions can be either bytes or characters).
With that in mind, I think it's fine to do whatever seems natural (e.g. perhaps only store syntax-position and syntax-span) at the layer you're working on. The langserver could then do whatever position conversion is needed.
I don't feel particularly strongly about these arrows, although they aren't the ones I would have drawn just based on my own understanding of what's useful. So I guess that my position is that if you wanted to change the set of arrows because they're useful to you, then I guess they will be useful to others and so let's include them in DrRacket too. I mean, it wouldn't be the first time I was wrong about what was useful, not by a long shot!
As for backwards compatibility, I think that it is probably okay to be flexible about exactly which arrows show up (and I can update the test suites based on what we decide). The main backwards compatibility concern I would have is where some method changes arity or changes like that which would make old clients just crash.
As for adding multiple layers that are targeted to different kinds of tools, I think that could work great too! But I wouldn't want to accumulate that technical debt for arrows that we expect to show up in Emacs mode but not in DrRacket. Instead, I'd prefer to just say "If you feel strongly it should be there, let's put them into DrRacket too.".
My use of syncheck:add-arrow is not fully as-intended because I'm not using it to draw arrows for end users. I can't draw graphical arrows; not even in GUI Emacs. Seriously, it's not that I want the end user to see different arrows in Emacs, DrR, vscode, vim, etc. Instead it's that I need an analysis building block.
Speaking of backward compatibility, if syncheck:add-arrow changed someday, wrt what it connects... that could actually break some of my code. Hmm.
So ...
What if we add a new syncheck method named (the hardest part) add-binder (add-connection?). Probably the default just calls add-arrow. But a derived class can define/override it, to use it as a non-GUI building block for other analysis, which is my actual use.
add-arrow is free to change, no backward compatibility concerns wrt how many arrows and what they connect; it's purely an end user presentation concept. Free to show them more/fewer/other arrows someday, as desired, if people think that's a better GUI.
But the new add-binder method ought to remain backward compatible; it shouldn't suddenly start getting called someday with different values for the same program under analysis (except maybe to fix a bug or omission, where not changing it is obviously worse).
TL;DR: It's not a whole other "layer" or "level". Instead it's one new method, and a clarification about which method is supposed to remain backward compatible, and which it doesn't really matter.
I can see one use-case of the database for DrRacket, namely looking in it to find the information it needs to do the things it currently does (instead of just recomputing things all the time). For that to work, however, basically everything that goes into the annotations object needs to be recorded. (Indeed, I could imagine DrRacket both looking in there and also populating the database as it decides it needs info about some specific file). Is that something that fits in with your planning? (If not, no worries.)
Faithfully capturing all the syncheck method calls is not something I've tried to do so far. I'm open to doing that.
I know at least one thing that's not obviously serializable[1]. But if you know how to fix such cases, or can live without them, then it should be straightforward.
That seems doable. I think probably simpler and more reliable if DrRacket just asks pdb (if installed) for info about the file -- relying on pdb populating the db after any new analysis. (As opposed to DrRacket trying to give pdb all the exact info from running its own analysis.)
When pdb is available you could also make use of the multi-file rename-sites feature. (It returns a hash-table of paths and sites within each path. You need to do the actual changes -- although I guess a convenience wrapper to do that, too, could be added; in Emacs it's easy to do it there.)
When pdb is available DrRacket's "Open Defining File" could use pdb's find-definition, which returns a path plus a span within. If it's another file, the analysis was done or retrieved to find the answer, so that's available quickly when you open a DrRacket window or tab to show that file.
I'm not sure how to serialize add-arrow's name-dup -- it's a procedure. Sometimes a work-around is to serialize the argument values, to call the procedure later. But name-dup seems to wrap identifier-binding, which IIUC implies serializing a namespace, which AFAIK isn't possible? âŠī¸
re drracket using pdb: agree completely. Ideal for me would be if DrRacket always goes through the pdb interface (which then uses traversals.rkt and friends as needed), just as you say. I guess we have some specifics to iron out. The one that comes to min is that DrRacket will analyze a file that's not saved from time to time, as it reanalyzes files while they are being edited -- so it would be nice to be able to say "here's some text to analyze--get the requires from pdb even if this isnt' a file saved anywhere" but I guess the same thing comes up for Emacs so we can probably handle that the same way(?).
re pdb being installed: I'm imagining adding pdb as a dependency of DrRacket; this would probably require having some minimal amount of dependencies of pdb itself but if that ends up being problematic, I don't mind doing something more complex inside DrRacket (like dynamically installing it or something, as needed).
re name-dup: totally agree. I'm okay with just dropping support for that. Currently it gets used as a guard on a rename to say "watch out! that name was already used in this scope and you might be conflicting with it" but it has a lot of problems precisely because of the lack of serialization already.
re: multi-file rename in drracket: woot! In drracket, doing the rename in other files would work best using racket's filesystem functionality -- perhaps with some kind of a notification in case one of the files is open in drracket. There are notification facilities in the filesystem layer in racket, tho, so the notifications wouldn't necc. have to come directly from pdb.
Maybe I'm being overly cautious but at early this stage I'd be more comfortable if it were optional -- where DrRacket would keep as a fallback the code where it does expand and calls the two make-traversal functions itself.
So if there's some sqlite problem, or some users don't want to spend disk space, or whatever unforeseen bug or scenario, then DrRacket doesn't have to regress vs. today.
[For Racket Mode I'm planning to keep the status quo as a fallback. Well, partly for caution. Also because because pdb will require bleeding edge versions of drracket-tool-text-lib and Racket itself (whereas Racket Mode generally supports older verisons).]
I could at least arrange it such that pdb can take an instance of the exact same annotations-mixin class object, and its methods don't need to know or care if they're being called from traversals.rkt or pdb. So that there's no divergence there?
Yes, that was just pseudo-code. The actual function currently takes both a complete-path? and an optional string? for the code, for exactly this scenario.
Great, that's easy then.
Yes. I think for tools generally it's best just to give them hash-table of the files and sites within each, and let each tool actually make the change however it thinks best. So that's why I went with that approach.
[My only question is whether it might need a for-each variant. One thing I've tried is to analyze all ~8,000 files in main-distribution (which takes a few hours, and uses about 100 MB disk space). Someday -- I haven't tried this torture test yet -- I want to ask rename-sites about define in racket/private/pre-base.rkt; that is likely to be a big hash-table of files and sites that would need to be renamed. ]
That sounds very wise. At the drracket level, we can have a preference that default to off for some number of years until we're comfortable that things are rock solid and then change the default. Maybe that could also work for Emacs if we share the same preference and just have different ways to set it.
100 MB sounds totally plausible (looks like 8.8 unpacked on my machine is 800 MB) but I guess that we can improve the few hours number. I don't think anyone has spent much time with the profiler and traversals.rkt and there might be some low-hanging fruit there.
Also, if we did want to improve the 100 MB, I guess there is a lot of redundant information coming out of traversals. It may make sense to actually design one of those intermediate layers where we have just one line for a reference to an imported identifier that many different pieces of information could be calculated from. (But maybe the 100MB isn't with all of the traversals information?)
Re pdb being able to call a DrRacket syncheck object's methods as if status quo check-syntax were doing so:
The simplest and most reliable thing would be just to capture and serialize all those calls -- similar to the output of the simplified show-content -- and later "play them back" for DrRacket.
However, speaking of (say) 100 MB, this would add significantly more on top of what I'm already capturing in a slightly different style. So...
The more space-efficient way would be to keep my current representation -- just adding any few bits missing info I don't happen to store, yet -- and use that single rep to support both my desired API as well as the "play back for DrRacket" style.
This is slightly trickier, and unlikely to be 100.0% exactly like how DrRacket's syncheck object would get called.
For one thing, the order of calls is certain to be different. For example, I'll probably do (say) all the add-arrow method calls in position order, then (say) all the open-require-menu in position order, and so on. I'm guessing that's OK, and if it mattered to DrRacket, you'd say that's a bug in DrRacket you'd want to fix... but maybe not? Any thoughts here?
There is the business of me storing slightly different arrows than what syncheck:add-arrow says. I can easily filter out my extra arrows (e.g. for import/export rename clauses). There might be one example where I change an arrow to point to the new name in a rename clause, instead of to the modname. But either you'll say that's OK, or, I can store enough extra info to do a status quo add-arrow for you.
I think any other difference would be a problem on my end, that I ought to fix, and could easily fix by storing some info I'm not bothering to, today. So just slog through one by one, find, fix. I can make a unit test to compare say show-content-style with my own, work until it passes.
p.s. By "my desired API" I mean: Whether freshly analyzed or from cache, support an access pattern where an app can ask for all or some analysis results for a sub-span ("paged"). So for a large-ish file like class-internal.rkt, there's no need to marshal results for the entire file. Instead a client could ask for just what's visible to the user; do it in "pages".
If you think that's worthwhile for DrRacket, we could orient it that way? In that case maybe the "playback the syncheck methods API" isn't worth doing, after all? I'm happy to do it just wanted to check.
p.p.s. A message or two above you mentioned optimizing traversal.rkt to be faster. So e.g. something like class-internal.rkt today takes roughly 10 seconds for check-syntax to analyze.
When those new results are finally available, being able to access them in "page" subsets is still a benefit, which is why I'm doing that. But it's not ideal: it's 10 seconds after an edit until anything new is available; meanwhile you can only get the old results. And @samth has a huge linklet file example that's closer to 60 seconds. For these, it would be awesome if check-syntax could work in more of a "streaming" style, delivering results as they become available.
I don't know if that's even really possible; maybe you need to build some hash-tables for the entire file, before any results of a certain kind can be returned? But I wanted to mention it.