Syncheck:find-source-object

I like the idea of you keeping your own representation in the database (perhaps augmenting it if more information is needed) and then replaying the calls based on that. It is clear that there are more uses than I originally envisioned when I was making Check Syntax and now that we have more experience, I see that a different set of data to extract from a fully expanded program makes more sense. There's no doubt that I was very focused on "what am I going to show to the user" as information coming directly from a fully expanded syntax object. But you make a strong case for an intermediate layer of "what is the information in a fully expanded syntax object" being what we store in the database and then, separately, calculating multiple different things from it based on what an application wants to do.

I'm also happy if the results (by the time the information makes it to DrRacket) are different in the details, as long as they feel like improvements to someone who is editing code based on these annotations.

Here's some specific comments based on your message:

Agreed: that should be a bug in drracket if it mattered (and from what I know of the code, it is mostly just sticking stuff in tables so it might not even be an issue).

Do you think that those arrows would be useful to someone who is editing the program? If so, then how about we experiment with just keeping them? Since we're going to be guarding this all behind an opt-in preference setting for some time anyway, we can experiment and I'm up for it.

That's the easy case for me! I'd be happy to help by trying to narrow down things for test cases, too, when a difference comes up.

Chiming in on the side; as I've gotten a little involved with J. Ryan Stinnett's efforts to bring editor experiences closer together, I'm hopeful that improvements to DrRacket's Check Syntax also percolate to the LSP? I think there's already a strong connection there, but I could be wrong. If so, I also hope to find out about the opt-in and enable it so I can participate on the LSP/Vim side of the conversation :slight_smile:

Yes, they should make it through to the langserver / LSP. The langserver already makes use of DrRacket's Check Syntax data, so I would imagine either (a) DrRacket adds pdb pathways as discussed in this thread and the langserver then exposes those abilities or (b) the langserver pulls in pdb itself to expose them.

One way or another, we'll get them to happen via LSP. :slightly_smiling_face:

1 Like

To share some experience and get some feedback...

I started to work on a new function, send-to-syncheck-annotations-object that should pass a test like this:

  (define o (new build-trace% [src file.rkt]))
  (send-to-syncheck-annotations-object file.rkt o)
  (define (massage xs)
    (define ignored '(syncheck:add-id-set))
    (for/set ([x (in-list xs)]
              #:when (not (memq (vector-ref x 0) ignored)))
      (case (vector-ref x 0)
        [(syncheck:add-arrow/name-dup/pxpy) ;delete name-dup proc
         (apply vector (reverse (cdr (reverse (vector->list x)))))]
        [else
         x])))
  (define actual (massage (send o get-trace)))
  (define expected (massage (show-content file.rkt)))
  (check-equal? actual
                expected
                "send-to-syncheck-object is equivalent to show-content, modulo order")

Commentary:

  • I think syncheck:add-id-set can't be serialized and can't be supported, just like name-dup?. Does that seem correct and OK?

  • I haven't stored some things like pxpy offsets for arrows; adding those.

  • Other miscellaneous paper cuts to slog through; no big deal.

  • Zero-span items... oh $EXPLETIVE.

In my use of check-syntax all these years, I've intentionally ignored items where the beginning and end positions are equal (originating from zero syntax-span).

I've ignored them because, up in Emacs, I can't do anything with them. I can't create an overlay or put a text property using such coordinates. Effectively, in Emacs, they are invisible and can't exist. So I filter them out.

Also down in the Racket Mode back end (written in Racket), I often use data/interval-map for these things. It can't store an empty interval. So far in pdb I've still been using interval-map directly or as a building block.

So: To make the tests pass, I would need to create some other data structure that can handle the zero-span items. And I'm OK with doing that. However I want to double-check with you, first.

  1. There appear to be a boatload of these for things like #%app and #%datum. Not just arrows, but doc-menu items. So for many programs there will be data like

    '#(syncheck:add-docs-menu
       49
       49
       #%app
       "View documentation for “#%app” from racket/base, racket"
       #<path:/home/greg/src/racket-lang/racket/doc/reference/application.html>
       '(form ((lib "racket/private/base.rkt") #%app))
       "(form._((lib._racket/private/base..rkt)._~23~25app))")
    

    for... every single paren that's a function call. Which seems... very "heavy". Plus the annotations for #%datum. And maybe some others I haven't noticed yet. Disk space usage is likely to multiply. This seems very low ROI, but maybe that's just because I don't need them in Emacs.

  2. Using the DrRacket app, I can't manage to see how it actually uses these. For a non-zero-span item, I can get an arrow, and I can get a right-click menu on which I might find a docs item. But for these zero-span items, I can't. Am I doing something wrong?

So that's where I want your feedback. Do you even actually need/use these? I wanted to check before I go to the work to change most of my data structures, as well as multiply the disk space used.

p.s. Actually now I do see the usage in DrRacket -- both arrows and docs (in the upper right corner) for #%app, as well as for #%datum.

This is definitely used.

So I'll try supporting it.

If the disk space usage seems too big, I'll think about ways to represent it more compactly, as well as the gzip-ing that I'm already doing.

p.p.s. We had discussed this previously when check-syntax started doing this a couple years ago; I should have remembered sooner, and searched around for that earlier.

Thanks, @greghendershott. I do think that there are a lot of these zero-span thingies and it would be nice if things could be stored more compactly. (And I do remember a lot of pain when making DrRacket support them for precisely the reasons you're mentioning.)

re add-id-set I can't actually remember what that method does but the docs claim it isn't called anymore (and the information moved over to syncheck:add-arrow/name-dup/pxpy) and I don't see any calls to it in the code either, so I think it can be skipped.

I manage to get #%datum annotations from the LSP when on the open-quote of strings or the beginning of numbers, things like that.

I thought I got #%app from the open parens, but I can't manage to get this to work now.

What's stored in the db is a gzip of the read-able representation. It might turn out that's sufficient wrt disk space.

That still leaves the cost of marshaling all these as sexps or json through Racket Mode or LSP for clients like Emacs or vscode and so on.

For clients who want to use a pdb function to retrieve these (as opposed to via status quo syncheck:annotation object method calls) I might offer a couple options to reduce the marshaling overhead.

  1. Already there's the option to fetch annotations in slices, e.g. as needed for visible "page" sub-spans of the buffer, as well as hoovering all at once. Doesn't reduce the cost, but amortizes it.

  2. I could add a flag to filter the zero-span ones; in case the client won't even use them, no point marshaling them at all.

  3. I could add some sort of "interning". ("Here's the doc-menu data for a few items like #%app and #%data; please remember it because I'm only going to tell you the position for each one thereafter".).

Anyway need to fry some bigger fish first.

Good to know, thanks.

I realize this has turned into a fire hose thread; I'll try to keep this a brief update.

  • I'm handling the zero-width items (for now, simply by "sharding" zero- and non-zero-width items into separate interval maps, and retrieving from one or both as appropriate).

  • I'm able to synthesize nearly all syncheck method calls identical to show-content. The only divergence is add-arrow, as described up-thread (and add-jump-to-definition, because I synthesize it from arrows). This only arises with arrows related to stuff like rename-in or prefix-in, which I adjust for renaming purposes. For some programs it's 100% the same, for others less.

  • So I think I'm back on track, after assimilating these two new requirements.

  • Having said that... I have some doubts about using my own representation to synthesize the syncheck method calls. I hate having tests that can't pass for 100.0% of programs. And I don't want to be in the business of fixing bugs with where I don't mimic check-syntax exactly right. Part of me wants to try another approach:

    • Capture things in show-content style; almost literally that list of vectors. Store that data (plus some small extra side info I gather, but see below) on disk. As a result, it can be played back to clients who want the "classic" syncheck API (vs. using my new "give me just a slice of annotations; I'll ask as/when they would become visible" style API). TL;DR something only trying to be a "dumb" cache, will be a more reliable cache, than some other "clever" representation.

    • Anything I do making my own arrow structs, and adjusting some of them, for purposes of renaming? I can build that in memory, after loading from the above storage. It's fast enough. So that chunk of my stuff is just another user of the same cache. And I don't even need to use any extra disk space for renaming arrows hoohah.

I'm going to marinade myself in that for awhile, but likely to tackle that next.

For the syncheck:add-arrow -- it sounds like it is time to add a new method, like you suggest upthread, where the extra arrows that you want are just there but there is an argument somewhere saying "these are new arrows" (perhaps with a better name :slight_smile:) and then the default version of that method forwards on the arrows that aren't the new ones to the current add-arrow method? Then we can put the logic that calculates those arrows in the traversals code.

I think you told me before but I'm not able to dig up quite what the definition is. If you want to describe the general rules, I'll try to add it or if you know already what to do and want to add it, that's also fine with me (but not expected!). I do see syncheck:add-arrow/name-dup/pxpy upthread -- is that the whole deal? That looks not too difficult to pull into traversals.

Does it make sense for me to get started trying to implement a pdb mode in DrRacket, just so that I can understand what you're thinking about it bit better?

I'll need a little time to review and double-check. But quick answer not to leave you waiting: I think the spirit of the new add-arrow would be, minimally:

  • Drop name-dup? arg (not serializable).

  • Add a new arg, which supplies the result of calling identifier-binding on the syntax object for the "use" a.k.a. "to" end of the arrow.

    (That's my need today. Consider passing syntax objects for both arrow ends, so in the future people might be able to do other things?)

  • Add a backward compatibility pledge, i.e. unlike add-arrow/xxx this isn't merely user presentation, it's an analysis building block.

1 Like

I neglected to answer this. If you don't mind alpha quality software, I think you could try.

  • Clone the pdb repo and do make install.

  • The two functions you'd want to dynamic-require from pdb (falling back to status quo if that fails) would be analyze-path and send-to-syncheck-annotations-object. No docs yet but they have comments and contracts in the source.

  • If that goes smooth-ish, a third you could try would be rename-sites.

I have become lost in this discussion.

What is pdb?

When I look it up I find
protein data base
protocol debugger ( https://www.blackhat.com/presentations/bh-usa-06/BH-US-06-Rauch.pdf )
and a friend guessed python debugger.

All of these might involvee links and arrows....

-- hendrik

It's a project I've been working on for awhile, which seems like it might be getting to the point of being actually usable: pdb.

This is WIP exploring the idea of storing, for multiple source files, the result of running drracket/check-syntax, plus some more analysis.

The main motivation is to support multi-file flavors of things like "find references" and "rename".

The intent is this could enhance Racket Mode, as well as Dr Racket and other tools.

...

It's a project I've been working on for awhile, which seems like it might be getting to the point of being actually usable: pdb.

It looks useful.

@robby @jryans I've started to peek a little at things like drracket/online-comp.rkt and racket-langserver/check-syntax.rkt.

One question I have is about expansion.

Currently pdb assumes that you give it a path (and optionally string for the source, for an un-saved buffer). It handles all reading and expansion, before giving to check-syntax proper. Why? There is useful information to be captured during expansion:

  1. Things like typed-racket can call error-display-handler multiple times to provide multiple error messages with srclocs (as well as that from the final exn:fail?).

  2. Things like (again) typed-racket use a logger-based protocol to add more mouse-overs (saying e.g. that this span is of this type).

So pdb captures both during expansion, and saves them as a list of zero or more error messages, as well as additional mouse-overs, and you get those back for newly-analyzed file. And if you open a previously analyzed file (from "cache"), they are also available.

So... for you to use this we'd need to figure this out. Do you really need to fully expand before calling pdb?

  • Maybe you don't; some reasons why you use fully expanded syntax, might already be covered by pdb (example: digging through it looking for completion candidates), so you could just get that from pdb, too. No need to fully expand anymore.

  • Or if you really do, then probably analyze-path needs some optional arguments, where you can supply the fully-expanded syntax, as well as the list of errors messages and additional mouse-overs that you presumably gathered during expansion, so pdb can store those in the db for later use, too.

Although there might be more integration speed bumps, this is the main one I'm aware of, so far.

Welcome your thoughts (not urgent, as/when you have time).

My take on this is that we want to centralize the code that takes a program and calculates the stuff that goes into the DB. Just as you say, Typed Racket (and maybe other languages) do stuff during expansion and so we want all the tools to receive that information in the same way. I believe that what DrRacket's doing is the right thing to do but it is great if that code is just called by all the tools. Probably best if we make an interface that supplies a port (so that I can offer an unsaved DrRacket buffer to the expansion process and maybe other things will boil down to it) but that seems like it is already in your thinking.

It may also make sense for pdb to offer the fully expanded object back (after it does the expansion and captures all the things) if someone needs that.

One possible complication: DrRacket wants the expansion process to be kill safe (it just kills the expansion process and expects the resources to be relaimed when it does that). Probably we can make that work, however, as DrRacket will be happy to trust pdb (it won't trust the code that gets called during expansion in the macros which is why it needs to be killable).

[edit: I thought of a reason why DrRacket uses the fully expanded object -- it reuses the work of expansion (sometimes) and doesn't reexpand ]

I'm open to improving or replacing the code I have for this, with what you've done for DrRacket, especially if you think mine is incomplete or wrong.

Currently it takes a string (what I can marshal from Emacs to Racket). I could have it also take a port if that fits better for DrRacket.

One detail: Currently it does need some pathname (even for an unsaved buffer) as a key for the in-memory cache and on-disk sqlite db.

I need to re-read the kill-safe paper to be sure I know exactly what that means.

I have written it to be "break-safe". If you call analyze-path from one thread, but before it completes, call it again for the same path from another thread, it will break the first one; details.

The gotcha is that pdb can give this back for a fresh analysis, or for one not yet flushed from the in-memory cache, but it can't for one that had to be loaded from the sqlite db.

Hmm OK. So, maybe the result of whatever it does could be saved to the db, instead, somehow?

Or I could serialize the expansion. [But I've been meaning to ask you about GitHub - rfindler/fully-expanded-store. From what I can understand (likely wrong), this is really a store of compiled code. I don't immediately see how to get a syntax object back, which would be usable in the same way as one directly from expand. Even if the syntax datum structure is recoverable, don't you often need the lexical context and the original namespace in which expansion took place? AFAICT it's not possible to serialize a namespace.]

But in any case we can find a good approach, somehow.

re kill-safety: it looks like you have the same motivation in that break-handling code. The difference is that we'd need to be able to call custodian-shutdown-all on some custodian that's has control of the thread that's doing the expansion (and it should be set up with memory limits too). The idea is both, as you write in those comments, to give up on obsolete ones, but also to prevent the program that's being expanded from using up all memory or doing other things. In DrRacket, the network is turned off and you can't write to the filesystem (except that cm has a carve out for that, to be able to save .zo files). Stuff like that.

I believe this usecase isn't as complicated as some of the stuff in the kill safety paper but the basic idea is that you use the CML-style message-passing APIs to communicate between the thread that's doing the expansion and the thread that's storing stuff in the table. The thread that's storing stuff in the table can wait on either getting a new piece of information or wait on the thread being killed (in which case it just cleans up and discards the partial information). That's not all the deatils :slight_smile: but hopefully enough to give a sense of what's needed?


re ports vs strings: in DrRacket I can turn an editor into a port that reads directly out of the editor, but now, remembering a bit more, I realize that I don't actually do that; instead I build a string and hand off the string to another place, as the programmer might edit the program while that first "read" step before expansion might happen, which would be bad. So a string is perfect.

Although, there is also the wrinkle that the file might have embedded images (more accurately it might be in wxme format) but since there is an eventual port that has to be handed to read-sytnax somewhere, probably this will just amount to building the port using the wxme library instead of directly from the string and be a straightforward change.


re holding onto fully expanded objects: The goal was to avoid having the programmer who is editing their program avoid having to wait through two expansions of the same code. So, if we're reading the data from the database and we didn't actually do a first expansion so no problem if we don't have the object saved. I guess one way to do this would be to have an API call to pdb that supplied as input the program to expand (via a port or string or filename etc) and got back the expansion results and, if the program actually was expanded, to get back the fully expanded object. So, if it isn't there to be gotten, then, np, just don't give it back. Does something like that sound workable?

Well this is the problem with setting aside something for 2 years. It nagged at me overnight and finally I remembered I'd already asked about this.

You read and also eval the compiled bytes, to get the syntax object. But... it looks like syntax properties aren't serialized; that would be a to-do.