RFC: hash table pattern matching

#:final seems like a fine keyword. But what should be the counterpart keyword for partial matching? The current design follows @ryanc’s advice that there should be explicit keywords for full and partial matching. #:full and #:partial are nice in that they pair up nicely.

Re backward compatibility, to me this is the issue about consistency. If we don’t care about consistency, one possible design would be:

(hash [expr pat def] ... tail)

where:

def ::= 
      > #:default expr

tail ::=
       > #:rest pat
       > #:full
       > #:partial

By default, the hash pattern would do partial matching unless #:full is given. This one pattern suffices for all functionalities that we are discussing and has many attractive features. It has partial matching by default, so you can do JSON matching easily. It has grouping, so you can easily write a match expander as @shhyou and @jbclements prefer.

But a lot of people here do care about consistency. They want no grouping because that’s consistent with the syntax of the hash constructor. They want full matching by default because that’s what almost every other pattern does.

The compromise that I think makes sense is that there are two patterns: hash and hash*. hash would be consistent with the hash constructor: no grouping and requiring all key-values. hash* would do a different thing that people find convenient: grouping and allowing key-values to be omitted. hash and hash* are also consistent with the patterns struct and struct*, which exactly have the same characteristics.

That compromise doesn't make sense. Consistent behavior matters more than "making all possible behaviors convenient." I'd rather set the consistent behavior as partial matching, or allowing extra keys, for both patterns by default. Behavior different from that, such as full or "final" matching, should be expressed with something that makes that difference clear, such as #:final or #:rest (empty-hash), not simply the presence or absence of a * character in the name.

Well, I could similarly argue that perhaps struct* shouldn’t have allowed you to leave out fields, in order to have consistent behavior with respect to struct. But I’m not sure if most people will like that. Names can change behavior, in a good way, and there is a precedent for that. Again, hash to hash* is like struct to struct*. It’s as consistent as it can get.

And I think it would be weird if the hash{,*} family is the only pattern family that does not fully match the matching value. Why does it get a special treatment? The motivation of having both hash and hash* is that one reflects the constructor and one doesn’t need to. In the constructor, you fully list key-values. So why shouldn’t the hash pattern fully match by default?

The difference between struct and struct* is not primarily between "full" vs "partial" matching: it's between by-position and by-key matching. The by-position matching is full and the by-key matching is partial, but that's because by-position needs "full" to catch errors more quickly, while by-key can afford partial because keys make that kind of error-catching mostly unnecessary.

But hash and hash* would both be by key, so they can both be "partial".

1 Like

I acknowledge that your interpretation of struct and struct* is valid. But I believe my interpretation is similarly valid, too.

From what I gather, there are many people who want full matching by default: @joeld, @shhyou, @benknoble, and @samth. There are also many people who want partial matching by default: @notjack, @AlexKnauth, and @rocketnia. I don’t know if anybody’s position has changed during the course of discussion. I would love it if they can weigh in.

One way to resolve the conflict is to say that both options are equally valid, and it’s just a matter of personal preference. In that case, we can also just make a poll and let people vote.

Full matching by default for hash is still more attractive to me, because hash* already exists for people who want partial matching by default. So the different default is convenient for people who need different behavior.

Hm. At some point during the Rhombus discussions on its Map patterns, I remember my position changing due to convincing arguments about forward-compatibility and datatype extensibility (probably from @notjack, though I'm unsure because apparently they said they couldn't make it to that meeting?), with JSON being one example of that.

If arguments like those convinced me during Rhombus discussions, other people can be convinced as well here in Racket discussions.

FWIW I think the (hash [expr pat def] ... tail) design given in reply 41 above is perfectly fine (even elegant, in isolation).

I agree with @AlexKnauth’s statement that ‘Consistent behavior matters more than "making all possible behaviors convenient."’ Amusingly though, I draw exactly the opposite conclusion (though much more weakly than he draws his); partial matching by default in hash would be inconsistent with most other match patterns. But if the rationale were clearly documented, I wouldn’t actually find this inconsistency irritating (as long as it’s easy and clear how to do full matching when you do need it).

And I’m completely ambivalent about whether the match pattern needs to match the hash constructor. I don’t think about match patterns in those terms, but I would never assert that I’m representative of most users in that respect.

I think the idea that struct does "full matching" is a misconception. All the patterns for structs allow partial matching:

(struct foo (a b))
(struct bar foo (c d))

(match (bar 1 2 3 4)
  [(foo a b) (list a b)])

(match (bar 1 2 3 4)
  [(struct foo (a b)) (list a b)])

(match (bar 1 2 3 4)
  [(struct* foo ([a a])) (list a)])

All of those patterns succeed. The struct* one is a bit more partial than the others for ignoring the foo-b field, but the others still ignore the bar-c and bar-d fields.

@AlexKnauth: I don’t want => arrows in the middle of key-value pairs. It’s unnecessary, it takes up space, and more space the more entries there are.

That's how I feel about grouping braces. I want grouping when I have a hash where certain entries take up multiple lines, but if they're mostly one-liners, it's unnecessary and distracting. (At this moment I find it more distracting than =>, even though it does take up one fewer character per entry.)

I have "keywords" => and ~? in there because otherwise there's not much way to prevent ambiguity when the body is a sequence of key, key value, key value default, and key default value entries. (The last two are awkward to keep unambiguous even in braces.)

@AlexKnauth: I also don’t want ellipses involved in the syntax of the rest patterns.

I sympathize with that. I think ellipses have a Rackety feel and might be popular, which is why I went for them there, but I'd love #:rest rest because Parendown tends to be most helpful when long expressions are precisely at the end of their parents. :slight_smile:

I think the same design works with either of these rest syntaxes, neither of them, or both.

@sorawee:

On keys as expressions: I don't think (== pat) to make "room to have keys that are patterns" would work nicely. If I have:

(define == add1)
(match (hash 1 2 2 3)
  [(hash* [(== 1) v]) v])

Is that matching against a pattern 1 or matching against a key value 2?

A key value 2. Since the == there isn't free-identifier=? to the == that hash* would looking for, it would consider this a miscellaneous expression like any other.

I know match isn't entirely in the habit of parsing things according to free-identifier=?, but it does in this case.

(Actually, free-identifier=? doesn't have perfect hygiene itself. If a macro expands into (hash* [k v]) using user input for the k, I think an occurrence of == in that input should not have the meaning hash* is looking for; the binding site hash* and the usage site == are in separate parts of the code and are engaging in a variable capture that's likely unintended. But every macro that uses free-identifier=? for keywords has this problem, and trying to solve it in a way that can be easily ported to all those uses of free-identifier=? across Racket would probably be a digression from this thread.)

@sorawee: a key doesn’t need to be a string/symbol in Racket hash tables

That's why I proposed supporting strings, symbols, keywords, and syntax templates. Just the things that are written in ways that have the least surprising translations to identifiers.

@sorawee: I don’t know what the behavior of (hash* [timestamp]) in your proposal is.

It would be an error because timestamp isn't quoted or a string.

Not only would allowing it be confusing in all the ways you mentioned, it would also make the parsing of bracketless entries ambiguous because (=>hash 'a => 'artichoke) could mean the three-entry hash pattern (=>hash ['a] [=>] ['artichoke]). So I didn't allow it, and I don't want to.

@sorawee: I experimented with the idea of "JavaScript-style identifier-pruning"

Oh yeah, I remember that now, nice!

@sorawee: But what should be the counterpart keyword for partial matching?

Maybe #:extensible? #:open?

I want to make sure it's the default though. If people are commonly rejecting hash tables that have entries they don't care about, then there's pretty much no possible change to the JSON that won't break somebody. (Somebody puts their finger on their temple. Oh, it's past me. Yes, past me, everything breaks when you don't prepare for backwards compatibility in advance, but it's nice for it sometimes not to. :slight_smile: )

Yesterday I was staunchly in favor of match patterns matching only things that were equal to things the corresponding constructor could return. I was writing that up in my message to this thread. But then I realized I also believe that the better notions of equality are really asymmetrical (partial orders) so as to allow for some backwards-compatible extensions. And since hash entries are one of those extension points that could be backwards compatible, I eventually found myself in favor of partial matching by the time I posted.

I don't have a strong opinion about full vs partial matching; I find @notjack's points fairly reasonable. The only thing I have an actually strong opinion about is agreeing with @ryanc about the importance of symmetry between construction and matching.

Re syntax: the current design follows @ryanc’s proposal that hash mirrors the hash constructor, and hash* does a different thing. @shhyou gave a convincing argument that match expander would be difficult to write without grouping, so hash* probably should have grouping. There’s no room left for (<pattern-name> <key-expr> => <val-pat>), unless we want to make a third pattern (hash**? @notjack would kill me!). (<pattern-name> [<key-expr> => <val-pat>]) is still possible, but as @AlexKnauth mentioned, it seems unnecessarily taking up spaces. (Of course, it could be that you disagree with @ryanc or @shhyou’s argument, so the premise is moot).

Let's do a poll then whether we should do partial or full matching for hash. Note that this is only about hash. hash* will definitely use partial matching.

1 Like

Should the hash pattern do a full matching or partial matching?

  • Full matching
  • Partial matching

0 voters

@sorawee: Re syntax: the current design follows @ryanc’s proposal that hash mirrors the hash constructor, and hash* does a different thing. @shhyou gave a convincing argument that match expander would be difficult to write without grouping, so hash* probably should have grouping. There’s no room left for (<pattern-name> <key-expr> => <val-pat>), unless we want to make a third pattern (hash**? @notjack would kill me!).

My suggestion came with the name =>hash. I think it's a little better than hash** in that it at least has a clue as to its distinguishing trait.

And the idea of =>hash was to have grouping, defaults, and punning, all in one syntax that also had non-grouping. A "have it all" kind of syntax, so that multiple diverging patterns like hash and hash* don't have to exist.

That's my primary reason for caring about grouping at all -- just because other people do. (Second place would be indentation style consistency, but that goal can take me all over the place, so I don't count it for much. Third place would be what @shhyou was saying about avoiding ~@ in syntax templates, but I think I'd just embrace the ~@.)

If people really want the name hash for their "have it all" syntax, there are options for that too. Certain calls to hash are errors, and any of them could open the design space for a "have it all" solution.

(hash #:an-unrecognized-keyword)
(hash =>)  ; or ..., or ~?, or similar, or a even new one
(hash . an-improper-list-of-arguments)
(hash #####a-new-reader-syntax)
(hash an-odd-number-of-arguments)
(hash an-unbound-variable)
(hash a-newfangled-value-that-hashes-complain-about)

For instance, perhaps (hash #:@ ['a 1] ['b 2]) would be a good grouping syntax. Not quite as terse as hash*, but ever so slightly more self-explanatory by allusion to ~@ and ,@.

I'm worried that those generic names aren't as good good for a poll without specific examples. Like here's a specific example I'm thinking about:

What should this program produce:

(define (f json)
  (match json
    [(hash 'type type 'brand brand 'model model)
     (string-append "My " type " is a "  brand " " model ".")]))

(f (string->jsexpr "{ \"type\": \"car\", \"brand\": \"Ford\", \"model\": \"Mustang\", \"year\": 2021,  \"color\": \"red\" }"))
  • An error: match: no matching clause for '#hasheq((brand . "Ford") (color . "red") (model . "Mustang") (type . "car") (year . 2021))
  • A string: "My car is a Ford Mustang."

0 voters

Oh well, now we have two polls. How should we interpret the results?

Some people may already vote on both polls, while some other people may vote on one but not on the other.

Does it make sense to require that the results from two polls agree? And ask people to vote on both polls consistently? Or do we just sum votes up across two polls?

I think it makes sense to take the maximum number of votes for an option between the 2 polls. So currently "full" has 5 votes on one poll and 2 votes on another, so it has the max of that, 5 votes. And "partial" has 3 votes on one poll and 2 votes on another, so it has the max of that, 3 votes.

But... I still think there's probably room for convincing people who would have a gut-reaction to vote one way, but would vote differently after encountering examples or good arguments.

One way to go would be to treat the polls as advisory to @sorawee who is doing the hard work of wrangling the best possible consensus and implementing it.

4 Likes

I wrote above that:

I still want both, but, while I had leaned toward wanting full matching by default, this example:

prompted me to think that partial matching might have the opportunity to be more efficient than full matching. It likewise reminded me of this point:

I likewise have never wanted an updated hash with the matched keys removed. Moreover, if I did want such a thing, I think it could easily be implemented as a library match expander using and and app.

Maybe it would be useful to articulate why we're discussing adding a pattern with special support from match, rather than a pure library extension? (That is what we're discussing, right?) I think it's a combination of (a) bugs in hash-table and (b) potential efficiency gains from cooperation with match's optimizer, but maybe there are other reasons.

If that's so, maybe it would help to figure out what the new primitive functionality we want from match is before pinning down a preferred surface syntax. Ideally, the right primitives would allow many possibilities to be implemented efficiently as library extensions. That seems desirable even if we provide the extensions from racket/match, as we do with == and struct*.

I likewise have never wanted an updated hash with the matched keys removed. Moreover, if I did want such a thing, I think it could easily be implemented as a library match expander using and and app.

In my latest proposal, there are three modes: #:full, #:partial, and #:rest. If you don’t want the rest hash table, simply don’t specify #:rest.

this example: …. prompted me to think

Well, that example, in my view, is biased. I think everyone agrees that if you are to match a JSON object, partial matching is the way to go. But JSON object is just one use case for the hash table.

I don’t hate partial matching. In fact, what prompted me to discover the issue in https://github.com/racket/racket/pull/4532 and subsequently made this RFC, was that I was working on https://github.com/racket/scribble/pull/328 and needed a way to do partial matching.

My argument for the full matching by default for hash, again, is that when you look at all other patterns, especially those that mirror their corresponding constructors, they fully match. That’s the default expectation. So why shouldn’t hash that is specifically designed to mirror the constructor fully match too?

As I expressed before, if I am restricted to adding only one pattern, I would just add hash* (with the name of hash), which has partial matching by default.

But a lot of people prefer consistency. They want a pattern that mirrors the constructor (which I personally don’t really care about), so now we have two patterns: hash and hash*. And if mirroring the constructor is the motivation for having hash like that, I don’t see why it shouldn’t be a full matching too.

Maybe it would be useful to articulate why we’re discussing adding a pattern with special support from match, rather than a pure library extension? (That is what we’re discussing, right?) I think it’s a combination of (a) bugs in hash-table and (b) potential efficiency gains from cooperation with match’s optimizer, but maybe there are other reasons.

It’s mainly because of the bugs in the existing hash-table. I need to add something so that users can switch over and stop using the buggy pattern. hash-table is already expressible as a match expander. Same for the proposed hash and hash*. There’s no efficiency gain.

If that’s so, maybe it would help to figure out what the new primitive functionality we want from match is before pinning down a preferred surface syntax. Ideally, the right primitives would allow many possibilities to be implemented efficiently as library extensions.

In my view, we have already done that in the first phase of this RFC. But I am now trying to wrap it up and start implementing it.

3 Likes

when you look at all other patterns, especially those that mirror their corresponding constructors, they fully match

Not necessarily. If you look at struct constructors, those can partially match on sub-structs, as @rocketnia pointed out in

I think the idea that struct does "full matching" is a misconception. All the patterns for structs allow partial matching:

(struct foo (a b))
(struct bar foo (c d))

(match (bar 1 2 3 4)
  [(foo a b) (list a b)])

Sure, it fully matches the fields of foo, but it partially matches the value. In the same way a "partial" version of (hash 'a a 'b b) fully matches the keys 'a and 'b while partially matching the value.

This is also why I don't really like the words "partial" and "full". In the same message where @rocketnia pointed this out about structs, there were other naming suggestions floated: #:extensible or #:open instead of #:partial, and #:final or #:closed instead of #:full.

I personally like the pair #:open and #:closed instead of #:partial and #:full. Either the datatype is open to extensions, or it's closed to them.

I also like #:final for "closed / full / final", but I only like it if "open / partial / extensible" is the default for everything without any keyword necessary for any form to specify non-final.

1 Like