RFC: hash table pattern matching

The current hash-table pattern for match is problematic. It is buggy and inefficient. We would like to deprecate the hash-table pattern and create a new pattern that does a better job.

A preliminary discussion is in match: fix hash-table pattern and its doc, add more optimizations by sorawee · Pull Request #4532 · racket/racket · GitHub. I will summarize viewpoints from the discussion in each section to show possible design spaces.

Naming

Everyone seems to agree on the name hash. One downside of hash is that the hash constructor constructs only immutable hash tables, but the hash pattern should be able to destruct mutable hash tables. However, there doesn't seem to be a good alternative name.

General syntax

Should key and value be grouped together?

One reason why we should not group key and value together is to make the pattern resemble the constructor. The hash function is used like this:

(hash key-1 val-1 key-2 val-2)

So perhaps the hash pattern should do the same.

On the other hand, one reason why we should group key and value together is to make it clearer to readers. I personally really don't like reading.

(hash 'abc 'def' 'ghi 'jkl 'mno 'pqr 'stu 'vwu)

Is 'mno a key or a value? To answer that I need to start parsing from 'abc. On the other hand, with grouping, I can immediately tell that 'mno is a key.

(hash ['abc 'def'] ['ghi 'jkl] ['mno 'pqr] ['stu 'vwu])

A counterargument is that one can use newline as a way to group already, which IIUC is the "Clojure style".

(hash 'abc 'def' 
      'ghi 'jkl 
      'mno 'pqr 
      'stu 'vwu)

A counter-counterargument is that it is unnecessarily verbose, and the "Racket style" tends to group things explicitly.

One more datapoint is that the current hash-table pattern does group key and value together.

Currently, @samth advocates for no grouping, while I (@sorawee) advocates for grouping.

Features

Full and partial matching

A full matching requires that there is no keys besides those listed in the pattern. A partial matching allows extra keys.

For example:

(match (hash 1 2 3 4)
  [(hash [1 a]) a]
  [_ 'none])

would return 'none under the full matching semantics, and 2 under the partial matching semantics.

The question that I want to ask here is about what could be supported. There are three choices.

  1. Only support full matching.
  2. Only support partial matching.
  3. Support both.

IIUC, currently @samth advocates for (1) for the initial implementation, though he is open to further extension in the future. @notjack advocates for (2). @AlexKnauth and I advocate for (3).

Default matching behavior

If you choose "support both" in the previous section, congrats, you have one more thing to think about: what should be the default matching behavior?

Three possibilities:

  • Partial matching by default. That is to say the above program would return 2. However, the pattern could support something like (hash [1 a] #:full) to fully match.
  • Full matching by default.
  • Explicit mode is required. Or perhaps there should be two separate patterns full-hash and partial-hash.

Currently:

  • @AlexKnauth advocates for partial matching by default, but users can use the rest feature (see below) to do full matching.

    (match (hash 1 2 3 4)
      [(hash [1 a] #:rest (empty-hash))]) ; full matching
    

    I in fact considered this option before, but the fact that we need to support another pattern to recognize an empty hash doesn't seem ideal to me (technically (empty-hash) = (? empty-hash?), but an explicit empty-hash pattern could make matching much more efficient)

  • I prefer full matching by default, but users can use the rest feature to do partial matching

    (match (hash 1 2 3 4)
      [(hash [1 a] #:rest _)]) ; partial matching
    

    Note that this very same program would be a partial matching in @AlexKnauth's preference, too.

  • My another preference is partial matching by default, but if #:full is given, it would do the full matching.

Rest feature

The rest feature allows users to extract the residue from matching. For example:

(match (hash 1 2 3 4 5 6)
  [(hash [1 a] #:rest h) h])

would produce (hash 3 4 5 6).

There are many possibilities for this feature:

Syntax: keyword vs ellipsis

Should the rest feature be expressed by #:rest, as shown above, or ellipsis, as shown below?

(match (hash 1 2 3 4 5 6)
  [(hash [1 a] h ...) h])

One reason to not prefer ellipsis is that ellipsis-bound variable has always been a list in match. But we might not necessarily want a list as a result (see next section).

Hash table vs list

What should the residue be? A hash table or a list?

The current hash-table pattern extracts the residue as a list (via ellipsis).

(match (hash 1 2 3 4 5 6)
  [(hash-table [1 _] h ...) h])

produces '((3 4) (5 6)).

A hash table extraction is much more efficient when the matching value is an immutable hash table, taking time proportional to the number of key-value pairs if keys are not arbitrary patterns (see next section). In contrast, list extraction takes linear time to the size of the hash table.

I also believe that a hash table result is the preferred format that can be used in future computation. In contrast, for list extraction to be useful, the list probably needs to be converted back to a hash table.


Currently, @samth seems to be against the rest feature in general for the initial implementation, but seems to prefer ellipsis/list extraction if the rest feature is to be implemented. @notjack seems to prefer ellipsis/hash extraction, though not against keyword/hash extraction. I prefer the keyword/hash extraction. @AlexKnauth seems to prefer a hash extraction.

Key position

What should the key position be? A couple of possibilities:

  • The key position can be an arbitrary pattern. The hash-table pattern currently does this. The disadvantage is that it is extremely inefficient in general and doesn't seem to be useful. We can of course specialize cases where we know how to make matching efficient. But is it worth supporting the feature in the first place?
  • The key position must be a literal. This is the specialized clause in the current hash-table pattern that can be matched efficiently.
  • The key position can be an arbitrary expression. This is like literals, but it would also allow variable reference, etc, without making matching inefficient.
  • Or maybe we could allow id, literals, and (== expr) but nothing else. The intention is that it could potentially be useful to extract one arbitrary key without any constraint out (via hash-iterate-first). However, for the id case, the corresponding value pattern must also be an id or _, so that there is no constraint from the value pattern.

I'm not aware of anyone's position on this except mine, which is to support arbitrary expressions, but not arbitrary patterns. Ideally, I want to support id too, but that seems too complicated.

Duplicate keys

Should duplicate keys be allowed to successfully match? That is to say, should:

(hash [1 a] [1 b])

be equivalent to:

(hash [1 (and a b)])

?

Or should duplicate keys always fail matching? Or should it error?

If literals are given, should match warn/error at compile-time?

One data point is that the hash-table pattern always fail matching on duplicate keys, but doesn't cause an error (though there's a bug that could make matching successful).

I'm not aware of anyone's position on this, and I personally don't have a preference either. Supporting duplicate keys is probably the cleanest semantics, I think?

5 Likes

Disclaimer, these are the thoughts of a non-CS user.

Regarding matching behavior, rest, and residue: If I knew only enough about match to be familiar with how it treated lists, and you forced me to guess what I would find in the Racket docs about matching hash tables, here’s what I would guess we would find:

  • (hash 'abc 'def) for a full match
  • (hash* 'abc 'def h) for a partial match — with h being an actual hash table.

Key position: in the guessing-about-the-docs thought experiment above, I would have supposed that just about anything allowed in a list pattern would also be allowed in a hash pattern. But after thinking about it for three minutes, I think I agree that arbitrary expressions, but not arbitrary patterns, might be the most reasonable.

Duplicate keys: if you had duplicate literal keys, that seems like it should raise an exception? But supporting arbitrary expressions for keys might mean it is best to just allow duplicates. I take it utterly for granted that a hash or hash* would never match on duplicate keys.

3 Likes

There are two things I would like in the hash matching feature that I think are missing from this proposal:

  • Providing default values for the match
  • a discussion on the efficiency of a "rest" pattern when matching a subset of the keys

Providing default values for the match: I noticed that it is a lot easier to provide a default value to hash-ref in cases where I want to handle a missing key immediately. That is, I think it is simpler to write code as shown below, than the alternative of using with-handlers to catch the exception. I also found that I almost always want to actually handle errors, since I can usually either handle the missing value, or provide a more appropriate error message to the user.

For me, to be useful, a hash matching functionality needs to allow a default to the value if the key is not in the hash.

(let ([val (hash-ref key a-hash #f)])
  (if val
       (begin ...)
       (error "Some meaningful error to the user"))

Efficiency of a "rest" pattern when matching a subset of the keys A lot of time I need to get a few keys out of a hash table that can contain many other keys. In such cases, I don't want the hash table matcher to allocate a new hash table with the remaining elements (which I would not use).


I would also like to see some concrete use cases of hash tables in the proposal.

For example, it is not clear to me that the code below is more readable than just using hash-remove (and would it be as efficient as the hash-remove ?):

(match (hash 1 2 3 4 5 6)
  [(hash-table [1 _] h ...) h])

It is also not clear that the code:

(match (hash 1 2 3 4)
  [(hash [1 a]) a]
  [_ 'none])

is more readable than using hash-ref with a 'none default value; and it is not clear how it can be extended to look up two values:

(define h (hash 1 2 3 4))
(define a (hash-ref 1 h 'none-a))
(define b (hash-ref 2 h 'none-b))

Also, would the match code allow users to concisely express an inefficient match? For example, if the user tries to find the key that is bound to the value 10 in a large hash table, would they get some warning?

(define k
   (match (make-some-hash-table-with-lots-of-keys)
      ((hash k 10) k)
      (_ 'none))

I can imagine many ways in which users would be able to easily create inefficient code using the hash matching behaviour.

Alex.

2 Likes

Thanks, Joel and Alex!

guess what I would find in the Racket docs about matching hash tables, here’s what I would guess we would find:

I like the idea of hash* being ergonomic for partial matching, and hash being ergonomic for full matching, though it may still make sense to allow hash for partial matching, similar to how (list* x xs) would be equivalent to (list x xs ...). That is, (hash* [x y] h) would be equivalent to (hash [x y] #:rest h).

Providing default values for the match

I can see how this feature could be useful. One possible way to support the default value is with #:default <default-val>:

(match (hash 1 2)
  [(hash [1 x #:default #f])
   x]) ;=> 2

(match (hash 1 2)
  [(hash [3 x #:default #f])
   x]) ;=> #f

Here could be more advanced usage:

(match (hash 1 (list 2))
  [(hash [1 (or (list x) x) #:default #f])
   x]) ;=> 2

(match (hash 1 (list 100))
  [(hash [3 (or (list x) x) #:default #f])
   x]) ;=> #f

Note that #:default <default-val> would make most sense when there’s a key-value grouping, so that’s another reason to favor grouping.

Efficiency of a “rest” pattern

The intention of my proposal is that the rest pattern is efficient, provided that the matching value is an immutable hash table and key positions are not arbitrary patterns. It would use hash-remove under the hood, yes.

I would also like to see some concrete use cases of hash tables in the proposal.

That’s a good suggestion. Some examples of hash table pattern matching in the wild can be seen below (thanks to @shhyou for collecting these examples). Most of them are extracting JSON dictionary content.

it is not clear to me that the code below is more readable

The programs in the proposal are toy examples meant to illustrate the semantics of the pattern. I wasn’t concerned about readability at all. See the list of links above for actual usage, which would need to be concerned with readability.

would the match code allow users to concisely express an inefficient match

That depends on whether the key position is a pattern OR an expression. In the current (or rather, previous) hash-table pattern, it allows any pattern, so yes, it is horribly inefficient with no warning. However, in the proposal, I proposed that the key position would be an expression. In that case, hash table pattern matching is guaranteed to be efficient (unless the rest pattern is used on a mutable hash table). Your example that tries to match "k 10" would be an error, since k is used as an expression, not as a binding pattern.

1 Like

I don't use match too much, so I was surprised that the hash-table can match partial cases.

I like (hash a b c d h ...), where h is a hash like the initial hash. I don't expect too long list of key-values, and in that case I expect to use the vertical alignment. I used it to open http connections that is a function that has like 5 arguments and 5 keywords (see for example http-sendrecv). Moreover I used comments as fake keywords for the first arguments.

Does the macro detect (hash a b c d _ ...). and avoid the creation of the rest part? It looks very difficult to remove the creation of the unused rest part later in the optimization.

What about removing the _ and allowing (hash a b c d ...)? Is it too weird to have a ... that just match the rest and are not connected to any binding?

A weird question: What happens with (hash x 10 y 10)? Can it bind (hash 'a 10)?

I agree that the general case is very inefficient. Perhaps start the initial version very restrictive conditions in the keys (like only literals) and later add more case in case they are popular in the wild. The reverse option is harder due to backward compatibility.

Another weird question: What happens with weak hashes? If the keys in the match clause can only be literals, I don't see any problem. But in a general case where the weak hash must be iterated, it can be a problem if the garbage collector removes something and the internal structure change and the iterator gets confused.

2 Likes

IMO (key value) pairs should be represented as just that, "pairs"; if some folks would prefer to omit clarifying punctuation, why not simply introduce some syntactic sugar:

(pair <k.0> <v.0> <k.1> <v.1> ...)
=> (<k.0> <v.0>) (<k.1> <v.1>) ...

So either may be used as desired:

(hash (<k.0> <v.0>) (<k.1> <v.1>) ...)
-or-
(hash (pair k.0> <v.0> <k.1> <v.1> ...))

As just a thought.

2 Likes

I second the proposal that

  1. Assumes the keys in the hash pattern are restricted to expressions, not other patterns
  2. Defaults to full matching but supports the #:rest option, assuming that #:rest _ is compiled to no-op and full match is just some additional size check.
  3. Groups key-value pairs and uses keywords for the other options (like #:rest).

In addition, I like @joeld's suggestion to have hash* for partial matches. The pattern (hash* [k-expr v-pat] ...) can just be a match expander that turns into (hash #:rest _ [k-expr v-pat] ...). The name hash* is also consistent with the match expander struct*.

For the syntax of hash, the 3rd point is very important for match expanders and for macros that generate match expanders. If the hash pattern splices the key-value sequences or uses ellipsis, the escape syntax would be challenging for everyone -- macro authors would need to escape ~@ and ... one or two times in the syntax template. For example, consider this macro for defining JSON match expanders:

Had hash used ellipsis to express partial matching, in order to generate _ ... macro authors would have to write

#'(hash-table [k v] ... _ (... (... ...)))

because define-json-expander is a macro that generates another macro (a match expander) and thus requires two levels of escape. Similarly, macro authors would need to escape ~@ as (... ~@) in order to express the splicing in the generated macro.


I don't any opinion about what #:rest should do. Actually, what would be a practical use case of (hash [k v] ... #:rest non-trivial-pattern)? I assume that hash* would be the most popular case, so hash that supports #:rest _ is a good solution without sacrificing the ability of expressing full-match semantics. However, I am curious what are the applications of #:rest given that it would just be removing some fixed & finite set of entries.

I can understand samth's opposition to the #:rest feature. It is not clear what the semantics should be if we can't find applications other than #:rest _ at the moment. Not adding #:rest right now makes it possible to choose the right behavior later.

On a side note, I wonder whether there is a need to differentiate between mutable/immutable hashes, hasheq, hasheqv, hash (equal) and hashalw. For example, should the pattern (hash-table [(== (expt 2 100)) val]) fail for the hash table (hasheq (expt 2 100) "no-match")? What if the keys contain some mutable boxes?

1 Like

This behavior was unintentionally introduced for the specific case where the keys are all literals and there is at least one key. The GitHub PR details the relevent commits.

It's simply that partial matches are common for extracting fields in JSON hash tables: RFC: hash table pattern matching - #4 by sorawee

I think it is better to restrict the keys to expressions and ask the users to write (app hash->list (list-no-order XYZ ...)) if they want to search for the keys.

Just to note that list-no-order already can(must?) produce very slow code to check all the cases, but the name is more scary. So it may be not so strange that hash produce slow code. Perhaps it would be better to allow any pastern for keys and values like in the other forms, but add a big warning in the docs.

I feel strongly that when a constructor name is also used as a pattern form, they should use the same syntax. That is, hash should have ungrouped key-value pairs. On the other hand, I think it's reasonable for the pattern form to support a superset of the constructor's syntax, for example through keywords.

So my preferences would be:

  • hash uses ungrouped key-value pairs.
  • hash* uses grouped key-value pairs (and maybe allows extensions like #:default clauses for missing keys).

Both forms should allow keyword options to indicate complete vs partial matches and maybe other things like hash kind and key comparison (or maybe that's better left to and and predicates). I don't have a preference about complete vs partial matching for the default, but there should be explicit syntaxes for both behaviors.

I think we should avoid automatically conflating partial matches with an optional "rest" pattern. I almost always want partial matches, but I have never wanted an updated hash with the matched keys removed. Avoiding that feature would avoid questions like whether mutable hashes (or weak/ephemeron hashes) are converted to immutable tables or copied, etc.

While hash-table in general is slow, the special case where the key patterns are all literals is very fast (just hash-has-key? checks and hash-refs). This is also a very common use case.

1 Like

One reason to not prefer ellipsis is that ellipsis-bound variable has always been a list in match .
But we might not necessarily want a list as a result (see next section).

It makes sense to reserve the ellipsis for lists.

In analogy with list-rest I suggest introducing a hash-rest where the last pattern
is bound to a hash table with the rest of the pairings.

If an ellipsis-like notation is needed for hash, how about ht... instead of ...?

1 Like

Some datapoints, with a grain of salt. (Disclaimer, I started responding from the top, so if something is already covered, please ignore it.)

Naming

Probably not the immediate concern, but what about dict that could match any gen:dict? Then sticking with the constructor syntax isn't an issue, since there is no unified constructor syntax. OTOH, this is probably vastly more complex.

General Syntax

I prefer to think of hashes and dicts like binding pairs. Racket has many examples of grouping related items with square brackets when it doesn't indicate #%app. So I like [key value] syntax. A dict matcher being a future possibility, with the likelihood of using [key value] syntax, leads me to believe that some consistency with hash and dict would be nice.

Let's also not forget that hash is not the only constructor, and both hash-literals and make-hash-style constructors take assoc-lists. I suggest we not write dotted-pairs [key . value], though.

Full/Partial Matching

I like (3) support both, and I think they deserve something to distinguish them. Hypothetically, hash is full only, and a later hash* can be partial. This gives preference to @samth's (1) for the initial implementation. Looks like @joeld and I agree on this.

Rest pattern

This reminds me assoc lists and functional maps; I like the general idea. Seems to allow for partial matching as you wrote, and my hash* from above can be sugar for this. I suggest we pin down the semantics to avoid the original issues.

Related to the later headings, I strongly prefer a hash residue in the spirit of functional immutable programming.

Key Position

I would have to think more about typical uses of match with hashes, and what uses would be supported by different options. As it is I rarely use hash-table, biased by my use of SML/NJs opaque mapping structures. I think @alexh raises good points here, where usage drives my opinions on some of these things (and existing APIs might be more suitable).

Duplicate Keys

The duplicate keys problem seems moot until "Key Position" is resolved.


Related work:

  • The ORD_MAP signature from SML/NJ, where implementation's map types are opaque, so no matching can be done. But the API makes operations convenient in terms of an added or removed "binding" (key-value pair) and the residual hash. There are also plenty of functors for creating ORD_MAPs over specific key types.

Thanks everyone! I will reply to points later this week. For now, here’s an initial implementation that is implemented as a match expander, so that you can run without modifying match. There’s also a test module attached at the end.

https://gist.github.com/sorawee/dcd9d365b30fff8ef8f65c5a53e92ed6

Some design decisions:

  • hash with no grouping but without #:default support. Full matching by default.
  • hash* with grouping and with #:default support. Partial matching by default.
  • hash* to hash is like struct* to struct rather than list* to list. In particular, hash* doesn’t mandate a “rest pattern” like list*.
  • #:default support.
  • Possible keywords at the end: #:full for explicit full matching, #:partial for explicit partial matching, #:rest rest-pat for full matching with residue. #:rest _ specially recognized to be equivalent to #:partial.
  • Non-trivial #:rest would construct a residue hash table. This is only efficient for immutable hash tables. The residue hash table preserves key comparison and weak/strong/ephemeron-ness.
  • Key positions are expressions.
  • Duplicates are not allowed. This makes full matching easier to implement efficiently (esp. in presence of #:default).
2 Likes

Generally speaking, I respect y'all's opinions enormously. I think the only area of this for which I'd like to put my coin on the scale is in the area of key/value grouping, and (like @shhyou and @sorawee ) I support it. Indeed, every time I construct a hash without grouping it feels slightly wrong to me. In addition, I think that @shhyou makes a compelling argument in the syntactic arena.

1 Like

On second thought, I think I would prefer allowing duplicates, which makes partial matching significantly faster, while full matching is still as slow as it currently is.

In the current implementation of hash*, which disallows duplicates:

"partial match"
cpu time: 2124 real time: 2218 gc time: 27
"full match"
cpu time: 2152 real time: 2255 gc time: 8
"hash-table, which is buggy, but it allows partial match and duplications"
cpu time: 1126 real time: 1220 gc time: 1

But if hash* allows duplicates:

"partial match"
cpu time: 589 real time: 624 gc time: 22
"full match"
cpu time: 2052 real time: 2142 gc time: 7
"hash-table, which is buggy, but it allows partial match and duplications"
cpu time: 1096 real time: 1173 gc time: 0

Here’s the benchmark:

#lang racket

(require "hash-pattern.rkt")

(define N 10000000)

(define (test h)
  (match h
    [(hash* [1 v] [2 v2] [3 v3])
     (list v v2 v3)]))

(define (test2 h)
  (match h
    [(hash* [1 v] [2 v2] [3 v3] #:full)
     (list v v2 v3)]))

(define (test3 h)
  (match h
    [(hash-table [1 v] [2 v2] [3 v3])
     (list v v2 v3)]))

(define h (hash 1 2 2 3 3 4))

"partial match"
(time (for ([i (in-range N)])
        (test h)))

"full match"
(time (for ([i (in-range N)])
        (test2 h)))

"hash-table, which is buggy, but it allows partial match and duplications"
(time (for ([i (in-range N)])
        (test3 h)))

1 Like

Super trivial bike-shedding about surface syntax "ergonomics":

In a variant like the proposed hash* which groups using delimiters,

[key-pat val-pat]

it feels verbose needing to write #:default:

[key-pat val-pat #:default default-val]

Instead could it simply be (somewhat like Scribble defproc style):

[key-pat val-pat default-val]

After all, one reason for using delimiters (as well as for readability, and historically for want of syntax-parse splicing :wink:) is it allows riffing on "arity" like this.


Maybe the #:default is allowed but optional for those who think it aids readability. (Although even here I'd suggest something shorter like #:else or even #:or?)

So I guess:

mapping-pattern = [key-pat val-pat default]

default = 
        | val
        | #:or val
1 Like

I forgot to mention this in my previous message, but I think it would be better to have key patterns rather than key expressions. The patterns should be restricted to atomic datum patterns, quoted datum patterns, and explicit == patterns. The mixture of expressions and patterns is confusing/suprising. As evidence, see Greg's message above this one, where he talks about key-pat. Using restricted key patterns also makes it easier to correctly recognize the common case where the keys are constants: you can use the rules for patterns rather than trying to analyze Racket expressions. (No local-expand or #%datum checks needed.)

4 Likes

I strongly believe the default hash pattern should use partial matching, for the reasons I outlined in my comment on the pull request. To summarize, people are going to do this a lot:

(match (get-json-from-some-web-api)
  [(hash 'user user-id 'posts post-list)
   (do-stuff user-id post-list)])

This will break whenever the web service being used decides to put a little more data in their response. Given the prevalence of JSON web APIs in the modern development landscape, this will catch a lot of people off guard.

2 Likes

I feel like it is sufficient to have two different patterns with examples in the docs, like using hash* for default partial matching and hash for default full matching.

1 Like