The current hash-table
pattern for match
is problematic. It is buggy and inefficient. We would like to deprecate the hash-table
pattern and create a new pattern that does a better job.
A preliminary discussion is in match: fix hash-table pattern and its doc, add more optimizations by sorawee · Pull Request #4532 · racket/racket · GitHub. I will summarize viewpoints from the discussion in each section to show possible design spaces.
Naming
Everyone seems to agree on the name hash
. One downside of hash
is that the hash
constructor constructs only immutable hash tables, but the hash
pattern should be able to destruct mutable hash tables. However, there doesn't seem to be a good alternative name.
General syntax
Should key and value be grouped together?
One reason why we should not group key and value together is to make the pattern resemble the constructor. The hash
function is used like this:
(hash key-1 val-1 key-2 val-2)
So perhaps the hash
pattern should do the same.
On the other hand, one reason why we should group key and value together is to make it clearer to readers. I personally really don't like reading.
(hash 'abc 'def' 'ghi 'jkl 'mno 'pqr 'stu 'vwu)
Is 'mno
a key or a value? To answer that I need to start parsing from 'abc
. On the other hand, with grouping, I can immediately tell that 'mno
is a key.
(hash ['abc 'def'] ['ghi 'jkl] ['mno 'pqr] ['stu 'vwu])
A counterargument is that one can use newline as a way to group already, which IIUC is the "Clojure style".
(hash 'abc 'def'
'ghi 'jkl
'mno 'pqr
'stu 'vwu)
A counter-counterargument is that it is unnecessarily verbose, and the "Racket style" tends to group things explicitly.
One more datapoint is that the current hash-table
pattern does group key and value together.
Currently, @samth advocates for no grouping, while I (@sorawee) advocates for grouping.
Features
Full and partial matching
A full matching requires that there is no keys besides those listed in the pattern. A partial matching allows extra keys.
For example:
(match (hash 1 2 3 4)
[(hash [1 a]) a]
[_ 'none])
would return 'none
under the full matching semantics, and 2
under the partial matching semantics.
The question that I want to ask here is about what could be supported. There are three choices.
- Only support full matching.
- Only support partial matching.
- Support both.
IIUC, currently @samth advocates for (1) for the initial implementation, though he is open to further extension in the future. @notjack advocates for (2). @AlexKnauth and I advocate for (3).
Default matching behavior
If you choose "support both" in the previous section, congrats, you have one more thing to think about: what should be the default matching behavior?
Three possibilities:
- Partial matching by default. That is to say the above program would return
2
. However, the pattern could support something like(hash [1 a] #:full)
to fully match. - Full matching by default.
- Explicit mode is required. Or perhaps there should be two separate patterns
full-hash
andpartial-hash
.
Currently:
-
@AlexKnauth advocates for partial matching by default, but users can use the rest feature (see below) to do full matching.
(match (hash 1 2 3 4) [(hash [1 a] #:rest (empty-hash))]) ; full matching
I in fact considered this option before, but the fact that we need to support another pattern to recognize an empty hash doesn't seem ideal to me (technically
(empty-hash)
=(? empty-hash?)
, but an explicitempty-hash
pattern could make matching much more efficient) -
I prefer full matching by default, but users can use the rest feature to do partial matching
(match (hash 1 2 3 4) [(hash [1 a] #:rest _)]) ; partial matching
Note that this very same program would be a partial matching in @AlexKnauth's preference, too.
-
My another preference is partial matching by default, but if
#:full
is given, it would do the full matching.
Rest feature
The rest feature allows users to extract the residue from matching. For example:
(match (hash 1 2 3 4 5 6)
[(hash [1 a] #:rest h) h])
would produce (hash 3 4 5 6)
.
There are many possibilities for this feature:
Syntax: keyword vs ellipsis
Should the rest feature be expressed by #:rest
, as shown above, or ellipsis, as shown below?
(match (hash 1 2 3 4 5 6)
[(hash [1 a] h ...) h])
One reason to not prefer ellipsis is that ellipsis-bound variable has always been a list in match
. But we might not necessarily want a list as a result (see next section).
Hash table vs list
What should the residue be? A hash table or a list?
The current hash-table
pattern extracts the residue as a list (via ellipsis).
(match (hash 1 2 3 4 5 6)
[(hash-table [1 _] h ...) h])
produces '((3 4) (5 6))
.
A hash table extraction is much more efficient when the matching value is an immutable hash table, taking time proportional to the number of key-value pairs if keys are not arbitrary patterns (see next section). In contrast, list extraction takes linear time to the size of the hash table.
I also believe that a hash table result is the preferred format that can be used in future computation. In contrast, for list extraction to be useful, the list probably needs to be converted back to a hash table.
Currently, @samth seems to be against the rest feature in general for the initial implementation, but seems to prefer ellipsis/list extraction if the rest feature is to be implemented. @notjack seems to prefer ellipsis/hash extraction, though not against keyword/hash extraction. I prefer the keyword/hash extraction. @AlexKnauth seems to prefer a hash extraction.
Key position
What should the key position be? A couple of possibilities:
- The key position can be an arbitrary pattern. The
hash-table
pattern currently does this. The disadvantage is that it is extremely inefficient in general and doesn't seem to be useful. We can of course specialize cases where we know how to make matching efficient. But is it worth supporting the feature in the first place? - The key position must be a literal. This is the specialized clause in the current
hash-table
pattern that can be matched efficiently. - The key position can be an arbitrary expression. This is like literals, but it would also allow variable reference, etc, without making matching inefficient.
- Or maybe we could allow
id
, literals, and(== expr)
but nothing else. The intention is that it could potentially be useful to extract one arbitrary key without any constraint out (viahash-iterate-first
). However, for theid
case, the corresponding value pattern must also be anid
or_
, so that there is no constraint from the value pattern.
I'm not aware of anyone's position on this except mine, which is to support arbitrary expressions, but not arbitrary patterns. Ideally, I want to support id
too, but that seems too complicated.
Duplicate keys
Should duplicate keys be allowed to successfully match? That is to say, should:
(hash [1 a] [1 b])
be equivalent to:
(hash [1 (and a b)])
?
Or should duplicate keys always fail matching? Or should it error?
If literals are given, should match
warn/error at compile-time?
One data point is that the hash-table
pattern always fail matching on duplicate keys, but doesn't cause an error (though there's a bug that could make matching successful).
I'm not aware of anyone's position on this, and I personally don't have a preference either. Supporting duplicate keys is probably the cleanest semantics, I think?