[TIL] hash-ref is very fast on hasheq

Not a question per se, just a positive remark: I was wondering whether hash-ref was fast for fixed large sets of fixnums, or if a custom binary search could do better. Answer: It's plenty fast: it can answer 1 000 000 queries on a set of 10 000 000 fixnums in less than 200ms in total.

For reference:

The custom binary search is actually slower, by a factor 2 or so.

At least I don't have to waste time on this, so thank you, implementors :slight_smile:

2 Likes

If you use a hasheq, I'd expect keys to be compared by identity (eq?), not by hash function results or a numerical comparison.

If key equality is checked with eq?, I wonder whether it's a good idea to use fixnums as keys. Probably fixnums aren't interned/reused. (I'm only guessing that they aren't.)

From the Racket docs, " As a special case among numbers, two fixnums that are = are also the same according to eq?." 4.1 Equality

2 Likes

There are a few extra things you might want to consider in an actual application:

  • Performance of creating a 10 000 000 fixnum hash vs creating and sorting a 10 000 000 fxvector
  • The number of lookups you use in your test has to match what you expect from your application. In an extreme case, if your application needs to do just a few lookups and you already have all the data in a list, it might be faster to scan the list in linear time than to construct a hash and use the much faster hash-ref
  • The number of elements you search for also has to match what you expect from your application. You might find that, if you have only a few thousand elements the performance profile changes.

I suspect (without checking) that, for sets of 100 elements or less, vector-memq might be faster than both the binary search in a fxvector and a hash-ref...

Alex.

4 Likes

Additionally if you have very many elements (and depending on how dense those are distributed) it may make sense to use a bloom filter. It can tell you that something is not in the set with certainty, but it only can tell you whether something could be in the set.

So usually it is used to quickly decide that something isn't in the set and if it might be in the set you need to check some other data structure.
I mostly mention this, because I like the idea of using a bloom filter, so far I haven't had a case where it worked out to improve performance.

1 Like

I agree.

[Ignoring some minor details, like negative fixnums and the 32bits version of CS.]

In BC, the "even pointers" are real pointers to objects, and the odd "pointers" are fake pointers that are an encoded fixnums, where the N is internally represented as 2N+1.

In CS, the fixnum N is internally represented as 8N, and the other 6 remainders modulo 8 are pointers to real objects+something, where something indicate the kind of the the object, like a cons, of a box, a struct,... And if you were counting, there is 1 additional remainder that indicate that it's an special value like #t, #f, ... or a char.

So, fixnum are not interned, but imagining them as interned is a good approximation.

2 Likes