A global hash table where values are sets

I have a text file of words, and I have found out how to read it, line by line. I extract the words from each line into a list, and those lists look like this: (w0 (w10 w11 w12 ...) (w20 w21 w22 ...))
As I read new lines, I would like to add words to a global hash table, let's call it wh. For each w0, I want to make sure that w0 is in wh; if it's not, I must insert it with its value being an empty hash set. Now that I know w0 is in wh, I shall add w20, w21, ... to the hash set belonging to w0 in wh.
If you can help me find out how to do this, then I shall be able to take care of w10, w11, ...
I'm looking forward to your suggestions. Thanks.

Hi @jkleiser ,

As a first approach, I would use hash-has-key? to check whether wh already contains w0 or not. If not, I would just use hash-set! to add w0 to wh, mapped to the the empty set. If wh already contains w0, I would first use hash-ref to first retrieve the set w0 is mapped to in wh, and then use hash-set! to update it to the union of the new set with the old set.

Depending on the size of the files you process, you might also use group-by to first group together all the lists to which w0 should be mapped, and then append then together and convert the result to a set.

There are probably faster way to achieve what you need by taking into consideration the performance aspects of the used functions, but I hope my suggestions will help you get started.

Something like this?

#lang racket

(define ht (make-hash))

(define (insert w0 ws)
  (define w0-words (hash-ref! ht w0 (λ () '())))
  (hash-set! ht w0 (append ws w0-words)))

(insert "foo" (list "foo1" "foo2"))
(insert "bar" (list "bar1" "bar2"))
(insert "foo" (list "foo3" "foo4"))

(hash-ref ht "foo")
1 Like

Thanks to both of you for getting me on track! I used soegaard's 'insert' definition, and it worked great.
In my little job, speed is not important at all, but thanks for telling me of the possibilities for speed-up.

2 Likes

After some testing, I found that soegaard's 'insert' for a given w0 could result in a list where some words could be repeated. That would happen if the last insert-line in his example was
(insert "foo" (list "foo3" "foo4" "foo1")

Here is my new version:

(define (insert w0 ws)
  (define w0-set (hash-ref! all-words w0 (λ () (mutable-set))))
  (for-each (λ (w)
              (set-add! w0-set w))
            ws)
  (hash-set! all-words w0 w0-set))

Maybe not as elegant, but it eleminates duplicates. (My 'all-words' is his 'ht'.)

2 Likes

You can simplify things by using hash-update! to combine looking up the key and updating its value into one step:

(define (insert w0 ws)
  (hash-update! all-words w0
                (lambda (w-set) (for ([w (in-list ws)]) (set-add! w-set w)) w-set)
                mutable-set))

or using immutable sets instead:

(define (insert w0 ws)
  (hash-update! all-words w0
                (lambda (w-set) (foldl (lambda (w s) (set-add s w)) w-set ws))
                set))
2 Likes

Thanks a lot, shawnw! hash-update! was a nice solution.

Nice! I did not know about hash-update. I mean, I could have implemented it, I suppose, but now that I know about it I'll use it all the time! Also, should your second example have used hash-update instead of hash-update! ?

The second one is still assuming a mutable hash table, so hash-update! it is.

Oh! Sorry, I missed that.

One thing to note: hash-update! is essentially a hash-ref followed by a hash-set!, meaning that you need to worry about thread interactions.

The hash-update! and hash-ref! functions use a table’s semaphore independently for the hash-ref and hash-set! parts of their functionality, which means that the update as a whole is not “atomic.”

Quoted from 4.15 Hash Tables, which has more caveats to offer regarding concurrent modification.

1 Like