I have a text file of words, and I have found out how to read it, line by line. I extract the words from each line into a list, and those lists look like this: (w0 (w10 w11 w12 ...) (w20 w21 w22 ...))
As I read new lines, I would like to add words to a global hash table, let's call it wh. For each w0, I want to make sure that w0 is in wh; if it's not, I must insert it with its value being an empty hash set. Now that I know w0 is in wh, I shall add w20, w21, ... to the hash set belonging to w0 in wh.
If you can help me find out how to do this, then I shall be able to take care of w10, w11, ...
I'm looking forward to your suggestions. Thanks.
Hi @jkleiser ,
As a first approach, I would use hash-has-key? to check whether wh
already contains w0
or not. If not, I would just use hash-set! to add w0
to wh
, mapped to the the empty set. If wh
already contains w0
, I would first use hash-ref to first retrieve the set w0
is mapped to in wh
, and then use hash-set!
to update it to the union of the new set with the old set.
Depending on the size of the files you process, you might also use group-by to first group together all the lists to which w0
should be mapped, and then append then together and convert the result to a set.
There are probably faster way to achieve what you need by taking into consideration the performance aspects of the used functions, but I hope my suggestions will help you get started.
Something like this?
#lang racket
(define ht (make-hash))
(define (insert w0 ws)
(define w0-words (hash-ref! ht w0 (λ () '())))
(hash-set! ht w0 (append ws w0-words)))
(insert "foo" (list "foo1" "foo2"))
(insert "bar" (list "bar1" "bar2"))
(insert "foo" (list "foo3" "foo4"))
(hash-ref ht "foo")
Thanks to both of you for getting me on track! I used soegaard's 'insert' definition, and it worked great.
In my little job, speed is not important at all, but thanks for telling me of the possibilities for speed-up.
After some testing, I found that soegaard's 'insert' for a given w0 could result in a list where some words could be repeated. That would happen if the last insert-line in his example was
(insert "foo" (list "foo3" "foo4" "foo1")
Here is my new version:
(define (insert w0 ws)
(define w0-set (hash-ref! all-words w0 (λ () (mutable-set))))
(for-each (λ (w)
(set-add! w0-set w))
ws)
(hash-set! all-words w0 w0-set))
Maybe not as elegant, but it eleminates duplicates. (My 'all-words' is his 'ht'.)
You can simplify things by using hash-update!
to combine looking up the key and updating its value into one step:
(define (insert w0 ws)
(hash-update! all-words w0
(lambda (w-set) (for ([w (in-list ws)]) (set-add! w-set w)) w-set)
mutable-set))
or using immutable sets instead:
(define (insert w0 ws)
(hash-update! all-words w0
(lambda (w-set) (foldl (lambda (w s) (set-add s w)) w-set ws))
set))
Thanks a lot, shawnw! hash-update! was a nice solution.
Nice! I did not know about hash-update. I mean, I could have implemented it, I suppose, but now that I know about it I'll use it all the time! Also, should your second example have used hash-update
instead of hash-update!
?
The second one is still assuming a mutable hash table, so hash-update!
it is.
Oh! Sorry, I missed that.
One thing to note: hash-update!
is essentially a hash-ref
followed by a hash-set!
, meaning that you need to worry about thread interactions.
The hash-update! and hash-ref! functions use a table’s semaphore independently for the hash-ref and hash-set! parts of their functionality, which means that the update as a whole is not “atomic.”
Quoted from 4.15 Hash Tables, which has more caveats to offer regarding concurrent modification.