Postgres and Custom Types

Hi, Racket Discourse.

I am messing around with embeddings, and I need some help getting a custom type to work.

Using the pg_vector extension to Postgres (running version 17), I can send an embedding vector to the database just fine, like so:

(db:query-exec
 pgc "INSERT INTO items (embedding) VALUES ($1::float4[])"
 embedding) ;; embedding is a list of real?

But when I try and do something like this:

(db:query-list
 pgc "SELECT 1 - (embedding <=> $1::vector) AS cosine_similarity FROM items"
 embedding)

I run into problems.

query-list: cannot convert given value to SQL type
  given: '(-0.0043678335 -0.026342869 -0.031660147 ...)
  type: bytea
  expected: bytes?
  dialect: PostgreSQL

If you try and swap the ::vector for ::float4[], it complains that the types are wrong:

query-list: operator does not exist: vector <=> real[]
  SQLSTATE: 42883

I can do this to make it work, but it seems like a "hack":

(db:query-list
 pgc "SELECT 1 - (embedding <=> CAST($1::float4[] AS vector)) AS cosine_similarity FROM items"
 embedding)

;=> '(0.5611949909970255 0.5611949909970255 0.7075591289715957)

Now, I've defined the custom type like so:

(define vector-typeid
  (vector->values
   (db:query-row pgc "SELECT oid FROM pg_type WHERE typname = $1" "vector")))

(send pgc add-custom-types
      (list (pg:pg-custom-type vector-typeid 'vector)))

but I am assuming I should use the #:send and #:recv keywords correctly. I tried naively serializing the embedding to bytes, but it seems like the pg_vector extension expects a particular format for that.

Is there an "easy" way to get something like this to work, or do I just have to figure out the encoding/decoding and then it'll make sense?


Eventually I just looked at another implementation of the binary serialization to get the idea:

(define (encode-embedding e)
  (apply
   bytes-append
   (list*
    ;; header (from the python `>HH`, meaning two big-endian, unsigned shorts)
    (integer->integer-bytes (length e) 2 #false #true)
    (integer->integer-bytes 0          2 #false #true)
    ;; content (I just guessed they would be big-endian, too)
    (for/list ([real (in-list e)])
      (real->floating-point-bytes real 4 #true)))))

(define read-four
  (lambda (in) (read-bytes 4 in)))

(define (decode-embedding d)
  (define in (open-input-bytes d))
  (define dimms (read-bytes 4 in))
  ;; adds some noise, need to tune this
  (for/list ([bytes (in-producer read-four eof in)])
    (floating-point-bytes->real bytes #true)))

...

(send pgc add-custom-types
      (list (pg:pg-custom-type vector-typeid 'vector
                               #:send encode-embedding
                               #:recv decode-embedding)))

Et voilĂ :

(db:query-list
 pgc "SELECT 1 - (embedding <=> $1::vector) AS cosine_similarity FROM items"
 embedding)

;=> '(0.5611949909970255 0.5611949909970255 0.7075591289715957)

Do I have to make a reply to mark it as solved?

Edit: indeed, I do.

FWIW, “email list” users don’t see the updated edit with the solution, requiring me to clickthrough. I’d encourage thinking about whether new content is truly an edit or a new post (which would also have helped the “solution” button here) :slight_smile:

2 Likes

Indeed, I guess it was because I treated the question a bit cavalierly. I had an answer so that was that, but I forget that this is a forum, while yet appealing to its members. Thanks for the prompt.