Using pdf-read package to extract text from pdf file

Using the following code en DrRacket, in a PC with Windows 10:

#lang racket
(require pdf-read)

When running, generates the following message:

 ..\..\..\Program Files\Racket\collects\ffi\unsafe.rkt:131:0: ffi-lib: could not load foreign library
  path: libpoppler-glib.dll
  system error: No se puede encontrar el mĂłdulo especificado.; win_err=126

Searching in my computer, I only found libpoppler-glib-8.dll
On the other hand, looking at the Web, I found libpoppler-glib-8.dll.a

Question: What is the recommended way to solve this issue?
I have the racket-poppler package installed, and I also â–şunderstand that it works for sure only on Linux or Mac OSX, and that it will be of great value also to be able to make it work for Windows.

Context: I hope to use this package to extract text from Numbas exams PDF reports.

I look forward to use this very promising package.
Thank you very much for your support.
ECB

I have the racket-poppler package installed, ...

Does racket-poppler work for you?

If so, it contains the same function reading from pdfs as pdf-read.

For some reason the documention doesn't appear of pkgs.racket-lang.org
[I'll need to look into why, but there is documententation here:

[clone the repo and open index.html]

There is small example here:

1 Like

Thank you @soegaard for the information:

The following code worked w/o errors:

#lang racket
(require racket-poppler pict)

; document-info : d -> assoc-list
(define (document-info d)
  (list (list 'title    (pdf-title d))
        (list 'author   (pdf-author d))
        (list 'subject  (pdf-subject d))
        (list 'keywords (pdf-keywords d))
        (list 'creator  (pdf-creator d))
        (list 'producer (pdf-producer d))
        (list 'page-count (pdf-count-pages d))))

; (define f "x.pdf")

;; Open the "numbas-firstpage.pdf" file.
(define f "numbas-firstpage.pdf")
(pdf-file? f)
(define d (open-pdf f))
d
;; Title and other document info

;;(pdf-title d)

Resulting in the following answers in the interaction window:

#t
#<pdf>

But, when I uncommented the last command (pdf-title d), then the title is written in the interaction windows, but an exception was raised in a console DrRacket 8.11 windows, with the content:

cpointer-accessor: contract violation
  expected: cpointer?
  given: "numbas-sample.pdf"
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: cpointer-accessor: contract violation
  expected: cpointer?
  given: "numbas-sample.pdf"
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe\alloc.rkt:27:0: deallocate
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17
internal error: attempt to deschedule the current thread in atomic mode
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: internal error: attempt to deschedule the current thread in atomic mode
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17

When using ctrl-C (twice) on this last windows, and closing it, also the DrRacket editor is closed.

The very strange situation, is that the file name in the console DrRacket 8.11 windows, keep appearing as in the first run, that is "numbas-sample.pdf", instead of "numbas-firstpage.pdf". I change the file name, to go from a multiple page pdf, to a one-page pdf. Even, in the interaction windows of DrRacket editor, the name keep appearing as "numbas-sample.pdf" (that is the first result). I closed the DrRacket Editor, and run it again, but the result was the same as before.

As a pure especulation, without knowing exactly what is going on, my intuition is that maybe a function is being evaluated just once, and all further calls, results in the first evaluation. That is the case when we define and use a function without parenthesis. Such was the case with the dollar.rkt file in the scribble-math package, that is (I hope) in the process of being updated. But this is only a first speculation.

Thank you @soegaard for any further exploration concerning this issue.

@encomer

Let's first figure whether it is document specific.

If you open the file

racket-popler/examples/test-pdf-functions.rkt

in DrRacket and click the run, what happens?

Here is a screen shot from DrRacket 8.11.1 [cs] on macOS:

If it fails, we know the issue is OS-related.

If it works, then send the numbas pdf file, so we can test on the same file.

Thank you @soegaard . I runned the code in the original test-pdf-functions.rkt , now using the file "guide.pdf", with the following results:

#t
#<pdf>
""
'((title "")
  (author "")
  (aubject "")
  (keywords "")
  (creator
   "LaTeX with hyperref package")
  (producer
   "pdfTeX-1.40.3")
  (page-count
   364))
#<page>
'(612.0 792.0)
'(0.0
  0.0
  612.0
  792.0)
"The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to programming, consider\ninstead reading How to Design Programs. If you want an especially quick introduction to\nRacket, start with Quick: An Introduction to Racket with Pictures.\nChapter 2 provides a brief introduction to Racket. From Chapter 3 on, this guide dives into\ndetails—covering much of the Racket toolbox, but leaving precise details to The Racket\nReference and other reference manuals."
"The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to programming, consider\ninstead reading How to Design Programs. If you want an especially quick introduction to\nRacket, start with Quick: An Introduction to Racket with Pictures.\nChapter 2 provides a brief introduction to Racket. From Chapter 3 on, this guide dives into\ndetails—covering much of the Racket toolbox, but leaving precise details to The Racket\nReference and other reference manuals.\n1"
'()
'((238.76
   164.72937400000006
   250.24267179999998
   180.20601860000005)
  (250.24267179999998
   164.72937400000006
   259.8144342
   180.20601860000005)
  (259.8144342
   164.72937400000006
   267.45807179999997
   180.20601860000005)
  (267.45807179999997
   164.72937400000006
   271.7619218
   180.20601860000005)
  (271.7619218
   164.72937400000006
   284.19144059999996
   180.20601860000005))
.
.
.
.
.
> 

(Note: the points in the results above, correspond to images of the PDF content, in normal and reduced scales).

But, the program also provoked the openning of a Console DrRacket 8.11 Windows with the following content.

cpointer-accessor: contract violation
  expected: cpointer?
  given: "The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to programming, consider\ninstead reading How to Design Programs. If you want an especially quick introduction to\nRacket, start with Quick: An Introduction to Racket with Pictures.\nChapter 2 provides a brief introduction to Racket. From Chapter 3 on, this guide dives into\ndetails—covering much of the Racket toolbox, but leaving precise details to The Racket\nReference and other reference manuals.\n1"
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: cpointer-accessor: contract violation
  expected: cpointer?
  given: "The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to programming, consider\ninstead reading How to Design Programs. If you want an especially quick introduction to\nRacket, start with Quick: An Introduction to Racket with Pictures.\nChapter 2 provides a brief introduction to Racket. From Chapter 3 on, this guide dives into\ndetails—covering much of the Racket toolbox, but leaving precise details to The Racket\nReference and other reference manuals.\n1"
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe\alloc.rkt:27:0: deallocate
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17
cpointer-accessor: contract violation
  expected: cpointer?
  given: "The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to programming, consider\ninstead reading How to Design Programs. If you want an especially quick introduction to\nRacket, start with Quick: An Introduction to Racket with Pictures.\nChapter 2 provides a brief introduction to Racket. From Chapter 3 on, this guide dives into\ndetails—covering much of the Racket toolbox, but leaving precise details to The Racket\nReference and other reference manuals."
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: cpointer-accessor: contract violation
  expected: cpointer?
  given: "The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to programming, consider\ninstead reading How to Design Programs. If you want an especially quick introduction to\nRacket, start with Quick: An Introduction to Racket with Pictures.\nChapter 2 provides a brief introduction to Racket. From Chapter 3 on, this guide dives into\ndetails—covering much of the Racket toolbox, but leaving precise details to The Racket\nReference and other reference manuals."
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe\alloc.rkt:27:0: deallocate
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17
cpointer-accessor: contract violation
  expected: cpointer?
  given: "pdfTeX-1.40.3"
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: cpointer-accessor: contract violation
  expected: cpointer?
  given: "pdfTeX-1.40.3"
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe\alloc.rkt:27:0: deallocate
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17
cpointer-accessor: contract violation
  expected: cpointer?
  given: "LaTeX with hyperref package"
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: cpointer-accessor: contract violation
  expected: cpointer?
  given: "LaTeX with hyperref package"
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe\alloc.rkt:27:0: deallocate
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17
internal error: attempt to deschedule the current thread in atomic mode
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: internal error: attempt to deschedule the current thread in atomic mode
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17
GList->C: argument is not non-null `GList' pointer
  argument: [?error-value->string-handler not ready?]
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: GList->C: argument is not non-null `GList' pointer
  argument: [?error-value->string-handler not ready?]
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe\alloc.rkt:27:0: deallocate
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17
internal error: attempt to deschedule the current thread in atomic mode
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: internal error: attempt to deschedule the current thread in atomic mode
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17

As before, using Ctrl-C twice on this last window, and closing it, also closed the DrRacket Editor.

So, this means, as you said (due to the above results), that the issue is OS-Related.

Thank you again @soegaard for helping to pinpoint the source of this issue.

It dawned on me that the the extra console window contains messages written to standard error. So I tried running the test program in terminal and I see similar errors. See below.

The missing title, author etc. is not due to an error though, if I open the document in PDF Expert, I see the same thing:

I use the racket-popler daily to generate svgs from pdfs (via pdftex).

I'll investigate the cause of the g_free error. Maybe something has change in GLib?

An error such as:

g_free: given value does not fit primitive C type
  C type: _pointer
  given value: "pdfTeX-1.40.3"

probably means it is a simple type error for one of the ffi functions.

/Jens Axel

racket test-pdf-functions.rkt
#t
#<pdf>
""
'((title "") (author "") (aubject "") (keywords "") (creator "LaTeX with hyperref package") (producer "pdfTeX-1.40.3") (page-count 364))
#<page>
'(612.0 792.0)
'(0.0 0.0 612.0 792.0)
"The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to programming, consider\ninstead reading How to Design Programs. If you want an especially quick introduction to\nRacket, start with Quick: An Introduction to Racket with Pictures.\nChapter 2 provides a brief introduction to Racket. From Chapter 3 on, this guide dives into\ndetails—covering much of the Racket toolbox, but leaving precise details to The Racket\nReference and other reference manuals."
"The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to programming, consider\ninstead reading How to Design Programs. If you want an especially quick introduction to\nRacket, start with Quick: An Introduction to Racket with Pictures.\nChapter 2 provides a brief introduction to Racket. From Chapter 3 on, this guide dives into\ndetails—covering much of the Racket toolbox, but leaving precise details to The Racket\nReference and other reference manuals.\n1"
'()
'((238.76 164.72937400000006 250.24267179999998 180.20601860000005) (250.24267179999998 164.72937400000006 259.8144342 180.20601860000005) (259.8144342 164.72937400000006 267.45807179999997 180.20601860000005) (267.45807179999997 164.72937400000006 271.7619218 180.20601860000005) (271.7619218 164.72937400000006 284.19144059999996 180.20601860000005))
g_free: given value does not fit primitive C type
  C type: _pointer
  given value: "The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to progr...
g_free: given value does not fit primitive C type
  C type: _pointer
  given value: "The Racket Guide\nVersion 5.3.4.11\nMatthew Flatt,\nRobert Bruce Findler,\nand PLT\nJune 9, 2013\nThis guide is intended for programmers who are new to Racket or new to some part of\nRacket. It assumes programming experience, so if you are new to progr...
g_free: given value does not fit primitive C type
  C type: _pointer
  given value: "pdfTeX-1.40.3"
g_free: given value does not fit primitive C type
  C type: _pointer
  given value: "LaTeX with hyperref package"
g_free: given value does not fit primitive C type
  C type: _pointer
  given value: ""
g_free: given value does not fit primitive C type
  C type: _pointer
  given value: ""
g_free: given value does not fit primitive C type
  C type: _pointer
  given value: ""
g_free: given value does not fit primitive C type
  C type: _pointer
  given value: ""
g_free: given value does not fit primitive C type
  C type: _pointer
  given value: ""

@encomer

Good news. I believe, I have fixed the problem.
The Github repo contains the latest version.

The problem was related to wrappers used in a handful of functions.

An example:

; page-text : page -> string
;   return all text on page
(define-poppler page-text
  (_fun (p) :: [page-ptr : _PopplerPagePointer = (page-pointer p)]
        -> _string)
  #:wrap (allocator g_free)
  #:c-id poppler_page_get_text)

Removing the line

#:wrap (allocator g_free)

fixes the problem for now.

Let me know if the new version works on your computer.

/Jens Axel

Thanks again @soegaard , for approaching the solution to the problem.

I updated your racket-poppler package in my computer, and then running the original test-pdf-functions.rkt (with the original pdf files), the interactions windows under DrRacket 8.11, gave the previous results, but again a Console Racket 8.11 windows opens up, but now with different information:

GList->C: argument is not non-null `GList' pointer
  argument: [?error-value->string-handler not ready?]
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: GList->C: argument is not non-null `GList' pointer
  argument: [?error-value->string-handler not ready?]
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe\alloc.rkt:27:0: deallocate
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17
internal error: attempt to deschedule the current thread in atomic mode
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: internal error: attempt to deschedule the current thread in atomic mode
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25
   C:\Program Files\Racket\collects\racket\private\more-scheme.rkt:266:2: call-with-exception-handler
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2232:17

But if I comment out the last function calling of the test-pdf-functions.rkt, that is the following code:

(pict->bitmap
 (for/fold ([pageview (scale (page->pict p) 0.5)])
   ([bounding-box (in-list (page-find-text p "the"))])
   (match-define (list x1 y1 x2 y2) bounding-box)
   ;; Each match's bounding box ^
   (pin-under pageview x1 y1
              (cellophane (colorize (filled-rectangle (- x2 x1) (- y2 y1)) "yellow") 0.5))))

Then the modified code in test-pdf-functions.rkt runs perfectly fine, with no problems, and no Console DrRacket 8.11 windows openning up.

As an extra exploration, if a call is made to the following function

(page-find-text p "the")

instead of the whole last function (in the original file), then, the last entry in the interactions DrRacket 8.11 windows is:

'((238.76
   164.729374
   267.45807179999997
   180.2060186000001)
  (272.8258272
   229.27437440000017
   287.1362016
   239.96232320000013)
  (234.90877360000002
   406.2027312
   247.07310820000004
   415.2089216)
  (440.74999999999983
   406.2027312
   456.7997485999998
   415.2089216)
  (188.71194439999988
   418.1577312
   200.8862415999999
   427.1639216))

But the Console DrRacket 8.11 window, with the same content as before*, pop up again.

Thank you @soegaard for your great support on this issue.

That abort-current-continuation error looks very curious. There seems to be a place where that function is called incorrectly in the FFI internal.

@encomer Thanks for testing the new version. It'll take a couple of days before I get to a Windows computer to test on.

@usao

That abort-current-continuation error looks very curious.
You have a good eye. This looks very odd:

internal error: attempt to deschedule the current thread in atomic mode
exception raised by exception handler: abort-current-continuation: contract violation
  expected: continuation-prompt-tag?
  given: [?error-value->string-handler not ready?]; original exception raised: internal error: attempt to deschedule the current thread in atomic mode
  context...:
   C:\Program Files\Racket\collects\ffi\unsafe.rkt:2245:25

@mflatt Should I file a bug report about the internal error?

@encomer

Yesterday spent some time testing on both mac OS and Windows 11.
I believe I have found and fixed the error that caused the problems.
Will you test the update?

Thank you once again @soegaard , for your new update to racket-poppler package.

I updated racket-poppler in my Windows 10 based computer, and have run again the original file: test-pdf-functions.rkt, with the result that now all is working fine (no more console DrRacket 8.11 windows with exceptions reports).

@soegaard I appreciate very much your persistence and technical expertise to eliminate the previouly detected errors under the Windows 10 O/S.

In the future, it will be very good to have some documentation of your very useful racket-poppler package.

Concerning text extraction, a special case is when the PDF consist mainly of text, but this text isn't selectable with the mouse. I understand that this case will require at least character recognition from bitmaps. May be this can be a future added functionality, in racket-poppler, or pdf-read.

Thank you again for your racket-poppler package, and your support for Windows 10 O/S.

In the future , it will be very good to have some documentation of your very useful racket-poppler package.

Note: The package racket-popler only provides a fraction of the functionality that Poppler has. What I use day to day is rendering picts as pdfs. I haven't touched the other parts of Poppler in years. I'll be happy to incorporate pull requests if others want to extend the bindings.

I understand that this case will require at least character recognition from bitmaps. May be this can be a future added functionality, in racket-poppler, or pdf-read.

I think, it would be more straight-forward to find a command line utility that does OCR and then use system to run it from within Racket.

I've had good experience with ocrmypdf.

1 Like