Pratt parsing is so easy, it should be criminal!

Hi, Racket Discourse.

I recently had to implement a parser for converting infix to prefix notation, with the purpose of converting user-provided queries into an expression that could be evaled to produce the desired result.

A perfect opportunity to implement a Pratt parser, since I can't remember having done one before. I followed this evergreen walkthrough by Eli Bendersky.

After reading through, I decided I wanted to test out struct generics, as properties seemed like a good fit for the different properties of the tokens used in the parser.

Although the result is a toy, I am quite pleased with how simple it was to implement, and glad that I have some of the generics stashed in a corner of my brain, now.

I almost want to kick myself for wasting so much time in school, implementing shunting-yard :sweat_smile:.


The gist of Pratt parsing comes down to three properties of the tokens in the expressions one is trying to parse. Namely:

  • The left-binding power (lbp) of the token, if any. This defines how tightly a token binds to its neighbor on the left, and determines where the parser decides to cut off a group of tokens.
  • The null-denotation (nud) of the token, if any. This defines what happens when a token is in the "unary" mode, such as a literal value, or - x (unary minus), for example. It does not take any arguments apart from the token itself.
  • The left-denotation (led) of the token, if any. This defines what happens when a token is in the "binary" mode, such as x + y for the addition token. It, on the other hand, does take another argument, namely, the left-hand side of the token, which in this case is x.

Additionally, the parser itself takes as an argument the right-binding power (rbp) of the current expression, which is used in tandem with the lbp to group together tokens.

The main driver of the parser, remarkably, is simply this:

;; the token sequence
(define tokens (make-parameter (list)))
;; the current head of the sequence
(define token  (make-parameter #false))
;; return the head and advance the sequence by one element
(define (next)
  (match {tokens}
    [(list) (end-token)]
    [(cons head rest)
     {tokens rest} head]))

;; the expression parser
(define (expression #;right-binding-power [rbp 0])
  (define head {token})
  {token (next)}
  (let loop ([left (nud head)])
    (cond
      [(<= (lbp {token}) rbp) left]
      [else
       (define head′ {token})
       {token (next)}
       (loop (led head′ left))])))

I wanted to mimic the flow of the walkthrough by Eli, and I didn't feel like managing a generator "globally": so, parameters were an easy fit for dealing with the token sequence. It also, as a happy accident, made it easy to parse sub-expressions.

Can't beat not having to worry about parsing parentheses, either, although this can be achieved easily with Pratt parsing.

As we can see from the loop above, we start by getting the first token from the sequence, which we call head, and compute its nud, calling it left. The rbp is also implicitly zero, on startup

Entering the loop, if the next token in the sequence has an lbp less than or equal to the current rbp, we return our left value. Otherwise, we re-enter the loop after setting left to be the result of applying the led of the next token in the sequence with left as an argument.

That's it. Crazy, right?

Of course, this is not the whole story, because expression is accessed by the nud and led of the tokens, when applied.

Below, we can see some examples which demonstrate this.

;; literal token
(struct lit-token [value] #:transparent
  #:property prop:nud
  (lambda (self) (lit-token-value self)))

;; end-of-sequence token
(struct end-token [] #:transparent
  #:property prop:lbp 0)

;; addition
(struct add-token [] #:transparent
  #:property prop:lbp 10
  #:property prop:nud
  (lambda (self) (expression +inf.0))
  #:property prop:led
  (lambda (self left)
    (define right (expression 10))
    `(+ ,left ,right)))

;; multiplication
(struct mul-token [] #:transparent
  #:property prop:lbp 20
  #:property prop:led
  (lambda (self left)
    (define right (expression 20))
    `(* ,left ,right)))

;; exponent
(struct pow-token [] #:transparent
  #:property prop:lbp 30
  #:property prop:led
  (lambda (self left)
    (define right (expression 29))
    `(^ ,left ,right)))

The lbps of the tokens are arbitrary, as long as their ordering is as desired. Note, however, how simple it is to modify the behavior of the parser by passing in an rbp into the expression.

If, for example, one wanted exponentiation to bind to the left, instead of the right, one might change the rbp from 29 to some value equal to or greater than 30 (its lbp).

'(^ x (^ y z)) versus '(^ (^ x y) z)

I didn't spend any time on lexing, since quoted lists of atoms were already sufficient for my use-case.

Wrapping it up, we have the parse procedure, which converts our atoms into tokens and starts the parser:

;; the entry-point
(define (parse atoms)
  (match-define (cons head rest)
    (for/list ([atom (in-list atoms)])
      (case atom
        [(+) (add-token)]
        [(-) (sub-token)]
        [(*) (mul-token)]
        [(/) (div-token)]
        [(^) (pow-token)]
        [else
         (if (list? atom)
             (lit-token (parse atom))
             (lit-token atom))])))
  
  (parameterize ([token  head]
                 [tokens rest])
    (expression)))

I also implemented flattening, because why not, although this is not shown here for brevity's sake.

You can find the full implementation in this gist.

I had fun with this, and I hope the simplicity encourages you to try it out yourself, too, if you haven't already.

4 Likes