Structs vs symbols for building an AST

I’m experimenting with having a Pollen project compile to an AST rather than directly to HTML-equivalent X-expressions.

I notice that in other libraries, such as the xml library and the new commonmark library, transparent structs are used to compose the AST rather than symbols and X-expressions (1, 2).

Can someone explain or point me to some resource that explains precisely what the benefit is of using structs this way? Is it simply to have added control over what elements are allowed where? Is there any performance benefit or penalty? What would be the downside to building or processing an AST as a simple X-expression relative to using structs?

5 Likes

I have used a similar approach for generating html for a while and like a lot.
The idea came from Eli Barzilay's scribble/html.

The idea is to define a constructor function for each html element.
The function a constructs links, p constructs paragraphs etc.

Having a constructor for each element has a couple of advantages.

First, construction of html elements can be done without juggling quotes, quasiquotes and unquotes.
As an example

(p "This is" (b "bold") ".")

will construct a paragraph.

Second, it's impossible to misspell html elements. An attempt to write

(p "This is" (bold "bold") ".")

will result in an error:

bold: undefined;
  cannot reference an identifier before its definition

Catching the error here as opposed to later in the browser is great.

Third, an constructor can validate the arguments. For example the line break element br can't contain text, so (br "foo") could generate an error.

An example

(define (color-table names rgb-strings)
  (table
   (for/list ([name names] [rgb  rgb-strings])
     (tr (td name) (td rgb)))))

(~x (color-table '("black" "red" "green") '("000000" "ff0000" "00ff00")))

The output is (linebreaks inserted by be to make it readable) on this forum.

"<table><tr><td>black</td><td>000000</td></tr>
 <tr><td>red</td><td>ff0000</td></tr> 
 <tr><td>green</td><td>00ff00</td></tr></table>"

The above list of advantages apply when each html element has its own constructor, which includes your situation. The representation used internally by scribble/html doesn't represent each element by its own structure kind though. Instead it has a structure element that is used to represent all element types. For programs that read html pages and analyze them, the actual representation matter. For web apps that only need to produce html, the actual representation is not a matter of concern (as long as it fast to produce a string representation from it).

A great idea also stolen from scribble/html is autoquoting of symbols ending in colon.
This is convenient when element constructors allow attributes:

(span style: "color: red;"   
   "This text is red.")

This constructor style works great with at-expressions.

The example:

    (p "This is" (b "bold") ".")

becomes

 @p{This is @b{bold}.}

The previous example becomes:

@span[style: "color: red;"]{This text is red.}

Due to a technicality instead of using the structs from scribble/html directly, I use urlang/html.
(I should probably have put it elsewhere since it is unrelated to the rest of Urlang).

A nice little utility from urlang/html is html->at-exp which does what it says on the tin.

> (require urlang/html)
> (html->at-exp "<p><b>foo</b></p>")
"@p{@b{foo}}"

Similarly, output from Neil Van Dyke's html-parsing library can also be converted to at-expressions.

> (require html-parsing)
> (xexp->at-exp (html->xexp "<p><b>foo</b></p>"))
"@p{@b{foo}}"

Finally ~x and ~xs are provided. They convert a single element and a list of elements respectively to a string.

With regards to performance: I have never measured. Construction of lists are fast, but so is structure construction.

2 Likes

"How should I represent this data" is surely one of the two or three most fundamental questions in program design, and in sexp-based languages that also have structs, this question comes up every single darn time you want to represent a simple piece of data.

The obvious first advantage of lists: constructors are built-in, there are many many many utility functions that work on lists, they print nicely by default, uniform access to fields is convenient (e.g.: do any of the fields of this structure contain the number 7?)

The obvious first advantage of structures: they can't be confused with each other, you can't create them with the wrong number of fields, they express intent to the reader of the code more clearly, field access is constant-time.[*]

No matter which one you choose, there are tools and techniques you can use to obtain most of the benefits of the other.

Speaking only for myself, I would say that I generally prefer the clarity of intent that comes with structures, but working with structures has some real problems in Racket; destructuring is painful in the absence of dot notation, and there's a note in my head that says that transparent structures don't work well with Typed Racket (this may have been a problem with serialization, and it may also be a problem that was resolved years ago, I should revisit that). Maybe Rhombus solves these problems?

John

[*] This is especially important when your structs have more than 10,000 fields. Sorry, yes, I'm joking.

2 Likes

I think a preference for x-expressions can arise from wanting the surface syntax to be nice for "authoring".

So if you look at xml, for example, having (struct element (name attributes elements)) you may understandably think, "Well! There is no way I want to be writing (element 'p '() (list "Some " (element 'b '() "bold") " text.")). :frowning_face: I prefer (p () "Some " (b () "bold") " text."), or even just (p "Some " (b "bold") " text")."

But as @soegaard shows, you can define a constructor procedure for each kind of element, and use s-expression or at-expression syntax, which is also very nice and less "quasiquotey". [Whether that evaluates to a struct or some other representation (even an x-expression? :slight_smile: ) behind the scenes is distinct.]

Whether the element names should be open or closed, depends. Closed means you can catch compile time errors, e.g. when someone uses a name not defined in HTML5. Open means you can handle HTML6 without needing to define new constructors. It depends what you want to emphasize. Anyway, I think the "open" preference here also nudges some people toward x-expressions (but again that's not the only option)?

3 Likes

cough, cough

#lang racket
(require struct-plus-plus)
(struct++ person ([name string?]) #:transparent)
(define bob (person++ #:name "bob"))
bob
(person.name bob)
(person.name (person 'alice)) ; also works if you don't like contract checking or keyword ctors

https://docs.racket-lang.org/struct-plus-plus/index.html

3 Likes