Finite state automaton for parsing

Hi,

on this end of week i decided to update a bit my infix parser to allow some expressions as in other language, i take in example Python:

Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a=2
>>> b=3
>>> a*-b
-6

>>> 2*-++-+-++---+3
6

i admit the last example is a bit silly but it is a valid syntax, so why not allow those syntax in infix scheme?

The problem was in scheme you can write -3 but the parser do not allow -a for example. There was also other problems such as 2 * - 3 does not even look as an infix expression of the form operand1 operator operand2.

I thought it was something easy to do in a few lines, but quickly it becomes a nightmare, here the code i dropped and never finished:


;; (define sign '())
;; (define op-flag #f)

;; (define (parse-operators+- lst)
;;   ;; TODO add an arrow in schema-scheme+.jpg

;;   ;; advance in parsing
;;   (define (advance lyst)
;;     (cons (car lyst)
;; 	  (parse-operators+- (cdr lyst))))


;;   (define elem (car lst))
  
;;   ;; rewrite it with taking account only -, forget +
;;   ;; at the beginning we advance until we find an operator (this should not be too long... in 2nd position)
;;   (cond ((not op-flag)
;; 	 (when (operator-symbol-or-syntax? elem) ; (expr ...
;; 	   ;; here we search for an operator
;; 	   ;; (op ...
;; 	   ;; test for operator
;; 	   ;; expr  op + - - - + - + - expr
;; 	   ;;       ^ possibly here we are
;; 	   (set! op-flag #t))
;; 	 ;; but in all case we advance
;; 	 ;; expr  op + - - - + - + - expr
;; 	 ;; ^ or possibly here we are
;; 	 ;; advance
;; 	 (advance lst))

;; 	;; now we have op-flag set

;; 	;; but next depends of sign
	
;; 	;;   op expr
;; 	;;      ^ possibly here we are
;; 	;; not operator
;; 	((not (operator-symbol-or-syntax? elem))
;; 	 ;; advance
;; 	 (advance lst))

;; 	;; now we have an operator

;; 	;; erreur peut etre une sexpr
;; 	((and (not (ADD-op? elem))
;; 	      (not (MINUS-op? elem)))
;; 	 (error "ERROR : parse-operators+- : awaiting + or - operator,found something else."))

;; 	((ADD-op? elem)
;; 	 (advance lst))

;; 	((MINUS-op? elem)
;; 	       (set! sign (cadr lst))
;; 	       (cons (car lst)
;; 		     (parse-operators+- (cddr lst))))
;; 	     ;; (op expr ...
;; 	     (cons (car lst)
;; 		   (cons (cadr lst)
;; 			 (parse-operators+- (cddr lst))))))

;; 	;; here sign is not null
;; 	((or (operator? (car lst)) ; (op ...
;; 	     (operator-syntax? (car lst)))


and then i remembered of my university course on Finite state automaton , i knew they are used for regular expression, but for parsing too and i share this example as it could be usefull for someone else. The theory is very simple, you can search the internet for documents.

On this example i then restart my code from zero but before doing it, i decided to create an Automaton that do the parsing for me, a sort of map that i can follow to code the problem and solve it. It was then really easy, in a functional style with procedures, no more flag as each state is a sort of memory of the tracking of the parsing.

Here is the schema of the automaton of my parsing problem:

and the code was then so simple to implement, just follow the schema of the Automaton, each state being one procedure of the same name, the different arrows are the different case of the conditionals in code:


;; main entry of parsing +- (see parser+-.odg or jpeg image for the schema of automaton)
;; we want to parse infix expression like: 3 + - 4 and transform it 3 + (- 4)
;; like other languages, like Python do it for 3+-4 or others, bad but valid syntax as 3+---4 ,etc ....
(def (start-parse-operators+- lst)
  ;;(display "start-parse-operators+- : lst=") (display lst) (newline)
  (when (null? lst)
    (return lst))
  (define elem (car lst))
  (if (operator-symbol-or-syntax? elem)
      (cons elem ; we keep it in the resulting list
	    (first-operator (cdr lst))) ; and change to another automaton state
      (cons elem ; we keep it in the resulting list
	    (start-parse-operators+- (cdr lst))))) ; and stay in the same automaton state

(define (NO-OP? elem)
  (not (operator-symbol-or-syntax? elem)))

(define (error+- lst)
  (error "Error parsing +- : the sequence of operators has no mathematic signification, in the provided list : " lst))

;; we have find an operator and we check possibly another one following
(def (first-operator lst)
  (when (null? lst)
    (return lst))
  (define elem (car lst))
  
  (cond ((ADD-op? elem) ; we drop it from the resulting list
	 (first-operator (cdr lst))) ; and stay in the same automaton state
	((NO-OP? elem) ; should be a general sexpr
	 (cons elem ; we keep it in the resulting list
	       (start-parse-operators+- (cdr lst)))) ; but go back to the state start of the automaton
	((MINUS-op? elem) ; we drop it from the resulting infix list (but will be possibly integrated in a prefixed sub sexpr)
	 (loop-over+- (cdr lst))) ; go to another automaton state
	(else
	 (display "first-operator : elem=") (display elem) (newline)
	 (display "Error in first-operator : ")
	 (error+- lst))))

;; loop over the + - operators to find the final sub sexpr to create in the infix sequence
(def (loop-over+- lst)
  (when (null? lst)
    (return lst))
  (define elem (car lst))
  (cond ((MINUS-op? elem) ; we drop it because we have - - resulting in + which also can be dropped
	 (first-operator (cdr lst))) ; and we go back to the previous state of the automaton
	((ADD-op? elem) ; we drop it from the resulting list
	 (loop-over+- (cdr lst))) ; and stay in the same automaton state
	((NO-OP? elem) ; should be a general sexpr , so here we change his sign
	 ;;(display -) (newline)
	 (cons
	  (list '- ;(syntax -) ; '- : possible bug if we manipulate syntax object this should be (syntax -) and in R6RS could be also different
		elem)
	  (start-parse-operators+- (cdr lst)))) ; continue parsing with the inital automaton state
	(else
	 (display "Error in loop-over+- : ")
	 (error+- lst))))


note: def is just a macro that allow here to return at any point of a defined procedure.

and here is a few example in some infix curly expressions:

Welcome to DrRacket, version 8.14 [cs].
Language: reader SRFI-105, with debugging; memory limit: 8192 MB.

(define a 7)
(define b 3)
{a * - b}
-21

{5 -  - 2}
7


{5 - - - - + - - 2} ; the silly one :-)  
7

{- - 3}
3

Finite state automaton are very important, in the future i will use them to allow some more advanced syntax such as this one:

a ² - 2 a b + b ²

this could be implemented with a different automaton.

Perheaps the only thing that change it's that instead to allow 2 following operators it allows 2 following operands.

1 Like