`IR`

IR is an intermediate target language for Maru compilers with semantics similar to those of C. Some advantages compared to C are:

tree-structure
Program elements are tree structures that are trivial to generate, e.g., during parsing.

simpler type rules
Scalar and structured types correspond closely to hardware capabilities, but operations on them never perfom the implict promotions or coercions that C operators perform.

easily retargettable
Back-ends are easy to write and include generators for machine code (e.g., Intel x86) and for an equivalent C program (for portability and/or bootstrapping).

potentially interpretable
The tree structure is simple and equally conducive to interpretation, dynamic compilation, and static compilation.

compact representation
Programs could be linearised into a very compact external representation.

The remainder of this document presents the high-level (client) programming interface of IR and a small example of a compiled language built using it.

Typographic conventions

Code is set in a typewriter face. Symbols that should appear explicitly in code are set in bold. Symbols that are descriptive, standing in place of another symbol or expression, are set medium. Descriptive symbols representing optional items (that can be omitted) are given the subscript `opt'. Descriptive symbols representing several items (typically appearing at the end of a list) are followed by an ellipsis `...'.

High-level programming interface

Clients construct an object of type <ir> and then manipulate it to add program statements and then generate target code for them.

(ir-new)

creates a new, empty ir object. Where `ir' appears below, an instance of this type of object is indicated.

(ir-dump ir)

prints a symbolic representation of the program stored in the ir, which can be useful for debugging. The format of the information printed will be explained below.

(ir-gen ir target options_opt)

generates code for the program stored in the ir. The target controls what kind of output is generated. Possible values currently include:

<ir-backend-c> generates a C program.

<ir-backend-x86> generates an x86 binary program.

The details are currently beyond the scope of this document but options can control the target's data model (LP32, LP64), whether a symbolic listing of the generated code will be produced, whether binary code is to be generated in-memory or written to a file, whether the file is a linkable object or an executable, whether the file format should be ELF, Mach-O, etc...

(ir-put ir statements...)

adds one or more statements to the program stored in the ir.

The remainder of the programming interface consists of constants and constructors that are used to create the types and statements passed to ir-put. The constants representing types are as follows:

IR-VOID the type of an object of unknown type, or of an absence of objects (in a parameter list, for example)

IR-INT8 a byte, equivalent to C's char

IR-INT16 a halfword, equivalent to C's short on most architectures

IR-INT32 a word, equivalent to C's int and long on LP32 architectures

IR-INT64 equivalent to C's long long on LP32, and both long and long long on LP64

IR-FLOAT32 single-precision, equivalent to C's float on most architectures

IR-FLOAT64 double-precision, equivalent to C's double on most architectures

The constructors used to create derived types are as follows:

(ir-pointer-type type)
creates the type describing a pointer to (address of) an object of the given type.

(ir-function-type ir return-type argument-types..._opt)
creates the type describing a function that accepts arguments of the given argument-types and that returns an object of the given return-type. Note that
(ir-function-type IR-INT)
and
(ir-function-type IR-INT IR-VOID)
are not the same: the former can be called with any number and type of arguments, while the latter must be called with no arguments.

Several aliases and predefined derived types are provided for convenience:

IR-BOOL the type of a logical value, curently always IR-INT

IR-INT the natural width of an integer, curently always IR-INT32

IR-LONG the integer type having the same width as a pointer

IR-FLOAT the natural width of a float, currently always IR-FLOAT32

IR-STRING the type of a string, equivalent to (ir-pointer-type IR-INT8)

Note that the objects representing structurally-equal types are always identical. This applies to scalars, structures, pointers, and fuction types.

The constructors used to create statements are as follows:

(ir-lit value type_opt)

Creates a literal value. If type is omitted then it will be inferred from value, as follows:

literal value inferred type

any <long> IR-INT

any <double> IR-FLOAT

any <string> IR-STRING

(ir-def name type_opt value_opt)

defines a new global variable of the given type, initialised to the given value. If the value is omitted then the variable will be uninitialised. If the type is omitted then it will be set implicitly from the type of the value first stored (in program textual order) into the variable via ir-set, which must occur before any attempt to read the variable via ir-get.

(ir-get name)

reads the value stored in the variable with the given name.

(ir-get address)

reads the value stored in memory at the given address which must be of pointer type. The result is the referent type of address, unless that is an integer narrower than IR-INT in which case the value is zero-extended to make a result of type IR-INT. (Dereferencing a pointer to IR-VOID is not supported.)

(ir-set name value)

stores value in the variable with the given name. If the variable previously had no type, its type will be made that of value.

(ir-set address value)

stores value in memory at the given address which must be of pointer type. The width of the store operation is determined by the referent type of address which must be equal to the type of value, unless it is an integral type narrower than IR-INT in which case value is effectively (but not actually) truncated to the width of the store operation. (Storing through a pointer to IR-VOID is not supported.)

(ir-cvt type value)

converts value to the given type. Conversions can be performed between the following source and destination types:

from value of type to type

IR-INT32 IR-INT64

IR-INT64 IR-INT32

any pointer type IR-LONG

IR-LONG any pointer type

IR-FLOAT32 IR-FLOAT64

IR-FLOAT64 IR-FLOAT32

IR-FLOAT32 IR-INT32

IR-FLOAT64 IR-INT64

IR-INT32 IR-FLOAT32

IR-INT64 IR-FLOAT64

(ir-neg value)

the arithmetic negation of value, which must be of numeric type. The result has the same type as value.

(ir-com a)

the one's-complement bitwise inversion of value, which must be of integral type.

(ir-not value)

the logical negation of value, which much be of type IR-BOOL. The result is of type IR-BOOL.

(ir-add a b)

the sum of a and b. If a is a numeric type then b must be of the same type. If a is a pointer type then b must be of type IR-LONG and the result is the sum of a and b (scaled up by the width of the referent type of a) and of the same type as a. (If a is a pointer to IR-VOID then no scaling is performed.)

(ir-sub a b)

the difference between a and b. If a is a numeric type then b must be of the same type. If a is a pointer type and b of type IR-LONG then the result is the sum of a and b negated (and scaled up by the width of the referent type of a) and of the same type as a. If b is a pointer type then it must be identical to the type of a and the result is the difference bettwen the addresses a and b scaled down by the width of the referent type of a. (If a is a pointer to IR-VOID then no scaling is performed.)

(ir-mul a b)

the product of a and b which must be of the same numeric type.

(ir-div a b)

the quotient after division of a by b, which must be of the same numeric type.

(ir-mod a b)

the remainder after division of a by b, which must be of the same numeric type.

(ir-shl a b)

The result of shifting a left by b bits. The result has the same type as a. a must be of integral type and b must be of type IR-INT.

(ir-shr a b)

The result of shifting a right by b bits. The result has the same type as a. a must be of integral type and b must be of type IR-INT.

(ir-bitand a b)

bitwise `and' of a and b, which must be of the same integral type.

(ir-bitor a b)

bitwise `inclusive or' of a and b, which must be of the same integral type.

(ir-bitxor a b)

bitwise `exclusive or' of a and b, which must be of the same integral type.

(ir-lt a b)

compares a and b, which must be of the same numeric or pointer type, and produces a IR-BOOL that is non-zero iff a<b.

(ir-le a b)

compares a and b, which must be of the same numeric or pointer type, and produces a IR-BOOL that is non-zero iff a<=b.

(ir-eq a b)

compares a and b, which must be of the same numeric or pointer type, and produces a IR-BOOL that is non-zero iff a==b.

(ir-ne a b)

compares a and b, which must be of the same numeric or pointer type, and produces a IR-BOOL that is non-zero iff a!=b.

(ir-ge a b)

compares a and b, which must be of the same numeric or pointer type, and produces a IR-BOOL that is non-zero iff a>=b.

(ir-gt a b)

compares a and b, which must be of the same numeric or pointer type, and produces a IR-BOOL that is non-zero iff a>b.

(ir-logand a b)

evaluates the a and yields its result if zero, otherwise yields the result of evaluating b. Both a and b must be of type IR-BOOL.

(ir-logor a b)

evaluates the a and yields its result if non-zero, otherwise yields the result of evaluating b. Both a and b must be of type IR-BOOL.

(ir-if test consequent alternate)

evaluates test, which must be of type IR-BOOL. If the result is non-zero then consequent is evaluated to produce the overall result, otherwise alternate is evaluated to produce the result. consequent and alternate must be of the same type.

(ir-while test statements..._opt)

evaluates test, which must be of type IR-BOOL. If the result is non-zero then each of the statements is evaluated. The process repeats until the evaluation of test yields zero. There is no result and the type of the expression is IR-VOID.

(ir-seq statements...)

groups statements into a sequence that behaves as a single compound expression to the enclosing program structure. The result and type of the overall expression are those of the last statement. The group also delimits the scope of any local variables declared within statements via ir-var, whose lifetimes extend only to the end of the statements.

(ir-fun name type statements..._opt)

declares and defines a function. name must be a <symbol>, or nil (in which case the function is given a unique temporary name). type must be a function type, or nil (in which case it is inferred from the types of any function arguments declared within statements and the type of any ir-rets that occur within statements).

If name is non-nil then an entry will be created for it in the global symbol table, bound to an object of the given type, before any statements are considered. (This allows the function to be called recursively by name from within statements, provided type is non-nil.) The type of the function is also inferred from any ir-args and ir-rets that occur within statements. If type was non-nil then it must be congruent with this infferred type; if type was nil then the global binding for name is updated with the inferred type.

(ir-arg name type)

declares a function parameter with the given name and type. This statement must occur within an enclosing ir-fun, with each successive ir-arg corresponding to each successive function parameter from first to last.

(ir-var name type value_opt)

declares, and optionally initialises, a local variable with the given name and type whose scope begins with the declaration and extends to the end of the nearest enclosing ir-seq or ir-fun.

(ir-ret value_opt)

returns the given value from the nearest enclosing function. The type of value must correspond to the return type of the enclosing function. If value is omitted then the return value is undefined and of type IR-VOID. All occurences of it-ret within a function must have the same value type, and this must be the same as the fuction's return type (whether it was declared or is to be inferred).

(ir-call function arguments..._opt)

performs a call to function with zero or more arguments. function must be of function type (i.e., via an ir-get on the name of the function) or an expression whose type is a pointer to a function. The overall type of the expression is the return type of function.

(ir-ext name type options..._opt)

declares an external object of the given type. (The details of how name is resolved, and the interpretation of any options, are dependent on the kind of code that will be generated and currently beyond the scope of this document.)

(ir-addr name)

produces the address of name, which must be a symbol naming a variable that is currently in-scope. The type of the resulting address is a pointer to the declared type of the named variable.

Structures

Structure values and structure types can be used anywhere a scalar value of arbitrary type is allowed: variable declarations, initialisations and assignments, function parameters and arguments, etc. Structure types are tagged with a name (in a namespace distinct from normal variable definitions) and two functions are provided to declare and retrieve them:

(ir-def-struct ir name members..._opt): defines a structure type called name containing the fiven members. Each member must be a list whose elements are a type followed by one or more symbols naming structure members of that type.
(ir-struct-type ir name): yields the previously-defined structure type called name.

Two constructors are provided to create IR expressions that produce a new structure-valued object and deconstruct an existing structure value:

(ir-struct name values...): creates a structure value. The name must identify a previously-declared structure type and the values must produce objects whose types correspond (in order) to its declared member types.
(ir-member name value): produces the contents of a named field within a structure value. The structure must be a pointer to a object of structure type having a member identified by name. The overall type of the result is the declared type of the named member. Note that, because the structure value is a pointer, extracting a member from a structure-valued variable must be accomplished by first taking the address of that variable (with ir-addr) and then providing that address as the structure operand of ir-member.

Example `IR` program

Below is a complete example showing the construction and code generation of the popular benchmark function `nfibs'.


(require "ir.k")

(let ((ir (ir-new)))

  ;; the function prototype is required here because nfibs will call itself recursively
  (ir-put ir
    (ir-fun 'nfibs (ir-function-type IR-INT IR-INT)
      (ir-arg 'n IR-INT)
      (ir-ret
        (ir-if (ir-lt (ir-get 'n) (ir-lit 2))
          (ir-lit 1)
          (ir-add (ir-add (ir-call (ir-get 'nfibs) (ir-sub (ir-get 'n) (ir-lit 1)))
                          (ir-call (ir-get 'nfibs) (ir-sub (ir-get 'n) (ir-lit 2))))
                  (ir-lit 1))))))

  (ir-ext 'printf (ir-function-type IR-INT))

  ;; here the function prototype can be nil, and is inferred from the returned value
  (ir-put ir
    (ir-fun 'main ()
      (ir-call
        (ir-get 'printf)
        (ir-lit "%d\n")
        (ir-call (ir-get 'nfibs) 5))  ;; => 15
      (ir-ret (ir-lit 0)))

  (ir-gen ir <ir-backend-c>))

Example `IR`-based compiler

Below is a complete example showing a compiler for a simple Algol-60-like language that constructs an IR program during parsing and then generates code from it.


(require "ir.k")

(set peg-invoke-rule peg-invoke-rule-with-recursion)

(define-function param-list-types (pl)  (map car pl))
(define-function param-list-decls (pl)  (map (lambda (p) `(ir-arg ',(cadr p) ,(car p))) pl))

(println "#include ")

{
  expected      = .:what -> (error what " expected near: "(parser-stream-context self.source)) ;

  blank         = [\t ] ;
  eol           = "\n""\r"* | "\r""\n"* ;
  comment1      = "//" (&. !eol .)* ;
  commentN      = "/*" (&. !"*/" (commentN | .))* "*/" ;
  comment       = comment1 | commentN ;
  _             = (blank | eol | comment)* ;

  digit         = [0-9] ;
  higit         = [0-9A-Za-z] ;
  letter        = [A-Z_a-z] ;

  uinteger      = digit+ $#:x _                                     -> x ;
  integer       = "-"uinteger:x                                     -> (- x)
                |    uinteger
                ;

  ufloat        = (digit+ "."digit+ ("e"digit+)?)@$:s _             -> (string->double s) ;
  float         = "-"ufloat:x                                       -> (- x)
                |    ufloat
                ;

  number        = float | integer ;

  char          = "\\"  ( "t"                       ->  9
                        | "n"                       -> 10
                        | "r"                       -> 13
                        | "x" (higit higit) @$#16
                        | "u" (higit higit higit higit) @$#16
                        | .
                        )
                | . ;

  string        = "\"" ("\"\""->34 | !"\"" char)* $:x "\"" _        -> x ;

  idpart        = (letter (letter | digit)*) @ $$ ;
  identifier    = idpart:x !":" _                                   -> x ;

  prefix        = identifier:e                                      -> `(ir-get ',e)
                | number:e                                          -> `(ir-lit ,e)
                | string:e                                          -> `(ir-lit ,e)
                ;

  arglist       = expression?:a (","_ expression)*:b                -> `(,@a ,@b) ;

  primary       = prefix:a ( "("_ arglist:b ")"_                    -> `(ir-call ,a ,@b) :a
                           )*                                       -> a ;

  factor        = primary:a "*"_ factor:b                           -> `(ir-mul ,a ,b)
                | primary:a "/"_ factor:b                           -> `(ir-div ,a ,b)
                | primary:a "                | primary
                ;

  term          = factor:a "+"_ term:b                              -> `(ir-add ,a ,b)
                | factor:a "-"_ term:b                              -> `(ir-sub ,a ,b)
                | factor
                ;

  relation      = term:a "<" _ term:b                               -> `(ir-lt ,a ,b)
                | term:a "<="_ term:b                               -> `(ir-le ,a ,b)
                | term:a "=="_ term:b                               -> `(ir-eq ,a ,b)
                | term:a "!="_ term:b                               -> `(ir-ne ,a ,b)
                | term:a ">="_ term:b                               -> `(ir-ge ,a ,b)
                | term:a ">" _ term:b                               -> `(ir-gt ,a ,b)
                | term
                ;

  expression    = "if"_ "("_ expression:a ")"_
                  "then"_ expression:b
                  "else"_ expression:c                              -> `(ir-if ,a ,b ,c)
                | relation:a "="_ expression:b                      -> `(ir-set ,a ,b)
                | relation
                ;

  sequence      = "{"_ statement*:s "}"_                            -> `(ir-seq ,@s) ;

  statement     = sequence
                | "return"_ expression:e ";"_                       -> `(ir-ret ,e)
                | expression:e ";"_                                 -> e
                ;

  type          = "int"_                                            -> IR-INT ;

  param         = type:t identifier:i                               -> `(,t ,i) ;

  paramlist     = param?:p (","_ param)*:q                          -> `(,@p ,@q) ;

  fndecl        = type:t identifier:i "("_ paramlist:p ")"_
                   statement:e                                      -> `(ir-fun
                                                                          ',i
                                                                          (ir-function-type ,t ,@(param-list-types p))
                                                                          ,@(param-list-decls p)
                                                                          ,e) ;

  definition    = "extern"_ identifier:i ";"_                       -> `(ir-ext ',i (ir-function-type IR-INT))
                | identifier:i "="_ fndecl:e                        -> `(ir-def ',i () ,e)
                |                   fndecl
                | identifier:i "="_ statement:e                     -> `(ir-def ',i () ,e)
                |                   statement:e
                ;

  program       = _ (definition:d -> `(ir-put ir ,d))*:p
                    (!. | { expected "definition or expression" })  -> p ;

  # read a 'program' from the rest of the file

  program:p                     -> (eval `(let ((ir (ir-new ())))
                                            ,@p
                                            (ir-gen ir <ir-backend-c> 'main)))
}

int nfibs(int n)
  return
    if (n < 2)
    then 1
    else nfibs(n - 1) + nfibs(n - 2) + 1;

extern printf;

printf("%i\n", nfibs(5));

The example generates a C program which can be compiled and run. Assuming the above program is in a file called test-ir.k:

$ maru repl.k test-ir.k > c.c

$ cat c.c
#include <stdint.h>
int32_t nfibs(int32_t n) {
return ((n<2)?1:(nfibs((n-1))+(nfibs((n-2))+1)));
}
int main() {
printf("%i\012", nfibs(5));
return 0;
}

$ cc -o c c.c

$ ./c
15

BUGS

•	structure types are declared outside the program and are global (they should be local to the nearest enclosing scope, whether global or local)
•	structure tags are in a different namespace to regular variables (fake namespaces are easy to simulate in the client by prepending something unattainable as a program variable name)

TO DO

•	varargs
•	option to support closures, nested functions with free variables
•	option to support oop type and GC; cooperation between code generator and GC make memory management transparent
•	low-level programming interface, i.e: how to extend with new instructions, data types, code generators, etc.
•	explanation of output from \ttir-dump
•	example of code generation from within a parser (complete implementation of a small language)
•	find a real name to use instead of `\IR', and... do we \breally need all those `ir-` prefixes?

`<ir-backend-c>`	generates a C program.
`<ir-backend-x86>`	generates an x86 binary program.

`IR-VOID`	the type of an object of unknown type, or of an absence of objects (in a parameter list, for example)
`IR-INT8`	a byte, equivalent to `C`'s `char`
`IR-INT16`	a halfword, equivalent to `C`'s `short` on most architectures
`IR-INT32`	a word, equivalent to `C`'s `int` and `long` on LP32 architectures
`IR-INT64`	equivalent to `C`'s `long long` on LP32, and both `long` and `long long` on LP64
`IR-FLOAT32`	single-precision, equivalent to `C`'s `float` on most architectures
`IR-FLOAT64`	double-precision, equivalent to `C`'s `double` on most architectures

`IR-BOOL`	the type of a logical value, curently always `IR-INT`
`IR-INT`	the natural width of an integer, curently always `IR-INT32`
`IR-LONG`	the integer type having the same width as a pointer
`IR-FLOAT`	the natural width of a float, currently always `IR-FLOAT32`
`IR-STRING`	the type of a string, equivalent to `(ir-pointer-type IR-INT8)`

literal value	inferred type
any `<long>`	`IR-INT`
any `<double>`	`IR-FLOAT`
any `<string>`	`IR-STRING`

from value of type	to type
`IR-INT32`	`IR-INT64`
`IR-INT64`	`IR-INT32`
any pointer type	`IR-LONG`
`IR-LONG`	any pointer type
`IR-FLOAT32`	`IR-FLOAT64`
`IR-FLOAT64`	`IR-FLOAT32`
`IR-FLOAT32`	`IR-INT32`
`IR-FLOAT64`	`IR-INT64`
`IR-INT32`	`IR-FLOAT32`
`IR-INT64`	`IR-FLOAT64`

IR