Coke programming guide

$Id: coke.html.in 545 2006-11-06 18:04:41Z piumarta $
corresponds to examples/jolt in the idst-5.4 release


Contents:

1   Introduction

Jolt is a cardboard cut-out of a Coke: a higher-order, functional abstract syntax processor and assembler. It is included in the idst5.4 distribution in the 'examples/jolt' directory. The main difference between Jolt and Coke is in the back-end: Coke's back-end is written in Pepsi/Coke whereas Jolt avoids the problem by using the VPU as a portable back-end.

2   Compiling and installing

Type
make
in the examples/jolt directory of the idst release. If you want to exercise it, type
make run
or run
./main
for an interactive prompt.

3   Syntax

Coke programs are constructed from a small number of syntactic elements: constants (numbers, strings), identifiers (for variables and user-defined parse node rules), strings (arrays of characters), and lists (representing parse tree nodes and their children).

3.1   Numbers

Integers consist of an optional sign followed by one or more digits. Only decimal integers are supported.

Floating-point numbers consist of an optional sign, one or more digits, a decimal point, zero or more digits, and an optional exponent consisting of a single letter 'e' and an integer. Only decimal floating-point numbers are supported.

Note: Jolt does not currently support floating-point numbers.

3.2   Strings

Strings begin with a double-quote character '"' followed by zero or more characters and end with a double-quote character. The initial and final double-quote characters delimit the string but are not included in its contents. Backslash characters introduce 'escape sequences' within a string, as follows:
\a'bel' (ASCII code 7)
\bbackspace (ASCII code 8)
\eescape (ASCII code 27)
\fform feed (ASCII code 12)
\nnewline (ASCII code 10)
\rcarriage return (ASCII code 13)
\thorizontal tab (ASCII code 9)
\vvertical tab (ASCII code 11)
\nnnoctal character nnn (ASCII code 0nnn)

3.3   Identifiers

Identifiers consist of one or more letters, digits or characters from the set !#$%&*./:<=>?@\^_|~+- (where the initial character cannot be a digit). Ambiguity between identifiers and numbers are resolved in favour of the number.

3.4   Lists

Lists consist of an initial opening parenthesis '(', zero or more syntactic elements (as described in this section), and a terminating closing parenthesis ')'.

3.5   Comments

A semicolon (appearing outside a string literal) introduces a comment. The semicolon and everything following it on the same line is discarded by the scanner and ignored. Comments count as blank space.

3.6   Artificial sweeteners

Certain operators are so frequently used that they merit dedicated abbreviated syntax in the scanner. In the following quotation-related 'syntactic sugar', element can be any syntactic element described in this section:
'element
(single quote) is equivalent to (quote element)
 
`element
(backquote) is equivalent to (quasiquote element)
 
,element
(comma) is equivalent to (unquote element)
 
,@element
(comma followed by commercial at) is equivalent to (unquote-splicing element)
Interaction with (sending messages to) the objects that describe program structures and the compiler environment is supported by the following syntactic sugaring:
[ receiver selector [arguments...] ]
is equivalent to
( send (quote selector ) receiver [arguments...] )
(where no attempt is made to impose the correct number of arguments for the selector provided). Note that the syntactic sugar quotes the selector for you (if you need to compute the selector, use the send form directly). Note also that the send form and the [...] syntax have the receiver and selector in a different order.

For keyword selectors one more feature is provided:

[ receiver keyword1: argument1 keyword2: argument2 ... ]
is equivalent to
( send (quote keyword1:keyword2:... ) receiver argument1 argument2 ... )
with the individual keyword parts collected into a single selector. This feature makes unary, binary and keyword message syntax entirely compatible with traditional Smalltalk syntax.

4   Semantics

Semantics are apparent in the evaluation rules of syntactic elements (their run-time meaning or effect) and in the treatment of compound forms (lists with particular structure or contents) that have special (or user-defined) syntactic or semantic meaning during the compilation process.

4.1   Evaluation rules

Numbers evaluate to the primitive (non-object) representation of their value.

Strings evaluate to a primitive (non-object) pointer to an array of characters terminated with a nul character.

Identifiers always represent an r-value (some physical location associated with the identifier during the execution of the code in which the identifier appears). When they appear outside special syntactic forms they evaluate to the associated l-value -- the value stored in the physical location associated with the identifier.

Lists (that are not special syntactic or semantic forms, as described below) represent the application of a function value to zero or more argument values. The form

( function [arguments...] )
evaluates the function and arguments (if present) in an unspecified order. The result of evaluating function must be a function (native code) address that is called with the evaluated arguments.

4.2   Intrinsic operators

Certain built-in operators behave like functions but are not evaluated as identifiers. (Their arguments are evaluated exactly as descried above.)

The intrinsic arithmetic and relational operators are as follows:

( + args... )addition of two or more arguments
( - arg )negation of a single argument
( - args... )subtraction of two or more arguments
( * args... )multiplication of two or more arguments
( / args... )division of two or more arguments
( % args... )remainder after division of two or more arguments
( < args... )non-zero iff first argument less than second argument
  <= >= >   ... idem for other (in)equalities
( == args... )non-zero iff arguments identical
( != args... )non-zero iff arguments not identical

Pointers behave like (long) integers. Pointer dereferencing is provided for the traditional machine types:

( char@ address [index] )
( short@ address [index] )
( int@ address [index] )
( long@ address [index] )
evaluates to the integer (char, short, int or long, sized according to local platform conventions) stored at the given address in memory. If the optional index is present, it is first scaled by the size of the addressed value (typically by 1, 2, 4 or 8) and then added to the base address to yield the effective address for the operation. The related forms
( set-char@ address [index] value )
( set-short@ address [index] value )
( set-int@ address [index] value )
( set-long@ address [index] value )
write value into a location whose effective address is calculated as just described. (These 'setter' forms need not be written explicitly; see the set form described below.) The form as a whole yields value as its result.

4.3   Special forms

Special forms are list structures that disobey the usual evaluation rules for lists, typically by not evaluating (or by evaluating unusually) one or more of their elements.
( and args... )
evaluates each of the args from left to right until one of them yields zero; args to the right of the first zero value are not evaluated. The result of the overall form is non-zero iff all args yield non-zero.
 
( or args... )
evaluates each of the args from left to right until one of them yields non-zero; args to the right of the first non-zero value are not evaluated. The result of the overall form is zero iff all args yield zero.
 
( if condition consequent [alternate] )
evaluates the condition and then either the consequent (if the condition was non-zero) or the alternate (if present). The result of the overall form is the result of whichever of consquent and alternate is evaluated. If alternate is not present and condition is zero then the overall result is zero.
 
( while condition [args...] )
while the condition evaluates to non-zero, all the args are evaluated in order, left to right. This process continues until condition evaluates to zero, causing the overall form to yield zero.
 
( let ( [bindings...] ) [args...] )
evaluates each of the args in order, left to right, in an environment augmented by the result of evaluating the bindings. Each binding has the form
( identifier value )
where value is evaluated according to the usual rules before being bound to identifier during the evaluation of the args. The identifiers are added to the environment before evaluating and binding their values (which is a bug and will be fixed sometime in the near future). The overall result is the value of the last of the args; if no args are present then the overall value is zero.
 
( set identifier value )
evaluates value and then assigns it to the location associated with the r-value identifier.
 
( set ( identifier args... ) value )
does not cause any evaluation. It simply rewrites itself in the form
( set-identifier args... value )
and re-evaluates the resulting form. This permits any field or slot 'getter' to be converted into a corresponding 'setter' by enclosing the getter expression in a set form. For example, the expression
( set ( long@ address ) 42 )
is rewritten as
( set-long@ address 42 )
by the compiler and then immediately re-evaluated to effect the assignment.
 
( lambda ( [formal...] ) [expr...] )
evaluates to the address of a function accepting zero or more formal arguments, where each formal must be an identifier.

When applied, the function value of a lambda expression evaluates each of its exprs in order, from left to right, in an environment in which each formal names a variable bound to a corresponding actual argument supplied in the function application. The value of the applied function is the value of the final expr; if no expr is present then the function returns zero.

Note: The evaluation of a lambda yields a function that can be applied; the exprs are not evaluated until the function result of lambda is applied to zero or more actual arguments as explained for the evaluation of lists above. A given function value can be applied any number of times and obeys the local platform's C ABI conventions; it can therefore be passed as a callback 'function pointer' to statically-compiled library and OS routines.

( return value )
appearing in the body lambda function evaluates value and then immediately returns control to the function's caller yielding value as the result of the function application.
 
( define identifier value )
evaluates value, creates a variable binding for identifier in the current scope, and assigns the value to the variable. If the form appears as a top-level expression then the binding is created in the current global environment. If the form appears within a let or lambda expression then the binding is created as a local (automatic) variable within that expression.
 
( syntax identifier function )
is similar to define except that the function value must evaluate to an applicable function address. The identifier names a new abstract syntax tree node type with the function providing the compilation rules for any node of the newly-created type. (The scoping rules for the binding are identical to those of define). See the section below on abstract syntax definitions for further explanation of the function argument.
 
( quote element )
evaluates to an object corresponding to the compiler's internal representation for element. (This object can be incorporated into abstract syntax trees passed to the compiler for evaluation and compilation.)
 
( send message receiver [args...] )
evaluates message, receiver and args (if present). The message must evaluate to an identifier and receiver to a compiler object. The message identifier is then sent as a message to the receiver object along with any args supplied. The overall value of the form is the (object) answered by the message send.

5   Primitives

Primitives appear as applicable function addresses bound in the top-level (global) environment. Only one primitive is provided:
( _dlsym handle symbol )
evaluates symbol to yield a string and handle to yield an integer representing a library or process namespace. The primitive returns the address of the code or data location associated with the symbol in the namespace.

Which namespaces are searched depends on the handle parameter. If handle is zero then every namespace associated with the process and its dynamic libraries is searched in the order they were loaded. If handle is associated with a library opened with the library function dlopen then only that library (and any libraries it depends on) are searched for the symbol.

The _dlsym primitive returns zero if symbol cannot be found, and sets an error condition which may be queried with the dlerror library function.

Note: the library functions dlopen and dlerror must be imported explicitly. For example:
(define _dlopen  (_dlsym 0 "dlopen"))
(define _dlerror (_dlsym 0 "dlerror"))
(define printf   (_dlsym 0 "printf"))
(define exit     (_dlsym 0 "exit"))

(syntax begin
  (lambda (form)
    `(let () ,@[form copyFrom: '1])))

(define dlopen
  (lambda (name)
    (let ((handle (_dlopen name 0)))
      (if handle
          handle
          (begin
            (printf "Error: dlopen: %s\n" (dlerror))
            (exit 1))))))

6   Abstract syntax definitions

Abstract syntax tree node types defined with the syntax form cause their associated function to be invoked whenever the compiler encounters the corresponding node type during compilation. The function is passed two arguments: the abstract syntax tree node that is currently being compiled (necessarily a list beginning with a symbol corresponding to the node type name) and the compiler (an object) that is compiling it. A syntax definition therefore typically has the following form:
(syntax type (lambda (node compiler) action...))
where type is the name of the defined node type and action... are statements/expressions controlling the compilation of the node.

The actions are responsible for compiling the node. Three mechanisms are available to the actions:

Any number of the above mechanisms can be combined to achieve the desired implementation. If the function completely generates the code required for the ndoe it can return zero to inform the compiler that no further compilation is needed for the node. In this case the function must generate (directly or indirectly) precisely one value for the node.

6.1   Compiler back-end interfaces

Several back-end interfaces are available and/or planned. The Jolt prototype uses a VPU-like interface that generates code for PowerPC and Intel processors. The VPU instance generating code for the compiler passed to a syntax function can be retrieved by sending the compiler argument the message vpu:
(syntax foo
  (lambda (node compiler)
    (let ((vpu [compiler vpu]))
      ...)))
The vpu implements an abstract machine supporting a superset of C semantics. Three stacks are used. The vpu presents the following interface:
Management
[ vpu new ]
creates a new instance of a vpu. (This is normally not of interest, since the vpu passed to the syntax function indirectly through the compiler argument will be the correct instance to use for code generation.)

[ vpu iTmp: numTemps ]
[ vpu iTmp ]
pushes numTemps integer-valued temporary variables onto the variable stack. In the second form, the missing numTemps is assumed to be one.

[ vpu dropTmp: numTemps ]
removes the topmost numTemps variables from the variable stack.

[ vpu begin: numLabels ]
pushes numLabels local labels onto the label stack.

[ vpu end: numLabels ]
removes the topmost numLabels local labels from the label stack.

[ vpu define: n ]
defines the nth local label (relative to the topmost) in the label stack to refer to the current 'program counter' location. A local label must not be defined more than once. A local label can be left undefined provided it is never used as the target of a branch instruction.

Function prologue and epilogue
[ vpu iEnter ]
declares the prologue for an integer-valued function. This must be the first (non label defining) instruction in a function.

[ vpu iArg: numArgs ]
[ vpu iArg ]
declares numArgs integer-valued arguments. Arguments must be declared immediately after the iEnter prologue and before any other instructions in the function body. In the second form, the missing numArgs is assumed to be one.

[ vpu ret ]
returns the topmost item on the stack as the result of the function being compiled. A ret always pops one item from the stack (the value returned), can occur any number of times in the body of a function, but must also be the last instruction in the function.

Data movement
[ vpu ldInt: integer ]
pushes integer (which must be an integer object value) onto the arithmetic stack.

[ vpu ldPtr: object ]
pushes the address of the given object onto the arithmetic stack.

[ vpu ldArg: n ]
pushes the nth argument of the function being compiled onto the arithmetic stack. Arguments are numbered from zero.

[ vpu stArg: n ]
stores the topmost item in the arithmetic stack into the nth argument of the function being compiled. The item is not removed from the stack.

[ vpu ldTmp: n ]
pushes the contents of the nth local variable (relative to the top of the variable stack) onto the arithmetic stack.

[ vpu stTmp: n ]
stores the topmost item in the arithmetic stack into the nth local variable. The item is not removed from the stack.

[ vpu dup: n ]
[ vpu dup ]
pushes a copy of the nth item in the arithmetic stack (relative to the top of the stack before the value is pushed) onto the top of the arithmetic stack. In the second form, n is implicitly zero (i.e., the topmost item in the stack is duplicated).

[ vpu drop: n ]
[ vpu drop ]
pops the topmost n items off the arithmetic stack. In the second form, n is implicitly one.

Arithmetic and relational operators
[ vpu add ]
[ vpu sub ]
[ vpu mul ]
[ vpu div ]
[ vpu mod ]
replaces the topmost two items on the arithmetic stack with their sum, difference, product, quotient or remainder (respectively).

[ vpu lt ]
[ vpu le ]
[ vpu eq ]
[ vpu ne ]
[ vpu ge ]
[ vpu gt ]
replaces the topmost two items on the arithmetic stack with zero or one according to whether the topmost item is less than, less or equal to, equal to, not equal to, greater or equal to, or greater than (respectively) the second item.

Memory access
[ vpu rdb ]
[ vpu rdh ]
[ vpu rdw ]
replaces the topmost item on the arithmetic stack with the byte, half word or word (respectively) at that address in memory.

[ vpu wrb ]
[ vpu wrh ]
[ vpu wrw ]
writes the second item on the arithmetic stack into the memory byte, half word or word (respectively) addressed by the topmost item in the stack. The topmost item is then popped from the arithmetic stack.

Control flow
[ vpu br: n ]
transfers program control to the instruction immediately following the point of definition of the nth local label (relative to the top of the local label stack).

[ vpu bf: n ]
[ vpu bt: n ]
pops the topmost item off the arithmetic stack and conditionally transfers program control to the instruction immediately following the point of definition of the nth local label (relative to the top of the local label stack). The bf instruction transfers control to the label's address only if the topmost item popped from the stack was zero. The bt instruction transfers control to the label's address only if the topmost item popped from the stack was non-zero.

[ vpu iCall: n label: label]
[ vpu iCalli: n ]
pops n actual arguments from the arithmetic stack and calls an integer-valued function, leaving the result of the function call on the top of the arithmetic stack. In the first form the address to call is retrieved from a global label (see the section on global labels below). In the second form the address to call is popped off the arithmetic stack before the arguments (i.e., the arguments are pushed followed by the address to call in preparation for an iCalli instruction). Arguments are pushed in 'reverse' order, starting with the last and working towards the first.

Compilation and miscellaneous information
[ vpu compile ]
compiles an entire function, as described by the sequence of message sends previously sent to the receiver, into native code. This message must never be sent more than once to a given instance of the vpu.

[ vpu codeSize ]
answers an integer object representing the total number of native bytes compiled by any vpu in the current process since the process began executing.

[ vpu versionString ]
answers a string object containing a human-readable version string for the vpu library linked into the running process.

6.1.1   Global labels

When sent a compile message the vpu answers with the address of the first instruction in the native code implementation of its function. However it is much more convenient to define a global label associated with the entry point for the function: The compiler provides the object type VPULabel for this purpose.

[ VPULabel new ]
answers a new, uninitialised global label.

[ vpuLabel lookup: symbol]
defines the receiver as the address associated with the code or data object named by symbol (a string object) looked up in the process global namespace via _dlsym (see the section on primitives).

[ vpu defineLabel: label]
defines the given global label to refer to the address of the next instruction generated in the receiver.

(See also the iCall:label: instruction described earlier.) Once defined, the function address associated with a global label can be retrieved or invoked as follows:
[ vpuLabel _labelAddress ]
answers a primitive pointer equal to the defined address of the receiver.

[ vpuLabel value ]
answers the result of calling the defined address of the receiver as a function with no arguments.

[ vpuLabel value: argument ]
answers the result of calling the defined address of the receiver as a function with a single argument.

7   Resources

The COLA mailing list: http://vpri.org/mailman/listinfo/fonc.