1 Introduction
2 Compiling and running programs
2.1 Object files and shared libraries
2.2 Compiling the compiler
3 Syntax
3.1 Imperatives
3.2 Prototype definitions
3.3 Translation unit variables
3.4 Method definitions
4 Semantics
4.1 Blocks
4.2 Prototypes and objects
4.3 Everything is first-class
5 Pragmatics
6 The runtime system: introspection and intercession
6.1 Object layout and object pointers
6.2 Essential protocol of runtime objects
6.3 Runtime examples
7 Caveats and gotchas for Smalltalk programmers
8 Appendices
8.1 Compiler directives
8.2 Compiler types
9 Resources
It would be nice if it ran fast too. The last time I benchmarked
it I got about nine times Squeak speed, but this is likely to go down
(with increased generality and dynamism in the lowest levels of the
implementation) and up (with the sophistication of the
implementation), and over the long term things could go either way.
(The GC will probably have measurable impact on the performance of
'real'/long-running systems and applications too. Without profiling
it I'm not sure how well/badly the current conservative GC is holding
up.)
The search path for imported and included files can be extended
with the -I option (which can appear any number of times).
The command
The -s option tells the compiler to generate a shared
library from the source file. (The resulting library can be loaded
into an already-running program.) The command
If all that sounds too complicated, ask me to make you a binary
distribution.
The compiler has been tested (and is known to work) on:
Programs are translated one source file (with zero or more
additional source files being imported) at a time. To the compiler,
this body of code is called a translation unit. The compiler
always processes one complete translation unit at a time, and
currently (this is a temporary limitation) a translation unit must
contain an entire program (all object and method definitions required,
with no external or unresolved references).
A translation unit consists of a sequence of definitions
and imperatives. Definitions either either create a new
prototype or add a method to an existing prototype. Imperatives are
sequences of code that are executed in-order when the program is run.
Top-level imperatives can also take the form
2 Compiling and running programs
The fundamental object and messaging model is called 'Id'. The Id
compiler is called 'idc'. The hard-wired language of Pepsi
looks quite like Smalltalk and so the suffix '.st' was
hijacked for source files. The compiler compiles the files named on
the command line to create an executable whose name is derived from
the input files (by removing '.st' suffixes). The command
compiles the file foo.st to create an executable file
called foo. The -o option overrides the default
name of the output file, if required.
idc foo.st
builds the program bar from the source foo.st,
searching ../st80 and .../MyClassLibrary for
included files.
idc -I../st80 -I.../MyClassLibrary -o bar foo.st
2.1 Object files and shared libraries
The default behaviour is to compile a single source file into an
object file and then link it into an executable program. The
-c option tells the compiler not to link the executable
program. The command
compiles bar.st into the object file bar.o. Any
number of .o files can be linked when compiling an
executable program. The command
idc -c bar.st
compiles foo.st into the executable foo combining it
with previously compiled object files bar.o
and baz.o. (The examples/static directory
contains an example of linking multiple object files into a monolithic
program.)
idc foo.st bar.o baz.o
compiles bar.st into the object file bar.so
that can be loaded into a running program with the import:
directive. (The directory examples/dynamic contains an
example of importing shared libraries into a running program.)
idc -s bar.st
2.2 Compiling the compiler
The compiler source directory contains several directories, as follows:
To build the compiler, type
boot A version of the idc compiler precompiled to C source files, used for bootstrapping the idc compiler. doc Documentation (including the file you are reading). examples A collection of small and large example programs. gcX.Y The conservative garbage collector used by the Id runtime. idc The source for the idc compiler itself (written entirely in idst). lib The source for the Id runtime library. st80 A Smalltalk-like 'class' library.
make
in the top-level directory (the one containing the directories listed
above). It should build the GC, runtime library, and then the
compiler itself. To install the compiler and runtime libraries,
become the superuser ('root') and type
make install
3 Syntax
Most of the syntax is the same as Smallalk-80. Comments are contained
within double quotes:
The few minor additions to Smalltalk-80 syntax are to accomodate
compilation from plain text files, variadic blocks (and methods), an
expanded range of literal types, and direct access to non-printing
characters in Character and String literals.
"this is ignored"
3.1 Imperatives
A literal block can appear at the top-level (outside any other kind of
definition):
[ statements ]
The code within the block is executed at the moment 'control'
nominally reaches the block within the source file at runtime. This
is handy for initialising complex data structures (think of it as a
means to obtain behaviour similar to class initialisation methods) and
also for starting the whole program in motion at the end of the source
(something akin to a 'main' method, if you like).
{ directive optionalArguments... }
to direct the compiler to perform some unusual action. The most
commonly used directive is import:. The imperative
{ import: name }
asks the compiler to search for a file called 'name.st' and
make the global declarations within it available to the importing
program. The complete list of supported directives is given in
the appendix Compiler directives.
name ( listOfSlots )creates a new 'root' prototype (it has no parent, or 'delegate') and binds it to name. The prototype contains zero or more named slots, similar to instance variables. The definition could be read as: "name is listOfSlots".
Such a prototype has no useful behaviour (it can't even clone itself to create useful application objects). Adding a minimum of primitive behaviour (e.g., cloning) is the first thing you'll want to do to such an object.
The second form:
name : parent ( listOfSlots )is similar, except the new prototype delegates to the named parent object and inherits the parent object's slots before adding its own. Such definitions could be read as: "name extends parent with listOfSlots".
(This is every bit as bogus as a single inheritance mechanism being used to share state and behaviour, but I'm still trying to figure out how to separate delegation from the sharing of state without sacrificing performance. Only allowing slots to be accessed by name in their defining prototype, forcing inherited slots to be accessed by message send, is probably the way to go. Better still, making all state accesses into message sends -- especially assignments.)
name := [ expressions ]creates a new variable with the given name and binds it to the value of the last expression. (The expressions are separated by periods, causing all but the last to become statements.)
name pattern [ statements ]where name identifies a prototype object (defined as described above), pattern looks (more or less) like a Smallalk-80 message pattern, and statements is a block (notice the brackets) providing the behaviour for the method. The pattern component can be a unary, binary or keyword message pattern.
Extensions to Smalltalk's fixed-arity messages include additional and variadic formal arguments. Additional formal arguments for unary and keyword selectors are written like block arguments and can appear before or after the initial opening bracket. For example, two additional formal arguments could be written
name selector :arg1 :arg2 [ statements ](where selector is a unary or keyword selector). Unary or keyword message sends can pass additional actual arguments by prefixing each additional argument with a colon. To ask the receiver to add two numbers:
name selector [ :arg1 :arg2 | statements ]
Variadic arguments can be attached to unary or keyword methods. This is indicated by an ellipsis in the message pattern immediately following the last named argument. The pattern for unary and keyword syntax therefore also includes:Object add :x :y [ ^x + y ] [ | sum | sum := self add :3 :4. ]
name unarySelector ... [ statements ]
name keywords: arguments ... [ statements ]
(Simply for lack of time, there is currently no friendly syntax to recover the 'rest' arguments within the body of a message. Wizards, however, can easily recover these arguments by writing some low-level magic inside a method body.)
[ statements ]Both arguments and temporaries are strictly local to the block and will not conflict (other than in name) with similarly-named arguments or temporaries in lexically disjoint blocks. The compiler currently disallows the shadowing of names.
[ :arguments | statements ]
[ | temporaries | statements ]
[ :arguments | | temporaries | statements ]
(This means that you cannot set a method-level temporary by naming it as a block argument. It also means two blocks in the same method that share an argument or temporary name will each refer to a completely different value, regardless of the common name.)
identifier := expressionwith the ':=' operator having the lowest precedence of any operator (including keyword message sends) and associating from left to right.
primary unarySelector(Whether or not the binary selectors should be treated differently, introducing several levels of implicit precedence based on the operator name to provide the traditional arithmetic order of evaluation, would also be a possibility.)
unaryMessage binarySelector unaryMessage
binaryMessage keywords: binaryMessages
receiver messageSend ; messageSend
An Extension to Smalltalk-80 syntax allows unary and keyword message sends to provide additional actual arguments. (See the discussion above on additional and variadic formal arguments.) The simplest possible change that would allow this is to drop the name part of a 'keyword' (but keep the colon):
receiver unarySelector : anonymousArgumentwith as many ': argument' pairs as required. (Anonymous arguments can only appear after a unary message or the arguments associated with a proper keyword; no further 'keyword: argument' pairs are allowed after the first ': anonymousArgument' that occurs in a keyword send.)
receiver keywords: arguments : anonymousArgument
In addition to literal Arrays
#( elements... )we also have literal WordArrays
#{ integers... }and ByteArrays
#[ integers... ](where each integer must be between 0 and 255). In Array literals, nested Array, ByteArray and WordArray literals can appear without the initial '#' (although one can be supplied if you like).
Integer literals themselves are in decimal by default, with the usual
radixInteger r valueIntegersyntax supported. For the hackers out there, I saw no reason to avoid supporting
0xvalueIntegerfor hexadecimal integers too. Digits greater than '9' in hexadecimal literals (in either of the above syntaxes) or in literals of any base greater than ten (in the 'r' syntax) can be specified using upper- or lower-case letters.
Smalltalk-80 Character literals are supported:
$characteras are non-printing Characters either by mnemonic or by explicit value (following the ANSI 'escape sequence' conventions):
(Extended mnemonic names such as '$\newline' for '$\n' could easily be supported too.) In the event that a non-printing character literal not in the above list is required, a generic octal escape is provided:
syntax asciiValue ASCII designation $\a 7 bel (alert) $\b 8 bs (backspace) $\t 9 ht (horizontal tab) $\n 10 nl (newline) $\v 11 vt (vertical tab) $\f 12 np (new page, or form feed) $\r 13 cr (carriage return) $\e 27 esc (escape) $\\ 92 \ (a single backslash character)
$\octalNumberwhere octalNumber is precisely three (no more, no less) octal digits in the range '000' to '377' specifying the value of the Character. In other words, '$\n' and '$\012' are the same Character, and '$\000' is the 'nul' Character (ascii value zero).
String literals obey much the same rules as Smalltalk-80. Adjacent String literals:
are concatenated with an intervening single quote:'like''this'
However, the conventions that apply to '\' in escaping single Character literals also apply to characters within a String. You could write a String literal that contains two lines, each terminated by a newline with the whole String terminated by a nul Character:like'this
(I was very, very tempted to make consecutive String literals simply concatenate without the implicit intervening single quote, as in other languages that support juxtaposed String literals. I may yet change this so that single quotes inside Strings must be escaped'like\nthis\n\000'
to bring them into line with other languages. (Escaping the embedded single quote does already work just fine, but it isn't currently the unique means to introduce a single quote into a String -- which is a bug.) If you think that's bad, just consider that it took all my self control to avoid making Character literals look like 'a' 'b' and 'c', and Strings look like "abc" -- with some necessary change to comments too.'like\'this'
Note: The 'character escape' rules above apply to Symbols too. If you want to write the literal symbol for the 'remainder on division' binary message, you have to say '#\\\\' (since the first and third backslash characters escape the second and fourth). I think this is a bug (character escapes should only be recognised if the Symbol is created from a String [so '#'\\\\' == #\\' would hold]) and intend to fix it sometime. In the meantime: beware!
Block contexts (activated BlockClosures) have strictly local arguments and temporaries. The value of an argument or temporary can never come into contact with, nor be affected in any way by, an enclosing lexical context. They are quite literally inaccessible. You cannot, for example, implictly assign to a method temporary by naming it as a block argument.
BlockClosures can 'close-over' local state defined in a lexically-enclosing scope. In such cases, the closed-over state will be preserved on exit from the enclosing scope, leaving it accessible to future activations of blocks defined within that scope. Each time the defining scope is entered, fresh copies of closed-over state are created. (In other words, block closures 'see' the state associated with the activation in which they were created, rather than that associated with the closure in which they were created. Things like 'fixTemps' are completely unnecessary.)
All BlockClosures are first-class (they can be stored or passed upward for activation at a later time) although block activations are strictly LIFO, with no exceptions. (Your hardware really, really wants things to be this way.)
(For the terminally-curious: closed-over state, corresponding to any variables that appear 'free' within a lexically-nested scope, are stored in a heap-allocated 'state vector' independent of the defining method or block activation context. These state vectors persist for as long as there are reachable block closures that reference them -- either explicitly, as their defining context, or implicitly, by holding a reference to a free variable stored within the vector.)
There is currently one limitation: blocks containing non-local returns make no attempt to detect whether their defining method context has already returned. Attempting to return from a block whose method activation has already exited, rather than resulting in a friendly runtime error along the lines 'this block cannot return', will most likely provoke a segmentation fault and core dump. (This is really easy to fix; I'm just too lazy to deal with it right now.)
Objects are created by being cloned, which creates an uninitialised shallow copy of the original object. By convention the 'reusable' object that you clone, to make a new object to be modified and otherwise abused, is the 'prototype' for its 'clone family'. All members of a clone family share the same behaviour (response to messages), including the 'prototype' at the head of the clone family. If you modify the behaviour of the prototype (or any other member of its clone family) then the behaviour of all members of the clone family (including that of the prototype) is modified, identically. This is something of a compromise between Lieberman-style prototypes (simple conventions, since there is no `meta' class organisation to manage, but harder to implement efficiently) and class-instance systems (easier to implement efficiently, but imposing more complex organisational conventions on their surrounding systems).
In other words, a prototype (in the sense of the present discussion) is nothing more than an object that has been:
is equivalent to:Foo : Point ()
This results in a useful idiom for creating shared structures:" add 'Foo' to the set of visible named prototypes, then... " Foo := ObjectMemory allocate: Point byteSize + N "size of Foo slots in bytes". Foo methodDictionary: (MethodDictionary new parent: Point methodDictionary).
(although I'm not suggesting that this is either the best idiom nor, by a long way, a secure and desirable one.)BadVisibilityZone : Dictionary () [ (BadVisibilityZone := BadVisibilityZone new) at: 'Archer' put: #below; at: 'Warrior' put: #below; at: 'Sparrowhawk' put: #above; at: 'Cardinal' put: #above. ]
Note 1: the explicit reinitialisation (by sending 'new') of the prototype is required since the implicit cloning in the prototype specification creates an uninitialised object (in all respect other than having a valid method dictionary installed in it).
Note 2: this kind of idiom rapidly grows too verbose and was the motivation for translation-unit variables. The above example can also be written:
BadVisibilityZone := [ Dictionary new at: 'Archer' put: #below; at: 'Warrior' put: #below; at: 'Sparrowhawk' put: #above; at: 'Cardinal' put: #above; yourself ]
The only 'bizarre' (or not, according to your perspective) thing about this is that any 'instance' of 'Point' will be able to create new 'Points' in response to 'new'.Point : Object ( x y ) Point new [ self := super new. x := 0. y := 0. ] Point magnitude [ ^((x * x) + (y * y)) sqrt ]
Another possibility would be to create parallel hierarchies, with class behaviour defined in one and instance behaviour in the other.
Point : Object () "the 'class' side" aPoint : anObject ( x y ) "the 'instance' side" Point new [ self := aPoint clone. x := 0. y := 0. ] aPoint magnitude [ ^((x * x) + (y * y)) sqrt ]
(assuming the existence of 'setX:setY:'), although the former is: (a) cleaner, (b) more in keeping with 'prototype and clone' style (as opposed to 'class and instance' style), and (c) faster. The disadvantage is that 'super new' might not return a Point, after which assigning to 'x' and 'y' directly might not be a good idea. (Yet another reason to abolish direct manipluation of 'inherited' state within methods...)Point new [ self := self clone. x := y := 0 ] Point new [ ^super new setX: 0 setY: 0 ]
The only 'special name' to which you cannot assign is 'super'.
(Actually, I never tried to assign to super. I don't think the Parser
will let you, but you might just be able to assign to 'self' by
calling it 'super'. Of course, the correct response to assigning to
'super' should be to dynamically re-parent 'self', but that's fraught
with semantic complications -- not to mention problems with
maintaining consistency in methods that access state directly. Again,
a great reason to get rid of it.)
In the meantime, primitive behaviour has to be hand-coded (by a
wizard) and inserted explicitly into the compiled code at the
appropriate point. Code appearing between braces '{...}' is copied
verbatim to the output. Such external blocks are legal
Here's a trivial example, showing how to send a 'Character' to the
'console', answering 'true' or 'false' depending on whether the
operation succeeded:
The header word is a pointer to the object's virtual table.
Message sends to the object are resolved (when not present in the
method cache) by sending 'lookup:' to the header object. This is the
only explicit relationship between an object and the value stored in
its header word.
Object pointers correspond to the address in memory of the first
slot of an object, one word beyond the object's header (_vtbl
pointer). In other words, the object header (containing the _vtbl
pointer) is in the word before the one referenced by the
object's oop. This is done to allow 'toll-free bridging' of idst
objects to C/C++ structs/classes, Objective-C instances, or to native
objects in any other language that does not use the same convention of
putting a header in the word before an object's address. Allocating
the idst _vtbl pointer before (e.g.) a C/C++/ObjC object effectively
'wraps' the foreign object in an 'invisible' idst object, whose layout
is identical to (and whose state is stored at the same address as)
that expected by the native implementation of the foreign object.
The Id runtime support is manifest in three mechanisms:
The initial underscore '_' implies that these objects are
primitive and not necessarily intended to be included in an end-user
object system. Many of their slot names have the same prefix,
implying that they store 'primitive' values useful for their state
only -- you cannot send message to the values stored in these
slots. All other slots (without underscore prefix) contain references
to real objects to which messages can be sent.
is a singleton prototype that defines behaviour common to all
objects. This behaviour includes message lookup (dynamic binding),
which is achieved by sending (real) messages to the objects involved.
In other words, every single object created must eventually delegate
to _object, otherwise it would be impossible to interact with
(send messages to) that object. In yet other words, _object
is necessarily the parent of every other object in the
system. If you write
is a 'virtual table', similar to a MethodDictionary in
Smalltalk-80. Virtual tables map selectors to method implementations
for a particular clone family (one _vtbl is shared between all clones
in a given family). The bindings slot points to a
_vector of pointers to _assoc objects describing the
associations between _selectors and _closures. The
'_tally' slot contains the number of entries
in bindings. Finally, 'delegate' points to another
_vtable to which all unrecognised messages are delegated.
is a one-dimensional array of _size object pointers.
(Note that the storage for the pointers is allocate in-line, in
the body of the _vector object itself.)
associates a key (often a _selector) with
a value (often a _closure describing the
implementation of the method associated with the key in
a vtable).
is an interned (unique) string, much like a Smalltalk Symbol.
The selector itself is stored as a size (in bytes) and a
_name (a primitive array of bytes; i.e., of C type 'char
*').
describes the implementation of a method.
The _method slot contains the address of native code
implementing the method's body and the data. This
address is called when invoking a method, passing the
associated _closure as a 'hidden' argument.
The data slot is available implicitly (via the hidden
closure argument) for passing arbitray persistent information to
the method being invoked.
creates a new prototype (with an empty protocol) whose clones
delegate unimplemented messages to the receiver's family. This is the
only mechanism for creating a prototype hierarchy, including during
object/prototype initialisation at program startup. The source form
creates a new unique selector whose name is the given C string (a
primitive string, of type 'char *' ). This is the only mechanism for
creating new selectors. (The initial underscore in the selector is a
convention indicating that the argument '_cString' is a primitive,
non-object type.) extends (or modifies) the protocol of the receiver's clone family.
Subsequent lookups of aSelector in the receiver's family will be
resolved to _aMethod. This is the only mechanism for adding protocol
to a prototype. The source form
answers (a raw pointer to the native code of) the method
implementing the response to aSelector within the receiver. This is
the only mechanism for performing message lookup (dynamic binding)
within the system. (Note that for performance reasons the results of
'lookup:' may be memoized by the runtime system. There is currently
no way to prevent this, meaning that a given _vtbl might only have one
chance to influence the meaning of a given message send. This is a
limitation [read: bug] and will be fixed soon.) The implementation of the above methods (along with several
potentially useful auxiliary methods in the runtime classes) can be
found in the file 'Smalltalk/runtime.st'.
5 Pragmatics
The ABI (executable code conventions) are entirely C-compatible. The
intention is to integrate seamlessly with other
languages/applications, platform libraries and data types, without
having (in the vast majority of cases) to leave the object-message
paradigm.
provided that the code cannot be confused with a directive
('{ import ...') or WordArray literal ('#{...}').
A few things to note:
Character : Object
(
value "character's value as a Smalltalk integer"
)
Character putchar
{
return ((long)self->v_value & 1)
&& putchar((long)self->v_value >> 1) >= 0 ? v_self : 0;
}
The above example could be written to raise a 'primitive failed' error
on failure (more in keeping with traditional Smalltalk-80 primitive
methods):
Character putchar
[
| _code |
_code := value _integerValue.
{
if (putchar((long)v__code) >= 0) return v_self;
}.
" fall through to failure code... "
^self primitiveFailed
]
6 The runtime system: introspection and intercession
The only intrinsic runtime operation (in the sense that it is
inaccessible to user-level programs) is the 'memoized' dynamic binding
(of selectors to method implementations) that takes place entirely
within the method cache. Every other runtime operation (prototype
creation, cloning objects, method dictionary creation, message lookup,
etc.) is achieved by sending messages to objects, is expressed in
entirely in idst, and is therefore accessible, exposed and available
for arbitrary modification by any user-level program.
6.1 Object layout and object pointers
Objects have a single header word followed by zero or more bytes
corresponding to the named slots containing the state of the object.
6.1.1 Intrinsic objects
The intrinsic objects (all of them prototypes, accessible by name with
global visibility) form a small delegation hierarchy as shown below.
All objects delegate to _object. All _objects have
a virtual table (either implicit or explicitly stored one word before
the address of the object). Virtual tables contain _vectors
(one-dimensional fixed-size arrays) containing _associations
between _selectors and _closures.
A _closure stores a pointer to a method implementation
(executable native code) and a pointer to arbitrary data. The method
implementation receives the _closure in which it appears as
an 'invisible' first argument.
_object ()
_selector : _object ( _size _elements )
_assoc : _object ( key value )
_closure : _object ( _method data )
_vector : _object ( _size "indexable..." )
_vtable : _object ( _tally bindings delegate )
to create a 'parentless prototype' then the runtime system will
tacitly convert this into
RootPrototype ()
to ensure that you (and, more importantly, the system itself) can send
messages to RootPrototype (and its clones).
RootPrototype : _object ()
6.2 Essential protocol of runtime objects
Compiled code assumes the existence of responses to the following
messages:
Ambitious applications can therefore (amongst other tricks) redefine
'_object _methodAt:put:' and/or '_vtbl lookup:' to implement unusual
dynamic binding behaviour.
is equivalent to creating a new name Foo and initialising it with an
object delegating to Bar:
Foo : Bar ()
(Note that this equivalence holds only when the new
object adds no slots to its delegate.)
(The '_delegate' message has an initial underscore to avoid
over-polluting the protocol of user prototypes derived from
_object.)
Foo := [ Bar _delegated ]
is equivalent to evaluating:
Foo bar: baz [ ... ]
where "bar:" is a primitive string ('char *')
and barMethod is the address of the native code implementing
'Foo _bar:'. (The initial underscore in the selector is to avoid
polluting _object's protocol.)Foo _methodAt: (_selector _intern: "bar:") put: barMethod
6.2.1 Intrinsic methods
6.2.2 Intrinsic functions
6.2.3 Process arguments
Three global variables are defined during initialisation giving access
to the command-line arguments and environment of the process:
For some examples of the above in use, search for '{' within the
library source code.
6.3 Runtime examples
The directory 'examples/reflect' contains code demonstrating
how to reimplement much of the runtime support described above with
equivalent userland implementations.
7 Caveats and gotchas for Smalltalk programmers
Numbers are signed (positive or negative) and the scanner is not
context-sensitive (the syntactic type of each token is uniquely
determined by its spelling, irrespective of its position).
8 Appendices
8.1 Compiler directives
In the imperative form
{ directive optionalArguments... }
the following directives are recognised:
8.2 Compiler types
The following compilerTypes must be associated with a concrete
type using a { pragma: type compilerType programType }
directive before the first use of the corresponding type of literal:
9 Resources
The COLA mailing list: http://vpri.org/mailman/listinfo/fonc.