Pepsi -- not quite The Real Thing

$Id: pepsi.html 256 2005-03-12 00:23:39Z piumarta $


Contents:
  1. Introduction
  2. Compiling and running programs
  3. Syntax
  4. Semantics
  5. Pragmatics
  6. The runtime system: introspection and intercession
  7. Caveats and gotchas for Smalltalk programmers

Introduction

This is a cardboard cut-out of a prototype-based language similar to Smalltalk. It is code-named 'Pepsi' (in deference to The Real Thing, code-named 'Coke'). It is intended: Hopefully it will also serve (at some time or another) to demonstrate that: But mostly I am fed up of battling with C++ (and its ridiculous over-educated type system) and want a platform in which 'Coke' development can continue unhindered by type 'safety'. (As of the instant the 'Pepsi' compiler successfully compiled itself, I hope never to write another line of C++ in my life.)

It would be nice if it ran fast too. The last time I benchmarked it I got about five times Squeak speed, but this is likely to go down (with increased generality and dynamism in the lowest levels of the implementation) and up (with the sophistication of the implementation), and over the long term things could go either way. (The GC will probably have measurable impact on the performance of 'real'/long-running systems and applications too. Without profiling it I'm not sure how well/badly the current conservative GC is holding up.)

Compiling and running programs

The compiler compiles the files named on the command line to the standard output (it prints the compiled code on the terminal). The '-o' option redirects the output to a file:
idst -o foo.c foo.st
The search path for 'import'ed files can be augmented with the '-I' option (which can appear any number of times):
idst -I../Smalltalk -I../MyClassLibrary -o foo.c foo.st

Compiling the compiler

The compiler source directory contains several directories, as follows:
Smalltalk  The '.st' files containing the Smalltalk kernel classes.
etcAn 'idst-mode' for Emacs that mostly works.
exampleA cheezoid example program, intended mainly as a template for your own programs.
gcX.YThe conservative garbage collector (courtesy of Hans Boehm).
idstThe source for the idst compiler itself (written entirely in idst).
xidstThe source for the bootstrap idst compiler (written in C++).
To build the compiler you can just type 'make' in the top-level directory (the one containing the directories listed above). It should build the GC, runtime library, and then the compiler itself, followed by the example program (just to make sure the compiler is working).

For the record, I have tested the compiler on:

The really adventurous might want to try building the compiler without using the precompiled 'idst.c' file supplied with the above. Typing 'make bootstrap' in the top-level directory will perform a 'classic' three-stage bootstrap. First a bootstrap compiler called 'xidst' (written in C++) is compiled. Then xidst is run to create a 'stage1' compiler from the real compiler sources ('idst/idst.st'). The 'stage1' compiler is used to create a stage2 compiler, and that in turn is used to create a stage3 compiler. Stage2 and stage3 are compared (they should be identical), leaving behind a new 'idst.c' file containing the compiler. Finally, the new compiler is used to compile a short test file (that tries hard to break various parts of the compiler and Smalltalk library). If the whole process finishes without error then you have a brand-new, perfectly healthy compiler.

If all that sounds too complicated, ask me to make you a binary distribution.

Syntax

Most of the syntax is the same as Smallalk-80. For example, comments are contained within double quotes:
"this is ignored"
The few minor additions to Smalltalk-80 syntax are to accomodate compilation from plain text files, variadic blocks (and methods), an expanded range of literal types, and direct access to non-printing characters in Character and String literals.

Programs are translated one source file (with zero or more additional source files being imported) at a time. To the compiler, this body of code is called a translation unit. The compiler always processes one complete translation unit at a time, and currently (this is a temporary limitation) a translation unit must contain an entire program (all object and method definitions required, with no external or unresolved references).

A translation unit consists of a sequence of definitions and imperatives. Definitions either either create a new prototype or add a method to an existing prototype. Imperatives are sequences of code that are executed in-order when the program is run.

Imperatives

A literal block can appear at the top-level (outside any other kind of definition):
[ statements ]
The code within the block is executed at the moment 'control' nominally reaches the block within the source file at runtime. This is handy for initialising complex data structures (think of it as a means to obtain behaviour similar to class initialisation methods) and also for starting the whole program in motion at the end of the source (something akin to a 'main' method, if you like).

Top-level imperatives can also take the form:

{ directive optionalArguments... }
Currently only one directive is recognised:
{ import name }
will search for a file called 'name.st' and substitute its contents in place of the directive. (The Scanner is currently pretty stupid and wants to see precisely '{ import ', with one space after the '{' and one space after 'import', in order to recognise this directive.)

Prototype definitions

Two top-level forms provide for the creation of new prototypes:
name ( listOfSlots )
creates a new 'root' prototype (it has no parent, or 'delegate') and binds it to name. The prototype contains zero or more named slots, similar to instance variables. The definition could be read as: "name is listOfSlots".

Such a prototype has no useful behaviour (it can't even clone itself to create useful application objects). Adding a minimum of primitive behaviour (e.g., cloning) is the first thing you'll want to do to such an object.

The second form:

name : parent ( listOfSlots )
is similar, except the new prototype delegates to the named parent object and inherits the parent object's slots before adding its own. Such definitions could be read as: "name extends parent with listOfSlots".

(This is every bit as bogus as a single inheritance mechanism being used to share state and behaviour, but I'm still trying to figure out how to separate delegation from the sharing of state without sacrificing performance. Only allowing slots to be accessed by name in their defining prototype, forcing inherited slots to be accessed by message send, is probably the way to go. Better still, making all state accesses into message sends -- especially assignments.)

Method definitions

Methods are just 'named blocks', tied to a particular prototype only by permitting direct access to the state within that prototype. (Therein lies yet another reason to abolish direct access to state.) This is reflected in the syntax of the top-level form for adding methods (named blocks) to a prototype:
name pattern [ statements ]
where name identifies a prototype object (defined as described above), pattern looks (more or less) like a Smallalk-80 message pattern, and sequence (notice the brackets) is the body of a block. (The block can take arguments, but these are contained in the pattern and so the syntax prohibits explicit block arguments from appearing at the start of the statements sequence. This restriction applies only to blocks used as method implementations.)

The pattern component can be a unary, binary or keyword message pattern. Extending Smalltalk's fixed-arity messages, blocks associated with keyword patterns can be variadic (accomodating zero or more additional arguments, beyond those associated with explicit keywords). This is indicated by an ellipsis in the message pattern. Expanding the pattern part of the above syntax, the four valid forms of message pattern are therefore:

name unarySelector [ statements ]
name binarySelector argumentName [ statements ]
name keywords: arguments [ statements ]
name keywords: arguments ... [ statements ]
where 'keywords: arguments' are 'keyword: argument' pairs, repeated as many times as necessary, and '...' means an explicit ellipsis (and does not mean 'more keywords/arguments, as required'). (See the discussion on message sends below for the syntax of sending a message with optional 'rest' arguments.)

(Simply for lack of time, there is currently no friendly syntax to recover the 'rest' arguments within the body of a message. Wizards, however, can easily recover these arguments by writing some low-level 'magic' inside an external block. There will be an example of this later.)

Blocks

Blocks are similar to Smalltalk-80 blocks, but allow for local (block-level) temporaries:
[ statements ]
[ :arguments | statements ]
[ | temporaries | statements ]
[ :arguments | | temporaries | statements ]
Both arguments and temporaries are strictly local to the block and will not conflict (other than in name) with similarly-named arguments or temporaries in lexically disjoint blocks. The compiler currently disallows the shadowing of names.

(This means that you cannot set a method-level temporary by naming it as a block argument. It also means two blocks in the same method that share an argument or temporary name will each refer to a completely different value, regardless of the common name.)

Assignment

The Smaltalk-80 'left arrow' assignment operator is gone. The corresponding form is:
identifier := expression
with the ':=' operator having the lowest precedence of any operator (including keyword message sends) and associating from left to right.

Message sends

Are similar to Smalltalk-80: unary, binary and keyword messages have the same precedence as in Smalltalk-80 and cascaded messages (with the ';' operator) work in exactly the same manner.
primary unarySelector
unaryMessage binarySelector unaryMessage
binaryMessage keywords: binaryMessages
receiver messageSend ; messageSend
(Whether or not the binary selectors should be treated differently, introducing several levels of implicit precedence based on the operator name to provide the traditional arithmetic order of evaluation, would also be a possibility.)

Extending the Smalltalk-80 syntax is the ability to send a keyword message with 'anonymous' arguments. (See the discussion above on variadic message patterns.) The simplest possible change that would allow this is to drop the name part of the keyword (but keep the colon):

receiver keywords: arguments : anonymousArgument
with as many ': argument' pairs as required. (Anonymous arguments can only appear after arguments associated with a proper keyword; no more 'keyword: argument' pairs are allowed after the first ': anonymousArgument' occuring in a keyword message send.)

Parentheses

If you don't like the precedence defined by unary, binary, and keyword sends, put parentheses around expressions to force evaluation order.

Literals

Literals are immutable. In other words: literals created by the compiler cannot be modified by the program. This was done for two reaons:
  1. It's cleaner, making the semantics simpler to explain (no more confusing behaviour when a program inadvertently modifies a literal causing some method somplace to have behaviour different to that implied by its source code).
  2. My C compiler puts literals in a read-only data section, at one point causing me a certain amount of stress while debugging what was ultimately correct code but containing an attempt to write into a read-only location. If all compiler-generated literals are immutable then this particular platform idiosyncracy ceases to be of any concern whatsoever.
A handful of new classes (ImmutableArray, ImmutableByteArray, ImmutableWordArray) are present in the library to accomodate the above.

In addition to literal Arrays

#( elements )
we also have literal WordArrays
#{ integers }
and ByteArrays
#[ integers ]
(where each integer must be between 0 and 255). In Array literals, nested Array, ByteArray and WordArray literals can appear without the initial '#' (although one can be supplied if you like).

Integer literals themselves are in decimal by default, with the usual

radixInteger r valueInteger
syntax supported. For the hackers out there, I saw no reason to avoid supporting
0xvalueInteger
for hexadecimal integers too. Digits greater than '9' in hexadecimal literals (in either of the above syntaxes) or in literals of any base greater than ten (in the 'r' syntax) can be specified using upper- or lower-case letters.

Smalltalk-80 Character literals are supported:

$character
as are non-printing Characters either by mnemonic or by explicit value (following the ANSI 'escape sequence' conventions):
syntaxasciiValueASCII designation
$\a7bel (alert)
$\b8bs (backspace)
$\t9ht (horizontal tab)
$\n10nl (newline)
$\v11vt (vertical tab)
$\f12np (new page, or form feed)
$\r13cr (carriage return)
$\e27esc (escape)
$\\92\ (a single backslash character)
(Extended mnemonic names such as '$\newline' for '$\n' could easily be supported too.) In the event that a non-printing character literal not in the above list is required, a generic octal escape is provided:
$\octalNumber
where octalNumber is precisely three (no more, no less) octal digits in the range '000' to '377' specifying the value of the Character. In other words, '$\n' and '$\012' are the same Character, and '$\000' is the 'nul' Character (ascii value zero).

String literals obey much the same rules as Smalltalk-80. Adjacent String literals:

'like''this'
are concatenated with an intervening single quote:
like'this
However, the conventions that apply to '\' in escaping single Character literals also apply to characters within a String. You could write a String literal that contains two lines, each terminated by a newline with the whole String terminated by a nul Character:
'like\nthis\n\000'
(I was very, very tempted to make consecutive String literals simply concatenate without the implicit intervening single quote, as in other languages that support juxtaposed String literals. I may yet change this so that single quotes inside Strings must be escaped
'like\'this'
to bring them into line with other languages. [Escaping the embedded single quote does already work just fine, but it isn't currently the unique means to introduce a single quote into a String -- which is a bug.] If you think that's bad, just consider that it took all my self control to avoid making Character literals look like 'a' 'b' and 'c', and Strings look like "abc" -- with an obviously necessary change to comments too.)
Note: The 'character escape' rules above apply to Symbols too. If you want to write the literal symbol for the 'remainder on division' binary message, you have to say '#\\\\' (since the first and third backslash characters escape the second and fourth). I think this is a bug (character escapes should only be recognised if the Symbol is created from a String [so '#'\\\\' == #\\' would hold]) and intend to fix it sometime. In the meantime: beware!

Anything else...?

If you find something (either some feature in the sources that I wrote, or something you think should work but doesn't, that does not seem to be explained here) then please let me know so I can fix this document.

Semantics

The semantics are similar to Smalltalk-80, with three main differences:

Blocks

The restrictions placed on Blocks by Smalltalk-80 have been eliminated, and the (end-user) notion of BlockContext has been replaced by BlockClosure (in several variations according to optimisability). When you write a block '[...]' in a program, what you create is a BlockClosure (and not a partially-crippled, half-initialised activation context, as would be the case in Smalltalk-80).

Block contexts (activated BlockClosures) have strictly local arguments and temporaries. The value of an argument or temporary can never come into contact with, nor be affected in any way by, an enclosing lexical context. They are quite literally inaccessible. You cannot, for example, implictly assign to a method temporary by naming it as a block argument.

BlockClosures can 'close-over' local state defined in a lexically-enclosing scope. In such cases, the closed-over state will be preserved on exit from the enclosing scope, leaving it accessible to future activations of blocks defined within that scope. Each time the defining scope is entered, fresh copies of closed-over state are created. (In other words, block closures 'see' the state associated with the activation in which they were created, rather than that associated with the closure in which they were created. Things like 'fixTemps' are completely unnecessary.)

All BlockClosures are first-class (they can be stored or passed upward for activation at a later time) although block activations are strictly LIFO, with no exceptions. (Your hardware really, really wants things to be this way.)

(For the terminally-curious: closed-over state, corresponding to any variables that appear 'free' within a lexically-nested scope, are stored in a heap-allocated 'state vector' independent of the defining method or block activation context. These state vectors persist for as long as there are reachable block closures that reference them -- either explicitly, as their defining context, or implicitly, by holding a reference to a free variable stored within the vector.)

Non-local returns

An explicit return statement inside a block behaves just like in Smalltalk-80: the method activation in which the block closure was originally created will return the indicated value.

There is currently one limitation: blocks containing non-local returns make no attempt to detect whether their defining method context has already returned. Attempting to return from a block whose method activation has already exited, rather than resulting in a friendly runtime error along the lines 'this block cannot return', will most likely provoke a segmentation fault and core dump. (This is really easy to fix; I'm just too lazy to deal with it right now.)

Prototypes and objects

Well, it's all just objects really.

Objects are created by being cloned, which creates an uninitialised shallow copy of the original object. By convention the 'reusable' object that you clone, to make a new object to be modified and otherwise abused, is the 'prototype' for its 'clone family'. All members of a clone family share the same behaviour (response to messages), including the 'prototype' at the head of the clone family. If you modify the behaviour of the prototype (or any other member of its clone family) then the behaviour of all members of the clone family (including that of the prototype) is modified, identically. This is something of a compromise between Lieberman-style prototypes (much simpler and more general oganisational conventions, but very difficult to implement efficiently) and class-instance systems (easier to implement efficiently, but imposing much more complex organisational conventions on their surrounding systems).

In other words, a prototype (in the sense of the present discussion) is nothing more than an object that has been:

In yet other words, writing:
Foo : Point ()
is equivalent to:
" add 'Foo' to the set of visible named prototypes, then... "
  Foo := Smalltalk allocate: Point byteSize + N "size of Foo slots in bytes".
  Foo methodDictionary: (MethodDictionary new parent: Point methodDictionary).
This results in a useful idiom for creating shared structures:
BadVisibilityZone : Dictionary ()
[
    (BadVisibilityZone := BadVisibilityZone new)
        at: 'Archer'      put: #below;
        at: 'Warrior'     put: #below;
        at: 'Sparrowhawk' put: #above;
        at: 'Cardinal'    put: #above.
]
(although I'm not suggesting that this is either the best idiom nor, by a long way, a secure and desirable one.)

Note that the explicit reinitialisation (by sending 'new') of the prototype is required since the implicit cloning in the prototype specification creates an uninitialised object (in all respect other than having a valid method dictionary installed in it).

Random thoughts on class-like behaviour

The easiest thing is just to mix 'meta' and 'application' behviour:
Point : Object ( x y )

Point new
[
    self := super new.
    x := 0.
    y := 0.
]

Point magnitude
[
    ^((x * x) + (y * y)) sqrt
]
The only 'bizarre' (or not, according to your perspective) thing about this is that any 'instance' of 'Point' will be able to create new 'Points' in response to 'new'.

Another possibility would be to create parallel hierarchies, with class behaviour defined in one and instance behaviour in the other.

Point : Object ()          "the 'class' side"
aPoint : anObject ( x y )  "the 'instance' side"

Point new
[
    self := aPoint clone.
    x := 0.
    y := 0.
]

aPoint magnitude
[
    ^((x * x) + (y * y)) sqrt
]

Everything is first-class

In case you hadn't already noticed, 'self' is a variable. (As are 'nil', 'true', and 'false'.) If you assign to 'self' inside a method, the receiver instantly changes identity and retains the new identity through to the end of the method (or the next assignment to 'self'), including any implicit return of 'self' at the end of the method. The following have exactly the same behaviour:
Point new
[
    self := self clone.
    x := y := 0
]

Point new
[
    ^super new setX: 0 setY: 0
]
(assuming the existence of 'setX:setY:'), although the former is: (a) cleaner, (b) more in keeping with 'prototype and clone' style (as opposed to 'class and instance' style), and (c) faster. The disadvantage is that 'super new' might not return a Point, after which assigning to 'x' and 'y' directly might not be a good idea. (Yet another reason to abolish direct manipluation of 'inherited' state within methods...)

The only 'special name' to which you cannot assign is 'super'. (Actually, I never tried to assign to super. I don't think the Parser will let you, but you might just be able to assign to 'self' by calling it 'super'. Of course, the correct response to assigning to 'super' should be to dynamically re-parent 'self', but that's fraught with semantic complications -- not to mention problems with maintaining consistency in methods that access state directly. Again, a great reason to get rid of it.)

Pragmatics

The ABI (executable code conventions) are entirely C-compatible. The intention is to integrate seamlessly with other languages/applications, platform libraries and data types, without having (in the vast majority of cases) to leave the object-message paradigm.

In the meantime, primitive behaviour has to be hand-coded (by a wizard) and inserted explicitly into the compiled code at the appropriate point. Code appearing between braces '{...}' is copied verbatim to the output. Such external blocks are legal

provided that the code cannot be confused with a directive ('{ import ...') or WordArray literal ('#{...}').

Here's a trivial example, showing how to send a 'Character' to the 'console', answering 'true' or 'false' depending on whether the operation succeeded:

Character : Object
(
  value    "character's value as a Smalltalk integer"
)

Character putchar
{
  struct t_Character *this= (struct t_Character *)self;
  int value= _integerValue(this->value);
  return putchar(value) >= 0 ? v_true : v_false;
}
A few things to note: Alternatively, the above example could be written to raise a 'primitive failed' error on failure (more in keeping with traditional Smalltalk-80 primitive methods):
Character putchar
[
    {
      struct t_Character *this= (struct t_Character *)self;
      int value= _integerValue(this->value);
      if (putchar(value) >= 0) return self;
    }.
    " fall through to failure code... "
    ^self primitiveFailed
]
The function '_integerValue(anObject)' used in the above examples is one of several 'helper functions' defined for external code to use. The full set is as follows:
sel_t _selector(char *name)
invokes the message send '_selector intern: name'.
oop _proto(oop parent)
invokes the message send 'parent _delegated'.
void _method(oop prototype, sel_t selector, imp_t method)
invokes the message send 'prototype _methodAt: selector put: method'.
(The effects of the message sends performed by the above functions are explained in detail in the section describing the
runtime system.)
imp_t _bind(oop object, sel_t selector)
performs a memoized lookup of 'selector' in 'object', yielding a method implementation for the corresponding message response.
imp_t _rebind(oop object, sel_t selector)
similar to _bind, except that the result is not placed in a point-of-send inline cache (if such are enabled). (This is critical for the correct implementation of 'Object perform:' and similar.)
void *_newPointers(int size)
answers the address of a (collectible) block of uninitialised memory at least size bytes in length. (Pointers to objects within this memory will be considered by the garbage collector during marking.)
void *_newBytes(int size)
answers the address of an atomic (uncollectible) block of uninitialised memory at least size bytes in length. (The contents of this memory are ignored by the garbage collector.)
oop _integerObject(int value)
answers an object corresponding to the given integer value.
int _integerValue(oop object)
answers the integer value corresponding to the given object. (No check that the object is in fact an integer is performed.)
int _isIntegerObject(oop object)
answers nonzero if the given object is an integer.
int _areIntegerObjects(oop a, oop b)
answers nonzero if both a and b are integer objects.
Two global variables are predefined to make writing command-line applications a little easier:
int _argc
contains a copy of the original value of argc passed to the program at startup.
char **_argv
contains a copy of the original value of argv passed to the program at startup.
For some examples of the above in use, search for '{' within the Smalltalk library source code.

Finally, here is the 'variadic method' example promised earlier in this document:

Foo sum: firstArgument ...
[
    " Add all arguments until one of them is nil, then stop.  Answer the sum. "
    | sum next |
    sum := firstArgument.
    { va_list ap; va_start(ap, v_firstArgument) }.    " start scanning additional arguments "
    [{ v_next= va_arg(ap, oop) }.		      " read next argument "
     next notNil]
        whileTrue:
            [total := total + next].
    { va_end(ap) }.				      " stop scanning arguments "
    ^sum
]

[
    | total |
    total := Foo sum: 1 : 2 : 3 : nil.	              " leaves 6 in total "
]
If that didn't make much sense, type 'man stdarg' on any Unix-based machine.

The runtime system: introspection and intercession

The only intrinsic runtime operation (in the sense that it is inaccessible to user-level programs) is memoized 'secondary' dynamic binding, taking place entirely within the method cache. Every other runtime operation (prototype creation, cloning objects, method dictionary creation, message lookup ['primary' dynamic binding, outside the method cache], etc.) is achieved by sending messages to objects, is expressed in entirely in idst, and is therefore accessible, exposed and available for arbitrary modification by any user-level program.

Runtime structures

Four types of object are used within the runtime system, and are the basis for all computation. The hierarchy looks like this:
_object ()
    _selector ( size _name next )
    _binding ( selector _method )
    _vtbl ( size capacity _bindings delegate )
(Slots starting with an underscore '_' are primitive types useful for for their state only -- you cannot send message to these objects. All other slots contain pointers to real objects that respond to messages.)

_object is a singleton prototype that defines behaviour common to all objects. This behaviour includes message lookup (dynamic binding), which is achieved by sending (real) messages to the objects involved. In other words, every single object created must eventually delegate to _object, otherwise it would be impossible to interact with (send messages to) that object. In yet other words, _object is necessarily the parent of every other object in the system. If you write

RootPrototype ()
then the runtime system will tacitly convert this into
RootPrototype : _object ()
to ensure that you (and, more importantly, the system itself) can send messages to RootPrototype (and its clones).

_selector is an interned (unique) string, much like a Smalltalk Symbol. The selector itself is stored as a 'size' (in bytes) and a '_name' (a primitive array of bytes; i.e., of type 'char *'). The 'next' field links all the selectors into a list for purposes of interning.

_binding associates a _selector (in the 'selector' slot) with the address of native code implementing a method (in the '_method' slot).

_vtbl is a 'virtual table', similar to a MethodDictionary in Smalltalk-80. Virtual tables map selectors to method implementations for a particular clone family (one _vtbl is shared between all clones in a given family). The '_bindings' slot points to a primitive vector of pointers to _binding objects describing the virtual table's mapping. The 'size' slot contains the number of entries in _bindings, and 'capacity' is the maximum number of entries that _bindings can contain (without being grown). Finally, 'delegate' points to another _vtbl to which all unrecognised messages are delegated.

Essential protocol of runtime objects

Compiled code assumes the existence of responses to the following messages: Ambitious applications can therefore (amongst other tricks) redefine '_object _methodAt:put:' and/or '_vtbl lookup:' to implement unusual dynamic binding behaviour.

The implementation of the above methods (along with several potentially useful auxiliary methods in the runtime classes) can be found in the file 'Smalltalk/runtime.st'.

Additional protocol of runtime objects

Several additional methods are defined in runtime objects for convenience. Amongst these are:

Runtime examples

The file 'Smalltalk/runtime.st' contains a disabled (commented) section of code at the end. Remove the comments to see the above methods bringing the entire system up, in excruciating detail.

A (somewhat contrived) example showing how to control the modification of object protocol and the behaviour of method lookup can be found in 'example/intercede.st'. To build and run it, type

make PROGRAM=intercede
./intercede
from within the 'example' directory.

Object layout and object pointers

Objects have a single header word followed by zero or more bytes corresponding to the named slots containing the state of the object.

The header word is a pointer to the object's virtual table. Message sends to the object are resolved (when not present in the method cache) by sending 'lookup:' to the header object. This is the only explicit relationship between an object and the value stored in its header word.

Object pointers correspond to the address in memory of the first slot of an object, one word beyond the object's header (_vtbl pointer). In other words, the object header (containing the _vtbl pointer) is in the word before the one referenced by the object's oop. This is done to allow 'toll-free bridging' of idst objects to C/C++ structs/classes, Objective-C instances, or to native objects in any other language that does not use the same convention of putting a header in the word before an object's address. Allocating the idst _vtbl pointer before (e.g.) a C/C++/ObjC object effectively 'wraps' the foreign object in an 'invisible' idst object, whose layout is identical to (and whose state is stored at the same address as) that expected by the native implementation of the foreign object.

Caveats and gotchas for Smalltalk programmers

Pepsi includes a pervasive experiment in zero-relative indexing. Numbers are signed (positive or negative) and the scanner is not context-sensitive (the syntactic type of each token is uniquely determined by its spelling, irrespective of its position).