An ag(e)ing hacker, Luca Saiu's blog
2013-08-23 12:54 (last update: 2023-09-29 17:57) Luca Saiu
Several people wrote me in private about epsilon, requesting a tutorial or at least some practical information about how to run the thing. I’m very happy about the feedback I received, from people I already knew and from others who had just discovered my work on the net. I hadn’t anticipated it: thanks. The insterest I saw motivated me to write this informal, hands-on introduction.
Of course I’ll also prepare reference documentation, but not at this time: I won’t actually start working on a proper manual until the language reaches a reasonably stable state. However I plan to adapt this post into a chapter to be kept up to date and added to the “official” manual, in a first part named Tutorial Introduction.
I’m very interested in your feedback about my presentation style:
please write to epsilon-devel
or to me personally if you have
constructive suggestions.
In this post color is significant: unless you’re already an epsilon expert you will follow much more easily using a graphic web browser supporting CSS. You should see this text in red, this other text in green and the latest one in yellow. This posts also uses Unicode subscripts and, since I’m already requiring all of this anyway, a couple Greek letters as well. I apologize to the users of w3m, links and lynx: it’s for this one time only.
This post doesn’t belong to the series of short articles I plan to write “for myself” about current developments in epsilon; the idea of this introduction is just to show interested people how to start, so that they can explore the system themselves.
The current implementation can be played with, but is far from friendly. In its current state I can only recommend it in good conscience to experienced programmers who already know some Lisp dialect (Scheme, Common Lisp or Emacs Lisp), and can competently use a GNU/Linux system.
The name “ε” comes from the Mathematical Analysis naming
convention for small variables. When I’m limited to
traditional character sets or if I feel less smug I write it in Latin
characters as “epsilon”; in any case, to convey the idea of the
language smallness, I always write “epsilon” lowercase, even at the
beginning of a sentence. The names of the core language ε₀
and the high-level personality ε₁ can be written in Latin
characters as epsilon0
and epsilon1
(again, always
lowercase) or abbreviated into e0
and e1
when part of
program identifiers.
A sentence by Chuck Moore resonates with me: current software is shameful1. It feels strangely right for being such an extreme idea.
My ideal environment is different from Mr Moore’s, but I share his opinion about most software, which is indeed bloated and overcomplicated. More importantly, we’re developing things the wrong way: our tools are at the same time too complex, not powerful enough, and not suitable to formal automatic reasoning.
In my vision a programming language should be extensible, so that the user may bind it to her needs and to the problem. The language should be built upon a very small and simple core using powerful syntactic extension capabilities, such as macros. This way, a program written using extensions is automatically rewritten into the core language, to be then executed or compiled. A program should be able to inspect its own state, including procedures and globals, and self-modify.
With some important philosophical differences explained in my PhD thesis introduction, the idea is accomplishing the vision expressed by Guy Steele in his marvelous 1998 talk Growing a Language, which hasn’t materialized yet out of the Lisp world. You can read a transcription at the address https://www.cs.virginia.edu/~evans/cs655/readings/steele.pdf, but I highly recommend the video recording of the original talk. Download it, and watch it as soon as you have an hour to spare:
youtube-dl www.youtube.com/watch?v=_ahvzDzKdB0
The ε core language, an imperative first-order language, is called ε₀. It’s inconvenient for humans to use for serious problems, yet simple and potentially very efficient. Using ε₀ I’ve built ε₁, a higher-level language with a Lispy feel, but untyped — not dynamically typed like Lisp: untyped, like Forth and machine language.
ε₁ is defined with macros and expression-to-expression transformations, in a way such that any piece of ε₁ code is automatically rewritten into ε₀ before execution. This strategy simplifies the system a lot, since it only needs an interpreter or compiler for ε₀: everything else, including the ε₁ extension machinery, can be run as ε₀ code — it doesn’t matter if machine-generated.
ε₁ is usable as a programming language, if a little unforgiving to beginners. Within the architecture of epsilon, ε₁ can be thought of as one personality2 built on top of the core language, useful to build more high-level personalities: the user is free to radically change the language and build something potentially very different, for example including static or dynamic typing, object orientation, or continuations. If some user dislikes ε₁, she’s free to change it into something else which rewrites into ε₀: Lispy features such as s-expressions are not hardwired in ε₀, and can be replaced with something else as well if a user so prefers.
ε₀ and ε₁ programs can self-modify, creating or removing procedures and global variables at run time. This adds power, but makes the language difficult to reason about and compile in the general case; however in the particular case of a program eventually reaching a “static” (which is to say, not self-modifying) state, we can use ordinary compilation techniques to translate the final static ε₀ program into native code, and recover efficiency — I’ll write about my compilation hacks in some forthcoming post.
In a style similar to compilation, a program can unexec to a file, to be later loaded and interpreted, possibly with a different runtime library or on another machine. The unexec operation entails freezing the current dynamic state of the program (including procedures), to be resumed later.
Thanks to reflection, unexecing and compiling don’t need to be special language features, but are instead ordinary procedures accessing the currently active global definitions as data structures. This is part of a general pattern: what in other languages is hardwired, in ε becomes defined by the user and changeable: in a sense I chose to move complexity from the language definition, which is to say Mathematics, to much more flexible code.
I call ε the whole system or set of languages/layers, starting from ε₀ and ε₁, up to the higher-level personalities (not yet existing at this point) on top of them.
ε is radical and somewhat subversive: since mainstream languages aren’t up to the task and aren’t getting better, I propose an open-ended system which the user can grow in any direction. Don’t worry about compatibility: dialect proliferation is good. When the right ideas emerge we can think of standardizing a personality, or some subset of it3 — but not now. First we need language experimentation, again: the software crisis isn’t anywhere near solved. Let the people play.
I’ve been pondering about programming languages for a long time, and epsilon has actually had several implementations already4, each less naïve and more ambitious than the previous. The current rewrite has also been the topic of my recently completed PhD thesis. My thesis describes the core language in close detail, and gives a pretty good overview of extension mechanisms. Of course a thesis is not software documentation: it contains a formal mathematical description of the language and its properties, and as such the bulk of it isn’t conceived for the end user.
At the beginning I was thinking to make my thesis very accessible: in
my original intentions a formally-minded programmer would’ve been able
to follow the treatment, without any specific prerequisites. That
attempt ultimately failed: describing epsilon requires some mathematical
sophistication, and since the design contains non-standard choices I’ve
developed a non-standard mathematical notation, which many people find
heavyweight.
I like another phrase which I’ve heard said by some Forther, I
think Greg Bailey or Jeff Fox if not Chuck Moore himself:
simple but not trivial. And that can certainly be said of ε:
it’s simple — very simple — but still not trivial.
Despite the inherent non-triviality of the matter, some chapters of my
thesis remain pretty accessible to motivated programmers, particularly
Chapter 1, the commentary part of Chapter 2, the whole Chapter 3 and
most
of Chapter 5.
The text is available in color for on-screen reading at the address http://ageinghacker.net/publications/luca-saiu--phd-thesis-color.pdf, or black-and-white, for printing, at http://ageinghacker.net/publications/luca-saiu--phd-thesis-bw.pdf.
I spent a long time working on this document, and I’d be happy if others could get something out of it. Have a look.
I implemented ε in itself; in particular, I used ε₀ to implement an ε₀ interpreter with its reflective data structures (what these structures are will become clear below), and syntactic extension mechanisms: macros and transforms. This first part of the implementation was painful to write since ε₀ is by design a weak language, without much abstraction power.
Then of course I built ε₁ using ε₀ with macros and
transforms: and each new syntactic form I added became immediately
available to define another one. In a somewhat arbitrary fashion, I
called “ε₁” the final state of this extension process,
blessing it as a personality.
ε₁ contains all of ε₀ plus syntactic extension
mechanisms, and a library of syntactic extensions including
variadic syntax, blocks, closures, imperative loops, sum types,
pattern matching, futures, unexec, and so on — none of these
features is present in ε₀, and the distance
between ε₀ and ε₁ feels very wide.
The idea of building ε₁ in ε₀ is not only aesthetically pleasing; it is a way of associating a formal specification to ε₁. Since I gave a formal mathematical specification of ε₀ (in my thesis), everything defined in it inherits its rigorously formal nature; but, again, ε₁ is defined with code rather than logic rules and equations.
Just to complete the picture: the runtime is of course written in C. In particular, I have two garbage collectors written by me, one of which parallel, not yet integrated. The code generated by the native compilers I’m working on (compilers are written in ε₁, of course) also require a little assembly code. Right now I support MIPS and x86_64; more backends will come including bytecode for a virtual machine, as a fallback case.
How did I implement ε₀, to run it the first time? I used
GNU Guile (http://www.gnu.org/software/guile) for bootstrapping,
extended with a little C to implement ε data as a
SMOB5, yielding
what I call guile+whatever
. The implementation still
depends on Guile at the present time, but only for relatively minor
functionality such as s-expression parsing and printing. I’ll have to
re-implement this functionality in ε itself, remove Guile, and
then the system will be completely self-hosting.
Don’t get me wrong: I love Guile; I really do. It’s an awesome system, well documented, with a good library, and getting faster. I also like the people working on it. But I need to remove Guile as a dependency: ε is not Scheme (ε₀ is much more minimal than Scheme), and it should stand on its own. In particular, ε can be used as a very low level language if the user prefers so, and I plan to support very small systems6 as compilation targets.
I assume you’re running a GNU/Linux system. I develop and regularly test on little-endian MIPS and x86_64; however any other GNU system should work as well, including GNU/Hurd. I suppose the software also works on BSD systems, with few or no changes.
You need (including development packages, if you don’t compile dependencies yourself):
Get a copy of the epsilon trunk from the bzr repository on Savannah (http://savannah.gnu.org/bzr/?group=epsilon):
bzr branch bzr://bzr.savannah.gnu.org/epsilon/trunk epsilon-trunk cd epsilon-trunk
Generate the configuration machinery, configure and compile:
./autogen.sh && ./configure && make && echo SUCCESS
If you have all the dependencies listed above everything should work,
and after a little while you should see the SUCCESS
message.
You don’t need to install.
As this is the first time you use the system, you have to bootstrap it from Guile. Enter the bootstrap source directory (which, at the current time, actually contains much more than what’s needed for bootstrap; yes, I should rename it).
cd bootstrap/scheme
Enter guile+whatever
. If you’re using Guile 1.8.x, type:
../../bin/guile+whatever
If instead you’re using Guile 2.0.x, type:
../../bin/guile+whatever --no-auto-compile
You need the option --no-auto-compile
because of a couple dirty
kludges I did with Guile macros, which prevent bytecode compilation.
This is not Guile’s fault: Guile 2 is great, and also a tremendous
improvement over Guile 1, but unfortunately I’ve abused the language.
It’s not worth fixing that part: you’ll never need to touch it, and
I’ll drop the Guile dependency anyway.
At guile+whatever
’s prompt, type:
(load "bootstrap.scm")
This will take a while: the system builds itself from a temporary ε₀ implementation written with Scheme macros, before unexecing. You don’t need to understand the details.
If there are no error messages, you can test the bootstrapped system.
Exit guile+whatever
by pressing Ctrl+D, re-enter guile+whatever
like above (don’t forget the --no-auto-compile
option on Guile
2.0.x), and at the prompt type:
(load "quick-start.scm")
This should be much faster.
You won’t need to bootstrap again unless you modify some file under
bootstrap/scheme/. So from the next time, if the sources haven’t
changed, you can simply come back to bootstrap/scheme/, run
guile+whatever
(with --no-auto-compile
if needed) and
execute (load "quick-start.scm")
from the prompt.
guile+whatever
and EmacsIt’s nearly always a good idea to enable Guile’s readline support.
Execute these lines at guile+whatever
’s prompt:
(use-modules (ice-9 readline)) (activate-readline)
If you add the two lines to your ~/.guile, they will be
executed automatically when you enter Guile, and
guile+whatever
as well.
I strongly recommend the Emacs major mode for epsilon (actually ε₁) which I derived
from Scheme mode. I find it already useful for indentation and font
locking, even if at this stage it can’t interact with external
processes yet: you’ll have to explicitly kill&yank from the editor to
the guile+whatever
REPL.
Visit epsilon-trunk/emacs/epsilon.el and do M-x
eval-buffer
. You can toggle the major mode when you’re visiting an
epsilon file with M-x epsilon-mode
. If you like
ParEdit (http://www.emacswiki.org/emacs/ParEdit) for
editing s-expressions you can enable that as well.
Visit epsilon-trunk/bootstrap/scheme/core.e and epsilon-trunk/bootstrap/scheme/epsilon1.scm: they will be useful to keep around for reference while playing with predefined procedure.
We’re now goning to play with ε₁. As part of this first
introduction we’ll only show already existing forms, without
discussing how the user can define her own macros and transforms.
Notice, however, that the symbol names starting with the conventional
prefix ‘e1:’ are all defined as macros (in some cases also
relying on transforms): essentially all forms occurring in user
expressions don’t come predefined in ε₀.
You’ll need ε₁ also in the future to do more complicated things such as defining new extensions, and possibly even to write another personality to replace ε₁ itself: using ε₀ only is way too cumbersome in practice, and I won’t cover ε₀ at all here.
guile+whatever
is an extended Guile, so Scheme is still available:
(+ 1 2) ⇒ 3
Scheme has been very useful for me during the initial implementation, where I had to carefully mix languages; but it’s more of a source of confusion for users at this point. You’ll have to live with the nuisance of having two different interpreters on the same REPL, for the time being.
We want to distinguish our data and operations from Scheme’s
predefined versions, which are incompatible with ours. By convention,
we name our global identifiers with prefixes associated to informal
“name spaces”: for example our sum operation over fixnums
(integers small enough in modulo to fit within a machine register,
possibly minus a few reserved bits) is fixnum:+
, different from
Guile’s +
which operates on Guile’s s-expressions. At the
current time we also need7 to wrap toplevel expressions within the e1:toplevel
Guile macro, to tell the system that what’s contained should be
evaluated in ε₁, not Guile. Again, remember that this is
just a convention I adopt: I’ve simply decided to use the character
‘:’ to delimit prefixes, even if ‘:’ is not special in any
way and can be used anywhere within symbol names.
I won’t waste time describing ε₁’s syntax: all ε₁ forms are encoded as s-expressions, like in Lisp, and in fact ε₁’s syntax is very similar to Lisp’s: if you can read some Lisp dialect you can read ε₁, even if it’s not compatible with any particular Lisp dialect. As for semantics, ε₁ evaluation is call-by-value, left-to-right; tail calls don’t consume stack space.
3 ⇒ 3 ;; no e1:toplevel: again this was Scheme, not ε₁ (e1:toplevel 3) ⇒ 3 (e1:toplevel (fixnum:+ 1 2)) ⇒ 3
The REPL prints the result in color, to make it clear that it’s an ε value. Unboxed objects are green, boxed objects red (or yellow when they have already been printed as part of the same data structure). Notice that all unboxed objects are shown as fixnums.
[To do: I think this should be moved]
(e1:toplevel (e1:begin ;; like begin in Scheme and progn in other Lisps
1 ;; first run this, ignoring result(s)…
(fixnum:+ 2 3) ;; …then this, again ignoring result(s)…
8)) ;; …till the last form: return its result(s)
⇒ 8
Let’s try cons:make
, a procedure allocating two-element buffers:
(e1:toplevel (cons:make 7 9)) ⇒ 0x2af7350[7 9] ;; a pair, also called a cons (e1:toplevel (cons:make 7 9)) ⇒ 0x3636fe0[7 9] ;; another pair (different pointer) (e1:toplevel (cons:make 1 (cons:make 10 20))) ⇒ 0x306b240[1 0x306b200[10 20]] ;; a pair with another pair inside
The red or yellow hexadecimal numbers represent pointers, and of course their specific values may vary from machine to machine and from execution to execution. The elements between brackets right after each “red” pointer represent the epsilon objects contained in the pointed buffer; such elements in their turn may be boxed or unboxed. In order to avoid potentially infinite printings, yellow pointers are printed without their elements, which are already known anyway. We’ll encounter a yellow pointer soon.
These objects printed in colors are the whatever part of
guile+whatever
, whatever meaning untyped: the
implementation needs to represent “whatevers” differently from
Guile’s predefined s-expressions, which carry tags at runtime; in
guile+whatever
, an epsilon datum is one of a wealth of
s-expressions cases: in “pure” ε implementations such as the
one I’ll get after removing the Guile dependency, there are
ε data only. Later I’ll show two already existing
examples of pure ε implementations, not depending on Guile.
The idea is that the same program can be run with different runtime
libraries, representing objects in a different way: some will be more
efficient, others more forgiving to the programmer.
The form e1:define
adds or replaces a toplevel definition: like
in Scheme8, the same definition form works for both procedures and
non-procedures, according to the shape of the first parameter:
(e1:define x expression1 … expressionm)
defines x as a global, the result of expressionm, obtained after first evaluating
the previous expressions for side effects.
(e1:define (x y1 … yn) expression1
… expressionm)
defines x as a procedure taking the
formals y1 … yn, and having the sequence
expression1 … expressionm as body. The procedure has
the same result(s)9
as the last expression in its body.
Definition themselves return zero results. Notice that global
definitions are expressions: they can occur anywhere expressions can.
Differently from Scheme’s define
, ε₁’s
e1:define
always affects global bindings, even if it’s used
within a deeply-nested expression.
As a special case for convenience you are allowed not to wrap
a toplevel e1:define
in guile+whatever
within
e1:toplevel
; so for example
(e1:toplevel (e1:define x 10))
can be also written more simply as
(e1:define x 10)
.
So, let’s define a global variable:
(e1:define c (cons:make 100 200)) ⇒ ;; zero results
c
is now a pair holding two fixnums:
(e1:toplevel c) ⇒ 0x2b71c80[100 200] (e1:toplevel c) ⇒ 0x2b71c80[100 200] ;; the same object, of course: same pointer
All boxed objects are mutable10. The procedure buffer:set!
takes three parameters:
the object to update, a 0-based field index, and the new value for the field.
ε follows the Scheme convention of using a ‘!’ suffix when naming procedures used for their side effects.
(e1:toplevel (buffer:set! c 0 75))
⇒ ;; zero results
(e1:toplevel c)
⇒ 0x2b71c80[75 200] ;; the pointer didn't change
With mutation it’s easy to build an object referring itself, which yields our first “yellow” pointer:
(e1:toplevel (buffer:set! c 1 c))
⇒ ;; zero results
(e1:toplevel c)
⇒ 0x2b71c80[75 0x2b71c80] ;; c is now cyclic
Of course buffer:set!
isn’t limited to pairs. You can create a buffer of any size
with the procedure buffer:make
, which takes the number of elements as parameter,
and update it:
(e1:define b (buffer:make 4)) ;; make a buffer of four cells
⇒ ;; zero results
(e1:toplevel b)
⇒ 0x149ae10[127 127 127 127]
guile+whatever
currently fills every cell which haven’t been explicitly
initialized with the 127
fixnum.
We can update the buffer now named b
, and look up its elements with buffer:get
:
(e1:toplevel (buffer:set! b 2 99)) ;; make the third element (indexed 2) be 99 ⇒ ;; zero results (e1:toplevel b) ⇒ 0x149ae10[127 127 99 127] ;; same pointer, updated element (e1:toplevel (buffer:get b 0)) ⇒ 127 (e1:toplevel (buffer:get b 2)) ⇒ 99
Fixnums and buffers are the stuff ε values are made of: every datum in ε, (including even reflective objects such as expressions: see below) is in fact encoded using only fixnums and buffers.
You can write booleans as #t
and #f
, like in Scheme; but in fact
any non-#f
value is taken as true, and #f
is just another notation
for 0
:
(e1:toplevel #f) ⇒ 0 ;; the false value #f is just the 0 fixnum (e1:toplevel #t) ⇒ 1 ;; any non-0 value is "true", but using #t is cleaner
In ε₀ characters are just fixnums as well:
(e1:toplevel #\a) ;; lowercase 'a' character (Scheme notation)
⇒ 97 ;; Unicode code point, same as in ASCII
What happens if we go out of bounds in a buffer access?
(e1:toplevel (buffer:get b 50)) error→ out of bounds memory access
Nothing good happens, of course. Errors are not recoverable in ε₁: there is no exception or condition mechanism at this level. What you should do is to prevent error situations from ever happening.
(e1:toplevel (fixnum:/ 10 0)) ;; divide ten by zero error→ division by zero
The runtime of guile+whatever
is relatively friendly: it
prints out an error message, then goes back to the REPL; but that is
not necessarily the case for efficent runtimes: in the same situation
a runtime conceived for speed may crash with a Segmentation Fault,
give a wrong result, or silently affect the system state: for example
if you’re writing out of the bounds of a buffer, an efficient
runtime may silently corrupt some unrelated data structure.
Personalities at a level higher than ε₁ will provide error handing facilities, but here I made the choice to sacrifice any feature which may impact performance or minimality. The details become more apparent when looking at ε₀.
Manipulating buffers using only buffer:make
, buffer:get
and buffer:set!
can become tedious. That’s why I also provide
convenient syntax to make initialized buffers of any fixed size,
called tuples:
(e1:toplevel (tuple:make 10 b #t)) ⇒ 0x21f0850[10 0x149ae10[127 127 99 127] 1] ;; a three-element buffer (e1:toplevel (tuple:make 10 b #t 20 #f)) ⇒ 0x21f82c0[10 0x149ae10[127 127 99 127] 1 20 0] ;; a five-element buffer (e1:toplevel (tuple:make (fixnum:+ 2 2) (fixnum:1+ 2))) ;; as 1+ in Lisp: the successor of 2 ⇒ 0x167f5f0[4 3] ;; arguments are evaluated
The result of evaluating a tuple is just an ordinary buffer: the
result of tuple:make
can’t be distinguished by a buffer made by
buffer:make
and filled by buffer:set!
calls.
As another convenience feature for working on buffers, records provide a way of accessing fields by name in fixed-size buffers:
(e1:toplevel (record:define point x y)) ⊣ Defining the procedure point... ⊣ Defining the procedure point-make-uninitialized... ⊣ Defining the procedure point-explode... ⊣ Defining the procedure point-explode-from-second-element... ⊣ Defining the procedure point-get-x... ⊣ Defining the procedure point-with-x... ⊣ Defining the procedure point-set-x!... ⊣ Defining the procedure point-get-y... ⊣ Defining the procedure point-with-y... ⊣ Defining the procedure point-set-y!... ⇒ ;; zero results
Records are a good example of this general way of using the system:
when we called record:define
to define the record type
point
, the effect was to automatically generate useful
procedures to work on points. Such procedures aren’t special: they
work on buffers and fixnums, and you could write them by hand as well;
but having them automatically generated when defining a record is a
good use of syntactic abstraction: I defined
record:define
(as a macro) once and for all, and from now you can use it as
if it were primitive, thinking at a higher level:
(e1:define p (point 10 20)) ⇒ ;; zero results (e1:toplevel p) ⇒ 0x24c3830[10 20] (e1:toplevel (point-explode p)) ⇒ 10 ;; point-explode yields two results... ⇒ 20 ;; ...you don't know how to use them yet (e1:toplevel (point-get-x p)) ⇒ 10 (e1:toplevel (point-set-x! p (fixnum:+ (point-get-x p) 100))) ⇒ ;; zero results (e1:toplevel p) ⇒ 0x24c3830[110 20] (e1:toplevel (point-with-y p 57)) ;; make a copy of p with a different y ⇒ 0x268fa30[110 57] ;; different address (e1:toplevel p) ⇒ 0x24c3830[110 20] ;; p didn't change
If you only use the procedures automatically generated by
record:define
to work on point data structures, you can ignore
their internal representation
— in this case just the relative order of x
and y
.
However ε₁ never hides information, by design: the
underlying representation is always visible and accessible by
buffer:get
and buffer:set!
. Using these generic buffer
accessors indiscriminately doesn’t sound like a good idea in general,
but the system won’t stop you: you’re supposed to
know what you’re doing.
Vectors are more or less what you would expect: collections of objects (any object, possibly with different shapes), which can be addressed by index:
(e1:define v (vector:make 10))
(e1:toplevel v)
⇒ 0x72ec30[10 127 127 127 127 127 127 127 127 127 127]
The underlying implementation is obvious: vectors are just buffers where the first element holds the number of “payload” elements — which is to say, the number of cells excluding the first cell itself.
vector:get
and vector:set!
work on payload indices,
“skipping” the first element. Again, you could use
buffer:get
and buffer:set!
with incremented indices
instead, but that will only be worth the trouble if you’re
desperate for optimization.
(e1:toplevel (vector:get v 0)) ⇒ 127 ;; not 10: this is the first payload element (e1:toplevel (vector:get v 2)) ⇒ 127 (e1:toplevel (vector:set! v 2 324)) (e1:toplevel (vector:get v 2)) ⇒ 324 (e1:toplevel v) ⇒ 0x72ec30[10 127 127 324 127 127 127 127 127 127 127]
As for generic buffers, there’s no guarantee that bounds are
checked, in general. Because it’s a development tool
guile+whatever
actually checks and in case of failure prints
a reasonable error message; but you shouldn’t expect even that from
the efficient runtimes.
Since the length is stored within the data structure, you can read it back:
(e1:toplevel (vector:length v))
⇒ 10
You can concatenate vectors:
(e1:toplevel (vector:append v (vector:make 3))) ⇒ 0x72efb0[13 127 127 324 127 127 127 127 127 127 127 127 127 127] (e1:define w (vector:make 2)) (e1:toplevel (vector:set! w 0 4)) (e1:toplevel w) ⇒ 0x70c240[2 4 127] (e1:toplevel (vector:append w w w w)) ;; any number of parameters! ⇒ 0x6c9850[8 4 127 4 127 4 127 4 127]
Notice however that vector:append
doesn’t deep-clone elements:
(e1:define vv (vector:make 2)) (e1:toplevel (vector:set! vv 0 (tuple:make 1 2 3))) (e1:toplevel vv) ⇒ 0x240d7f0[2 0x2433d80[1 2 3] 127] (e1:toplevel (vector:append vv (vector:make 2))) ⇒ 0x1bad9f0[4 0x2433d80[1 2 3] 127 127 127] ;; shares 0x2433d80 with vv
If ε₀ characters are just fixnums, as you may guess strings are just vectors of fixnums:
(e1:toplevel "aa") ⇒ 0x16032b0[2 97 97] ;; two characters: #\a and #\a (e1:toplevel (string:append "aa" "bb")) ;; string:append is just an alias ⇒ 0x15d7a00[4 97 97 98 98] (e1:toplevel (vector:append "aa" "bb")) ;; this works just as well ⇒ 0x16c0520[4 97 97 98 98] (e1:toplevel (string:length "foobar")) ;; an alias as well ⇒ 6
Fixnums have a wide enough range to cover all Unicode code points, which is good, but I don’t want to get into the UTF/UCF encoding craziness: so I always use this internal one-fixnum-per-character representation. The implementation uses the nice GNU libunistring (http://www.gnu.org/software/libunistring) by Bruno Haible for input and output.
(e1:toplevel (string:write "Hello there!\n")) ⇒ ;; zero results ⊣ Hello there!
As a temporary limitation of guile+whatever
, you shouldn’t
expect to see output until you also write a final newline character.
ε₁ vectors are not resizable: re-allocating a vector entails changing its address in memory, hence its identity. You can implement resizable vectors, if you want them, by adding a level of indirection: the data structure pointer refers a single-cell buffer pointing to a vector like the one above, which you can replace on resize. This is what I actually did for hash tables, which are implemented as a vector which must be able to accommodate more elements without raising the fill factor over a certain threshold.
This idea of one-element buffers adding a level of indirection for mutable structures is generally useful. I call box11 such a one-element buffer. I’ve defined procedures to allocate, lookup and update boxes:
(e1:define b1 (box:make 10)) (e1:toplevel b1) ⇒ 0x799280[10] ;; just a one-element buffer (e1:toplevel (box:get b1)) ⇒ 10 (e1:toplevel (box:set! b1 45)) ⇒ ;; zero results (e1:toplevel (box:get b1)) ⇒ 45 (e1:toplevel b1) ⇒ 0x799280[45] ;; same address. Here it's important!
I’ve hinted at boxes maintaining an “identity” for a vector which
can be replaced with a resized version. That’s quite easy, if you
keep in mind that box:make
just allocates a box holding the
word you pass, be it a fixnum or a pointer — but the content
is not recursively cloned:
(e1:toplevel v) ⇒ 0x72ec30[10 127 127 324 127 127 127 127 127 127 127] ;; v is 0x72ec30 (e1:define b2 (box:make v)) ;; make a box pointing to the vector (e1:toplevel b2) ⇒ 0x79a5e0[0x72ec30[10 127 127 324 127 127 127 127 127 127 127]] ;; same vector: 0x72ec30 (e1:toplevel (box:get b2)) ⇒ 0x72ec30[10 127 127 324 127 127 127 127 127 127 127] (e1:toplevel (vector:set! (box:get b2) 0 4)) ;; update the vector (e1:toplevel (box:get b2)) ⇒ 0x72ec30[10 4 127 324 127 127 127 127 127 127 127] ;; same (now updated) vector (e1:toplevel v) ⇒ 0x72ec30[10 4 127 324 127 127 127 127 127 127 127] ;; notice the 4 (e1:toplevel (box:set! b2 w)) ;; replace the pointer in the box (e1:toplevel b2) ⇒ 0x79a5e0[0x70c240[2 4 127]] ;; same box, different content (e1:toplevel v) ⇒ 0x72ec30[10 4 127 324 127 127 127 127 127 127 127] ;; still 0x72ec30
Exercise: implement resizable-vector:make
, resizable-vector:get
, resizable-vector:set!
, resizable-vector:length
, resizable-vector:resize!
, using boxes and vectors.
I always try to be precise when speaking about pointers and shared
data, because the idea is important for equality. The
procedure whatever:eq?
corresponds to eq
in Common Lisp
or Emacs Lisp and eq?
in Scheme12: it’s an equality by identity.
Again, we follow the Scheme naming convention according to which
a name ending in ‘?’ identifies a procedure returning a boolean.
Let’s look at how whatever:eq?
behaves. Unboxed objects are easy to compare:
(e1:toplevel (whatever:eq? 7 3)) ⇒ 0 ;; the word 7 is different from the word 3 (e1:toplevel (whatever:eq? 3 3)) ⇒ 1 ;; the word 3 is equal to the word 3 (e1:toplevel (whatever:eq? 97 #\a)) ⇒ 1 ;; the fixnum 97 is the fixnum 97
Notice that whatever:eq?
is an ordinary procedure, so its receives its
parameters already evaluated (as usual, call-by-value left-to-right):
(e1:toplevel (whatever:eq? (fixnum:+ 2 2) 4)) ⇒ 1 ;; the word 4 is equal to the word 4
In case of boxed objects, whatever:eq?
only looks at the two pointers:
(e1:define t1 (tuple:make 1 2 3)) (e1:toplevel t1) ⇒ 0x15e5200[1 2 3] (e1:define t2 (tuple:make 1 2 3)) (e1:toplevel t2) ⇒ 0x183a0c0[1 2 3] ;; same content as t1 (e1:toplevel (whatever:eq? t1 t2)) ⇒ 0 ;; 0x15e5200 is different from 0x183a0c0: not the same object
What happens if we compare a boxed object with a fixnum having the same value as the pointer? Let’s see:
(e1:toplevel t1) ⇒ 0x15e5200[1 2 3] ;; t1 is 0x15e5200 #x15e5200 ;; use Guile to print the address in decimal ⇒ 22958592 (e1:toplevel (whatever:eq? 22958592 t1)) ⇒ 1 ;; the fixnum is "the same" as the pointer
So the answer is that whatever:eq?
doesn’t distinguish between
pointers and non-pointers when comparing.
This issue is actually deeper than it looks. I’ve already made clear that in ε, differently from Lisp, objects doesn’t carry their “type”: an integer, a boolean or a character are represented in the exact same way. However, you might wonder how the system can print its nice object “dumps” distinguishing pointers from non-pointers, also writing buffers with the correct number of elements.
The answer is that, even if there are no types,
guile+whatever
represents boxedness tags: for
each ε object the runtime keeps track of its fixnum-vs.-pointer
nature, and also associates a word containing its length to each
buffer. This is not guaranteed to happen in all runtimes: later we
will see an example of a more efficient runtime which doesn’t store
this information. With that runtime, of course, it won’t be possible
to dump objects using our color notation, or in any other way:
numbers and pointers are indistinguishable, in the general case.
Even if it’s possible to represent boxedness tags without too much overhead, I like the idea of not depending on them, particularly when compiling for very small targets where code size counts; for this reason, ε permits you to use them if you prefer so, but you can do away with them if you want to build something really lean and minimal. Fixnums may be more narrow on efficient runtimes representing boxedness tags: a good implementation reserves one bit for this information within each fixnum/pointer.
There are procedures to access boxedness tags, which always fail in
runtimes not representing them: boxedness:fixnum?
,
boxedness:buffer?
and boxedness:buffer-length
. If you
want your program to run on all runtimes, don’t use them.
boxedness:fixnum?
returns a non-0
value if and
only if its parameter is a fixnum, which is to say a non-pointer:
(e1:define n 10) (e1:define b (box:make 10)) (e1:toplevel (boxedness:fixnum? n)) ⇒ 1 (e1:toplevel (boxedness:fixnum? b)) ⇒ 0
Conversely, boxedness:buffer?
returns a non-0
value if and
only if its parameter is a pointer:
(e1:toplevel (boxedness:buffer? n)) ⇒ 0 (e1:toplevel (boxedness:buffer? b)) ⇒ 1
boxedness:buffer-length
returns the length of the buffer
pointed by its argument. You shouldn’t ever pass it a non-pointer:
guile+whatever
will fail with an error message, but as usual
efficient runtimes may just crash:
(e1:toplevel (boxedness:buffer-length (buffer:make 4))) ⇒ 4 (e1:toplevel (boxedness:buffer-length b)) ⇒ 1 ;; one element (e1:toplevel (boxedness:buffer-length n)) error→ size of non-pointer ;; likely crash, on efficient runtimes
At the cost of being obnoxious let me stress again that
boxedness:fixnum?
, boxedness:buffer?
and
boxedness:buffer-length
are only available on runtimes which
represent boxedness tags: calling any of them on a runtime without
boxedness tags yields a failure situation, or again a crash.
Lists are really nothing new: you just obtain them by chaining
conses13, which is to say two-element buffers, by convention
right-deep, like in Lisp. By convention the empty list is
0
, which can’t be confused with a pointer on modern
machines14,
where memory addresses never have very low values.
More formally, we could state that (by induction) a list is
either the empty list 0
or a cons containing an element
on the left and another list on the right.
You can use 0
and cons:make
or tuple:make
to make
lists, one cons at a time:
(e1:toplevel 0) ⇒ 0 ;; the empty list (e1:toplevel (cons:make 10 (cons:make 20 0))) ⇒ 0x14670b0[10 0x14670f0[20 0]] ;; a list containing 10 and 20
Of course the system doesn’t force you to nest on the right side: for example we wouldn’t call a “list” this left-deep structure, which is the mirror image of the previous example:
(e1:toplevel (cons:make (cons:make 0 20) 10))
⇒ 0x14a0050[0x14a0010[0 20] 10]
Even if the left-deep nested pair above is still a perfectly valid
memory data structure, the ε₁ convenience syntax and
procedures work on ordinary lists, which are right-deep and terminated
with 0
. When used on non-list structures, predefined
functions for lists may fail or give unexpected results.
Given a list, you can check if it’s empty or obtain its head
(first element) and tail (list of all the elements except the
first) with the procedures list:null?
,
list:head
and list:tail
:
(e1:define my-list (cons:make 100 (cons:make 200 0))) (e1:toplevel (list:null? my-list)) ⇒ 0 (e1:toplevel (list:null? 0)) ⇒ 1 ;; the empty list is empty (e1:toplevel (list:head my-list)) ⇒ 100 (e1:toplevel (list:tail my-list)) ⇒ 0x7ddb50[200 0]
list:head
and list:tail
don’t allocate new buffers: they
just return what’s contained in the given cons. In other words, they are
accessors:
(e1:toplevel my-list) ⇒ 0x7ddc50[100 0x7ddb50[200 0]] ;; 0x7ddb50, as the tail above (e1:toplevel (list:tail my-list)) ⇒ 0x7ddb50[200 0] ;; 0x7ddb50 again
list:head
and list:tail
are very fast (they are
essentially calls to buffer:get
with index 0
or
1
), but as usual in ε₁ they don’t check for errors:
you shouldn’t ever call them on an empty list — and using them on
something different than a cons which is part of a list is probably a
bad idea.
(e1:toplevel (list:head 0))
error→ buffer:get on a non-buffer --- or simply crash
(e1:toplevel (list:tail (tuple:make 1 2 3)))
⇒ 2 ;; not a list! This isn't an error, but using list:tail on non-lists is confusing
I’ve defined several utility procedures working on lists. One of them is
the one-parameter procedure list:iota
(name inspired by APL,
thru Guile and MIT Scheme) returns a list holding all fixnums from
0
included to the argument excluded.
(e1:toplevel (list:iota 0)) ⇒ 0 ;; the empty list (e1:toplevel (list:iota 2)) ⇒ 0x14a8100[0 0x14a80a0[1 0]] (e1:toplevel (list:iota 10)) ⇒ 0x14b02d0[0 0x14b0270[1 0x14b0230[...]]]
The REPL didn’t print the complete structure, because we hit the depth
limit. We can raise or eliminate the limit by calling
set-whatever-dump-maximum-depth!
, which is currently a
Guile procedure — which means that we can’t use
e1:toplevel
. We can give the procedure either a natural
number, of #f
to mean “no limit”:
(set-whatever-dump-maximum-depth! #f)
(e1:toplevel (list:iota 10))
⇒ 0x14b85d0[0 0x14b8570[1 0x14b8530[2 0x14b84f0[3 0x14b8490[4 0x14b8430[5 0x14b83d0[6 0x14b8370[7 0x14b8310[8 0x14b82b0[9 0]]]]]]]]]]
A generalization of list:iota
is list:range
, a
procedure taking two fixnum parameters and returning a list of
fixnums, from the the first to the second, both included — or an
empty list if the first parameter is greater than the second:
(e1:toplevel (list:range 10 25))
⇒ 0x12701d0[10 0x11ec730[11 0x11ec6d0[12 0xde34d0[13 0x105d290[14 0x105d230[15 0x11a82e0[16 0x891af0[17 0x1208e70[18 0x90e010[19 0x90dfb0[20 0x65ed00[21 0x65eca0[22 0x76c900[23 0x6d1830[24 0x7f7830[25 0]]]]]]]]]]]]]]]]
The e1:length
procedure takes a list and returns its length, as a fixnum:
(e1:toplevel (list:length (list:iota 10000))) ⇒ 10000 ;; the first 10000 naturals are 10000 in number (e1:toplevel (list:length 0)) ⇒ 0 ;; the empty list has zero elements
Re-defining list utility procedures is a good programming exercise.
Let’s implement mylength
, our own version of
list:length
.
mylength
will be a recursive procedure doing a case analysis on
its parameter. You’ll need a conditional. ε₁ has a good
variety of Lisp-style conditionals, including:
e1:when
, e1:unless
;
e1:if
;
e1:cond
, e1:case
.
Syntax and semantics follow Scheme conventions, apart from the usual ‘e1:’ prefix.
Our length procedure has only two cases: empty list, or non-empty
list. The length of an empty list is zero, and the length of a
non-empty list is one plus the length of its tail. We can compute
“one plus” with either the two-parameter fixnum:+
(giving
it 1
as one parameter) or with the one-parameter fixnum:1+
.
As a naming convention, I sometimes use “plural” variable names such
as as
, bs
, xs
and ys
for list objects.
(e1:define (mylength xs) (e1:if (list:null? xs) 0 (fixnum:1+ (mylength (list:tail xs)))))
Does it work?
(e1:toplevel (mylength (list:iota 100)))
⇒ 100
It seems to work. But if we test it with a bigger list, the thing fails:
(e1:toplevel (mylength (list:iota 1000000))) ;; dangerous: don't try this error→ some strange error or crash
I was careful to make list:iota
tail-recursive but
mylength
is clearly not (the recursive call occurs in a
non-tail position, as the argument of a fixnum:1+
call). Since
mylength
consumies an unbounded quantity of stack space
proportional to the list length, it isn’t really usable on large
arguments.
It’s easy to redefine mylength
to be tail-recursive:
(e1:define (mylength xs) (mylength-acc xs 0)) (e1:define (mylength-acc xs acc) (e1:if (list:null? xs) acc (mylength-acc (list:tail xs) (fixnum:1+ acc))))
The new mylength
can compute the length of any list in
constant space:
(e1:toplevel (mylength (list:iota 1000000)))
⇒ 1000000
As a further example, let’s compute the last element of the given list. There are three cases: an empty list (on which we fail), a one-element list, (of which we know the last element), or a list with two or more elements:
(e1:define (mylast xs) (e1:cond ((list:null? xs) (e1:error "mylast of empty list")) ((list:null? (list:tail xs)) (list:head xs)) (else ;; #t works as well, like t in Common Lisp/Emacs Lisp (mylast (list:tail xs)))))
The only recursive call is already in tail position.
You should think of e1:myerror
as a procedure which potentially
crashes the system or brings it into an unrecoverable “failure
state”. However, before crashing, e1:error
will at least
print an error message. If you want to optimize at the cost of being
even more unforgiving than e1:error
, you can remove the first
e1:cond
case:
(e1:define (mylast xs) (e1:cond ((list:null? (list:tail xs)) (list:head xs)) (else (mylast (list:tail xs)))))
A two-way e1:cond
may look better as an e1:if
:
(e1:define (mylast xs) (e1:if (list:null? (list:tail xs)) (list:head xs)) (mylast (list:tail xs)))
We may want to factor away the two calls to list:tail
, even if
in practice they aren’t that expensive and the implementation is not
yet particularly efficient15. The idea, of course, is
using a let
block. ε₁ has Lisp-style e1:let
and e1:let*
:
(e1:toplevel (e1:let ((a 1) (b 2)) (fixnum:+ a b))) ⇒ 3 ;; e1:let* would behave just as e1:let here (e1:define g 100) (e1:toplevel (e1:let ((g 10) (b g)) ;; this sees the global g (fixnum:+ g b))) ⇒ 110 (e1:toplevel (e1:let* ((g 10) (b g)) ;; this sees the local g (fixnum:+ g b))) ⇒ 20
Using a block, the fast but unfriendly version of mylast
becomes:
(e1:define (mylast xs) (e1:let ((tail (list:tail xs))) (e1:if (list:null? tail) (list:head xs) (mylast tail))))
If you want to do exercises, you can re-implement in ε₁ any of the list procedures in core.e and epsilon1.scm, using recursion. You’ll see that the source files are divided into “sections”, delimited by a line full of semicolons plus another comment line containing a title. Look at sections containing the word “List” in their names.
When testing your code, you’ll probably want to use the ε₁ form
list:list
, which resembles Lisp’s list
:
(e1:toplevel (list:list 1 (fixnum:+ 10 20) 435 -2)) ⇒ 0xcfbbc0[1 0xcfc000[30 0x840650[435 0xe59640[-2 0]]]] (e1:toplevel (mylast (list:list 1 2 3 4 5))) ⇒ 5
list:list
may be convenient, but it’s nothing very deep: it’s
just a macro which expands to an expression using list:cons
as
many times as needed, with the correct nesting. Even without knowing
anything about macros, we can check this with the Guile debugging
procedure meta:macroexpand
(notice the quote):
(meta:macroexpand '(list:list 1 (fixnum:+ 10 20) 435 -2)) ⊣ [call list:cons 1₆₈₆₀₀ [call list:cons [call fixnum:+ 10₆₈₆₀₁ 20₆₈₆₀₂]₆₈₆₀₃ [call list:cons 435₆₈₆₀₄ [call list:cons -2₆₈₆₀₅ list:nil₆₈₆₀₆]₆₈₆₀₇]₆₈₆₀₈]₆₈₆₀₉]₆₈₆₁₀
What’s that strange notation? Well, for the first time you’re
looking at ε₀ expressions, which I intentionally made visually
distinct, using brackets. In fact when an ε₁ macro call is
completely expanded, it always yields its result as an ε₀
expression, ready to be executed or compiled. Even if you don’t
know ε₀ yet, you can already read this: the expression
consists in nested procedure calls, using as arguments either fixnum
literals, or list:nil
; list:nil
is just a global
variable, defined as 0
. The subscript numbers are
handles, unique identifiers attached to each ε₀
expression. list:cons
is just an alias of cons:make
.
Just to have a peek at ε₀, let’s look at the definitions of some simple procedures, with their bodies already expanded to ε₀ expressions:
(meta:print-procedure-definition 'list:length) ⊣ Formals: (x) ⊣ [call list:length-acc x₂₆₆ 0₂₆₇]₂₆₈ (meta:print-procedure-definition 'list:length-acc) ⊣ Formals: (x acc) ⊣ [if x₂₆₉ ∈ {0} then acc₂₇₀ else [call list:length-acc [call list:tail x₂₇₁]₂₇₂ [call fixnum:1+ acc₂₇₃]₂₇₄]₂₇₅]₂₇₆
The only conditional in ε₀ is a slightly unusual “[if
e ∈ constants then e else e]
”; the reason is
efficiency: such conditionals are easy to nest and compile using jump
tables or comparison trees, which may be an important optimization for
some styles of programming. See how small these handles are? The list
length procedures were defined very early in the bootstrap process.
You don’t really need to do anything with ε₀ at this time: the only thing you should remember is that each piece of ε₁ code is always translated into ε₀ before execution, compilation, or even procedure definition. The expanded ε₀ code will usually be longer and less human-friendly than the corresponding ε₁ version, but much easier to execute and analyze.
Now that you’ve seen meta:print-procedure-definition
, we can use it
to look at how list:null?
works:
(meta:print-procedure-definition 'list:null?) ⊣ Formals: (list) ⊣ [call whatever:zero? list₂₅₈]₂₅₉
whatever:zero?
computes what you’d expect, but its definition may look unusual:
(meta:print-procedure-definition 'whatever:zero?) ⊣ Formals: (x) ⊣ [call e1:not x₂₅₈]₂₅₉
e1:not
is also interesting, by the way:
(meta:print-procedure-definition 'e1:not) ⊣ Formals: (condition) ⊣ [if condition₅₀₈₈ ∈ {0} then 1₅₀₈₇ else 0₅₀₈₉]₅₀₉₀
Now, can you explain why whatever:zero?
and list:null?
are essentially aliases for e1:not
?
[To do: fill this]
Look again at our formal definition of lists: by induction, a list is either the empty list or a cons containing an element and another list.
This data structure definition by alternative cases which are allowed to be recursive is a useful idea, more general than lists. For example, let’s say that we want to define binary trees. We might say that “by induction, a binary tree is either empty or a non-empty triple containing a left (sub-)tree, a root element, and a right (sub-)tree”.
This kind of structure definition is popular in the functional programming community where it’s known as sum or sum-of-products — a “product” being an n-uple, and a “sum” being a union of disjoint sets where each element of the union conceptually carries a tag representing the origin set.
Differently from most functional languages and as usual in
ε₁, the system won’t force you to respect any “typing”
constraint: for example storing something different from a tree in the
left
field of a tree is confusing, but you can do it if you
want.
Like I did for records, I provide a form to automatically generate the needed constructor and accessor procedures, given a sum definition.
Here’s a definition in ε₁ for our sample binary tree, having
two cases: the empty
case with no fields, and the
non-empty
case having fields named left
, root
and
right
:
(e1:toplevel (sum:define tree (empty) (non-empty left root right))) ⊣ Defining the procedure tree-empty... ⊣ Defining the procedure tree-empty?... ⊣ Defining the procedure tree-empty-explode... ⊣ Defining the procedure tree-non-empty... ⊣ Defining the procedure tree-non-empty-make-uninitialized... ⊣ Defining the procedure tree-non-empty-explode... ⊣ Defining the procedure tree-non-empty-explode-from-second-element... ⊣ Defining the procedure tree-non-empty-get-left... ⊣ Defining the procedure tree-non-empty-with-left... ⊣ Defining the procedure tree-non-empty-set-left!... ⊣ Defining the procedure tree-non-empty-get-root... ⊣ Defining the procedure tree-non-empty-with-root... ⊣ Defining the procedure tree-non-empty-set-root!... ⊣ Defining the procedure tree-non-empty-get-right... ⊣ Defining the procedure tree-non-empty-with-right... ⊣ Defining the procedure tree-non-empty-set-right!... ⊣ Defining the procedure tree-non-empty?...
Of course trees are just buffers and fixnums. We can look at their
memory representation by building two small samples: an empty tree,
and a non-empty tree having 42
as its root and empty
left and right subtrees:
(e1:toplevel (tree-empty)) ⇒ 0 (e1:toplevel (tree-non-empty (tree-empty) 42 (tree-empty))) ;; 42 ⇒ 0x12376e0[0 42 0] ;; / \
The representation in memory is efficient, and very similar to what
we already saw about lists: the empty tree is 0
, and
non-empty trees point to three-element buffers containing in order
left subtree, root, and right subtree.
Of course we can build trees of any size:
(e1:define t (tree-non-empty (tree-non-empty (tree-empty) 1 (tree-empty)) ;; 2
2 ;; / \
(tree-non-empty (tree-empty) 3 (tree-empty)))) ;; 1 3
(e1:toplevel t) ;;/ \/ \
⇒ 0x1021e80[0xfe0040[0 1 0] 2 0xfdff20[0 3 0]]
Since sums usually have more than one case, it’s useful to query the case of a given object: of course you shouldn’t use tree procedures on anything but trees, but for any given sum all cases (in this case empty and non-empty) are always safe for case-querying:
(e1:toplevel (tree-empty? (tree-empty))) ⇒ 1 ;; the empty tree is in fact empty (e1:toplevel (tree-non-empty? (tree-empty))) ⇒ 0 ;; the empty tree isn't non-empty (e1:toplevel (tree-non-empty? t)) ⇒ 1
If you are sure that the object is the appropriate case of a sum you can use accessors, which are essentially the same as for records:
(e1:toplevel (tree-non-empty-get-right t)) ⇒ 0xfdff20[0 3 0] (e1:toplevel (tree-non-empty-get-root t)) ⇒ 2 (e1:toplevel (tree-non-empty-get-root (tree-non-empty-get-left t))) ⇒ 1 ;; the root of the left subtree of t
Since you can’t resize existing buffers or update unboxed objects, you can’t in general mutate an object from a case to another. However in ε₁ you can always mutate sum fields, and you get handy procedures for that:
(e1:toplevel (tree-non-empty-set-right! t (tree-empty)))
;; t's left subtree is now empty
(e1:toplevel t)
⇒ 0x1021e80[0xfe0040[0 1 0] 2 0]
Like for records, you can make copies with a different field:
(e1:toplevel (tree-non-empty-with-left t (tree-empty))) ⇒ 0x1a202a0[0 2 0] (e1:toplevel t) ⇒ 0x1021e80[0xfe0040[0 1 0] 2 0] ;; t didn't change
Well-designed recursive sums lend themselves particularly well to recursive programming, since case analysis tends to follow the structure of sum cases, with recursive calls occurring on recursive substructures.
As a simple example, let the height of a tree be zero for empty trees; and for non-empty trees, let it be one plus the height of the tallest subtree (left or right). Easy enough:
(e1:define (height t) (e1:if (tree-empty? t) 0 (e1:let ((left-height (height (tree-non-empty-get-left t))) (right-height (height (tree-non-empty-get-right t)))) (fixnum:1+ (e1:if (fixnum:> left-height right-height) left-height right-height)))))
It works:
(e1:toplevel (height t))
⇒ 2 ;; the left subtree has height one, the right one is empty
height
was a little clumsy to write. We can make the definition much
nicer if we recognize that the inner e1:if
above is just the
computation of the maximum between two fixnums.
There is already a fixnum:max
procedure; but even if you forgot
about that, you could redefine a maximum procedure for fixnums by yourself:
(e1:define (my-max a b) (e1:if (fixnum:> a b) a b))
And so, thanks to procedural abstraction, we can refactor
height
to be much more readable:
(e1:define (height t) (e1:if (tree-empty? t) 0 (fixnum:1+ (fixnum:max (height (tree-non-empty-get-left t)) (height (tree-non-empty-get-right t))))))
Sums work nicely, but they potentially hide a tricky representation problem. Think once more about case-querying; how can the system distinguish one case from another, for our trees? Sums must work on every runtime, so case-querying procedures can’t rely on boxedness tags.
What distinguishes an empty tree from a non-empty tree? More generally, what distinguishes a boxed sum case from an unboxed case, and one case from another?
(e1:toplevel (sum:define s1 (a) (b) (c))) ⊣ ;; [procedures are automatically defined, as usual]
The s1
sum has three cases: a
, b
and c
,
all with zero fields. In practice an s1
object is nothing more
than an enum
type in C: every s1
is a
, or b
, or
c
; nothing more. The implementation is also just as trivial:
(e1:toplevel (s1-a)) ⇒ 0 (e1:toplevel (s1-b)) ⇒ 1 (e1:toplevel (s1-c)) ⇒ 2
So cases with no arguments can always be represented as unboxed objects, distinct from one another. The constructors for such cases always return the same results when called multiple times:
(e1:toplevel (s1-a)) ⇒ 0 ;; again (e1:toplevel (s1-a)) ⇒ 0 ;; no pointers: there is really only one a
Notice also that 0
is the representation of both the
a
case of s1
, and of the empty
case of
tree
: it’s impossible to distinguish them from one another,
even you can always tell apart different cases of the same sum.
Let’s define a sum having more than one case with fields:
(e1:toplevel (sum:define s2 (a m n) (b q))) ⊣ ;; [automatic procedure definitions]
An s2
may be either an a
, containing two fields m
and n
, or a b
, containing one field q
. Ok, but
how are s2
’s represented in memory?
(e1:toplevel (s2-a 100 200)) ⇒ 0x19ba510[0 100 200] (e1:toplevel (s2-a 100 200)) ⇒ 0x19c0410[0 100 200] ;; different pointer! (e1:toplevel (s2-b 500)) ⇒ 0x19c5f70[1 500]
Of course both cases are boxed now; but the interesting difference is
how the first work of each object is now a tag identifying the
case: either 0
for a
, or 1
for
b
. Of course you don’t need to remember the presence of the
tag if you use the automatically-generated accessors for working with
fields:
(e1:toplevel (s2-b-get-q (s2-b 600))) ⇒ 600 ;; you don't need to remember that q is the second field (e1:toplevel (s2-a? (s2-b 700))) ⇒ 0 (e1:toplevel (s2-b? (s2-b 800))) ⇒ 1
Sums only need to contain tag words when they have more than one boxed case.
Unboxed cases can always be distinguished from pointers even without
boxedness tags, because they are small numbers: no modern
system will allocate a heap buffer at an address such as 0
,
1
, 2
or even 1024
, which is still a reasonable
upper bound on the number of cases, even in pathological situations.
Thanks to this fact, we can afford a much more efficient
representation for cases with no fields.
We have already said that lists are a sum: indeed, epsilon1.scm contains the definition:
(e1:toplevel (sum:define list:list (nil) (cons head tail)))
Of course there are convenience aliases for common operations; for
example list:head
, which we’ve already shown, is easier to
write than the automatically-generated name
‘list:list-cons-get-head’; however the underlying representation
exactly follows our description above: one unboxed case,
0
, plus one boxed case with two fields. Tag words
aren’t needed, so list conses can be represented
in a compact way
in memory, using only two words per cons. Since some lists may
contain a very large number of elements, saving one word per element
can make quite a
difference for performance.
An open sum is a sum to which you can add more cases later on. Since there’s no way to tell how many boxed cases will be needed in the end, all boxed cases have to contain a tag word; apart from this slight inefficiency, open sums work just like non-open sums.
Let’s define an open sum named os
. You can use form
sum:define-open
(which of course is defined as a macro as well)
just like sum:define
:
(e1:toplevel (sum:define-open os
(x)
(y a)))
⊣ ;; automatic procedure definitions...
(e1:toplevel (os-y 68))
⇒ 0x901640[1 68]
We gave our sum os
two cases named x
and y
.
Let’s add one more, z
:
(e1:toplevel (sum:extend-open os (z a))) ⊣ ;; automatic procedure definitions... (e1:toplevel (os-z 6)) ⇒ 0x8a7608[2 6] (e1:toplevel (os-y 6)) ⇒ 0x8a5bd0[1 6] ;; "old" cases still work
We can keep adding cases even after we’ve started to make instances,
as you already saw. Let’s add two more, t1
and t2
:
(e1:toplevel (sum:extend-open os (t1 a b c) (t2))) ⊣ ;; automatic procedure definitions... (e1:toplevel (os-t1 10 20 30)) ⇒ 0xcdc3d8[3 10 20 30] (e1:toplevel (os-y 7)) ⇒ 0x8a30e8[1 7]
At this point you should be ready to do some simple ε₁ programming, trying to come up with an implementation before reading my code.
If we accept to rely on boxedness tags, we can have a structural
equality procedure similar to equal
in Common Lisp and Emacs
Lisp or to equal?
in Scheme. Boxedness tags are necessary,
since we can’t know the object shape in advance, and if we are to
dereference pointers to compare corresponding buffers, we also need to
recognize them as pointers, and obtain buffer lengths.
Defining the procedure whatever:equal?
is a good programming
exercise. It only relies on Lisp-style if
, cond
,
let
, and
, not
(all available in ε₁ with
the prefix ‘e1:’), plus trivial arithmetics (such procedures are
already available for fixnums, with the prefix ‘fixnum:’).
Our whatever:equal?
is a typical recursive procedure of two
arguments, performing case analysis. There are three possible cases:
whatever:eq?
;
Even if we haven’t written the helper procedure
whatever:buffer-equal?
yet, we can already define the main
procedure:
(e1:define (whatever:equal? a b) (e1:let ((fixnum-a (boxedness:fixnum? a)) (fixnum-b (boxedness:fixnum? b))) (e1:cond ((e1:and fixnum-a fixnum-b) (whatever:eq? a b)) ((e1:and (e1:not fixnum-a) (e1:not fixnum-b)) (whatever:buffer-equal? a b)) (else #f))))
Two buffers are equal when they have the same length, and all corresponding elements are equal.
There are several ways of defining whatever:buffer-equal?
, but
the easiest is probably using another helper procedure (this time
recursive), checking the array content from a given index to the end.
However, we don’t even need to look at the elements if the two buffers
have different lengths:
(e1:define (whatever:buffer-equal? pointer-1 pointer-2) (e1:let ((length-1 (boxedness:buffer-length pointer-1)) (length-2 (boxedness:buffer-length pointer-2))) (e1:if (whatever:eq? length-1 length-2) ;; lengths are fixnums (whatever:buffer-equal-from-length? pointer-1 pointer-2 0 length-1) #f)))
And now, finally, the recursive helper function. It returns #f
immediately if it finds a difference at some point, in which case it’s
useless to look at the rest. If the procedure reaches the buffer
ends without having found a difference, it concludes that the buffers
were equal — whatever:buffer-equal-from-to?
is only used on
buffers of the same length, so we don’t have to worry about reaching the end
on just one of the two buffers.
Notice that whatever:buffer-equal-from-to?
recursively calls
whatever:equal?
when comparing buffer elements: this is
reasonable, because the two corresponding elements can themselves be
any combination of fixnums and pointers.
(e1:define (whatever:buffer-equal-from-length? pointer-1 pointer-2 from length) (e1:cond ((whatever:eq? from length) #t) ((whatever:equal? (buffer:get pointer-1 from) (buffer:get pointer-2 from)) (whatever:buffer-equal-from-length? pointer-1 pointer-2 (fixnum:1+ from) length)) (else #f)))
We can now test whatever:equal?
:
(e1:toplevel (whatever:equal? (tuple:make 1 (tuple:make 43 56)) (tuple:make 1 (tuple:make 43 56) 1))) ⇒ 0 (e1:toplevel (whatever:equal? (tuple:make 1 (tuple:make 43 56) 1) (tuple:make 1 (tuple:make 43 56) 2))) ⇒ 0 (e1:toplevel (whatever:equal? (tuple:make 1 (tuple:make 43 56) 1) (tuple:make 1 (tuple:make 43 7) 1))) ⇒ 0 (e1:toplevel (whatever:equal? (tuple:make 1 (tuple:make 43 56) 1) (tuple:make 1 (tuple:make 43 56) 1))) ⇒ 1
If you want to do another similar exercise by yourself you can write a deep-cloning procedure, a hashing procedure, or a lexicographic comparison procedure.
You could also try to improve whatever:equal
so that it never
loops on cyclic structures, but that’s much more difficult
(Hint: you need two associative data structures using pointers as keys
and sets of pointers as data, to keep track of tentative and proved
equalities).
The ε₁ form e1:value
resembles Lisp’s quote
,
and serves to tell the system that we are interested in some immediate
constant as a data structure; it is mainly useful for symbols:
(e1:value foo)
evaluates to the symbol foo
as a data
structure, which of course isn’t the same as the variable named
foo
. On the other hand (e1:value 42)
works exactly
like 42
in ε₁.
Let’s have a look at the symbol foo
as a data structure:
(e1:toplevel (e1:value foo))
⇒ 0x1484770[0x9a60b0[3 102 111 111] 0 127 0 0 0 0 0 0 0 0]
It’s a large boxed data structure — don’t worry, you won’t need to
remember all fields, or their positions.
The first field is the
string 0x9a60b0[3 102 111 111]
, containing its
length followed by the characters 102, 111 and 111; which is to say "foo"
, the symbol name.
The other fields
contain the symbol value as a global, a procedure, a macro, and other
such information. Since foo
isn’t the global name of anything,
there is nothing very interesting to see: we just have
0
in fields which otherwise would usually point to boxed data.
The 127
fixnum is actually irrelevant, and could be any
other value; that field holds the value associated to foo
as a
global variable, if any: the previous field is a flag saying whether
such value exists, and in this case it’s 0
, because
there’s no global variable named foo
. The flag is necessary
since any value, boxed or unboxed, including 0
or
127
, is a potentially valid value for a global
variable.
Of course named symbols are unique after interning, like in Lisp.
Therefore we can check whether two symbols are equal with
whatever:eq?
— in practice by comparing their pointers,
which is a very fast operation:
(e1:toplevel (e1:value foo)) ;; same symbol name as before... ⇒ 0x1484770[0x9a60b0[3 102 111 111] 0 127 0 0 0 0 0 0 0 0] ;; ...same pointer! (e1:toplevel (whatever:eq? (e1:value foo) (e1:value foo))) ⇒ 1 ;; 0x1484770 is equal to 0x1484770
Now let’s define a global named ‘foo’, then look at the symbol again:
(e1:define foo 671)
(e1:toplevel (e1:value foo))
⇒ 0x1484770[0x9a60b0[3 102 111 111] 1 671 0 0 0 0 0 0 0 0]
What happened shouldn’t be surprising at this point: the symbol now
has a global value, so the flag changed from 0
to 1
; and the field following it changed from
127
, which was unused with the flag set to
0
, to the appropriate value 671
.
Of course the symbol field holds its global value. Formal parameters or local variables named ‘foo’ don’t affect it:
(e1:define (bar foo) (fixnum:+ foo foo)) (e1:toplevel (bar 57)) ⇒ 114 (e1:toplevel (e1:let* ((a 10) (foo (+ a 2))) foo)) ⇒ 12 (e1:toplevel (e1:value foo)) ;; nothing changed in foo ⇒ 0x1484770[0x9a60b0[3 102 111 111] 1 671 0 0 0 0 0 0 0 0]
To make things more interesting, now let’s define a procedure
named ‘foo’. Our procedure will be a trivial zero-argument
constant function, always returning 82
.
(e1:define (foo)
82)
(e1:toplevel (e1:value foo))
⇒ 0x1484770[0x9a60b0[3 102 111 111] 1 671 0 0xd74d98[1 68979 82] 0 0 0 0 0 0]
You can immediately notice that the global binding to
671
survived: this means that a symbol can hold a
global non-procedure and a global procedure at the same time,
like in Common Lisp and Emacs Lisp, but differently from Scheme:
(e1:toplevel (foo)) ⇒ 82 (e1:toplevel foo) ⇒ 671
This possibility of mapping different global entities (non-procedures, procedures, macros, ...) to the same name is a natural consequence of how symbols work.
You can also see that our (simple) procedure doesn’t look very hard to
read in the data structure: what changed is just the new buffer
0xd74d98[1 68979 82]
. Can we understand it? It
turns out that yes, we can read it quite easily. Since the buffer
contains the fixnum 82
, you may guess that the buffer
at 0xd74d98
represents the procedure body; indeed it does.
Look for the section named “Expressions as an open sum type” in epsilon1.scm. You’ll find this code:
(e1:toplevel (sum:define-open e0:expression (variable handle name) (value handle content) (bundle handle items) (primitive handle name actuals) (let handle bound-variables bound-expression body) (call handle procedure-name actuals) (call-indirect handle procedure-expression actuals) (if-in handle discriminand values then-branch else-branch) (fork handle procedure-name actuals) (join handle future)))
The section title says it all: ε₀ expressions are an open sum. The sum cases you see above are everything which remains after macroexpanding and transforming; some cases which are added later are always rewritten away before execution or compilation, until you get only ε₀ expressions as specified above; there is really nothing more.
[Move this: Let me stress again how simple ε₀ expressions are: at ten cases total, this is much simpler than any mainstream language including “small” languages such as Scheme and SML, and much smaller than ε₁ as well.]
I won’t explain every case now; you can already understand several of them. You should notice that all cases are boxed, and they all contain a handle as their first element: you already saw handles before, printed as subscript numbers by the debugging facility.
Now you have all the information you need to understand
0xd74d98[1 68979 82]
as an e0:expression
sum: it has a tag word 1
,
hence it’s the second case from the beginning, value
; it
contains a handle
with value 68979
, and a
content
with value 82
.
You can check this reasoning with meta:print-procedure-definition
:
(meta:print-procedure-definition 'foo)
⊣ Formals: ()
⊣ 82₆₈₉₇₉
Let’s redefine the procedure foo
as something slightly more
complex: let’s make it the identity function, taking one
parameter and returning it unchanged. Now, to make our dump a little
easier to read, I’ll use foo
as the formal parameter
name as well.
(e1:define (foo foo) foo)
You shouldn’t worry about foo
being associated at the same time
with a global, a procedure and a parameter: foo
as occurs in
the body can’t be a procedure reference, and in ε₁ (like in
any other reasonable language) a parameter binding takes precedence
over a global binding.
(e1:toplevel foo) ⇒ 671 ;; our non-procedure global (e1:toplevel (foo 3)) ⇒ 3
Everything works. Let’s look at the symbol again:
(e1:toplevel (e1:value foo))
⇒ 0x1484770[0x9a60b0[3 102 111 111] 1 671 0xf44178[0x1484770 0] 0xf44638[0 69017 0x1484770] 0 0 0 0 0 0]
There are two differences with respect to the previous version:
671
there is the buffer
0xf44178[0x1484770 0]
instead of
0
;
0xf44638[0 69017 0x1484770]
.
The first difference shows that a symbol field contains a list of
formal parameters, as symbols. When the procedure had zero parameters
the list was empty, now it’s a one-element list containing
0x1484770
, which is foo
itself.
The second change is even easier to follow: the procedure body is now
the first case (tag 0
) of e0:expression
, named
variable
: and the variable is 0x1484770
—
once more, the symbol foo
. Finally the new expression got a
fresh handle, 69017
.
(meta:print-procedure-definition 'foo) ⊣ Formals: (foo) ⊣ foo₆₉₀₁₇
Within the e0:expression
sum, variables are always represented
as symbols, and sequences (be they variables, values or expressions)
as lists. Of course all sub-expressions are e0:expression
sums: the e0:expression
sum is recursive.
As another example, let’s see how a slightly more complicated expression
translates to ε₀. The e1:bundle
form serves to
return multiple results, like values
in Scheme and Common Lisp:
(e1:define (my-procedure) (e1:bundle 60 70)) (e1:toplevel (my-procedure)) ⇒ 60 ⇒ 70
Good, my-procedure
works. Let’s look at its body:
(e1:toplevel (e1:value my-procedure))
⇒ 0xb56ab8[0xb3d3b0[12 109 121 45 112 114 111 99 101 100 117 114 101] 0 127 0 0xaf14e8[2 75167 0xaf14b8[0x1a597c8[1 75165 60] 0xaf3bd8[0xaf3bb8[1 75166 70] 0]]] 0 0 0 0 0 0]
We got the information we wanted, but we saw the whole symbol as well.
If we only want the procedure body, we can use the procedure
state:procedure-get-body
, which takes a symbol and returns the
body of the procedure named after it. Actually
state:procedure-get-body
is a simple selector looking up the
given buffer at the appropriate index — notice that 0
isn’t a valid value for any e0:expression
case, so a
0
result represents the absence of a procedure
binding; we don’t need a separate flag as for non-procedure bindings.
(e1:toplevel (state:procedure-get-body (e1:value my-procedure))) ⇒ 0xaf14e8[2 75167 0xaf14b8[0x1a597c8[1 75165 60] 0xaf3bd8[0xaf3bb8[1 75166 70] 0]]] (e1:toplevel (state:procedure-get-body (e1:value this-is-not-a-procedure-name))) ⇒ 0 (e1:toplevel (e1:value this-is-not-a-procedure-name)) ⇒ 0x43d05e0[0x43cf7f0[28 116 104 105 115 45 105 115 45 110 111 116 45 97 45 112 114 111 99 101 100 117 114 101 45 110 97 109 101] 0 127 0 0 0 0 0 0 0 0] ;; the body field is 0
So we’ve seen that the bundle expression indeed contains a list of value expressions. Let’s check:
(meta:print-procedure-definition 'my-procedure) ⊣ Formals: () ⊣ [bundle 60₇₅₁₆₅ 70₇₅₁₆₆]₇₅₁₆₇
Thanks to their structure, it’s easy to work on expressions with recursive procedures: notably it’s easy to write an ε₀ interpreter in ε₁, and in fact you can find one in core.e — actually written in a very restricted subset of ε₁, very close to ε₀. Much in the same way, it’s not hard to write a compiler in ε₁ which translates ε₀ into native code — and I did that as well: see compiler.e.
In order for a compiler to work, it has to access all global definitions. As you saw, global procedures and non-procedures for a given symbol are easy to find; but where do you find all interned symbols? The answer, of course, is in the symbol table.
It may look a little frightening because of its size, but in the end the symbol table (a hash table mapping symbol names as strings into symbols) is just a data structure like any others, made of fixnums and pointers. Its written dump is very large, and not by accident: starting from the symbol table you can reach every symbol, and symbols point to their global definitions as globals and procedures... Since procedure formals and bodies contain symbols, you’ll see many “yellow” pointers as well.
(e1:toplevel symbol:table)
⇒ 0x946950[0x946990[4096 1993 0xc095b0[0xc09670[0xa1d780[0] 0xa1d700[0xa1d780 0 127 0 0 0 0 0 0 0 0]] 0xc095f0[0xc09630[0x980980[10 99 111 110 100 105 116 105 111 110 115] 0x980900[0x980980 0 127 0 0 0 0 0 0 0 0]] 0]] 0xc09530[0xc09570[0xb7dfa0[10 98 111 117 110 100 45 102 111 114 109] 0xb7df20[0xb7dfa0 0 127 0 0 0 0 0 0 0 0]] 0] 0 0 0 0xc094b0[0xc094f0[0xa17200[14 115 117 109 58 100 101 115 99 114 105 112 116 111 114] 0xa8fad0[0xa17200 0 127 0xa90200[0xa17110[0xa17190[7 116 114 105 118 105 97 108] 0 127 0 0 0 0 0 0 0 0] 0xa90240[0xa16fd0[0xa17050[17 99 97 115 101 115 45 115 101 120 112 114 101 115 115 105 111 110] 0 127 0 0 0 0 0 0 0 0] 0]] 0xa8fb50[4 18816 0xa901c0[0xa8fc80[0xa8fd00[3 95 57 52] 0 127 0 0 0 0 0 0 0 0] 0] 0xa900f0[5 18815 0x95ca20[0x95cd00[11 98 117 102 102 101 114 58 109 97 107 101] 0 127 0x95ccc0[0x95cbc0[0x95cc40[10 101 108 101 109 101 110 116 45 110 111] 0 127 0 0 0 0 0 0 0 0] 0] 0x95caf0[3 188 0x95ca20 0x95cb40[0x95cb80[0 187 0x95cbc0] 0]] 0 0 0x95caa0[1 1 0 0 4] 0 0 0] 0xa90140[0xa90180[1 18814 2] 0]] 0xa8fba0[4 18813 0 0xa8ff20[5 18806 0x95c0b0[0x95c410[18 98 117 102 102 101 114 58 105 110 105 116 105 97 108 105 122 101 33] 0 127 0x95c350[0x954760[0x9547e0[6 98 117 102 102 101 114] 0 127 0 0 0 0 0 0 0 0] 0x95c390[0x954640[0x9546c0[5 105 110 100 101 120] 0 127 0 0 0 0 0 0 0 0] 0x95c3d0[0x95b270[0x95b2f0[11 110 101 119 45 101 108 101 109 101 110 116] 0 127 0 0 0 0 0 0 0 0] 0]]] 0x95c180[3 201 0x95c0b0 0x95c1d0[0x95c310[0 198 0x954760] 0x95c210[0x95c2d0[0 199 0x954640] 0x95c250[0x95c290[0 200 0x95b270] 0]]]] 0 0 0x95c130[3 0 1 0 9] 0 0 0] 0xa8ff70[0xa900b0[0 18803 0xa8fc80] 0xa8ffb0[0xa90070[1 18804 0] 0xa8fff0[0xa90030[0 18805 0xa17110] 0]]]] 0xa8fbf0[4 18812 0 0xa8fd50[5 18810 0x95c0b0 0xa8fda0[0xa8fee0[0 18807 0xa8fc80] 0xa8fde0[0xa8fea0[1 18808 1] 0xa8fe20[0xa8fe60[0 18809 0xa16fd0] 0]]]] 0xa8fc40[0 18811 0xa8fc80]]]] 0 0 0 0 0 0]] 0] 0 0xc09430[0xc09470[0xb12540[11 115 116 97 116 101 58 101 114 114 111 114] 0xb124c0[0xb12540 0 127 0 0 0 0 0 0 0 0]] 0] 0 0 0 0 0xc093b0[0xc093f0[0xb12220[11 115 116 97 116 101 58 97 112 112 108 121] 0xb121a0[0xb12220 0 127 0 0 0 0 0 0 0 0]] 0] 0 0 0 0 0 0 0 0 0 0xc09230[0xc09270[0xc09330[10 112 114 105 109 105 116 105 118 101 63] 0xc092b0[0xc09330 0 127 0 0 0 0 0 0 0 0]] 0] 0xc091b0[0xc091f0[0xb14850[27 115 116 97 116 101 58 97 100 100 45 112 114 105 109 105 116 105 118 101 45 115 121 109 98 111 108 115] 0xb13a50[0xb14850 0 127 0xb147d0[0x9c59d0[0x9c5a50[6 115 97 108 105 115 116] 0 127 0 0 0 0 0 0 0 0] 0xb14810[0x963680[0x963700[3 97 99 99] 0 127 0 0 0 0 0 0 0 0] 0]] 0xb13ad0[7 2757 0xb14790[0 2738 0x9c59d0] 0xb14750[0 0] 0xb14710[0 2739 0x963680] 0xb13b30[4 2756 0xb146d0[0xb0cc70[0xb0ccf0[12 102 105 114 115 116 45 115 121 109 98 111 108] 0 127 0 0 0 0 0 0 0 0] 0] 0xb14570[5 2742 0x6b2d50[0x65cb40[8 99 111 110 115 58 99 100 114] 0 127 0x6f91a0[0x957840[0x9578c0[4 99 111 110 115] 0 127 0 0 0 0 0 0 0 0] 0] 0x61e700[5 250 0x957670[0x957950[12 99 111 110 115 58 103 101 116 45 99 100 114] 0 127 0x957910[0x957840 0] 0x9576f0[5 235 0x954460[0x9548c0[10 98 117 102 102 101 114 58 103 101 116] 0 127 0x954840[0x954760 0x954880[0x954640 0]] 0x954530[3 193 0x954460 0x954580[0x954720[0 191 0x954760] 0x9545c0[0x954600[0 192 0x954640] 0]]] 0 0 0x9544e0[2 1 0 0 7] 0 0 0] 0x957740[0x957800[0 233 0x957840] 0x957780[0x9577c0[1 234 1] 0]]] 0
;;; [...] this is just the beginning. Try it!
There are procedures which walk the symbol:table
structure and
return a list containing all the symbols bound to procedures or
to non-procedures; these are handy to get “global” information about
the program state:
(e1:toplevel (state:global-names)) ⇒ ;; [huge output] (e1:toplevel (state:procedure-names)) ⇒ ;; [different huge output]
e1:define
is just a macro!Many of the details above aren’t important in practice: you won’t need
to remember the field order in a symbol object, or to parse
an ε₀ expression data structure at a glance; you certainly
don’t need to remember such encodings by heart. But what is important
is that all of this can be done, by accessing reflective
data as ordinary data structures: in principle you could define a
complicated procedure using only buffer:make
, buffer:get
and buffer:set!
. In fact e1:define
is nothing magic:
it is itself a macro, which can define a procedure by
macroexpanding the procedure definition as provided by the user until
it reduces to a piece of ε₀ code, than sets some symbol
field. The result of macroexpanding an e1:define
use will be a
large ε₀ expression which when executed builds
another ε₀ expression, to be associated to a symbol with
buffer:set!
.
All of this should explain why no case of ε₀ expressions
looks like a global definition: indeed, global definitions
aren’t in the core language at all: there’s no need for them if we
can mutate symbol fields. By mutating such fields at run time, we can
even write self-modifying programs in a natural way: e1:define
expressions can occur in any context where another ε₁
expression can occur, even in deeply nested code: differently from
most languages ε₁ has no separate “toplevel forms”, but
only expressions: and expression syntactically include one another.
As a slightly less trivial example of a global definition, let
month->days
return the number of days in the given month (as a
0-based index), for a non-leap year. We can use a Lisp-style
e1:case
form (which compares by identity):
(e1:define (month->days month-index) (e1:case month-index ((8 3 5 10) ;; 30 days hath September, April, June and November 30) ((1) ;; …excepting February alone… 28) (else ;; …all the rest have thirty-one 31)))
Of course the e1:case
form isn’t in ε₀: rather
it’s a macro which gets rewritten into simple ε₀ conditionals:
(meta:print-procedure-definition 'month->days) ⊣ Formals: (month-index) ⊣ [let [_416] be month-index₇₅₃₅₁ in [if _416₇₅₃₅₂ ∈ {8, 3, 5, 10} then 30₇₅₃₅₃ else [if _416₇₅₃₅₄ ∈ {1} then 28₇₅₃₅₅ else 31₇₅₃₅₆]₇₅₃₅₇]₇₅₃₅₈]₇₅₃₅₉
Notice that we’ve never called month->days
yet: e1:define
performed the expansion once and for all at definition time.
The non-procedure case of e1:define
is similar:
e1:define
macroexpands what the user supplied, obtains
an ε₀ expression, evaluates it, and sets (another) symbol
field.
We’ve looked at the expansion result, which is to say
ε₀ expressions; now, without looking at the expansion
process itself, it’s time to focus on the source of this
translation: what exactly is it that e1:define
(and
e1:toplevel
as well) eventually rewrites to an ε₀
expression?
The answer is an s-expression: for example (e1:define qwe
(fixnum:+ 1 2 3 4))
has to turn, somehow, the s-expression
(fixnum:+ 1 2 3 4)
into an expression: but (fixnum:+ 1 2
3 4)
is just a data structure: in Lisp you would call it a
“list”, containing a “symbol” and four “numbers”; in
ε₁, to show more clearly that this is a (potentially)
syntactic object, different from the lists we saw before, we call it
an s-list (pronounced: ess-list). This s-list happens to
contain an s-symbol and four s-fixnums.
We say that ε₁ types can be injected into s-expressions:
for example fixnums are injected as s-fixnums, booleans as s-booleans,
the empty list as the empty s-list, and so on.
If you want to see the expansion of an s-expression into an
expression, you can use the Guile procedure meta:macroexpand
(it’s a Guile procedure: mind the quote, and don’t use
e1:toplevel
). You can see, for example, that fixnum:+
can accept any number of arguments in ε₁, but when rewritten
in ε₀ fixnum:+
calls become simple (and efficient)
two-argument calls, possibly nested:
[To do: shall I hint at how this works? Later, I’d say]
(meta:macroexpand '(fixnum:+ 1 2 3 4)) ⊣ [call fixnum:+ [call fixnum:+ [call fixnum:+ 1₇₅₃₇₃ 2₇₅₃₇₄]₇₅₃₇₅ 3₇₅₃₇₆]₇₅₃₇₇ 4₇₅₃₇₈]₇₅₃₇₉ (meta:macroexpand '(fixnum:+)) ⊣ 0₇₅₃₈₀ (meta:macroexpand '(fixnum:+ 20)) ⊣ 20₇₅₃₈₁
This is a fundamental difference between ε and Lisp: ε is not homoiconic; which is to say, in ε the data structures used to represent syntax (s-conses, s-symbols, s-fixnums, the empty s-list) are distinct from expressions: one notable feature of Lisp dialect is that data structures and expressions is the lack of this distinction.
Quoting and quasiquoting, in ε₁, yield an expression evaluating to the given s-expression; after evaluation, the result is the s-expression as supplied. Let’s see:
(e1:toplevel 100) ⇒ 100 ;; 100 as a fixnum (e1:toplevel '100) ⇒ 0xb44cf8[2 100] ;; 100 as an s-fixnum (e1:toplevel '120) ⇒ 0xb3f030[2 120] ;; 120 as an s-fixnum
You may already have guessed why an s-fixnum is represented as a boxed
object, containing a word which precedes the actual fixnum value:
s-expressions are a sum — an open sum, in fact16.
2
happens to be the tag associated to the s-fixnum case
of s-expressions. Other types injected into s-expressions have each
their own tag:
(e1:toplevel '()) ⇒ 0x1a46830[0 0] ;; empty s-list (e1:toplevel '#t) ⇒ 0xb5b170[1 1] (e1:toplevel '#f) ⇒ 0x1743368[1 0] ;; all s-booleans have the same tag (e1:toplevel '"abcdefghijkl") ⇒ 0x1671b68[5 0xb973a8[12 97 98 99 100 101 102 103 104 105 106 107 108]] (e1:toplevel 'abcdefghijkl) ⇒ 0xb65100[6 0x1701f98[0x1702e40[12 97 98 99 100 101 102 103 104 105 106 107 108] 0 127 0 0 0 0 0 0 0 0]]
We call s-conses the injection of conses, in this case meant in the restricted sense of pairs of s-expressions only:
(e1:toplevel (cons:make 100 200)) ⇒ 0x19c6aa8[100 200] ;; not an s-expression (e1:toplevel '(100 . 200)) ;; an s-cons of two s-fixnums ⇒ 0x91a4a0[3 0x953cb8[0x91dc30[2 100] 0x920d18[2 200]]]
An s-list is either the empty s-list or an s-cons whose right side is an s-list. We call s-car and s-cdr the left-hand side and right-hand side of a cons; by our definition of s-cons, both are always s-expressions.
(e1:toplevel '(1 . (2 . ()))) ;; an s-cons which is also an s-list ⇒ 0x92a1c8[3 0x99ade0[0x9a0a50[2 1] 0x99adc0[3 0x99a3c8[0x9a0a70[2 2] 0x99a3a8[0 0]]]]] (e1:toplevel '(1 2)) ;; just another way of writing '(1 . (2 . ())) ⇒ 0x916c00[3 0x916be0[0x91aa68[2 1] 0x916bc0[3 0x91aac8[0x91aa88[2 2] 0x91aaa8[0 0]]]]] (e1:toplevel '(quux 2)) ;; this happens to map to an ε₀ expression… ⇒ 0x16f0aa8[3 0x16f0ab8[0x16f0b68[6 0x164c0f8[0x174f320[4 113 117 117 120] 0 127 0 0 0 0 0 0 0 0]] 0x16f0ac8[3 0x16f0b08[0x16f0b28[2 2] 0x16f0b18[0 0]]]]] (e1:toplevel '(e1:if)) ;; …this doesn't, but it's still a valid s-expression ⇒ ;; [huge structure: e1:if is bound to a macro, indirectly referring many symbols]
You’ve already seen that
the empty s-list has tag 0
,
s-booleans have tag 1
,
s-fixnums 2
,
s-strings 5
and
s-symbols 6
.
But of course you don’t
have to remember such trivialities by heart: there are case-querying procedures for
injected types. Case-querying procedures work on any
s-expression, but of course not on non-s-expression ε data:
(e1:toplevel (sexpression:cons? '0)) ⇒ 0 ;; the 0 s-fixnum isn't an s-cons (e1:toplevel (sexpression:fixnum? '0)) ⇒ 1 ;; the 0 s-fixnum is an s-fixnum (e1:toplevel (sexpression:fixnum? '#f)) ⇒ 0 ;; the #f s-boolean isn't an s-fixnum (e1:toplevel (sexpression:boolean? '#f)) ⇒ 1 ;; the #f s-boolean is in fact an s-boolean (e1:toplevel (sexpression:symbol? 'foobar)) ⇒ 1 (e1:toplevel (sexpression:cons? 7)) error→ failure or crash: there's no s-expression tag in 7
There are also selector procedures for s-conses substructures; we call s-car the left element of an s-cons, and s-cdr its right element. Selectors are defined for s-car, s-cdr and their compositions up to length four (for example the s-caddr is the s-car of the s-cdr of the s-cdr). Mutators are also available for destructively changing the s-car and s-cdr of a given s-cons.
(e1:toplevel (sexpression:car '(2 . 3))) ⇒ 0x103e7b8[2 2] ;; 2 as an s-fixnum (e1:toplevel (sexpression:cdr '(2 . 3))) ⇒ 0x1031178[2 3] ;; 3 as an s-fixnum (e1:toplevel (sexpression:cdddr '(1 2 3))) ⇒ 0x73a1c0[0 0] ;; empty s-list (e1:define sc '(1 . 2)) (e1:toplevel sc) ⇒ 0x6ff890[3 0x6ff870[0x6ff830[2 1] 0x6ff850[2 2]]] (e1:toplevel (sexpression:set-car! sc '5)) (e1:toplevel sc) ⇒ 0x6ff890[3 0x6ff870[0x6ff830[2 5] 0x6ff850[2 2]]] ;; the s-car changed (e1:toplevel (sexpression:set-cdr! sc sc)) (e1:toplevel sc) ⇒ 0x6ff890[3 0x6ff870[0x6ff830[2 5] 0x6ff890]] ;; circular s-cons: (5 . (5 . (5 . …)))
S-expression constructors are actually injection procedures or injectors, useful for converting ordinary untagged data into s-expressions;
(e1:toplevel (sexpression:inject-boolean #t)) ⇒ 0x33d26e0[1 1] (e1:toplevel (sexpression:inject-fixnum 42)) ⇒ 0x33c8020[2 42] (e1:toplevel (sexpression:inject-symbol (e0:value blah-blah))) ⇒ 0x3405c90[6 0x33de740[0x33de230[9 98 108 97 104 45 98 108 97 104] 0 127 0 0 0 0 0 0 0 0]]
sexpression:cons
is a convenience procedure constructing an s-cons
holding the two given s-expressions.
It’s your responsibility to ensure that s-conses only contain
s-expressions. If you put something else in an s-cons field (for
example if you forget an injector call), you’ll obtain a data
structure which will likely cause problems later:
(e1:toplevel (sexpression:cons '1 '#f)) ⇒ 0x3440d40[3 0x3440dd0[0x3440ef0[2 1] 0x3440e60[1 0]]] (e1:toplevel (sexpression:cons '1 #f)) ;; say we forget the quote ⇒ 0x33d3660[3 0x33d41a0[0x33d4230[2 1] 0]] ;; not an s-cons: 0 isn't an s-expression
Of course ordinary ε data don’t carry any type information, so it’s your responsibility to use the correct injector:
(e1:toplevel (sexpression:inject-boolean 0))
⇒ 0x3410340[1 0] ;; a boolean s-expression #f
Ejection procedures or ejectors return the content of a
given s-expression. The universal ejector sexpression:eject
works on any s-expression case; case-specific ejectors check that the
s-expression case is correct, then call e1:error
on case mismatch, or
behave like the universal ejector otherwise.
(e1:toplevel (sexpression:eject '1)) ⇒ 1 (e1:toplevel (sexpression:eject-fixnum '1)) ⇒ 1 (e1:toplevel (sexpression:eject '#t)) ⇒ 1 ;; just the same! (e1:toplevel (sexpression:eject-fixnum '#t)) error→ fail or crash
Ejecting an s-cons yields a pair of s-expressions: the pair itself has no tag, but the left and right side are the same objects contained in the original s-cons, tag and all.
(e1:define my-s-cons '(() . #f)) (e1:toplevel my-s-cons) ⇒ 0x34503c0[3 0x3450380[0x3450300[0 0] 0x3450340[1 0]]] (e1:toplevel (sexpression:eject-cons my-s-cons)) ⇒ 0x3450380[0x3450300[0 0] 0x3450340[1 0]] ;; same pointers
I’ve already said that quoting expands to an ε₀ expression building the given s-expression as a constant data structure. Now you can understand the expansion of '(1 2)
:
(meta:macroexpand ''(1 2)) ;; two quotes: you're studying '(1 2) ⊣ [call sexpression:cons [call sexpression:make 2₆₉₀₈₂ 1₆₉₀₈₃]₆₉₀₈₄ [call sexpression:cons [call sexpression:make 2₆₉₀₈₅ 2₆₉₀₈₆]₆₉₀₈₇ [call sexpression:make 0₆₉₀₈₈ 0₆₉₀₈₉]₆₉₀₉₀]₆₉₀₉₁]₆₉₀₉₂
It macroexpands to an ε₀ expression building the s-expression (1 2)
. We can confirm this by looking at the result of the evaluation of '(1 2)
in ε₁:
(e1:toplevel '(1 2))
⇒ 0x346ba10[3 0x346b9d0[0x346b890[2 1] 0x346b990[3 0x346b950[0x346b8d0[2 2] 0x346b910[0 0]]]]]
Indeed it’s the s-expression (1 2)
, also written as (1
. (2 . ()))
: an s-cons holding the s-fixnum 1
and another
s-cons, holding the s-fixnum 2
and the empty s-list.
Notice that quoting is different from e1:value
: for example, '42
yields
the s-fixnum 42
, while (e1:value 42)
yields the fixnum 42
:
(e1:toplevel '42) ⇒ 0x819cb30[2 42] (e1:toplevel (e1:value 42)) ⇒ 42
At this point quasiquoting shouldn’t be very surprising to you,
even if it’s slighly less convenient than in Lisp because of the need
for injectors. For example `(1 ,(sexpression:inject-fixnum 2))
macroexpands to an ε₀ expression building the s-expression
(1 2)
,
which can also be written as
(1 . (2 . ()))
:
(e1:toplevel `(1 ,(sexpression:inject-fixnum 2)))
⇒ 0x348be00[3 0x348bdc0[0x348ba40[2 1] 0x348bd00[3 0x348bcc0[0x348bb40[2 2] 0x348bc00[0 0]]]]]
Of course the thing gets more interesting if we use non-constants in the unquoted parts:
(e1:define my-s-boolean '#t) (e1:toplevel `(1 . ,my-s-boolean)) ⇒ 0x7e92380[3 0x7e92340[0x7e921c0[2 1] 0x7e859a0[1 1]]] ;; the s-expression (1 . #t) (e1:define my-s-list '(#f)) (e1:toplevel my-s-list) ⇒ 0x7edd230[3 0x7edd1f0[0x7edd170[1 0] 0x7edd1b0[0 0]]] ;; the s-expression (#f) or (#f . ()) (e1:toplevel `(,my-s-list)) ⇒ 0x7ef9fe0[3 0x7ef9fa0[0x7edd230[3 0x7edd1f0[0x7edd170[1 0] 0x7edd1b0[0 0]]] 0x7ef9ee0[0 0]]] ;; the s-expression ((#f)) or ((#f . ()) . ()) (e1:toplevel `(,@my-s-list)) ;; unquote splicing ⇒ 0x7f070f0[3 0x7f070b0[0x7edd170[1 0] 0x7f06ff0[0 0]]] ;; the s-expression (#f), again
Don’t forget to ensure that your unquoted values are always s-expressions. If you omit the needed injection or quoting, you’ll get a non-s-expression data structure, which will likely to cause problems later:
(e1:toplevel `(,(sexpression:inject-fixnum 1))) ⇒ 0x817e3d0[3 0x817e390[0x817e210[2 1] 0x817e2d0[0 0]]] ;; the s-expression (1) (e1:toplevel `(,'1)) ⇒ 0x818a6d0[3 0x818a690[0x818a510[2 1] 0x818a5d0[0 0]]] ;; the same: '1 yields an s-expression (e1:toplevel `(,1)) ⇒ 0x8195290[3 0x8195250[1 0x8195190[0 0]]] ;; not a valid s-expression: 1 isn't an s-fixnum!
This will be unsurprising for you at this point: quoting and quasiquoting are not magic notations at the core of the language: they are defined in ε₁ as well, as macros. S-expressions are not the core data structure, being built as they are on buffers and fixnums.
It would be possible to build a different personality, completely replacing s-expression-based syntax with something else, possibly using ε₁ for the implementation. I wouldn’t accept such a change in the official version of ε without convincing evidence of some gain in extensibility, but you can certainly attempt that if you want: it’s free software; play with it.
There are two reasons to have s-expressions in ε₁:
The first point is obvious: you use s-expressions for writing ε₁ programs, and — you haven’t seen any definitions yet, but it’s easy to guess — macros to manipulate them. S-expressions are quite a nice way of encoding syntax: in particular they are not ambiguous to parse, they are reasonably compact, and they lend themselves to automatic manipulation. Quasiquoting makes it easy to write macros in which only some sub-structures are computed; s-expression syntax is very regular and easy to learn, even if it looks “different” from conventional languages.
The second point is more interesting, because it shows a fundamental
difference between ε and Lisp: in ε objects have no
runtime types: 0
may be a fixnum, a sum, a boolean, the
empty list; but if you decide to use s-expression for some data
structure, then you gain the flexibility of Lisp (or other
dynamically-typed languages), only where you want it:
s-expression tags are a form of runtime type representation.
As you saw ε’s s-expressions aren’t very efficient: even s-fixnums have to be allocated on the heap, and there is no unboxed case at all. I might decide to change this in the future, ideally by optimizing some more generic pattern in the representation of sums. In any case, making s-expressions more efficient would affect the readability of dumps — which might be still acceptable; Forth, for example, follows an untyped model in which there is really no way of distinguishing a pointer from a non-pointer: even without any notion of boxedness tags, Forthers are happy with the power of their language. They may be right, and destroying our nice colored dumps may be acceptable; I’m currently not strong enough at Forth for judging myself in a competent way.
Speaking of printed representations, s-expressions also have the
advantage of being easy to print. The elements of the
s-list (0 #f ())
, for example, are all printed differently,
which may be useful. At the present time ε₁ has no
s-expression printer, even if Jérémie Koenig has kindly contributed a
nice pretty-printer, that I still have to integrate.
S-expressions are also easy to parse, and I’ll have to add that support as well; I have a working prototype written in another language, to translate by machine into ε₁, or re-implement. Implementing this s-expression frontend is the only step needed before dropping Guile as a dependency and let ε be completely self-hosting. In this case the scanner, which will have to be extensible, is way more complex than the parser: here’s a complete attributed grammar to recognize s-expressions; it’s an LL(1) grammar, which I can easily code by hand as a recursive descent parser like I already did in the prototype.
Just in case it weren’t obvious this is an attributed grammar, not ε₁ code:
s-expression ::= atom { atom } | ( rest { rest } | prefix s-expression { s-cons(lookup(prefix), s-cons(s-expression, ())) } rest ::= ) { () } | . s-expression ) { s-expression } | s-expression rest { s-cons(s-expression, rest) }
Even in ε₁’s current state without a working parser and
printer we can still play with s-expressions in a relatively
comfortable way. The trick is using the Guile conversion
procedures I’ve written to translate between ε₁’s and
Guile’s incompatible s-expressions:
guile-sexpression->sexpression
and
sexpression->guile-sexpression
.
Since we’re speaking about Guile procedures, we’ll have to call them
out of e1:toplevel
; and since they are procedures and
not macros, we need to quote their arguments when appropriate
in Scheme.
(guile-sexpression->sexpression 15) ⇒ 0x52e0f20[2 15] ;; not unlike quoting 15 in ε₁… (e1:toplevel '15) ⇒ 0x52e8800[2 15] ;; …see? (guile-sexpression->sexpression 'rhfgkjsdfg) ⇒ 0x52f2220[6 0x52ea9d0[0x52ea450[10 114 104 102 103 107 106 115 100 102 103] 0 127 0 0 0 0 0 0 0 0]] (e1:toplevel 'rhfgkjsdfg) ⇒ 0x52f2220[6 0x52ea9d0[0x52ea450[10 114 104 102 103 107 106 115 100 102 103] 0 127 0 0 0 0 0 0 0 0]]
In a similar vein to s-expression conversion procedures,
I also defined conversion procedures for ε₁ data different from
s-expressions. For example, guile-sexpression->whatever
(I
should probably rename it) returns a given Guile datum converted into its ε₁
counterpart, as long as it’s unboxed:
(guile-sexpression->whatever 67) ⇒ 67 (guile-sexpression->whatever #\C) ⇒ 67 ;; big C is Unicode code point 67 (guile-sexpression->whatever #f) ⇒ 0
Converting in the other direction requires a different procedure per
Guile s-expression case; and of course the procedure parameter has to
be an ε₁ object, which may force us to use e1:toplevel
:
(whatever->guile-boolean (e1:toplevel 1)) ⇒ #t (whatever->guile-fixnum (e1:toplevel 1)) ⇒ 1
In practice it’s particularly useful to convert symbols between Guile and ε₁, since dumps easily get big for globally-bound symbols:
(symbol->guile-symbol (e1:toplevel (e1:value symbol:table))) ⇒ symbol:table (guile-symbol->symbol 'aaaaaa) ⇒ 0x536050[0x3271170[6 97 97 97 97 97 97] 0 127 0 0 0 0 0 0 0 0] (guile-symbol->symbol 'aaaaaa) ⇒ 0x536050[0x3271170[6 97 97 97 97 97 97] 0 127 0 0 0 0 0 0 0 0] ;; the same
Converting lists from ε₁ to Guile is easy if we accept having
Guile lists of ε whatevers. If we want to convert the elements
as well, we have to use Guile map
(called mapcar
in other Lisps):
(list->guile-list (e1:toplevel (list:list 1 2 3 4 5)))
⇒ (1 2 3 4 5)
(map whatever->guile-fixnum (list->guile-list (e1:toplevel (list:list 1 2 3 4 5))))
⇒ (1 2 3 4 5)
Going the other way essentially forces use to use some mapping,
because we can’t have whatevers referring arbitrary Guile
s-expressions — I’ve intentionally forbidden it, because it would
not be useful and would make some bugs very difficult to catch at
bootstrapping time. So guile-list->list
, counter-intuitively,
takes two parameters: the Guile list, and an element conversion
procedure:
(guile-list->list '(1 2 3) guile-sexpression->whatever)
⇒ 0x3270ad0[1 0x3270a90[2 0x3270a50[3 0]]]
You can see which procedures are available in bootstrap/scheme/conversion.scm.
Of course the very need for this translation between Guile
and ε₁ data will go away once I write a proper frontend for
ε.
On the other hand the distinction between ε₁ untagged data
and ε₁ s-expressions is useful, and will remain. It’s
perfectly reasonable to think of making a different personality
without such a distinction — essentially a Lisp — but such
decisions don’t belong in ε₁, which I want to remain
low-level enough to be efficient, and to be extensible into various
directions without being committed to limiting choices.
[To do: quoting and quasiquoting: refer e1:value
as mentioned before]
[To do: fill this]
Let’s define the Fibonacci series, in its naïve exponential-time form typical of micro-benchmarks:
(e1:define (fibo n) (e1:if (fixnum:< n 2) n (fixnum:+ (fibo (fixnum:- n 2)) (fibo (fixnum:1- n)))))
Check that it works, paying attention to how long this takes to compute:
(e1:toplevel (fibo 32))
⇒ 2178309
Now let’s try something cool. We want to freeze the current global state
(including globals and procedures) and unexec it into an image which,
when execed, will compute (fibo 32)
and print the result followed
by a newline. Easy:
(e1:toplevel (e1:unexec "/tmp/u" (fixnum:write (fibo 32)) (string:write "\n"))) ⇒ ;; zero results
This was fast: clearly the system didn’t compute (fibo
32)
this time. In fact it just created the file /tmp/u, which
will perform the slow computation when it’s run, starting from the
global state we’ve saved — which, crucially, includes the definition
of fibo
.
Open a new terminal, without closing the guile+whatever
session. From the new terminal enter the epsilon-trunk
directory, then try execing:
./bin/epsilon-image-interpreter-smob /tmp/u ⊣ 2178309
This was about as slow as the interactive system above: in fact
epsilon-image-interpreter-smob represents ε data
structures using Guile SMOBs, just as guile+whatever
.
But we can exec the same unexeced image with a more efficient runtime:
./bin/epsilon-image-interpreter-tagged /tmp/u ⊣ 2178309
This was much faster. But we can do even better17:
./bin/epsilon-image-interpreter-untagged /tmp/u ⊣ 2178309
epsilon-image-interpreter-tagged represents boxedness tags, but epsilon-image-interpreter-untagged doesn’t. This implies that when execing an unexeced image with epsilon-image-interpreter-untagged, you can’t unexec again; when execing with epsilon-image-interpreter-tagged (or epsilon-image-interpreter-smob), you can.
Let’s try (back on the interactive system):
(e1:toplevel (e1:unexec "/tmp/u2" (string:write "Now unexecing into /tmp/u3\n") (e1:unexec "/tmp/u3" (fixnum:write (fibo 32)) (string:write "\n")))) ⇒ ;; zero results
Verify that you have no /tmp/u3. Exec /tmp/u2 with
epsilon-image-interpreter-tagged, and /tmp/u3 will
appear. Exec that, and it will compute fibo
. Notice that you
can use epsilon-image-interpreter-untagged on /tmp/u3,
because that doesn’t need boxedness tags; but if you try
running
./bin/epsilon-image-interpreter-untagged /tmp/u2
you’ll get an error.
[To do: unexec limitations: external state such as open files, pointer identity]
ε compilers will have the same interface as unexec. They are still a little clumsy to use, so I won’t cover them here. However as soon as the interface matures compiling will feel a lot like unexecing, except that it will be slower at generating, and much faster at executing.
[To do: fill this]
[To do: fill this]
[To do: fill this]
[To do: fill this]
[To do: fill this]
[To do: fill this]
[To do: fill this]
At this point you should be able to understand essentially all of core.e, in which ε₀ is defined using itself along with reflective structures, and much of epsilon1.scm, in which ε₁ features are defined one by one. You will not understand the reasons behind some apparently gratuitous complications without reading my thesis (they mostly have to do with bootstrapping from Guile), but the intent should be clear.
epsilon1.scm is a case where you can feel very clearly how the language expressivity accelerates: the beginning is very clumsy, but after each extension is defined, it can be used to define the next one; at the end of the file, ε₁ has become pretty powerful.
Even compiler.e is not overly difficult, except for some kludgish solutions in the frontend part.
[To do: You’ve learned... That was actually simple: all the difficulty was in understanding the “big picture” rationale, and in distinguishing ε from Guile, which is just a transient implementation problem. Using ε₁ itself is easy.]
[To do: I can write another tutorial on syntactic abstraction if there is interest; but later. Let me do some development as well]
[To do: ε₁ doesn’t feel very minimalistic]
You’re welcome to subscribe to epsilon’s mailing list at
https://lists.gnu.org/mailman/listinfo/epsilon-devel and ask
questions publicly. At the present time there is no -bug
or
-help
mailing list, so epsilon-devel
is appropriate for
any kind of discussion about epsilon. I may activate other lists if
there is more traffic in the future.
[Move this: ε₁’s syntax is encoded by s-expressions, as in Lisp dialects (there is a crucial difference, but we will not see it now), to make syntactic extension easier.]
[Move this: [Maybe: whatevers can only refer other whatevers]]
[Move this: final unexec possibly without GC]
[Move this: means of abstraction, means of combination]
[Move this: Every ε₁ form you saw (everything with the ‘e1:’ prefix) is defined as a macro]
[Move this: Pattern-matching is on sums, tuples and records; not on s-expressions, like
Guile’s (ice-9 match)
module.]
[Move this: A footnote about addresses and moving GCs: it’s clear that the current GC is not moving, since we’ve assumed that object address don’t “spontaneously” change. This is definitely useful for debugging; however I might relax this constraint in the future. Currently, I do use addresses as keys in associative structures sometimes, which of course is only reliable with non-movable addresses.]
[Move this: e1:define
is in ε₁: we don’t strictly need
to have it predefined. in fact, we can simulate its behavior.]
[Move this: No error handling in ε₁: you should prevent errors, not recover from them.]
[Move this: No pointer arithmetics]
[Move this: S-expressions are for dynamic typing: just like boxedness tags, you can use them if you want]
[Move this: Show something which fails, near the beginning.]
[Move this: tail calls]
[Move this: Say "detour" instead of "digression"]
[Move this: more data structures: hint at hashes. I also implemented association lists18, and in fact hash buckets are alists.]
[Move this: I’m already requiring a CSS-capable browser, so I can write “λ-calculus” with a real “λ”.]
[Move this: I’ve heard some Forthers call this kind of reflection facility “browsing”. I don’t know how standard this term is.]
— Luca Saiu, 2013-08-23 12:54 (last update: 2023-09-29 17:57)
Tags: english, epsilon, gnu, guile, hacking, tutorial |
Next post | Previous post |
You might want to go to
the
main blog index
(feeds for every post:
Atom 1.0,
RSS 2.0)
or to my web site
https://ageinghacker.net
.
Luca Saiu |
The opinions I express here are my own and do not
necessarily reflect the beliefs or policies of my
employer or for that matter of anyone else. In case you
felt that the public statement of my thoughts threatened
your warm sense of security and your emotional stability
feel free to leave at any time.
You might be interested in my web site
|
Copyright © 2009, 2011-2014, 2017, 2018, 2021-2024 Luca Saiu
Verbatim copying and redistribution of this entire page are permitted
in any medium without royalties, provided this notice is preserved.
This page was generated by trivialblog.
trivialblog is
free software,
available under the
GNU GPL.
Tag icon copyright information is available
in this file.
See http://www.colorforth.com/cf.htm. Mr Moore was writing about conventional software as compared to colorForth, his minimal Forth dialect. ε and colorForth are very different systems, but they share much in terms of philosophy and rationale. I believe both to be local optima in the language design space, following McCarthy’s intuition of http://www-formal.stanford.edu/jmc/lisp20th.html.
Some research operating systems are based on micro-kernels on which several different libraries libraries emulate several different operating system APIs, at the same time: each library is called a “personality”.
As an example of a similar initiative, the SRFI effort (http://srfi.schemers.org/) in the Scheme community is working well: that may be the closest thing yet to Steele’s vision of a maintainer coordinating users’ improvements. And it’s very loose as an organization, which is good.
The first one, from 2001, was only my second compiler. My early attempts look a little embarrassing now.
See http://www.gnu.org/software/guile/manual/html_node/Defining-New-Types-_0028Smobs_0029.html. My implementation of the whatever “type” (static typing proponents are already throwing a fit at this point) is very inefficient; even sticking to SMOBs, there are better solutions than mine. I stuck with something simple; there’s no point in optimizing now, because I’ll soon get rid of the “whatever” runtime.
If you search for my name in recent Usenet messages, you’ll find a cool sub-project of mine, still in the planning stage.
In some cases using the macro may
not actually be necessary, but recognizing all such cases requires
knowledge of the implementation. Always using
e1:toplevel
at the top level for ε₁ forms is never an
error.
This syntax description is very informal: body
forms may be zero or more, in either case. A more correct definition
should actually use dot notation: since parameters are zero or more
and body expressions are zero or more, we can express the two cases as
(e1:define x . expressions)
and (e1:define
(x . formals) . expressions)
. I’m trying to be
friendly to less expert Lispers, avoiding dot notation in this first
introduction.
I’ll not cover multiple results here.
Before functional programming fundamentalists jump at my throat: I’m not saying that it’s a good idea to use mutation everywhere; I like mostly functional programming myself. But in some cases, imperative mutation provides just the right kind of modularity, as shown for example in SICP: http://mitpress.mit.edu/sicp/full-text/sicp/book/node53.html.
I have some experience of purely functional programming: I implemented an entire canonical LR(1) parser generator in the second implementation of ε, which was purely functional. It was very painful to write, and the result didn’t come out particularly beautiful, simple or more maintainable. It was definitely longer than an alternative with assignments could’ve been.
Back to the main point, competent programmers don’t need to be protected from themselves. If you want a bondage language, I don’t think you’ll like ε₀ and ε₁.
Scheme’s SRFI 111 specifies a feature with essentially the
same interface, even if not based on exposed untyped buffers:
http://srfi.schemers.org/srfi-111/.
Boxes are also similar to ML’s ref
s, of course
without the type restrictions. I’ve used a different name because I
don’t like unnecessary abbreviations,
and the word “box” is just as long as ref
.
According to the specifications, Scheme systems aren’t technically forced to implement eq?
as an equality by identity. But I don’t want to be anal, and we all know that in practice eq?
just compares two words. As for ε, by talking about data structures at a low level with pointers, buffers and fixnums, I can afford the luxury of stating what really happens in the implemented system.
I often use the name “cons” to speak of generic pairs, as it’s common in the Lisp jargon. In the ML and Haskell communities, “conses” are just used in lists. But since there’s absolutely no difference between the two kinds of pairs in ε₁, I adopt the Lisp convention.
I can also support older or less conventional machines such as VAXen or the target of my cool sub-project, which violate this rule: The idea is to have the heap start at some fixed address higher than 0: every number lower than the minimum will always be a non-pointer for our purposes, even independently of boxedness tags.
The current implementation of local variable is naïve, and the code will actually run slower after this change. However, that’s hardly the point. I will make the implementation efficient in the future.
Only conceptually for the time being, for reasons of convenience while bootstrapping. But the implementation will probably change soon.
epsilon-image-interpreter-tagged can be easily optimized, and made nearly as fast as epsilon-image-interpreter-untagged.
But their implementation will change in the future: they can be more efficient, and efficiency is important when a lot of elements are alive — this is the case of many hash tables, containing a large number of (small) buckets.