Unithorpe is a small interpreted
programming language and its virtual machine.
The driving idea is to use a single unicode character
to name each variable, function, namespace,
builtin operator, etc. in the language.
All data is either unicode characters or arrays of
unicode characters or other arrays.
- Thingy
-
All values in Unithorpe are thingies.
A thingy is either a Unicode Character (in the
16-bit code range 0..65535) or an array.
- Integers
-
Unicode Characters can also be used as short unsigned
integers, in the range 0..65535. Whether a value is a
character or an integer depends on how it's used.
- NUL Character
-
The NUL character, or the integer 0, is used for initial
values of uninitialized thingies.
- Arrays
-
Arrays are composed of 65536 slots, each containing a thingy --
either an integer or a reference to another (or the same) array.
The slots are indexed by Characters.
Arrays initially contain NUL characters (0 integers)
in all 65536 slots.
All possible characters can be used for indices to any array.
Three primative operations are defined on arrays:
Create an array, Get the value at an index,
and Set the value at an index.
Arrays are reference-counted: they go away when there
are no more referemces to them,
so no operation to explicitly delete an array is required.
- Strings
-
Conventionally, strings of unicode characters in the range 1..65535
are represented by arrays. The values of the array slots are
all characters (not array references), beginning
with index 0. Strings may be
of length 0 to 65534. At least one NUL characters
pad the rest of the array.
- Global Variables
-
One array known as the Global Array always exists.
It is the starting point for all programs and data.
Global variables are initialized to have builtin
bytecode operations set up for you.
The global array is used for many different
purposes, like to hold bytecodes, scripts,
namespaces, arrays, objects, strings,
formal parameters, locals, temporary variables, and any
other kind of data.
It is
the user's responsibility
to decide which slots will be used for what purpose.
- Scripts
-
Programs are made of Unithorpe Scripts,
which are Strings that begin
with the character ';'.
- Local Variables
-
Immediately following the initial ';' are the names
of local variables for the script,
if any, up to
a space character.
The local variables are also used as formal parameters.
The first local variable is an in-out
parameter; the remaining are in parameters.
When the script is called, variables named after the
command name are bound, in order, to the local variables
of the script. Extra local variables with no
calling variable bound to them are initialized to 0.
Local variables actually live in the global array.
Before a script is called, the existing values in slots
which will be local variables
are pushed onto a stack, and restored when the
script returns. These pushed values will be unavailable while
they are pushed.
- Command Fetch
-
After the ';', the local variables, and the space character
come the commands of the script. Simple commands begin
with a non-space character, which is used to index into
the global array. This is called the "command fetch."
What is found there determines what
happens next.
There are four cases:
- If it is a character, it is treated as a "bytecode."
Bytecodes name builtin primative operations, defined later.
- If it is a reference to an array with ';' in slot 0,
it is a Unithorpe script to be executed.
- If it is a reference to an array with integer 0 in slot 0,
it is a namespace. The next character in the script
is an index into this array, where command fetch is
repeated. A command may traverse several namespaces this way.
- If it is a reference to an array with another array
in slot 0, then the array is an "object" and the
array referenced by slot 0 is the "class'.
The next character in the script is an index into the
class array, where command fetch is repeated.
- Command Arguments
-
The characters after the command fetch characters are arguments
to the command. Arguments stop at the first space
or NUL character,
but a grouping character causes inclusion of all characters
up to the closing character of the group.
- Grouping Characters
-
Grouping characters come in pairs, and include
{ } [ ] ( ) < > ` '
- Pseudothorpe
-
Until unicode tools are available, in the script input
to the interpreter, a backslash followed by any two
ASCII characters creates a 16bit character code, whose high
8 bits is the first ASCII character, and whose low 8
bits is the second ASCII character. This three-ASCII-char
sequence, called pseudothorpe, creates a single character code
in the range 256..65535.
For instance, these pseudothorpe
operators are used as if each was a single
unicode character: \eq \ne \lt \le \gt \ge
- Builtin Bytecodes [unithorpe version 0.1]
-
These bytecodes are initialy assigned to their own slot in the
global array. The initial character names the bytecode,
and other characters are arguments.
Arguments which are single characters name global
variables for inputs or outputs. Conventionally,
an output argument comes before other arguments.
Arguments which are "blocks" begin with
a grouping character, and end with the next occurrance
of the matching grouping character. Arguments end with
a space or NUL that is not in a block.
\nCommand |
Usage |
Mnemonic |
Description |
\n
\n! |
!Z |
anew |
set variable Z to a new array containong all 0s. |
\n
\n, |
,Zai |
aget |
set Z to the value in slot i of array a |
\n
\n. |
.aix |
aput |
set slot i of array a to the value in variable x |
\n
\n; |
; |
return |
return from the script |
\n
\n+ |
+Zab |
plus |
set Z to the sum of a and b |
\n
\n- |
-Zab |
plus |
set Z to the difference of a and b |
\n
\n* |
*Zab |
plus |
set Z to the product of a and b |
\n
\n? |
?c{block1} |
if |
if c is not 0, then do block1 |
\n
\n? |
?c{block1}{block2} |
ifelse |
if c is not 0, then do block1, else do block2 |
\n
\n@ |
@a{block1}b{block2} |
while |
repeatedly test variables (like a or b) or do blocks, jumping out when a test variable is 0 |
\n
\n\eq |
\eqZab |
equals |
set Z to 1 if a equals b, to 0 otherwise |
\n
\n\ne |
\neZab |
not_equals |
set Z to 0 if a equals b, to 1 otherwise |
\n
\n\lt |
\ltZab |
less_than |
set Z to 1 if a < b, to 0 otherwise |
\n
\n\le |
\leZab |
less_equals |
set Z to 1 if a <= b, to 0 otherwise |
\n
\n\gt |
\gtZab |
greater_than |
set Z to 1 if a > b, to 0 otherwise |
\n
\n\ge |
\geZab |
greater_equals |
set Z to 1 if a >= b, to 0 otherwise |
\n
- Initial Global Values
-
Each of the above command characters ! , . ; + - * ? @ \eq \ne \lt \le \gt \ge is set in the global array to its own
bytecode value, which is the same as the command character.
Also the slots indexed by
characters '0' '1' '2' '3' '4' '5' '6' '7' '8' & '9'
are set to
the integers 0 1 2 3 4 5 6 7 8 & 9, respectively.
- Inequality
-
Ordering and equality on integers (characters) is the
natural unsigned order. All integers are less than all
array references. Array references have an arbitrary total
ordering. Only references to the same array are equal.
- Booleans
-
Notice that the "if" and "while" operators treat
0 as false, and everything else as true.
- Namespaces
-
Elaborate...
- Objects
-
Elaborate... (Object is bound to '$')
- Examples
-
This script computes the length of a string:
;Zat +Z00 @{ ,taZ }t{ +ZZ1 } ; strlen
|
It begins with ';' to mark it as a Unithorpe script.
The local variable Z will be used to output the result.
The local variable a will be the string input.
The local variable t will be a temporary.
The command +Z00 means set Z to the sum of integers 0 and 0,
that is, to 0.
The command @{ ,taZ }t{ +ZZ1 } is a while loop,
which will get slot Z of array a and put it in t;
will break if t is NUL; will increment Z; and repeat.
Finally the ';' is the return command. Whatever is in Z
will be returned.
Everything following the final ';' is a comment, since it can
never be reached.
If the above script were bound to slot 'L' in the global array,
the following script would create an array,
set the first three slots to '8' '8' '8', and find its
string length (3) into variable x.
; !a *n68 +nn8 .a0n .a1n .a2n Lxa
|
First it makes a new array and puts it in a.
Then it puts 6*8=48 into n.
Then it adds another 8 to n, to make 56, the value of ASCII '8'.
Then it writes that to slots 0, 1, and 2 of the array a.
Then it calls script L (the strlen above) on array a,
with result output in variable x.
- Loading a Program
-
Elaborate...
- Input & Output Primatives
-
TBA.
- Implementation Suggestions
-
Thingies should be (in C) unsigned long.
If the value is less than 65536, it's a character value.
Otherwise it's a pointer to an array.
Arrays could be always 65536 thingies long, or they could
be a structure that grows when needed, with all slots
beyond its actual size behaving like 0. Other sparse
representations could also be used.
Arrays should be reference-counted or garbage-collected.
Some structure is needed to implement the stack of
saved values when a script is called.
- Standard Library
-
We could use libraries for strings, collections, bignums, ...
- Bithorpe
-
A higher-level language "Bithorpe" is planned. It should
be interpreted by unithorpe code, or translated down into
unithorpe. Where Unithorpe is register-based, Bithorpe
should be expression-based, with binary operators.
- References
-
- TODO in future versions
-
- =Za assigns a to Z
- !Z{string value}
creates new array initialized to "string value".
- consider {abcde} creates new array with string "abcde".
(Then is the ! operator obsolete? use =Z{} instead?)
for any input variable
- 'Z{c}
assigns literal character 'c' to Z
- work on objects and classes
- arithmetic & boolean operators
- use real unicode instead of pseudothorpe for \eq \ne ...
- consider a Special Form (MACRO), perhaps a special argument
that gets an array of the rest of the parameter values.
Also a dynamic local variable creator.
- unithreads
- Unicode GUI
- JIT compiler, precompiler
- /section SemiThorpe
-
SemiThorpe is a semi-normal semi-unithorpish language
which compiles into Unithorpe code for the
Unithorpe Virtual Machine.
(thanksgiving day, 2004)
[ Sorry, all guestbooks disabled temporarily due to rampant spam :( ]
Source
This C source is very incomplete and untested
and just might hurt your computer or your head: