David.carybros.com-minimal Instruction Set

7/28/2019 David.carybros.com-minimal Instruction Set

1/31

david.carybros.com http://david.carybros.com/html/minimal_instruction_set.htm

minimal instruction set

When I was working on t his, I thought it was prett y creative and unique. Later I f ound out that I'd been stuck in

``Turing Tar Pit''. Still, this looks like a more reasonable instruction set than so me of the really ugly things that

have been developed by o ther people who have also wasted a lot of time in a ``Turing Tar Pit''.

Describes a microprocessor instruction set developed by David Cary that packs 2 instructions into 8 bits o f

RAM. As f ar as I know, this is *the* Minimal Inst ruction Set (f or a single-processor Von Neuman machine). In

other words, I know of no o ther (Turing-complete) instruction set that has as many or f ewer distinct

instructions (DI).

2003-01-04:DAV: I just discovered

``A Minimal CISC'' by Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/cisc/ that has f ewer distinct

instructions (DI): only 8, so 5 of them can pack into a 16 bit word.

The instructions are:

1. NOP: No operation.

2. DUP: Duplicate the stack top. This is the only way to allocate stack space.

3. ONE: Shift the stack top left one bit, shifting one into the least significant bit.

4. ZERO: Shift the stack top left one bit, shifting zero into the least significant bit.

5. LOAD: Use the value on the stack top as a memory address; replace it with the contents of

the referenced location.

6. POP: Store the value from the top of the stack in the memory location referenced by the

second word on the stack; pop both.

7. SUB: Subtract the top value on the stack from the value below it, pop both and push the

result.

8. JPOS: If the word below the stack top is positive, jump to the word pointed to by the stack

top. In any case, pop both.

... any constant can be pushed on the stack by a DUP followed by 16 ONE or ZERO instructions.

Zero may be pushed on the stack by the sequence DUP DUP SUB; negation may be done by

subtracting from zero; addition may be done by subtracting a negated value, and pushing zero prior

to pushing an unconditional branch address allows an unconditional branch.

See also its counterpart, "The Ultimate RISC" by Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/risc/

The "Whitespace" programming language is similar to this "Minimal CISC". http://compsoc.dur.ac.uk/whitespace

The "Path" language http://pathlang.sourceforge.net/ has an interest ing 2D layout, and a near-minimal

instruction set (expressed in single characters):

+ - increment the current memory cell

- - decrement t he current memory cell
http://www.cs.uiowa.edu/~jones/arch/risc/http://pathlang.sourceforge.net/http://compsoc.dur.ac.uk/whitespace/http://www.cs.uiowa.edu/~jones/arch/risc/http://www.cs.uiowa.edu/~jones/arch/cisc/http://david.carybros.com/html/minimal_instruction_set.html


2/31

} - go to the next memory cell

{ - go to the previous memory cell

, - input an ascii character f rom stdin into the current memory cell

. - output an ascii character f rom the current memory cell into s tdout

/ \ - Turn (unconditional branch)

^ < > V - Turn if current memory cell is no t equal to 0

! - jump next symbol

$ - s tart here heading right

# - end here

any other character including spaces - do no thing

2002-12-07:[FIXME: move ``Turing Tar Pit'' inf ormation here.] computer_architecture.html#tarpit

contents: [FIXME:]

See also

computer_architecture.html#misc f or other att empts at a Minimal Inst ruction Set Computer (MISC). Some

of them have even been produced in silicon.

What is the minimum number of instructions for a Turing-complete von Neumann machine ?

Here David Cary pushes the MISC idea to ugly extremes. [FIXME: need a catchy name for this architecture]

"whenever you excessibly constrain any parameter, something else has got to give." -- Don

Lancaster.

I've put together yet another MISC inst ruction set. I've squeezed it down to 11 instruct ions. I don't think I've

painted myself into a corner yet (I hope). If I could just squeeze out 3 more instructions, I could pack 5 of them

into a 16 bit cell.

One advantage of having few instructions -- you can document them in a reasonably-sized email, rather than

needing a entire book to document the details o f scads of inst ructions.

There's 2 very dif f erent ways of counting the the "size" of an instruction set.

We could count the number of dif f erent inst ruction mnemonics (NM) mentioned in the assembler

documentat ion. This method is subjective, since diff erent assemblers may describe the same machine with

dif f erent numbers of mnemonics (f or example, one assembler may require programmers t o type "LOAD dest

TO PC" to get the ef f ect of an immediate jump. Another assembler may require programmers to use a

additional mnemonic "JUMP TO dest" t o get exactly the same bit pattern in the f inal executable.)

Processers with smaller NM are likely to have regular, ort hogonal inst ruction sets (the same addressing mode

apply to all instructions) which makes them easier to program than processors with larger NM.

The minimum number of mnemonics NM is 1: the "move" instruction used by TTA

http://www.rdrop.com/~cary/html/computer_architecture.html#tta . Since you know what the instruction will be in

a TTA architecture, it t akes zero bits t o specif y -- -- but TTA still requires a bunch of bits per instruction to
http://david.carybros.com/html/minimal_instruction_set.html#ttahttp://david.carybros.com/html/computer_architecture.html#mischttp://david.carybros.com/html/computer_architecture.html#tarpit


3/31

select registers and/or addressing modes.

Another way of count ing the "s ize" of an ins truct ion set is t o count all possible dist inct instruct ions (DI), all

valid variations o f "source register", "dest ination register", and "addressing mode". Using this count, TTA has

lots more t han 1 instruction.

This is a more objective count : if the longest inst ruction has b bits, then the number DI is equal to 2^b minus

the number of invalid instructions o f that length.

Processo rs with smaller DI are likely to need fewer bits b to specify each instruction. Processo rs with f ewer

bits per inst ruction

pack more inst ructions into a given number of bits o f RAM, and

execute more instructions per second with a given RAM bandwidth.

Unfort unately, this advantage is part ially (perhaps completely) cancelled out by the f act that processors with

smaller DI of ten require *more* inst ructions (more RAM and/or more cycles) to implement a given piece of

functionality than processors with larger DI.

Nevertheless, it is a interest ing intellectual challenge to wonder What is the minimum number of bits bmin

required per instruction ? What is the minimum number of distinct instructions DI f or a Turing-complete von

Neumann machine ?

The shBoom(tm) microprocessor f rom Patriot Scientif ic Corporat ion http://www.ptsc.com/

http://www.circuitcellar.com/art icles/misc/tom-92.pdfpacks instructions into 8 bits.

So we know DImin is 256 or less.

The "Itty Bitty Stack Machine" http://www.ittybittycomputers.com/IttyBitty/IBSM.htm is very similar to the F21, wi

mos t instructions packed into 5 bits . (A to tal of about 54 instructions) (This instruction set is allegedly "fas t,

that is, it should be capable of emulation at a raw speed not slower than 10% of the host native hardware.")

The "F21" Forth engine by Chuck Moore F21 STACK PROCESSOR CPU DESCRIPTION

http://pisa.rockef eller.edu:8080/MISC/F21.specs has 27 distinct instruct ions, each one is packed into 5 bits

(neglecting the bits that follow "#" and branches).

So we know DImin is 27 or less.

Clive Sinclairhttp://www.cdworld.co.uk/zx2000/clive.html claims that he has designed a CPU with only "16

principle instructions", but he doesn't list any details.

Dr Neil Burgess mentions "ultraRISC processor, that has only 15 instructions"

http://www.acue.adelaide.edu.au/leap/discipline/eng/Burgess.html but he doesn't list any details.

Can a CPU really be designed to have 16 or f ewer dist inct instructions DI, such that one can pack 2

instructions into 8 bits o f RAM ? 8 inst ructions into a 32 bit word ? Or are these people count ing NM, not DI ?

Starting with the elegant 27 inst ruction set f or the "F21" Fort h engine by Chuck Moo re F21 STACK

PROCESSOR CPU DESCRIPTION http://pisa.rockef eller.edu:8080/MISC/F21.specs , and eliminating instructions

that could be emulated by (sometimes lengthy) combinations of the o ther inst ructions, David Cary managed to

get a (very ugly) 16 inst ruction set .

"# push pop A! A@ T=0 @A !A xor and + com 2/ dup over drop". (16 instructions as of 1998)

Can you think of a more "elegant" (yet still "complete") instruction set of 16 or fewer distinct instructions DI ?
http://pisa.rockefeller.edu:8080/MISC/F21.specshttp://www.acue.adelaide.edu.au/leap/discipline/eng/Burgess.htmlhttp://www.cdworld.co.uk/zx2000/clive.htmlhttp://pisa.rockefeller.edu:8080/MISC/F21.specshttp://www.ittybittycomputers.com/IttyBitty/IBSM.htmhttp://www.circuitcellar.com/articles/misc/tom-92.pdfhttp://www.ptsc.com/


4/31

Turing complete cellular automata are proo f that a Turing Machine can be built without *any* dist inct

"instructions".

Hint: You soo n get to the s tage "Every inst ruction in this set is essent ial. if I eliminate *this*, I won't be able to

do *that*, no matter how many of the remaining inst ructions I string together". While you can't s imply eliminate

instruction from this set, sometimes you can replace 3 instructions from a set with 2 completely different

instructions that, given the remaining instructions, can st ill do *that*. It helps to assume you have some

scratchpad RAM, since the fewer internal registers you have, the fewer instructions you need to shuffle thing

back and fo rth between those registers.

Myron Plichota and vic plichota have developed a much more elegant set of 16 instruct ions they call qUark

../mirror/quark.txt

With a bit o f a challenge fro m "vic plichota" , and many ideas from Myron Plichota, I've

developed a instruction set with only 13 inst ructions: "# ! + xor nand 2/ push pop dropR swap dup T=0 nop" (1

instructions as of 1999-02-22) Programmers model: there is a instruction pointer P and 2 stacks, the "return

stack" and the "data stack". The top o f the data s tack is called T and the second on the data stack is called S

the to p of the return stack is called R.

T R PS .. .. .. .

I've managed to whitt le it down even more with a ``conditional skip if arithmetic result no t zero '' idea f rom Alan

Grimes: Inst ruction summary: "# ! + xor nand 2/ push pop toA Afrom call nop" (12 instructions as of 1999-03-

22)

I've managed to whitt le it down even more "# ! + xor nand 2/ push popA AT call nop" (11 inst ructions as o f

1999-03-24)

I *think* this is s till functionally complete, and can st ill do everything that any o ther Turing-complete CPU can

do. It just takes a *lot* more inst ruction cycles to do most t hings than mos t CPUs.

2000-01-05:DAV: I've just stumbled across "BF: An Eight-Instruction Turing-Complete Programming Language

which was invented by Urban Mueller so lely for t he purpose o f being able to create a compiler that was less

than 256 bytes in s ize, f or the Amiga OS."

More BF details:

A very clean instruct ion set, although it takes f ar more ins truct ion cycles than even my 11 instruct ions to do

even the mos t t rivial things. I wonder if we could redef ine the "input" and "output" op-codes ...

** Programmers model

"# ! + xor nand 2/ push popA AT call nop" (11 instructions as of 1999-03-24) [perhaps more Forth-like names

would be

>R toR instead of pushA> Afrom instead of ATRA ?? instead of popA ]
mailto:[email protected]://david.carybros.com/mirror/quark.txt


5/31

2 push-down stacks, the "return stack" and the "data stack".

T A R PS .. .. .. .

P = the instruction pointer (program counter).

R = the top value of the return stack. The return stack sto res subroutine return addresses, address

pointers, and occasional data.

A = A regis ter t o ho ld temporary data (during "dup", "swap", "drop", and "pop").

T = the top value of the data stack (which hold data values and occasional address pointers).

S = the second value on the data stack

All registers are the same length, the length of a memory address .

[FIXME: A implementation needs to choo se:

sizeof _address (in bits )

sizeof _data_word (in bits) (must be less t han or equal to sizeof _address)

depth o f data stack (in addresses)

depth o f return stack (in addresses)

(Moore f ixed (sizeof _address) = (sizeof _data_word + 1 bit). What o ther choices would be "interesting" ???) ]

[A slow, minimal-gate implementat ion needs stack pointers that point to S and R in RAM. Faster processors ca

keep mos t o r all of the st ack on-chip ... What should the processo r do when stacks overflow or underflow ? ]

** Acronyms and notat ion: cell: the implementat ion's native integer size, t he number of bits read and wrtten at

once. [ ... ] indicates that everything inside the brackets is contained in a single cell. (. . .): parameter-stack

diagram, T t o f ar right, then S. : return-stack diagram, R on f ar right. |: pipe char used to indicate an

"either-or" choice

Subroutines are documented with the initial (to the left of the "-") and final (to the right o f the "-") state of the

data stack and the return stack.

** External interface summary

Data on the external memory bus always comes f rom T or f rom "external devices", and always goes to

"external devices" o r T or the instruct ion latch. All data in memory is accessed only on cell boundaries (i.e., onl

1 whole cell is read or written at a time).

Addresses on the memory bus always come f rom P or R (or perhaps f rom external devices while the CPU is

not using the bus).

Since instructions are always only 4 bits each, inst ructions are packed into "cells", as many as will f it (e.g., 3

instructions per 12 bit cells on some machines, or 4 instruct ions per 16 bit cell, 5 instructions per 20 bit cell, o

8 instruct ions per 32 bit cell on other machines. ). (does it make any sense f or "cells" to be bigger or smaller


6/31

than s izeof _data_word, the size o f a single word o f memory ?). A minimum of 3 instructions per cell is needed

f or t he ugly hack that works around the lack of a proper "load" command).

Any opcode ``# !R+ + xor nand 2/ push popA AT call nop'' can occur in any slo t.

[FIXME: How are interrupts handled ? computer_architecture.html#interrupt ]

It seems that the instructions execute so quickly that the bottleneck is the speed of the RAM.

** Memory access instructions:

Surprisingly, only 1 memory read inst ruction is needed for both in-line literals and f or data everywhere else in

memory:

// ( -- x )@P+ // load data in RAM at [P], push it onto T, then increment P.load // ditto# // ditto

For example, when the CPU executes a cell that s tarts with 3 "#" inst ructions, then the f ollowing 3 cells are

pushed onto the data stack (the second one ends up on S, the third one ends up on T), then the remaining

instructions in the cell are executed, and then the 4th f ollowing cell is loaded into the inst ruction latch and

executed.

// ( ... - ... data1 data2 data3 )[ # # # ... ] [data1] [data2] [data3] [more inst ruct ions ...] ...

The assembler pseudo- op "#(value)" keeps track of implementat ion details (word size, register s ize, sign-

extension, t he current instruction word, where the next literal/inst ruction cell will be loaded f rom, etc.).

"#(value)" expands to "#" (or perhaps "# com" or "# unsigned com") and packs the (possibly complemented)

literal value into the next available (sizeof _data_word) cell, such that when that code is executed, the desired

value gets loaded into T.

To load data that is *not * an in-line literal, somehow calculate the address and get it onto R, then pack these

instructions into a single cell: "swapPR load swapPR" .

There is also only one write instruction:

// ( x -- ) < address -- address+1 >

!R+ // pop T, st ore popped value to RAM at [R], then increment R.store // ditto

(??? should we use A f or the address instead o f R, a la the F21 ?) (hardware *might* be a bit simpler if P is

used f or *all* addresses to the memory bus, replacing !R+ with !P+).

2 operand arithmet ical instruct ions

The 2 operand arithmetical instructions always pop 2 items off the data stack (S and T) and push the result

back onto the data stack (into T).
http://david.carybros.com/html/computer_architecture.html#interrupt


7/31

+ ( n1 n2 -- n1_+_n2 ) add S to Txor ( n1 n2 -- n1_bitxor_n2 ) bitwise exclusive-or S to Tnand ( n1 n2 -- ~(n1_bitand_n2) ) bitwise nand S to T

One way to write software to implement multiple-precision arithemetic is to use the most significant bit of T as

the carry bit, and extending precision in sizeof _address-1 bit chunks.

1 operand arithmet ical instruct ions

2/ ( x -- x/2 ) arithmetic shift right of T (keep sign),and skip next inst ruct ion if T was not zero.

(Next instruction only executed if T was (and still is) 0).nop does nothing

[can "nop" be eliminated ?]

This "2/" is the only conditional inst ruction in the entire set .

The skipped inst ruction is typically "call" or "nop" or another ``2/''. For example, to shif t 3 bits to the right, do

"2/ 2/ 2/ nop". (I leave it as an exercise to the reader to show this sequence always has the net ef f ect of

unconditionally shifting T three times to the right).

[It may make hardware simpler (allow interrupts at end of every cell) if the programmer model always includes a

virtual "nop" at the end of every cell, i.e., even if a 2/ is the last instruction of a cell, the first instruction of the

next cell is executed unconditionally. This means the compiler must insert o r delete a ``nop'' to make the ``2/''

do what the programmer expects :

2/ call --> [... nop] [2/ call ...]

2/ nop call --> [... 2/ ] [call ... ]2/ 2/ --> [... 2/ ] [2/ ... ] ]

[Other opt ions: perhaps make "2/", if t he result is not zero, skip *all* the remaining instructions of the current

cell. perhaps make "2/", if the result is no t zero , skip precisely 3 instructions: following it in this cell and perhap

the start of the next cell. This delay slot could be f illed with "AT push popA" reducing the need for the "nop"

instruction. ]

Stack manipulation instruct ions

push ( x -- ) < -- x > pop T, push onto RpopA ( -- ) < x -- > pop R, push onto AdropR ( -- ) < x -- > // alias for popAAT ( -- x ) push a copy of A onto T (leaving A unchanged).

Data in any register in the list (T R A T) can be moved in a single cycle to the register to its right.

"popA" is a hint to the reader that the value of A will soon be used with AT; "dropR" is a hint t o t he reader that

the value of A is now irrelevant.


8/31

Branches (Change-of -program-f low instruct ions)

There is only 1 branch inst ruction, only one way to modify the value of P (but lots o f diff erent aliases t o make

the intent of the programmer clear):

// various aliases for the same branch instruction bit patternswapPR ( -- ) < future_P -- past_P > // swap P with R.call

branchreturnexit; // pronounced "exit"jmp

P typically (but not always) points to the cell f ollowing the cell f rom which the currently-executing inst ructions

were taken.

Once a cell is loaded into the inst ruction latch f or execution, P is incremented, and then *all* of the inst ruction

in that cell are executed, (possibly modif ying P) f rom the f irst t o the last . The CPU never skips the remaining

instructions in a cell (not even for the call instruction).

If a "call" is immediately f ollowed by a "#" inst ruction, then that distant value at [P] is loaded, *not* the data in

the cell immediately following the cell current ly being executed.

Af ter *all* inst ruct ions in a cell have executed, the next cell of instruct ions is read f rom this new value of [P]

into the inst ruction latch, P is incremented, and then all the inst ructions in that cell are always executed

(possibly modif ying P). Whatever value of P exists immediately af ter t hey are all executed is t he address wher

the next inst ruction cell will be loaded f rom.

Sequences that "temporarily" change P must be very careful to res to re P before the end of that cell.

All branch dest inat ions are t o (the f irst instruct ion of ) a part icular cell, no t to some (other) inst ruct ion inside

that cell.

** pref et ch implementation

Prefetch may be implemented on so me chips. It has no ef f ect on the programmer model; all implementat ions

act "as if " there was no prefetch.

Typically, while the instruct ions in one cell are executed, the next cell is being pre-f etched. That pre-f etched ce

will in turn be executed unless (a) the current cell mentions "load" (#), which diverts the pre- f etched cell to T

instead of the instruction latch, increments P, and starts pre- f etching the following cell. (b) the current cellmodif ies P with "swapPR". The CPU will then f lush the ``next cell'' it speculatively pre-loaded, then s tarts pre-

fetching the next instruction where P now points.

[A st range alternative: Instead, a chip *could* stick with the absolute simplest thing to do with "call" -- merely

swap R and P, and not worry about the consequences.

This has a *major* impact on the programmer's model.

This means that pre- f etched cells are *never* `wasted''. The consequences are that af ter a cell with one ``ca

and no ``load'' instructions is f etched, the cell at the f ollowing address is prefetched and will be executed

unconditionally (it is in the "branch delay slot " of a "delayed branch" a la Sparc ???), while pref etch loads the


9/31

first cell of that subroutine. (note that there is *both* a call delay slot *and* a return delay slot).

Confusing, difficult-to-explain things to happen if one tries to mix "call" followed several "#"s in the same cell.

The f irst "# " loads data f rom the cell immediately f ollowing the cell being executed, while the f ollowing "#"s loa

data f rom distant cells.

Other st range things happen if you put multiple call inst ructions in consecutive cells.

If you do this, then the macro f or @R+ expands to :

: @R+// next 4 instruction must be in same cell.swapPR@P+ // puts a wasted prefetched value in T, starts loading desired valueswapPR@P+ // puts the desired value in T, starts loading next instructioncnop[ cnop ] // t his cell is skippedswap // get the wasted value on topdrop // get rid of it .; ]

[ we might be able to have a minimal implementat ion that packs only 1 inst ruction per cell, feeding out of a 4 b

RAM, if we change the instruction buffer so it pre-fetches and holds not only the currently executing

instruction, but also the 2 f ollowing inst ructions. The ugly hack to f ake a "load [R]": would work like this: while

executing the source code "swapPR swapPR # push call", the inst ruction buf f er would look like this:

1. ``swapPR swapPR #'' (P and R get swapped; then data at [P++] is read)

2. ``swapPR # (data)'' (P and R get swapped; instruction at [P++] is read)

3. ``# (data) push'' (data shunted to T, instruction buffer overwritten with NOP; instruction at [P++] is

read)4. ``nop push call'' (etc.)

every instruction triggers 1 read cycle to get the next instruction and shifts the instruction queue down one. ]

If we we are f eeding out of a 4 bit RAM but we want the register s ize to be N times larger, then we could

prefetch N+1 f ollowing instructions (conf used yet ?). Then the sequence would be (f or N=3, so T = 3*4 bits =

12 bits)

1. ``swapPR nop nop swapPR #'' (P and R get swapped; then data at [P++] is read)

2. ``nop nop swapPR # (data)'' (nop; then data at [P++] is read)

3. ``nop swapPR # (data) (data)'' (nop; then data at [P++] is read)

4. ``swapPR # (data) (data) (data)'' (P and R get swapped; instruction at [P++] is read)

5. ``# (data) (data) (data) push'' (data shunted to T, instruction buff er overwritten with NOPs; instruction

at [P++] is read)

6. ``nop nop nop push call'' (etc.)

** The compiler/assembler

The assembler pseudo-op ``cnop'' f ills the rest o f the current cell with (zero o r more) NOPs, suf f icient to alig


10/31

the next instruction with the start of the next cell. This alignment is necessary when you want that next

instruction to be the destination of some branch in the code. (This is implied by labels at the start of a

subroutine, and the "then" and "else" of a "if-then" statement. ).

The compiler automatically generates inline code whenever the word being compiled would take *less* space

inline than the s ubrout ine call. (- - idea from Myron Plichota)

Should the compiler choose t o actually generate a call, ``.... subroutine_label ..." expands t o something like

this:

[ ... # push call][ subroutine_label ][ ... ]...subroutine_label:... // inst ruct ions to actually do something[ ... return dropR ]

To reduce space slightly, the compiler may choose to compact lists o f subroutine calls (af ter making sure that

none of the subroutines will interf ere too much with the return stack) from

[ ... # push call][ name_1 ][ ... # push call][ name_2 ][ ... # push call][ name_3 ][ ... # push call][ name_4 ][ ... return dropR] to

[ # # # # push push ][ name_1 ][ name_2 ][ name_3 ][ name_4 ][ push push jump dropR ]

or to

[ # push # push ][ name_4 ][ name_3 ][ # push # push jump dropR ][ name_2 ][ name_1 ]

(This is a generalization of "tailbiting").

To minimize space (but not time) in long sequences o f only subroutine calls so me compilers may use direct

threading. data_compression.html#program_compression Here is an example of a f ully re-entrant direct-

threaded subroutine list interpreter:
http://david.carybros.com/html/data_compression.html#program_compression


11/31

[... # push call][ subroutine_list _interpreter ][ sub1 ][ sub2 ][ sub3 ][ sub4 ][ continue ]continue: // ( ) < oldP p_subroutine_list >[dropR dropR ...]...subroutine_list _interpreter: ( -- ) < p_subroutine_list -- >[ swapPR # swapPR push call cnop ] // = [ @R! push call cnop ][ # jump dropR cnop ][ subroutine_list _interpreter ]

This subroutine list interpreter assumes these subroutines don't modify the to p 2 items o n the return st ack;

it's ok if they use deeper items f or parameters.

Once you have this subroutine list interpreter, you can collapse other lists of subroutine calls to 2 full cells +

partial cells o f overhead, plus 1 more cell per subroutine call.

The compiler converts ``.... if ... true_stuff ... then ...'' statements to something like

// last 5 instructions must all be in same cell[ ... # push 2/ jmp dropR push dropR ][ then ][ ... true_stuff ... ]...

then:...

The compiler converts ``.... if ... true_stuff ... else ... false_stuff ... then ...'' statements to something like

// last 5 instructions must all be in same cell[ ... # push 2/ jmp dropR push dropR ][ else ][ ... true_stuff ... ]...[ ... # push jmp dropR ][ then ]

else:[ ... false_stuff ... ]...

then:...

A minimum implementat ion (only 3 instruct ions per cell) is f orced t o use this alternate f orm:


12/31

// last 2 inst ruct ions must be in same cell[ ... ... # push 2/ jmp ][ else ][ dropR push dropR ... true_stuff ... ]...[ ... # push jmp dropR ][ then ]

else:[ dropR push dropR ... false_stuff ... ]...

then:[ ... ]...

A smarter compiler might have a special ``else 0 then'' case f or code like ``... if ... true_stuff ... else 0 then ...

to

[ ... # push 2/ jmp dropR ] // ( refrain from dropping T )[ (then) ][ push dropR ... true_stuff ... ] // T was non-zero; drop it

...then:

... // if T was 0, now T is still 0.

(note that the "2/" never drops T, even in the case where T=0 and the CPU actually executes t hat next

instruction. I suppose a really smart compiler might be able to ref rain f rom dropping T in certain unusual

situations when we later want to use T /2 in the true_stuf f . and/or 0 in more complicated f alse_stuf f . ).

The compiler converts ``... DO ... LOOP ...'' into [FIXME] DO: loop initializat ion ... LOOP: increment by 1 loop

+LOOP: variable increment loop ???

Some processors have the only conditional instruction be a "conditional return if 0" inst ruction. Another way oimplementing ``.... if ... true_stuff ... then ...'' is to use the "conditional-return" style:

[ ... # push call][ subroutine_label ][ dropR ... ]...subroutine_label:[ ... condition ... ][ ... 2/ return push dropR ] // conditional return[ ... true_stuff ... ]

[ ... return ] // unconditional return

A "condit ional return" implementation of ``.... if ... true_stuff ... else ... false_stuff ... then ... '' looks somethin

like


13/31

// could calculate some of condition here[ ... # push call][ (else) ]...

else:// < return_address >// could calculate some of condition here[ ... # push call ][ (then) ]// < return_address >

[ ... false_stuff ... ][ ... return ] // unconditional return

...

then:// < ... return_address else_address >

[ ... condition ... ]// either return to "else:",// or fix up return stack so// we can do true_stuff// and return all the way to

// original call.[ ... 2/ return drop dropR ]

// < return_address >[ ... true_stuff ... ]...

[ ... return dropR ] // unconditional return

//some common macros//using the instruct ion set// "# !R+ + xor nand 2/ push popA AT call nop" (DAV 1999-03-24)

// pseudo-primitives: pop, dup, drop, swapTR, swap.

// pop: t ake top value off R stack, push onto T stack.: pop ( -- x ) < x -- >popA AT ; // always inlined

: pop_dup ( -- x x ) < x -- >_[[ popA AT AT _]] ; // 3 cycles, always inlined

// dup: make a duplicate copy of T, push into T and S: dup ( x -- x x )push pop_dup ; // 4 cycles, always inlined

: drop ( x -- ) < -- >push dropR ; // always inlined

// swapTR: swap T with R: swapTR ( a -- b ) < b -- a >popA push AT ; // 3 cycles, always inlined

: swap (... a b --... b a ) swap T with Spush swapTR pop // 6 cycles

: 0 ( -- 0 )dup dup xor ;


14/31

: 0 ( -- 0 )_[[ AT AT _]] xor ; // 3 cycles, always inlined

// The A register and ``popA'' and ``AT'' instruct ions// may be reserved exclusively for these pseudo-primit ives to use.// Then use "push, pop, dup, drop, swap, swapTR, dropR, 0" macros// as if t hey were the primitive inst ruct ions.

// stack manipulation

: over (... b a --... b a b )push dup pop swap ;

: over (... b a -- ... b a b ) 13 cycles if inlinedpush dup swapTR pop ;

: rot (... a b c --... b c a )push swap pop swap ;

: rot (... a b c --... b c a )push push swapTR pop swapTR pop ;

// arithmetical manipulation

// com: complement: invert all bits of T.: com ( x -- (-x-1) ) or equivalently ( x -- -(x+1) )dup nand ; // 5 cycles

: -1 ( -- -1)0 com ; // 8 cycles: -1 ( -- -1)0 0 nand ; // ( -- -1) // 7 cycles: negate ( x -- -x )#(-1) + com ; // ( x -- -x ) // 13 cycles, or 7 cycles + 1 RAM cycle

: 1 ( -- +1 )#(1) ; // 1 cycle + 1 RAM cycle// too much work for no gain over straightforward literal.: 1-1 dup + com ; // 17 cycles: 1_[[ AT AT _]] xor // ( -- 0 )_[[ AT AT _]] xor // ( 0 -- 0 0 )nand // ( 0 0 -- -1 )push_[[ popA AT AT //+ // ( ... -1 -1 - - -2 ); A=-1

AT AT _]]+ // ( ... -2 -1 -1 -- ... -2 -2 )nand ; // (... -2 -2 -- ... +1 )// 16 cycles

: andnand dup nand ;: orover com and xor ; // clever sequence from Chuck Moore: orcom swap com nand ;: or // 11 cycles


15/31

pushpushpop_dupnandpop_dupnandnand

// subtract T from S:: - (... a b -- ... (a-b) )negate + ;

// 0 logical buffer: 0->0, nonzero->(-1).// 0= logical not: 0->(-1); nonzero->0// (Would it be better to map false t o +1 rather than -1 ?)// (Do we really use less power by using +1 rather than -1 ?)

// simple and straightforward: 0= ( 0 -- -1 ) | ( nonzero -- 0 )if

0else1then;

: 0 ( 0 -- 0 ) | ( nonzero -- -1 )if-1else 0 // take advantage of special speedup for ``else 0 then''then;

: 0= ( 0 -- -1 ) | ( nonzero -- 0 )0 com ;

: 0 ( 0 -- 0 ) | ( nonzero -- -1 )0= com ;

// doesn't work now -- only works if "swap" is really a primit ive//: 0 ( 0 -- 0 ) | ( nonzero -- -1 )// #(-1) swap// // x=0 | x!=0// 2/ swap // (...1 0 --...0 1 ) | (...1 x --...1 x/2 )// drop // (...0 1 -- ...0 ) | (...1 x/2 -- ... 1 )

// somewhat more bizzare programming styles: 0= ( 0 -- -1 ) | ( nonzero -- 0 )-1 swapifdupelse 0 // take advantage of special speedup for ``else 0 then''thenxor;

: 0= ( 0 -- -1 ) | ( nonzero -- 0 )


16/31

1 push 0 push // ( x -- x )

_[[ popA // ( x -- x ) , A=0// x=0 | x!=0

2/ popA // < 1 -- >, A=1 | < 1 -- 1 >, A=0AT AT _]] // (0 -- ...0 1 1) < -- > | (x/2 -- ...x/2 0 0) < 1 -- 1 >2/ dropR drop // (0 1 1 -- ...0 1)< -- > | (...x/2 0 0 -- ...x/2 0) < 1 -- >push push dropR pop // (...0 1 -- 1 )< -- > | (...x/2 0 -- 0 )< -- >;

: 0 ( 0 -- 0 ) | ( nonzero -- -1 )-1 pushifpopret // force a returnelse 0then // take advantage of special speedup for ``else 0 then''dropR;which the compiler expands to: 0 ( 0 -- 0 ) | ( nonzero -- -1 )[ # push # push 2/ jmp dropR ]

[ -1 ][ iszero ]nonzero:[ push dropR popA AT ret ]iszero:[ dropR ret ]

// even more bizzare versions:

: 0 ( 0 -- 0 ) | ( nonzero -- -1 )[ 2/ ret cnop ]nonzero:[ push dropR # ret cnop ][ -1 ]

: 0= ( 0 -- -1 ) | ( nonzero -- 0 )// this version must *not* be in-lined#(nonzero) push2/swapPR push dropR # dropR ret // these 6 inst ruct ions *must* be in the same cellzero:[-1]nonzero:[ 0]

which the compiler expands to: 0= ( 0 -- -1 ) | ( nonzero -- 0 )[# push 2/ swapPR push dropR # dropR ret ][ nonzero ][-1]nonzero:[ 0]

// A smart compiler might take any phrase of the form``... if #(a) ... true_stuff ... else #(b) ... false_stuff ... then''and compile it to


17/31

[ ... # push 2/ jmp dropR push dropR # ][ else ][ a ][ ... true_stuff ... ]...

[ ... # push jmp dropR ][ then ]

else:[ b ]

[ ... false_stuff ... ]...

then:...

: 1+ ( ... a -- ... (1+a) )com -1 + com ; // 19 cycles

: 1+ ( ... a -- ... (1+a) )#(1) + ; // 2 cycles + 1 RAM cycle

: 1- ( ... a -- ... (a-1) )-1 + ; // 9 cycles

: R-- ( -- ) < x -- (x-1) > subtract one from R-1 pop + push ;

// other macros

// "R@+", "fetch": load word from RAM to T at address in R.// Note that both swapPR *must* be in same cell to avoid wild jumps.: R@+ ( -- value ) < source_address -- (1+source_address) >_[[ swapPR load swapPR _]];

// "@", "fetch": load word from RAM at address in T.: @ ( source_address -- value ) < -- >pushR@+dropR ;

// "!", "poke": put value in T into RAM at address S// (... value dest_address -- )

: !push !R+ dropR ;

: = ( n1 n2 -- (n1==n1) )xor 0= ;

HEX


18/31

// f rom plichota: MIN- ( -- 80000000)

80000000 ;

: signbit ( -- 80000000)80000000 ;

// f rom plichota: MAX+ ( -- 7FFFFFFF)

7FFFFFFF ;

// if T is strict ly negative, return true.// f rom plichota: 0< ( x=0 -- 0 )signbit and 0 ;

: > ( n1 n2 -- flag=n1>n2 )- 0< ;

: < ( n1 n2 -- flag=n1 ;

: = ( n1 n2 -- flag=n1>=n2)swap : copyword(source, dest)#(dest ) #(source) @ ! ;

: copyword ( dest -- 1+dest ) < source -- 1+source >@R+

pushswapTRstorepop ;

// Note that both swapPR *must* be in same cell to avoid wild jumps.// this requires cell size to be at least 6 inst ruct ions// 9 cycles + 4 RAM cycles: copy2words ( source -- 2+source ) < dest -- 2+dest >push_[[ swapPRload store


19/31

load storeswapPR _]]pop

// ( dest -- 4+dest ) < source -- 4+source >: copy4words (copyword copyword copyword copyword)

// ( -- ) < source dest -- 4+source 4+dest >// Note that both swapPR *must* be in same cell to avoid wild jumps.// this requires cell size to be at least 6 inst ruct ions// 13 cycles + 8 RAM cycles: copy4words (_[[ swapPRload load load loadswapPR _]]_[[ popAstore store store storeAT _]]push

[

_Stack Computers_ by Philip Koopman 1989mentions these words:

// ?DUP: conditionally duplicate T if it is non-zero: ?DUP ( 0 -- 0 ) or ( x -- x x )dupifdupthen ;

(Chuck Moore doesn't like ?DUP:``1x Forth'' by Charles Moore April 13, 1999

http://www.ultratechnology.com/1xforth.htm

)

U< U1 U2 - FLAGReturn a true FLAG if U1 is less than U2 when comparedas unsigned integers.

U> U1 U2 - FLAGReturn a true FLAG if U1 is greater than U2 when comparedas unsigned integers.

U* N1 N2 - D3Perform unsigned integer multiplication on N1 and N2,yielding t he unsigned double precision result D3.

U/MOD D1 N2 - N3 N4Perform unsigned integer division on D1 and N2, yieldingthe quotient N4 and the remainder N3.
http://www.ultratechnology.com/1xforth.htm


20/31

how to implement multi-precision operations ?since t here are no condition codes,the carry flag must be pushed onto the data stackas a logical value.

_Stack Computers_ by Philip Koopman 1989mentions these words:

RLC N1 CIN -> N2 COUT ->Rotate left through carry N1 by 1 bit . CIN is carry-in,COUT is carry-out.

RRC N1 CIN -> N2 COUT ->Rotate right through carry N1 by 1 bit . CIN is carry-in,COUT is carry-out.

UNORM (... EXP1 U2 ->... EXP3 U4 )Floating point normalize of unsigned 32-bit mantissa

ADC (... N1 N2 CIN ->... N3 COUT )Add with carry. CIN and COUT are logical flags on thestack.

//Store the double-precision value D1 at the two memory//words start ing at ADDR.// [Is most -significant or least -significant word on T ?]: D! (...D1 ADDR - )

//Drop the double-precision integer D1.: DDROP ( D1 -- )drop drop ;

//Duplicate double-precision integer D1 on the stack.: DDUP ( D1 - D1 D1 )over over ;

D+ D1 D2 - D3Return the double precision sum of D1 and D2 as D3.

D@ ADDR - D1Fetch the double precision value D1 from memory startingat address ADDR.

DNEGATE D1 - D2Return D2, which is the two's complement of D1.

// Swap the top two double-precision numbers on the stack.: DSWAP ( D1 D2 - D2 D1 ) (... a b c d --... c d a b )pushswappushswap//(...c a ) pop pop


21/31

//(...c a b d) swappushswap//(...c d a) pop

I - N1Return the index of the currently active loop.

I' - N1Return the limit of the currently active loop.

- N1Return the index of the outer loop in a nest ed loop structure.

LEAVE -Set the loop counter on the return stack equal to theloop limit t o force an exit f rom the loop.

S-D N1 - D2

Sign extend N1 to occupy two words, making it a doubleprecision integer D2.

SP@ (fetch contents of data stack pointer)SP! (init ialize data st ack pointer)RP@ (fet ch contents of return stack pointer)RP! (init ialize return st ack pointer)MATCH (st ring compare primitive)ABORT" (error checking & reporting word)+LOOP (variable increment loop)/LOOP (variable unsigned increment loop)CMOVE (st ring move)

note to assembly language programmers and compiler writers

The compiler must ensure that A is in a don't care state at the start and end of every cell in order f or interrupt

routines to be f ree to use A without saving it.

The compiler can enforce this rule by only allowing the use of the ``popA'' and ``AT'' opcodes in the

subroutines "0", "dup", "pop", "swap", "swapTR" (and perhaps a f ew others), and making sure that when these

subroutines are inlined, cell breaks don't occur at the "wrong" place. (The dropR opcode, even though it is the

same as popA, can be allowed anywhere A is already in a don't care state, s ince it leaves A in a (dif f erent) doncare state).

Perhaps we need a special "non- breaking space" notation so the programmer can indicate that certain

instructions (those sensit ive to A, and certain other ones f ollowing ``2/'') must be packed into a s ingle cell. (if

cells contain N instructions, ``cnop'' f orces the next N instructions into the same cell; but when I really wanted

the next 3 inst ructions to stay together, and there were 4 empty slo ts remaining in the current cell, ``cnop'' is

bit wasteful.)

note to opt imizing compiler writers:


22/31

peephole opt imizer usually needs to eliminate do- nothing "popa at push" sequences in straight- line code, sinc

the only ef f ect is to change A and usually A is in a "don't care state".

In part icular, the "pop swap" macro sequence (in the "over" macro) expands to "pop push swapTR pop", and

the obviously do-nothing subsequence "pop push" is f urther expanded to the do- nothing sequence "

popa at push". Perhaps it would be simplest to immediately replace the "pop swap" sequence with the f aster

sequence "swapTR pop" and then expand *that* into "popA push AT popA AT"

Similarly, "swap push" expands t o "push swapTR pop push", with the same do- nothing subsequence "pop

push"; it seems simple to make the compiler smart enough to immediately replace "swap push" with "push

swapTR"

Macros to emulate each of the F21 instructions

(The "A" is simulated as an location in RAM, _A )

else unconditional jump: #(dest) push _[[ jmp dropR _]] cnopT=0(dest) : dup #(dest) push _[[ 2/ jmp dropR push dropR _]] cnopcall(dest) : #(dest) push call cnop dropR

C=0(dest) : #(MSbset) nand dup nand T=0(dest) cnop; return : return cnop cnop

subroutine start: // no entry sequence needed.

: @R+ // fetch, address in R, then increment Rcnop // next 3 instruction must be in same cell.swapPR@P+ // do the loadswapPR // rest ore P;

: @A+ :#(_A) // get address of _A (&A)dup push push @R+ dropR // fetch value of A: (A) < &A >push @R+ // fetch what A points to: (data) < A+1 &A >pop ! dropR // update A with new value. (data) < -- >

# : #@A:#(_A) // get address of A (&A)push @R+ dropR // fetch value of A: (A) < -- >push @R+ // fetch what A points to: (data) < A+1 >dropR //

!R+ : ! //store, address in R, increment R!A+ :

#(_A) // get address of A (&A data)dup push push @R+ dropR // fetch value of A: (A data) < &A >push ! // pop T, store it at [A]: ( -- ) < A+1 &A >pop ! dropR // update A with new value. (data) < -- >

!A :#(_A) // get address of A (&A data)push @R+ dropR // fetch value of A: (A data) < -- >push ! // pop T, store it at [A]: ( -- ) < A+1 >dropR //

com : dup nand2* : dup +2/ : 2/


23/31

+* : // add S to T if T0 one DUP 1 AND IF OVER + THENdup#(1) andifover +then

+* : // add S to T if T0 one DUP 1 AND IF OVER + THENdup#(1)and // expands to ``nand dup nand''ifover // expands to ``push dup swapTR pop''else 0then+

which expands to (with a good compiler on a 4-inst ruct ion-per-word machine)[push popA AT AT] // dup[ # push # nand] // set up for ìf'', #(1), `nand''[ (then) ][ (1) ]

[ push popA AT AT] // ''dup''[ nand 2/ jmp dropR ] // ``nand'', `ìf''[ push dropR push push ] // finish `ìf'', start ''over'': ``push push''[ popA AT AT nop] // continue òver'': `dupR''[ popA push AT nop] // ``swapTR''[ popA AT nop nop ] // finish òver'': `pop''.then:[ + ]

+*R : // add R to T if T0 one DUP 1 AND IF pop dup push + THEN[push popA AT AT] // dup[ # push # nand ] // set up for ìf'', #(1), `nand''[ (then) ][ (1) ][ push popA AT AT] // ''dup''[ nand 2/ jmp dropR ] // ``nand'', `ìf''[ push dropR nop nop ] // finish `ìf''[ popA AT AT push] // get copy of R `pop dup push''then:[ + ] // add either 0 or a copy of R to original T

// add step on return stack+*_return : // add Rnext t o Rtop if Rtop0 one

//pop DUP 1 AND IF pop dup push + THEN push[ popA AT AT #] // pop dup #(1),[ (1) ][ nand push nop nop ] // start of `ànd''[ popA AT AT nand] // end of `ànd''[ # 2/ jmp dropR ] //`ìf''[ (then) ][ push dropR nop nop ] // finish `ìf''[ popA AT AT push] // get copy of R `pop dup push''then:[ + push ] // add either 0 or a copy of Rnext to original Rtop


24/31

xor : xorand : nand dup nand+ : +pop : popA ATA@ : #(_A) push @R+ dropRdup : dupover : push dup pop swapover : push dup swapTR popwhich expands to

push push _[[ popA AT AT _]] _[[ popA push AT _]] _[[ popA AT _]]push : pushA! : #(A) push ! dropRnop : nopdrop : push popA

** Ways to expand instruction set and make less ugly

-- delete the "A" register, and replace 2 instructions "popA" and "AT" with 4 instructions "dup", "swap",

"drop", and "pop". (I think this actually simplif ies the hardware)

-- Add a proper "load" inst ruction (perhaps @R+) so we don't have to have this ugly hack of playing

around with the program counter. (which seems to do bad things with the instruction pref etch).

[FIXME:] 2000-07-23: After f urther thought, t his seems like a very goo d thing. Replace "!R++" and

"@P++" with "@R++" and "!A++". Don't have time to go t hrough and see if this is really superior and

update all the examples. Advantages: Now we have a real load and a real s to re instruction.

You may be asking, "OK, now we can *read* f rom a address in R, and we can *write* to a address in A --

but how do we get literal values into R ?". err... ummm... we can't. Other t han painstakingly mathematically

assembling them a bit at a time. Unless... we have a special function at memory address 0.

So normal code, whenever it needs some literal value, does

dup dup xor // get a zeropush // push that zero onto t he return stackswapPR // call the special function at address 0.60321 // some literal value, embedded in the codedropR // get rid of the 0.... // program continues from this point.

This is a classic technique in 6502 assembler (and probably other assembly languages) - - embed literalvalues, parameters to a subroutine, in the next memory cells af ter t he subroutine call. Then the

subroutine can access them using the value on the return stack:

[0 org]@R++ // grab the constant in the caller's code streamswapPR // return from subroutine.dropR // optionally get rid of the 0 in the branch delay.

With this modif ication, code is *not * fo rced to be executed a entire 3 inst ruction bundle at a t ime.


25/31

We don't need to call this subroutine *every* time - - I could imagine that we could have special versions

of most s ubroutines that have a extra "@R++" tacked on the end, to get that literal value set to go.

-- replace the ``swapPR'' jack-of-all-trades branch (it's not as ugly as earlier versions with a ``T=0'' jack-

of -all-t rades branch) with normal "call" (``:'') and "return" (``;'') instructions (call: P->R, T->P) (return: R-

>P) The original F21 MISC had 5 branch instructions with several addressing modes each, but they run

much faster.

- delete "nand" and replace with "and" and "com".

Strange variations and ideas (ways to make the instruction set *more* ugly):

maybe make a completely dif f erent inst ruction t he conditional instruct ion ?) (Alan Grimes gave me the

idea that we can make the jump instruction unconditional, if we design one or more o f the arithmetical

instructions, if they evaluate to zero, execute the f ollowing "jump" inst ruction (typically either this

unconditional jump instruction o r NOP),

Would it be better to have a "-" subtract replace the "+" add ? If one does have a subtract, is it better to

have f - or t o have r- ? T hey can emulate each other, of course:

: f- ( ... S T -- ... (S-T) )swap r- ;

: r- ( ... S T -- ... (T-S) )swap f- ;: negate0 r- ;: negate0 swap f- ;: +negate r- negate ; // 5 cycles: +negate f- ; // 4 cycles: +_+ ( ... a b c -- ... (a+b+c) )negate r- r- negate ; // 6 cycles: +_+ ( ... a b c -- ... (a+b+c) )

negate f- negate f- ; // 8 cycles

(??? would it be better to make ``2/'' *unsigned* shift-in-zeros ?)

make the jump inst ruction conditional, like this:

There is only 1 branch inst ruction, only one way to modif y the value of P. it is also the only conditional

instruction.

T=0 conditional branch: conditionally swap P with R.

(??? should this also drop T when T is zero ? When you *know* it is zero, why bother keeping it ?)

otherwise skip the "jump" and cont inue).

+* : // add S to T if T0 one DUP 1 AND IF OVER + T HEN dup #(1) nand dup nand ( x - - lsb x )

#(continuelabel) push T=0 cnop // (1 x) < continuelabel > drop over + continuelabel: // (x) < old_address

> dropR

Once a cell (pointed to by [P]) is loaded f or execution, all of the inst ructions in that cell are executed,

from the first to the last. The CPU never skips the remaining instructions in a cell (not even for the

branch inst ruction T =0). The CPU never starts executing somewhere in the middle of a cell -- all branch

destinations are to (t he f irst instruction of ) a particular cell, not to some (other) instruction inside that

cell. Once a cell has been loaded f rom a particular address into the f or execution, P is incremented.


26/31

Typically the instructions in that cell are executed while the contents of the next address is being loaded;

that next cell will in turn be executed unless (a) the current cell mentions "load" (#), which diverts the next

cell to T instead of the inst ruction latch, increments P, and starts loading the f ollowing cell; (b) the

current cell modif ies P with "T =0". (We *could* just f lush the "next cell" we just loaded, then load where P

now points, then start executing along this new branch).

Can "A" be thought o f as part of a stack, so that a minimal-gate implementat ion only needs to hold P, T,

the instruction latch, and pointers to the 2 stacks ? (what o ther internal registers are there ?) If A is part

of the T S stack, then it needs to be copied when the stack shrinks (i.e., when we have 2 operandinstructions or we have "push"). Perhaps making it part of the R s tack will be easier. If A is part of the R

stack, then "swapPR" is OK, "popA" is t rivial (decrement po inter to A R stack), "AT" is OK, "push" is a bit

tricky -- needs t o copy old value of A to next higher location, then move T to old location of A.

It may make the hardware simpler f or the "conditional skip next instruction" to only work *inside* a cell;

i.e., af ter the end o f every cell is a implied "nop", and if the conditional skip inst ruction is the last

instruction in a cell, only that implied "nop" is skipped, not the 1st inst ruction o f the next cell. Does this

*really* make the hardware any simpler ?

One needs either "dropT" or "dropR", the other can be synthesized:

: dropTpush dropR ;

: dropRpop dropT ; Picking "dropR"

seems to reduce branch and call overhead.

One needs either "swap" or "over", the other can be synthesized:

: overpush dup pop swap ;>R dup >A R> A> ;: swapover push push drop pop pop ;push popA push AT pop ;>A >R A> R> ;

"2-register" branch inst ruction can replace bot h call and return:

...load(label)SWAP(P,T)cnop(next inst ruct ion executed after subroutine returns)dropT...label: (start of leaf subroutine)push

...popSWAP(P,T)cnop (6 instructions, not counting cnop). Or


27/31

alternatively

...load(label)pushSWAP(P,R)cnop(next inst ruct ion executed after subroutine returns)dropR...label: (start of leaf subroutine)...SWAP(P,R)cnop (5 instructions, not counting cnop)

(6 instructions if we replace "dropR" with "pop dropT")

"3-register" branch inst ruction can replace bot h call and return

call-like "3-register" branch:

...load(label)P->R, T->Pcnop(next inst ruct ion executed after subroutine returns)dropR...label: (start of leaf subroutine)...popP->R, T->Pcnop (5

instructions, not counting cnop) (6 instructions if we replace "dropR" with "pop dropT")

"Go To" s tyle jumps can be coded

#(label) pushP->R, T->P // jump...label:dropR // drop the unused address from R...

Conditional "Go To" style jumps can be coded

#(label) push

conditional(T=0 ? P->R, T->P)...label:dropR // drop the unused address from the R stack...


28/31

return- like "3-register" branch:

...load labelpushP->T, R->P // call subroutinecnop(next inst ruct ion executed after subroutine returns)dropT...label: (start of leaf subroutine)push // save the return address on the R stack...P->T, R->Pcnop (6

instructions, not counting cnop) The return-like "3-register" branch seems to always be inferior to

the call-like "3-register" branch, because the call sequence requres that extra "push" instruction.

One could just make P the top of the return s tack, making R not directly accessible ... must be much more

careful not to change P accidentally when temporarily using R.

Then push and pop do dual-purpose as call and return.

Then "Go To" style jumps can be coded

#(label) pushwith matching "come from" locations codedlabel:pop pop drop push //...or alternatively#(label) pop drop push with no special code needed at the "come f rom" location.

Conditional "If (stuf f )Then (thenpart) endif (continuation)" style structures are coded

(stuff) // leaves result on T// T=0 conditionally drops either P (nonzero) or R (zero) ???#(label2) push T=0 cnop

label:(thenpart)

label2:(continuation) "If (stuf f )Then (thenpart) else (elsepart)

endif (continuation)"

(stuff) // leaves result on T

// T=0 conditionally drops either P (nonzero) or R (zero) ???#(label2) push T=0 cnoplabel:(thenpart)pop drop // get rid of current value of P#(label3) push // get new value of P

label2:(elsepart)

label3:(continuation)

Subrout ine calls can be coded


29/31

# (com) push ...while subroutines them selves can be coded

// the return address is on the R stack...pop drop

I played with this (1998) and got Inst ruction summary:

The 15 inst ruct ion codes are:

Code Name Description Tradit ional Forth where A is a variable

@P+ fetch, address in P, push data onto T, P++ P @ @ 1 P +!!P+ store, address in P, pop data from T, P++ P @ ! 1 P +!

push pop T, push into P >Ppop pop P, push into T P>

T=0 if T0 to T19 all zero, nop else drop P. (note: ignores MSbit of T !)

A! pop T into A A !A@ push A into T A @

xor exclusive-or S to T XORnand nand S to T AND COM+ add S to T +

2/ shift T, T20 to T19 (T20 unchanged) 2/dup push T into T DUPover push S into T OVERdrop pop T DROP

nop NOP The next inst ruction f etch

begins as soo n as no memory instructions (@P+ !P+) or possible branches (push pop T=0) are pending.

// fetch, address in R, increment R @R+pop @P+ over pushpush drop pop

//store, address in R, increment R !R+pop over !P+ push drop

push @P+ pop drop fetch, replacing address in T with data. @push !P+ pop drop Store, address in T, data in S, removing both. !

Memory load inst ructions:

One *could* read everything thro ugh a single register @P+ and !P+ but then ref erences t o variables in

data RAM require careful manipulation to put t he variable address in P, do the load, and resto re the next-

instruction address t o P before the end of the cell:

swapPR!P+

swapPR.


30/31

One *could* access everything through a single register @R+ and !R+ but t hen references to in-line

constants require that t he sequence

swapPR@R+swapPR f inish bef ore the end of the cell.

It would be nice to *allow* a pref etch mechanism to work properly:

As soon as instruct ion is loaded into the instruct ion latch and st arts to execute, the next word s tarts to

be loaded into a prefetch buf f er.

If the instruction finishes with no memory access, we wait for the word to finish loading then copy the

prefetch buf f er to the instruction latch. If the instruction has a # literal load, we wait f or t he word to

finish loading then copy the prefetch buffer to T, and start loading the *next* word. If the instruction has

a data-memory access ... fastest would be to wait for the word to finish loading somehow keep it in

some other buf f er, start ano ther memory cycle to load the requested item to T. But a simpler circuit

would just interrupt and throw away the current cycle, start a new cycle to load the item to t he buff er likeall memory accesses do , when that is done move it to T and re-start the cycle to load the next

instruction. (This may be faster if memory is *very* slow and can be cancelled quickly).

The only case when this pref etch doesn't help is with a calculated jump, then one must interrupt and

throw away the current cycle, start a new cycle to load the new P location, and when that is done move it

f rom the buff er to the instruction latch.

We might get away with *not* having a "delayed branch" for normal calls (#(dest) push call) if we could

force them into the sequence

- execute current instruction while loading next cell- when you hit #, wait for cell to finish loading- don't pre-fetch any more cells until we're ready toput t his new address on the data bus

- load new cell from [dest] into instruct ion register and continue f rom there. (but this st ill gives a

delayed branch for the return).

Using @P+ (rather than @R+) as the only memory access instruction seems to better reflect the state

of the "s impler circuit" pref etch mechanism.

If one tries to replace the unconditional SWAP(P,R) instruction with a conditional cSWAP(P,R) instruction,

then it becomes more dif f icult to do pack a data memory load sequence into a single cell. One needs

something like

//< address -- address+1 >//( -- [address] )dup dup xor // push 0 to T// all in one wordcSWAP(P,R) // always swaps, since T is zero@P+ // do the loadswap // get the 0 back on topcSWAP(P,R) // restore P// end of worddropT // get rid of the 0. OK, packing 4 inst ructions into a single


31/31

word isn't too dif f icult. Although it *does* seem like a lot of work (8 instructions) to do a pretty

f undamental operation. (it would only be 4 instructions if we add a unconditional SWAP(P,R) instruction;

if we instead add a @R+ instruction, it would only take 1 inst ruction). Something simple like "copy one

word f rom this address to that address" becomes

//< source dest -- source++ dest++>//( -- )dup dup xor // push 0 to T// all in one wordcSWAP(P,R) // always swaps, since T is zero

@P+ // do the loadswap // get the 0 back on topcSWAP(P,R) // restore P// end of worddropT // get rid of the 0.popswap // now < dest > ( value source++ )!R+push

Since you never have multiple branches inside a single cell, maybe you could seperate the f low-control

and perhaps the "#" listeral instruction, putt ing them in a special f ield. The remaining normal sequential

instructions, since there are fewer of them, might fit into fewer bits in the remaining fields of the cell.

This may be a slippery slope leading to a VLIW CPU rather t han a MISC CPU. But it avoids the conf usion

of "What if there are multiple f low-contro l instructions in a word ?" Basically, I'm thinking something like a

group of bits to indicate which type of cell is t his:

(a) this entire cell is a literal;read it into T and go on to the next word.(b) This cell is normal sequential list of inst ruct ions;read them into the inst ruct ion latch, execute them,and go on to the next word.

(c) When you're done with the instruct ions in this cell,do a conditional branch or call of type XXX.(d) When you're done with these instruct ions, return. At the t ime I'm writing this, that is 4 dif f erent

possibilities which requires 2 bits , but it eliminates 3 instructions f rom my opcode list. (Just a s ingle bit

could be used to discriminate between sequent ial v. branching cells ...)

I don't know if I like the semantics of the conditional branch instruction T=0. Do we want it to t est *all* T

? everything except the mos t s ignif icant bit o f T (t he carry bit) ? Do we want it t o t ake the value T and

conditionally (based on S, so we'll call it S=0) either move it t o push down P (P->R, T->P) or just throw

the value away ? Do we want T =0 to conditionally pop the value of f P (R->P) or do nothing ? Do we

want T=0 to conditionally *replace* the value P with a *copy* of the next value below it o n the s tack, or

do not hing ? Do we want the zero value to be popped or not ?

Is this signed 2/ better than unsigned ? (I think so, s ince it's much easier to synthesize a unsigned shif t -

- generate a mask and strip of f the shif ted-in one bits - - than it is to have the unsigned shif t be primitive

and try to synthesize a signed shift .)

David.carybros.com-minimal Instruction Set

Documents

Transcript of David.carybros.com-minimal Instruction Set