Gracana's Projects

RC2018/09 - Forth Machine


SEP 9 - LEARNING

Every time I dig into the various FRISC papers, I discover something new. In my ALU post I complained about missing information. Well, it's not missing, I only had to look harder: I just learned that only the 0, 1, Z, and N condition codes are valid following logic operations. Carry and overflow don't make sense for logic operations, so they're not valid. Good! I had figured that must be the case, but it's nice to have confirmed it.

I also learned more about the instruction cycle. It turns out that there is no point in splitting apart the ALU/stackop and register-write parts of the instruction execution. They can occur concurrently, which means I can use the four/six cycle instruction sequence discussed earlier. That's good, because as I just learned, instruction fetches are done as two distinct operations. In phase A (or first cycle, in my case) the next-instruction address is calculated and written to the A-bus. In phase B (or second cycle), the instruction is read from memory on the B-bus.

Knowing that, I can now explain how it's possible to fetch the correct instruction when the instruction address depends on a concurrently-executing branch instruction: The condition flag is set by a previous instruction, so the branch choice depends on a single bit, and the choices are the immediate literal field of the branch instruction, or the incremented program counter. That logic is simple and has minimal delay, so it's easily completed in time to write the address to the A-bus, as described above.

How does that work if an ALU instruction is executed and the program counter is written in phase B / on the second cycle? I couldn't figure that one out; it seemed like you must have to add an additional memory access cycle, but the documentation was clear that branches didn't take extra time. Well, according to The Architecture of the SC32 Forth Engine, "The PC cannot be explicitly loaded by an instruction even though this possibility is implied[...]". I guess that's cleared up, then.


SEP 9 - STACKS

Implementing the stack memories has been quite difficult! The stacks are the heart of the FRISC, so their behavior requires a good understanding of everything that connects to them. I think I have it now:

I now have a stack module that implements all of the necessary functions:

File: alu.v
module stackmem (
        input wire clk,
        input wire read,
        input wire write,
        input wire push,
        input wire pop,
        input wire [1:0] sel,
        input wire signed [31:0] in,
        output reg signed [31:0] out,
        output reg signed [31:0] tos,
        output reg [3:0] stack_ptr
);

        integer i;
        reg signed [31:0] m [0:15];
        wire [3:0] next_ptr;
        wire [3:0] sel_ptr;
        wire [3:0] increment;

        assign increment = (push == 1) ? 1 :
                           (pop == 1) ? -1 :
                           0;
        assign next_ptr = stack_ptr + increment;
        assign sel_ptr = next_ptr - sel;

        initial begin
                stack_ptr = 0;
                tos = 0;
                out = 0;
                for (i = 0; i < 16; i = i + 1) m[i] = 32'd0;
        end

        always @(posedge clk) begin
                if (read) begin
                        out <= m[sel_ptr];
                end
                else begin
                        stack_ptr <= next_ptr;
                        if (write) begin
                                m[sel_ptr] <= in;
                                if (sel == 0) tos <= in;
                        end
                        else tos <= m[next_ptr];
                end
        end
endmodule

File: alu_tb.v
module stackmem_tb;
        reg clk;
        reg read;
        reg write;
        reg push;
        reg pop;
        reg [1:0] sel;
        reg signed [31:0] in;
        wire signed [31:0] out;
        wire signed [31:0] tos;
        wire [3:0] stack_ptr;

        stackmem ds (clk, read, write, push, pop, sel, in, out, tos, stack_ptr);

        initial begin
                clk = 0;
                read = 0;
                write = 0;
                push = 0;
                pop = 0;
                sel = 2'd0;
                in = 32'd0;

                status();
                stackow(0, 32'd1823, 1, 1, 0);
                stackow(0, -32'd16990, 1, 1, 0);
                stackow(0, 32'd0, 1, 1, 0);
                stackow(0, 32'd400, 1, 1, 0);
                stackr(0);
                stackr(1);
                stackr(2);
                stackr(3);
                stackow(0, 0, 0, 0, 1);
                stackow(0, 0, 0, 0, 1);
                stackow(0, 0, 0, 0, 1);
                stackow(0, 0, 0, 0, 1);

                $finish;
        end

        always
                #1 clk = ~clk;

        task status;
                integer i;
                begin
                        $display("in: %0d out: %0d tos: %0d ptr: %0d",
                                in, out, tos, stack_ptr);
                        $write("stack:");
                        for (i = 0; i < 16; i = i + 1) begin
                                if (i == ds.stack_ptr)
                                        $write(" t:%0d", ds.m[i]);
                                else
                                        $write(" %0d", ds.m[i]);
                        end
                        $display("\n");
                end
        endtask

        // stack read
        task stackr;
                input [1:0] addr;
                begin
                        read = 1;
                        write = 0;
                        push = 0;
                        pop = 0;
                        sel = addr;
                        in = 32'd0;

                        @(posedge clk);
                        @(negedge clk);

                        $display("read reg #%0d:", sel);
                        status();
                end
        endtask

        // stack op & stack write
        task stackow;
                input [1:0] addr;
                input signed [31:0] val;
                input w;
                input d;
                input u;
                begin
                        read = 0;
                        write = w;
                        push = d;
                        pop = u;
                        sel = addr;
                        in = val;

                        @(posedge clk);
                        @(negedge clk);

                        if (push) $write("push");
                        if (pop) $write("pop");
                        if (write) $write(" write %0d", val);
                        $display(":");
                        status();
                end
        endtask
endmodule

Execution:
$ iverilog -o stackmem stackmem_tb.v stackmem.v && ./stackmem
in: 0 out: 0 tos: 0 ptr: 0
stack: t:0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

push write 1823:
in: 1823 out: 0 tos: 1823 ptr: 1
stack: 0 t:1823 0 0 0 0 0 0 0 0 0 0 0 0 0 0

push write -16990:
in: -16990 out: 0 tos: -16990 ptr: 2
stack: 0 1823 t:-16990 0 0 0 0 0 0 0 0 0 0 0 0 0

push write 0:
in: 0 out: 0 tos: 0 ptr: 3
stack: 0 1823 -16990 t:0 0 0 0 0 0 0 0 0 0 0 0 0

push write 400:
in: 400 out: 0 tos: 400 ptr: 4
stack: 0 1823 -16990 0 t:400 0 0 0 0 0 0 0 0 0 0 0

read reg #0:
in: 0 out: 400 tos: 400 ptr: 4
stack: 0 1823 -16990 0 t:400 0 0 0 0 0 0 0 0 0 0 0

read reg #1:
in: 0 out: 0 tos: 400 ptr: 4
stack: 0 1823 -16990 0 t:400 0 0 0 0 0 0 0 0 0 0 0

read reg #2:
in: 0 out: -16990 tos: 400 ptr: 4
stack: 0 1823 -16990 0 t:400 0 0 0 0 0 0 0 0 0 0 0

read reg #3:
in: 0 out: 1823 tos: 400 ptr: 4
stack: 0 1823 -16990 0 t:400 0 0 0 0 0 0 0 0 0 0 0

pop:
in: 0 out: 1823 tos: 0 ptr: 3
stack: 0 1823 -16990 t:0 400 0 0 0 0 0 0 0 0 0 0 0

pop:
in: 0 out: 1823 tos: -16990 ptr: 2
stack: 0 1823 t:-16990 0 400 0 0 0 0 0 0 0 0 0 0 0

pop:
in: 0 out: 1823 tos: 1823 ptr: 1
stack: 0 t:1823 -16990 0 400 0 0 0 0 0 0 0 0 0 0 0

pop:
in: 0 out: 1823 tos: 0 ptr: 0
stack: t:0 1823 -16990 0 400 0 0 0 0 0 0 0 0 0 0 0

One more thing...

I updated the code in the ALU post. The ALU now produces overflow and carry flags properly.


SEP 6 - PIPELINE

I recently started to write the stack module, but I got sidetracked by the complexity of the instruction cycle, so let's look at that instead.

Consider this diagram:

FRISC 3 Pipeline
ϕA ϕB
Instruction F F R AW
Instruction F F R AW
Load/Store
Instruction
F F R AW M M
Instruction F F R AW
Instruction F F R AW
Cycle 1 2 3 4 5 6 7

ϕA and ϕB represent the two instruction phases. "F" is fetch, "R" is register read, "A" is ALU operation, and "W" is register write.

A fetch takes one cycle, and instruction execution takes one cycle. During the first half of the execution cycle, operand registers are read. During the second half of the execution cycle, the ALU operates and the result is written back to a register. Fetch and execution cycles happen concurrently, so the next instruction is fetched while the current instruction executes.

In the case of a load/store instruction an extra cycle is needed for accessing system memory, because the memory bus would otherwise be in use by the instruction fetch logic.

This is a reasonable design for a 1980s ASIC, but for my purposes I really don't like how different work is done on each phase of the clock. Having some logic sensitive to the positive edge of the clock and some sensitive to the negative edge is confusing and introduces tricky timing and routing issues on an FPGA.

What I'd rather do is split up every instruction into four cycles, like this:

Four/Six Cycle Instructions
Instruction F F R AW
Instruction F F R AW
Load/Store
Instruction
F F R AW M M
Instruction F F R AW
Instruction F F R AW
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Now, instead of having different instruction phases, we just have normal cycles. Instruction fetch takes the first two cycles. Registers are read in the third cycle. During the fourth cycle, the ALU operates and registers are written. Everything is sensitive to the same edge of the clock, but now the CPU is doing half the work in each cycle, so we can double the clock rate to keep up.

But wait... There's still that "AW" cycle, where the ALU operates and registers are written, which seems a little lopsided. That's a lot of work to do in one cycle. I could split that cycle in two, like this:

Six/Nine Cycle Instructions
Instruction F F F R A W
Instruction F F F R A W
Load/Store
Instruction
F F F R A W M M M
Instruction F F F R A W
Instruction F F F R A W
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Now we need a clock that's six times faster than the original clock to do the same amount of work as the original processor, but that shouldn't be a problem. In fact, with instruction execution split into three equal parts, it may be possible to raise the clock speed even higher.


SEP 3 - ALU

Today I wrote a mostly-complete description of the ALU module. In doing so, I realized that the FRISC documentation I have is lacking in detail. For instance, I'd like to know what is supposed to happen if an ALU instruction specifies a logical operation and requests the flag register to be updated with arithmetic status information. Is this illegal, and if so, how are illegal instructions handled? Or is it just nonsensical? Should the flag retain the last sensible value? In my implementation, I've chosen the easy solution: if you ask for garbage, you'll get garbage. No exceptions.

File: alu.v
module alu (
	input wire signed [31:0] a, b, // operands
	input wire signed [6:0] op, // opcode
        input wire cin, // carry in
	output reg signed [31:0] y, // result
        output reg c, // carry flag
        output wire v, // overflow flag
        output wire z // zero flag
);

        assign z = ~|y;
        assign v = y[31] ^ c;
	
	always @* begin
		case (op[6:0])
                        7'h15: // 0001 0101 - A nand B
                        begin
                                y = !(a & b);
                                c = 1'bx;
                        end
                        7'h17: // 0001 0111 - A or not(B)
                        begin
                                y = a | !b;
                                c = 1'bx;
                        end
                        7'h1D: // 0001 1101 - not(A) or B
                        begin
                                y = !a | b;
                                c = 1'bx;
                        end
                        7'h1F: // 0001 1111 - A or B
                        begin
                                y = a | b;
                                c = 1'bx;
                        end
                        7'h20: // 0010 0000 - 0
                        begin
                                y = 32'd0;
                                c = 1'bx;
                        end
                        7'h21: // 0010 0001 - not(A)
                        begin
                                y = !a;
                                c = 1'bx;
                        end
                        7'h22: // 0010 0010 - -1
                        begin
                                y = -32'd1;
                                c = 1'bx;
                        end
                        7'h23: // 0010 0011 - A
                        begin
                                y = a;
                                c = 1'bx;
                        end
                        7'h24: // 0010 0100 - not(B)
                        begin
                                y = !b;
                                c = 1'bx;
                        end
                        7'h2C: // 0010 1100 - B
                        begin
                                y = b;
                                c = 1'bx;
                        end
                        7'h2F: // 0010 1111 - A xor B
                        begin
                                y = a ^ b;
                                c = 1'bx;
                        end
                        7'h41: // 0100 0001 - not(A) + Cin
                                {c, y} = !a + $signed({1'b0, cin});
                        7'h43: // 0100 0011 - A + Cin
                                {c, y} = a + $signed({1'b0, cin});
                        7'h44: // 0100 0100 - not(B) + Cin
                                {c, y} = !b + $signed({1'b0, cin});
                        7'h45: // 0100 0101 - not(A) + not(B) + Cin
                                {c, y} = !a + !b + $signed({1'b0, cin});
                        7'h46: // 0100 0110 - not(B) - not(Cin)
                                {c, y} = !b - $signed({1'b0, !cin});
                        7'h47: // 0100 0111 - A - B - not(Cin)
                                {c, y} = a - b - $signed({1'b0, !cin});
                        7'h49: // 0100 1001 - not(A) - not(Cin)
                                {c, y} = !a - $signed({1'b0, !cin});
                        7'h4B: // 0100 1011 - A - not(Cin)
                                {c, y} = a - $signed({1'b0, !cin});
                        7'h4C: // 0100 1100 - B + Cin
                                {c, y} = b + $signed({1'b0, cin});
                        7'h4D: // 0100 1101 - B - A - not(Cin)
                                {c, y} = b - a - $signed({1'b0, !cin});
                        7'h4E: // 0100 1110 - B - not(Cin)
                                {c, y} = b - $signed({1'b0, !cin});
                        7'h4F: // 0100 1111 - A + B + Cin
                                {c, y} = a + b + $signed({1'b0, cin});
                        7'h55: // 0101 0101 - A and B
                        begin
                                y = a & b;
                                c = 1'bx;
                        end
                        7'h57: // 0101 0111 - not(A) and B
                        begin
                                y = !a & b;
                                c = 1'bx;
                        end
                        7'h5D: // 0101 1101 - A and not(B)
                        begin
                                y = a & !b;
                                c = 1'bx;
                        end
                        7'h5F: // 0101 1111 - A nor B
                        begin
                                y = !(a | b);
                                c = 1'bx;
                        end
                        7'h6F: // 0110 1111 - A xnor B
                        begin
                                y = a ~^ b;
                                c = 1'bx;
                        end
			default:
                        begin
                                y = 32'bx;
                                c = 1'bx;
                        end
		endcase
	end
endmodule

I wrote a testbench for the ALU, but it really needs to be fleshed out. At least it shows that things are working:

File: alu_tb.v
module alu_tb;
        reg signed [31:0] a, b;
        reg [6:0] op;
        reg cin;
        wire signed [31:0] y;
        wire cout;
        wire v;
        wire z;

        alu a1 (a, b, op, cin, y, cout, v, z);

        initial begin
                $display("     op cin    a         b",
                         "       co    y     v z");
                $monitor("%4t %h: %b, %h, %h => %b %h %b %b",
                        $time, op, cin, a, b, cout, y, v, z);

                a = 0;
                b = 0;
                cin = 0;
                op = 7'h1F; // or
                #10

                a = 32'hF0A30000;
                b = 32'h000AEEB1;
                cin = 0;
                op = 7'h1F; // or
                #10

                a = 32'h01;
                b = 32'h01;
                cin = 0;
                op = 7'h4F; // A + B + Cin
                #10

                a = -32'd1;
                b = -32'd1;
                cin = 0;
                op = 7'h4F; // A + B + Cin
                #10

                a = -32'd1;
                b = -32'd1;
                cin = 0;
                op = 7'h4F; // A + B + Cin
                #10

                a = 32'h7FFFFFFF;
                b = 32'h1;
                cin = 0;
                op = 7'h4F; // A + B + Cin
                #10

                a = 32'd50;
                b = 32'd25;
                cin = 0;
                op = 7'h47; // A - B - !Cin
                #10

                a = 32'h80000000;
                b = 32'd1;
                cin = 0;
                op = 7'h47; // A - B - !Cin
                #10

                $finish;
         end
endmodule

Execution:
$ iverilog -o alu alu_tb.v alu.v && ./alu
     op cin    a         b       co    y     v z
   0 1f: 0, 00000000, 00000000 => x 00000000 x 1
  10 1f: 0, f0a30000, 000aeeb1 => x f0abeeb1 x 0
  20 4f: 0, 00000001, 00000001 => 0 00000002 0 0
  30 4f: 0, ffffffff, ffffffff => 1 fffffffe 0 0
  50 4f: 0, 7fffffff, 00000001 => 0 80000000 1 0
  60 47: 0, 00000032, 00000019 => 0 00000018 0 0
  70 47: 0, 80000000, 00000001 => 1 7ffffffe 1 0

Hooray.


SEP 1 - INTRODUCTION

I've always liked the Forth programming language. It's expressive and high-level while remaining easy to implement and very light on resources, and the bottom-up approach to building Forth programs is great for anyone who likes to build things from scratch: sketch out your whole design, then start your implementation by designing and testing the lowest-level words, and build up from there. Best of all, Forth systems are interactive and (usually) hosted on their target, so there's no compile-link-run cycle and there's no difference between your development system and your target system.

Because of these strengths, Forth was a good fit for the resource-constrained computers of its era, and it found a niche as a language and environment for embedded systems in astronomy and space applications. The special requirements of those applications created a demand for low-power, rad-hard processors that could execute Forth programs quickly, and the result was a variety of Stack Computers. These machines closely matched Forth's stack-based model of computing, and they were optimized to execute Forth's branch-heavy code with minimal overhead.

Sounds pretty neat, where do I get a stack processor?

Well... I once saw one pop up on eBay. I think you could probably convince Harris Intersil Renesas to sell you a batch of RTX2010s, but it's not something you can sample. ESA isn't using theirs anymore, but its location makes it difficult to salvage. I probably should have jumped on the eBay acution when I saw it.

There are other people who find stack machines interesting, so there are a number of modern FPGA-based designs to choose from, perhaps the most popular of which is James Bowman's J1 Forth CPU. However, I think it'd be cool to make my own, and I have my eye on a particular design...

The Plan

For the 2018/09 RetroChallenge, I will be working on an HDL re-implementation (for simulation/emulation and FPGA synthesis) of the John Hopkins University/Applied Physics Laboratory FRISC 3, a 32-bit stack processor that can execute most Forth primitives in a single cycle.

References


Valid HTML 4.01 Strict