YASEP news

To content | To menu | To search

Monday 22 September 2008

A suitable HW platform for CPU design ?

My order to Lextronic.fr is delivered : I purchased a couple of "FOX VHDL" boards from http://www.lextronic.notebleue.com/P1792-platine-dextension-fox-vhdl.html

The FoxVHDL board (image courtesy of Acme Systems, Italia)

This small board has all I need, and only that, for YASEP : a A3P250 FPGA, and a couple of 256Kx16 SRAM at 12ns. More informations can be found at http://www.acmesystems.it/?id=120 and a new version of the board will appear soon ! (without the VGA connector that takes a lot of space) Even at 100 Euros each, with only 1MB of memory, this is an excellent board with a lot of potential !

Only problem : the SRAM has "only" 12ns of access time. YASEP is aimed at 100MHz. So either I run at 64MHz only (too bad) or I find 8ns versions of the SRAM chips. I've spent lots of effort in the 2nd option and it is not fruitful yet. The 12ns parts are so much easier to get !

I could try to play tricks with the pipeline gates (doing "wave pipelining" in the memory chip) but it's very risky. Well, it would be better to make my own board... with those crazy 200MHz synchronous SRAM chips that I also recently found. But that will come later : It's better to use the Fox VHDL boards first, write the VHDL code, and later only make a custom board.

In the mean time, I should get acustomed to the use of my new shiny hot air rework station. BGA chips are soon going to be solderable :-)

Monday 18 August 2008

New register organisation

The architecture of YASEP is very unorthodox. It is a living experiment and evolves in many unexpected directions.

However, one known uncertainty has always been how to implement the instruction fetch mechanisms. The memory queues have been a guideline but no organisation has been tested yet, and validated. Back when VSP emerged from the chaos of my brain, I wanted to use one of four queues to fetch instructions, and to indicate the current queue in the 2-bit CQ register. This idea was already implemented in the RCA1802 processor (the 4-bit P register) but this adds some overhead (and YASEP's instruction stream was never meant to be ultracompact).

Funny : I find more and more common traits between YASEP and 1802 :-)

The CQ register (just like the COSMAC's P register) also slows down the core as a whole cycle is necessary to fetch the opcode from the queue. This goes against the idea of a pipelined processor, the pipeline being the implementation of a sequential principle and sequence occurs a lot in an instruction flow.

However, the availability of several queues as potential pre-cooked jump destination (address as well as corresponding data) is very interesting and this remains in the YASEP architecture. A jump instruction with a direct immediate adress remains possible but with some (future and planned) architectures, there is the risk of a high execution latency.

I recently came to the conclusion that a compromise between the completely weird and the classical approaches is necessary.

So I keep the memory queues but the first one is modified and assigned to the instruction pointer and a status register. I had sworn that I would never do that, but I'm forced to admit that in a sense, and in the current situation (where no cache can support parallel memory accesses) something "looking like that" is necessary. And I'll do my best to avoid the inherent traps !

First, why do I need the registers #0 and #1 to hold these values ? In the currently planned first implementation, I can use a bank of 512 registers, or 32 banks of 16 registers. This means that context swapping can be very fast (1 major cycle) and I need to save many informations at once. If these informations are stored in the SR space (as previously planned), more cycles are needed to save/restore the "whole" context. So the best place to store these critical informations is in the register set itself. I could have chosen to create another parallel register bank but this would consume too much memory. The availability of the "Current/Next IP" is also very useful for computing addresses in position-independent code.

So the new register map is :

0h: IP (replaces A0)
1h: ST (replaces D0)
2h: A1 \ Q1
3h: D1 /
4h: A2 \ Q2
5h: D2 /
6h: A3 \ Q3 
7h: D3 /
8h: A3 \ Q4
9h: D3 /
Ah: A3 \ Q5 
Bh: D3 /
Ch: R0
Dh: R1
Eh: R2
Fh: R3

Second: what does the Status Register contain ? Of course, I avoid the storage of carry flags and such. But I can't avoid the auto-update bits of the 5 remaining queues. They use 2x5 bits and 6 bits are unaffected yet (for how long ?). These two bits per queue represent the following codes :

00 : no update
10 : post-incrementation
11 : post-decrementation

2 queues are able to implement a normal stack (LIFO) and 2 additional bits represent this ability. So Q4 and Q5 have the following properties in the Status Register:

bit N   : update on/off
bit N+1 : update up/down
bit N+2 : stack on/off (pre/post modification)

Third: These registers are not really real registers. These are "shadowed" registers with a physical instance copied somewhere else. This is necessary because the register set can't have enough ports and these 2 specific registers are critical and accessed every cycle . So their incorporation in the register map makes them easily remanent through context switches and IRQs, as well as easily alterable (without going through get/put instructions) but the register bank is updated only when these new registers are accessed. Some new datapaths must be reserved for them.

sigh...

This means that most of the opcode map (the part with the jump instructions) must be redesigned.

re-sigh...

Thursday 7 August 2008

106MHz !

For a while, I was happy with the idea that YASEP would work with a standard SDRAM chip running at 133 MHz. So the core would run at 66MHz and a 16-bit datapath would provide 32 bits in 2 cycles. Good fit. The first synthesis attempts for the ADD/SUB unit (with a standard grade A3P250) gave something like 60 to 70MHz with a plain dumb 33-bit add/sub function (32 bits of result and one carry output). But I was not satisfied.

I have recently found Synchronous SRAM chips that run at 100 and 200MHz, with 18-bit and 36-bit datapaths, in capacities from 128KB to 2MB. That made me think a lot : 100MHz would be better than 66MHz or 50, obviously. However, achieving 50% of speed increase is FAR from easy. I have been busy on this matter since the end of june. More than one month of dumb, repetitive, error-prone work !

First, I needed an Add/Sub unit that I could control completely. I have not found anything near that, and the "default" add/sub created by Synplify was always faster by a significant margin. So... I have analysed this add/sub unit, gate by gate. More than 300 gates were transcribed by hand from the schematic output of the Actel software !

Another constraint is that the adder MUST be portable and easily modified (*sigh*). So using the VHDL output of Synplify was not possible because it uses Actel-specific mapped instances. I decided that "plain text" was better, so that 1) I could modify the netlist more easily 2) another FPGA with a proper synthesizer would not be stuck to 3-input gates (in a world where most FPGA use 4-input LUTs). As a result, after about one month of efforts, I finally got a big file full of "NetA <= NetB xor (NetC or NetD)"-like lines.

Well, nobody is perfect and in the 300+ gates written by hand, more than 10 errors were found. The first ones were spotted by a full re-check of the whole schematic. Painful once again and some naugthy errors where still here. I finally found a method to locate the probable location of an error, using the synthesizer as a "formal verification tool" (comparison against a working add/sub) or as an "oracle", and clever "bit tickling" techniques similar to what a cryptographer would do with an unknown "black box". After about a week, I finally got my dear, long awaited optimised add/sub netlist with a depth of 9 logic layers.

But it was not over ! A lot of cleanup and preparations were necessary, in order to prepare the next step. A lot of logic simplifications were found, possibly breaking some clever synthesis techniques, but also curing some of Synplicity's weaknesses. "Bubble-pushing" allowed a clear (mental) view of the netlist and some critical datapaths were solved through gate duplications and other adaptations. The maximum logic depth was reduced to 8 layers and the propagation delays were homogenized. After adding a first pipeline gate (just after the first logic layer), P&R said that I could run the add/sub unit at around 77MHz.

Now, that's promising but unsufficient. Reaching 100MHz requires a pipeline gate after the 6th logic layer. So there are 2 layers of gates after the pipeline barrer, and some room for muxes and Setup&Hold for writing the result to the registers. After some more editing efforts, Synplicity announced an estimated 101MHz , and P&R said 106MHz ! I had set the bar at 110MHz but 6MHz is enough margin for me. I mean : I can safely run at 100MHz.

Conclusion : YASEP can be "superpipelined" when needed (the added pipeline gates take room and draw power, which is not always necessary) and using a decent SiO2 process will give higher frequencies ! (the A3P250 is in 130nm and even at 0,35u, a pure ASIC will be faster). A 5-layers deep pipeline stage with 3-input gates and a mean fanout of 3 is not extraordinary today, but it's a challenging (the Cray3 had 4 logic layers). Yet, I'm still confidend that the pipeline depth will not explode, 4 stages is still possible. What bothers me is that clock gating was messed up by Synplicity, so the power draw is going to be a concern.

Another important thing is that the YASEP architecture is not changed. I realised that I could safely bite in the margin of the last pipeline stage ("Write back to the register") without affecting the execution sequence. A bypass network could be possible but is not necessary (too much control logic would be needed).

Anyway, with a 100MHz rating, one can use 12, 24, 24.576, 25 or 48MHz quartzs/oscillators to feed the PLL, and run from 96 to 100MHz internally. With a 4-stage pipeline, 4 threads can execute simultaneously (at 24-25MHz each) providing a peak 100MOPS performance. Imagine what this would yield on the latest 40nm FPGAs or ASICs !

However, memory bandwidth and latencies are going to be the main bottlenecks again. But I think that I have found a solution...

Tuesday 5 August 2008

More flexibility and options for YASEP

After some concertation with me, myself and my other instances, we came to the conclusion that it would be almost costless to make a 16-bit version of the YASEP architecture. In fact, only a few modifications are necessary to adapt the current 32-bit core for 16-bit operation. I can even foresee a 16-bit "compatibility mode" where a 16-bit program/thread can execute in a 32-bit core.

What is the difference ? Essentially, beside the smaller registers (smaller numbers can be computed), the memory access method needs to be adapted. There will be a limitation to 16-bit pointers, so instead of "segments", I'll use 4KB pages (protected memory will be logically available as a logical extension). The 64KB addressable range is split into 16 pages, each page can be configured for base address (on 4KB boundaries). A 16-bit core can then access up to 256MB of RAM (the limit of YASEP).

Such a small YASEP is suitable for smaller FPGAs and when horsepower is even less necessary. For "efficient" implementations, the pipeline remains the same and several threads will be interleaved as before. The speed and code density will be similar. Remove the pipeline gates and you get a slower but smaller core for typical microcontroller applications.

The structures and the instruction set remain untouched. The SHH instruction will be useless (or not) but it's apparently the only exception. The same tools will be used to generate the executable bytestream, the same source code could be used for 16-bit or 32-bit targets (to some extent).

The physical addresses and registers are the same too, so 16-bit and 32-bit "threads" can coexist/coexecute. The byte addressability works with the same principles (it's just implemented a bit differently).

It all looks promising and i'm updating my VHDL code now.

Another big modification will be the support for "short" immediate fields (Imm4) in the place of the SRC (register number) field. So i'll see later.

page 2 of 2 -