YASEP news

To content | To menu | To search

Thursday 1 January 2009

Barrel Shifter : SHL16 ready

Hello and Happy New Year Everybody !

I took some time to work on the next major building block of the YASEP16 execution unit : the shift/rotate unit is now ready in 16-bit flavour.

I concentrate now on YASEP16 because it is smaller and marginally faster, and consumes less bandwidth. It can fit easily in the A3P250 and its 6K 3-input tiles, though i don't know how many tiles are needed in the end.

SHL_16 uses about 220 tiles, and Actel's place&route estimates the unit to run at 140MHz in pipelined version. This is slightly faster and smaller than ASU_ROP2 that performs Add/Sub and boolean operations (115 MHz and about 350 tiles). The overall ALU (ASU_ROP2 + SHL + IE) is going to take roughly 700 tiles, or 1/8th of the A3P250's surface. Speed is looking satisfying, as I intend to clock the thing at 96MHz on the ACME boards (64MHz * 1.5 with the PLL).

Overall, the following operations are ready for the 16-bit flavor :

  • ASU : ADD, SUB and compares as side effects.
  • ROP2 : AND/OR/XOR/NAND/NOR/XNOR/ANDN/ORN as well as comparison for equality (XOR followed by a OR reduction tree)
  • SHL : SHR/SHL/ROR/ROL/SAR

The next part to be developped is the IE (Insert/Extract) unit, for the load and stores of bytes into a half-word. Stay tuned...

''Note : some P&R runs give a bit higher working frequencies but I reserve 15 or 20% of margin, since I expect that all the units put together will need even more MUX2 all over the place, longer wires etc. resulting in slower operation.' Furthermore, it is only YASEP16 yet, and the 32-bit flavor will double the design's size... '

Thursday 7 August 2008

106MHz !

For a while, I was happy with the idea that YASEP would work with a standard SDRAM chip running at 133 MHz. So the core would run at 66MHz and a 16-bit datapath would provide 32 bits in 2 cycles. Good fit. The first synthesis attempts for the ADD/SUB unit (with a standard grade A3P250) gave something like 60 to 70MHz with a plain dumb 33-bit add/sub function (32 bits of result and one carry output). But I was not satisfied.

I have recently found Synchronous SRAM chips that run at 100 and 200MHz, with 18-bit and 36-bit datapaths, in capacities from 128KB to 2MB. That made me think a lot : 100MHz would be better than 66MHz or 50, obviously. However, achieving 50% of speed increase is FAR from easy. I have been busy on this matter since the end of june. More than one month of dumb, repetitive, error-prone work !

First, I needed an Add/Sub unit that I could control completely. I have not found anything near that, and the "default" add/sub created by Synplify was always faster by a significant margin. So... I have analysed this add/sub unit, gate by gate. More than 300 gates were transcribed by hand from the schematic output of the Actel software !

Another constraint is that the adder MUST be portable and easily modified (*sigh*). So using the VHDL output of Synplify was not possible because it uses Actel-specific mapped instances. I decided that "plain text" was better, so that 1) I could modify the netlist more easily 2) another FPGA with a proper synthesizer would not be stuck to 3-input gates (in a world where most FPGA use 4-input LUTs). As a result, after about one month of efforts, I finally got a big file full of "NetA <= NetB xor (NetC or NetD)"-like lines.

Well, nobody is perfect and in the 300+ gates written by hand, more than 10 errors were found. The first ones were spotted by a full re-check of the whole schematic. Painful once again and some naugthy errors where still here. I finally found a method to locate the probable location of an error, using the synthesizer as a "formal verification tool" (comparison against a working add/sub) or as an "oracle", and clever "bit tickling" techniques similar to what a cryptographer would do with an unknown "black box". After about a week, I finally got my dear, long awaited optimised add/sub netlist with a depth of 9 logic layers.

But it was not over ! A lot of cleanup and preparations were necessary, in order to prepare the next step. A lot of logic simplifications were found, possibly breaking some clever synthesis techniques, but also curing some of Synplicity's weaknesses. "Bubble-pushing" allowed a clear (mental) view of the netlist and some critical datapaths were solved through gate duplications and other adaptations. The maximum logic depth was reduced to 8 layers and the propagation delays were homogenized. After adding a first pipeline gate (just after the first logic layer), P&R said that I could run the add/sub unit at around 77MHz.

Now, that's promising but unsufficient. Reaching 100MHz requires a pipeline gate after the 6th logic layer. So there are 2 layers of gates after the pipeline barrer, and some room for muxes and Setup&Hold for writing the result to the registers. After some more editing efforts, Synplicity announced an estimated 101MHz , and P&R said 106MHz ! I had set the bar at 110MHz but 6MHz is enough margin for me. I mean : I can safely run at 100MHz.

Conclusion : YASEP can be "superpipelined" when needed (the added pipeline gates take room and draw power, which is not always necessary) and using a decent SiO2 process will give higher frequencies ! (the A3P250 is in 130nm and even at 0,35u, a pure ASIC will be faster). A 5-layers deep pipeline stage with 3-input gates and a mean fanout of 3 is not extraordinary today, but it's a challenging (the Cray3 had 4 logic layers). Yet, I'm still confidend that the pipeline depth will not explode, 4 stages is still possible. What bothers me is that clock gating was messed up by Synplicity, so the power draw is going to be a concern.

Another important thing is that the YASEP architecture is not changed. I realised that I could safely bite in the margin of the last pipeline stage ("Write back to the register") without affecting the execution sequence. A bypass network could be possible but is not necessary (too much control logic would be needed).

Anyway, with a 100MHz rating, one can use 12, 24, 24.576, 25 or 48MHz quartzs/oscillators to feed the PLL, and run from 96 to 100MHz internally. With a 4-stage pipeline, 4 threads can execute simultaneously (at 24-25MHz each) providing a peak 100MOPS performance. Imagine what this would yield on the latest 40nm FPGAs or ASICs !

However, memory bandwidth and latencies are going to be the main bottlenecks again. But I think that I have found a solution...