106MHz !
By whygee on Thursday 7 August 2008, 08:06 - VHDL - Permalink
For a while, I was happy with the idea that YASEP would work with a standard SDRAM chip running at 133 MHz. So the core would run at 66MHz and a 16-bit datapath would provide 32 bits in 2 cycles. Good fit. The first synthesis attempts for the ADD/SUB unit (with a standard grade A3P250) gave something like 60 to 70MHz with a plain dumb 33-bit add/sub function (32 bits of result and one carry output). But I was not satisfied.
I have recently found Synchronous SRAM chips that run at 100 and 200MHz, with 18-bit and 36-bit datapaths, in capacities from 128KB to 2MB. That made me think a lot : 100MHz would be better than 66MHz or 50, obviously. However, achieving 50% of speed increase is FAR from easy. I have been busy on this matter since the end of june. More than one month of dumb, repetitive, error-prone work !
First, I needed an Add/Sub unit that I could control completely. I have not found anything near that, and the "default" add/sub created by Synplify was always faster by a significant margin. So... I have analysed this add/sub unit, gate by gate. More than 300 gates were transcribed by hand from the schematic output of the Actel software !
Another constraint is that the adder MUST be portable and easily modified (*sigh*). So using the VHDL output of Synplify was not possible because it uses Actel-specific mapped instances. I decided that "plain text" was better, so that 1) I could modify the netlist more easily 2) another FPGA with a proper synthesizer would not be stuck to 3-input gates (in a world where most FPGA use 4-input LUTs). As a result, after about one month of efforts, I finally got a big file full of "NetA <= NetB xor (NetC or NetD)"-like lines.
Well, nobody is perfect and in the 300+ gates written by hand, more than 10 errors were found. The first ones were spotted by a full re-check of the whole schematic. Painful once again and some naugthy errors where still here. I finally found a method to locate the probable location of an error, using the synthesizer as a "formal verification tool" (comparison against a working add/sub) or as an "oracle", and clever "bit tickling" techniques similar to what a cryptographer would do with an unknown "black box". After about a week, I finally got my dear, long awaited optimised add/sub netlist with a depth of 9 logic layers.
But it was not over ! A lot of cleanup and preparations were necessary, in order to prepare the next step. A lot of logic simplifications were found, possibly breaking some clever synthesis techniques, but also curing some of Synplicity's weaknesses. "Bubble-pushing" allowed a clear (mental) view of the netlist and some critical datapaths were solved through gate duplications and other adaptations. The maximum logic depth was reduced to 8 layers and the propagation delays were homogenized. After adding a first pipeline gate (just after the first logic layer), P&R said that I could run the add/sub unit at around 77MHz.
Now, that's promising but unsufficient. Reaching 100MHz requires a pipeline gate after the 6th logic layer. So there are 2 layers of gates after the pipeline barrer, and some room for muxes and Setup&Hold for writing the result to the registers. After some more editing efforts, Synplicity announced an estimated 101MHz , and P&R said 106MHz ! I had set the bar at 110MHz but 6MHz is enough margin for me. I mean : I can safely run at 100MHz.
Conclusion : YASEP can be "superpipelined" when needed (the added pipeline gates take room and draw power, which is not always necessary) and using a decent SiO2 process will give higher frequencies ! (the A3P250 is in 130nm and even at 0,35u, a pure ASIC will be faster). A 5-layers deep pipeline stage with 3-input gates and a mean fanout of 3 is not extraordinary today, but it's a challenging (the Cray3 had 4 logic layers). Yet, I'm still confidend that the pipeline depth will not explode, 4 stages is still possible. What bothers me is that clock gating was messed up by Synplicity, so the power draw is going to be a concern.
Another important thing is that the YASEP architecture is not changed. I realised that I could safely bite in the margin of the last pipeline stage ("Write back to the register") without affecting the execution sequence. A bypass network could be possible but is not necessary (too much control logic would be needed).
Anyway, with a 100MHz rating, one can use 12, 24, 24.576, 25 or 48MHz quartzs/oscillators to feed the PLL, and run from 96 to 100MHz internally. With a 4-stage pipeline, 4 threads can execute simultaneously (at 24-25MHz each) providing a peak 100MOPS performance. Imagine what this would yield on the latest 40nm FPGAs or ASICs !
However, memory bandwidth and latencies are going to be the main bottlenecks again. But I think that I have found a solution...