For a while, I was happy with the idea that YASEP would work with a standard
SDRAM chip running at 133 MHz. So the core would run at 66MHz and a 16-bit
datapath would provide 32 bits in 2 cycles. Good fit. The first synthesis
attempts for the ADD/SUB unit (with a standard grade A3P250) gave something
like 60 to 70MHz with a plain dumb 33-bit add/sub function (32 bits of result
and one carry output). But I was not satisfied.
I have recently found Synchronous SRAM chips that run at 100 and 200MHz,
with 18-bit and 36-bit datapaths, in capacities from 128KB to 2MB. That made me
think a lot : 100MHz would be better than 66MHz or 50, obviously. However,
achieving 50% of speed increase is FAR from easy. I have been busy on this
matter since the end of june. More than one month of dumb, repetitive,
error-prone work !
First, I needed an Add/Sub unit that I could control completely. I have not
found anything near that, and the "default" add/sub created by Synplify was
always faster by a significant margin. So... I have analysed this add/sub unit,
gate by gate. More than 300 gates were transcribed by hand from the schematic
output of the Actel software !
Another constraint is that the adder MUST be portable and easily modified
(*sigh*). So using the VHDL output of Synplify was not possible because it uses
Actel-specific mapped instances. I decided that "plain text" was better, so
that 1) I could modify the netlist more easily 2) another FPGA with a proper
synthesizer would not be stuck to 3-input gates (in a world where most FPGA use
4-input LUTs). As a result, after about one month of efforts, I finally got a
big file full of "NetA <= NetB xor (NetC or NetD)"-like lines.
Well, nobody is perfect and in the 300+ gates written by hand, more than 10
errors were found. The first ones were spotted by a full re-check of the whole
schematic. Painful once again and some naugthy errors where still here. I
finally found a method to locate the probable location of an error, using the
synthesizer as a "formal verification tool" (comparison against a working
add/sub) or as an "oracle", and clever "bit tickling" techniques similar to
what a cryptographer would do with an unknown "black box". After about a week,
I finally got my dear, long awaited optimised add/sub netlist with a depth of 9
logic layers.
But it was not over ! A lot of cleanup and preparations were necessary, in
order to prepare the next step. A lot of logic simplifications were found,
possibly breaking some clever synthesis techniques, but also curing some of
Synplicity's weaknesses. "Bubble-pushing" allowed a clear (mental) view of the
netlist and some critical datapaths were solved through gate duplications and
other adaptations. The maximum logic depth was reduced to 8 layers and the
propagation delays were homogenized. After adding a first pipeline gate (just
after the first logic layer), P&R said that I could run the add/sub unit at
around 77MHz.
Now, that's promising but unsufficient. Reaching 100MHz requires a pipeline
gate after the 6th logic layer. So there are 2 layers of gates after the
pipeline barrer, and some room for muxes and Setup&Hold for writing the
result to the registers. After some more editing efforts, Synplicity announced
an estimated 101MHz , and P&R said 106MHz ! I had set the bar at 110MHz but
6MHz is enough margin for me. I mean : I can safely run at 100MHz.
Conclusion : YASEP can be "superpipelined" when needed (the added pipeline
gates take room and draw power, which is not always necessary) and using a
decent SiO2 process will give higher frequencies ! (the A3P250 is in 130nm and
even at 0,35u, a pure ASIC will be faster). A 5-layers deep pipeline stage with
3-input gates and a mean fanout of 3 is not extraordinary today, but it's a
challenging (the Cray3 had 4 logic layers). Yet, I'm still confidend that the
pipeline depth will not explode, 4 stages is still possible. What bothers me is
that clock gating was messed up by Synplicity, so the power draw is going to be
a concern.
Another important thing is that the YASEP architecture is not changed. I
realised that I could safely bite in the margin of the last pipeline stage
("Write back to the register") without affecting the execution sequence. A
bypass network could be possible but is not necessary (too much control logic
would be needed).
Anyway, with a 100MHz rating, one can use 12, 24, 24.576, 25 or 48MHz
quartzs/oscillators to feed the PLL, and run from 96 to 100MHz internally. With
a 4-stage pipeline, 4 threads can execute simultaneously (at 24-25MHz each)
providing a peak 100MOPS performance. Imagine what this would yield on the
latest 40nm FPGAs or ASICs !
However, memory bandwidth and latencies are going to be the main bottlenecks
again. But I think that I have found a solution...
Last comments