YASEP news

To content | To menu | To search

Architecture

Entries feed

Saturday 28 November 2015

Progress with the SHL unit

Work continues on the discrete YASEP, with frequent log posts on hackaday.io.

In one post, the SHL group's instructions have been reordered :

  • 0 : SHR
  • 1 : SHRO
  • 2 : ROR
  • 3 : SAR
  • 4 : SHL
  • 5 : SHLO
  • 6 : ROL
  • 7 : BSWAP

Then the structure of the SHL unit has been refined and a detailed diagram is drawn:

There is still some room for extra features and the I/E unit (Insert/Extract) might benefit from the unused inputs of the last MUX4 layer :-)

Thursday 13 February 2014

Flag polarity

RISC does not like status flags, but it was determined to be a necessary evil for the YASEP. Yet something more important than cheduling has confused the flags : their values. They have often created more problems than they solve because I always remember the wrong ideas about them, or I forget that I changed a detail.

The Carry flag

In the beginning, it's very simple : the carry flag is set to 1 if the result of an addition generates a carry (overflows). It's valid both for signed and unsigned, thanks to the magic of 2s complement.

Then comes the subtraction : a little electronic quirk made me chose to not complement the borrow flag when there is a borrow. So the flag is set when there is no borrow. It's not usual but it saves maybe half a nanosecond and it's hidden by the symbolic treatment of the assembler with the "BORROW" condition.

Comparisons and MIN/MAX are even more confusing and I never know what to expect or how to come with the right thought process... The fact that the operands can often be swapped does not help !

(to be continued)

The Equal flag

This one just changed polarity, again. So now it's simple : it is set to 1 when the operands are equal.

The value's calculation is a bit subtle : it reuses the ROP2's XOR layer, but since CMP also performs a SUB in parallel, the SND operand is negated, so it's actually a XORN.
Additionally, the reduction was done by a OR, and now it's done by AND (otherwise it won't work).

The documentation at http://yasep.org/#!doc/reg-mem#equal is updated with a new diagram. This time, I should not forget the subtleties anymore...

Sunday 9 February 2014

Zero becomes Equal

I'm currently reviewing all the documentation of the YASEP. Going back to what I've done in the past, sometimes in a rush, and looking with the perspective of experience, gives me the opportunity to correct details that have become confusing in practice.

One example is the "Zero flag". It was created without thinking it through, during a frantic coding session.

It is affected only by the CMP instructions. The flag's value is the direct result of XORing both operands, then ORing all the bits. It uses the ROP2 unit's circuits, not the ALU's and it's independent from the Carry.

Why was it called "Zero" ? Maybe by habit. In fact it does not test if the operands are clear, even though it could. It tests if both operands are equal : it should be renamed to EQUAL / EQ and it would become coherent with the NEQ/EQ condition keywords. So let's do this !

Wednesday 5 February 2014

The yasep2014 milestone

With the resolution of the auto-update flags problem, other fundamental issues were ripe to be addressed. And they are now solved ! So here is the overview of the next major revision of the YASEP architecture :

The update fields are not the only one that changed. The immediate fields have been harmonised and the condition and destination fields have been swapped, which reduces the instruction decoder's complexity a bit (fewer wires and MUXes).

I'm also changing the rules that govern the assembly language's syntax. This will put the immediate values always after the opcode, which will save some code by dropping the Ri/RRi/RI/RRI forms that are seldom used. Less code makes better code :-)

New memory-handling instructions will also appear, like BSWAP, and a pair of "shift & OR" opcodes for bitstream extraction.

The VHDL code also needs a full rewrite that also offers a debugging and testing through an auxiliary port.

These changes are quite deep and require a full review of the whole code base, but it was necessary and now seems to be the right time to do things right, before new features are added on top. The changes will take a lot of time to mature and will not be published on the main site for a while. Stay tuned !

Friday 31 January 2014

Definition of the auto-update fields

(edit 20140207: some stupid typos crept into the tables)

(edit 2014-02-08 : dropping all the pre-modifications)

The instruction set of the YASEP architecture is finally frozen, after years of fine-tuning and exploration !

In august 2013, during a discussion with JCH, I came up with a new encoding for the 4 remaining bits of the extended instructions that were reserved for register auto-updates. I've been struggling with the one big shortcoming of the architecture : the very limited range of Imm4, particularly for conditional relative jumps. I had hacked a few tricks but none were really satisfying.

JCH pointed to some autoupdate codes that didn't make sense in combination with other flags and that's how he found a way to get 2 more bit for SI4/Imm4.

I tried to simplify the system down to a few simpler codes, following these principles :

  • A little reminder : when a D register (a memory access register) is referenced, it's the corresponding A register that gets updated, according to the size of the accessed word (1, 2 or 4 bytes). Otherwise, A registers are incremented by 2 or 4 bytes (depending on the datapath width, 16 or 32 bits) and R registers are incremented by 1. It's not very orthogonal but quite efficient.
  • any register may be post-incremented or post-decremented with one instruction (handy for string/vector code)
  • There must be "room" for 2 Imm bits and it should not break existing compiled code (NOP=0000)
  • Any of the 4 register fields may be affected

The important trick that JCH found is that the Imm/Reg field invalidates certain auto-updates and frees some bits. In particular, it makes no sense to update SI4 when this source operand is immediate, so SI4 is associated with NOP in certain cases.

There is very little room and I had to make some compromises. For example, the CND field can't be updated when other registers are. Pre-incrementations are also avoided (see at the bottom why). It's not possible to increment one register and decrement another.

The resulting format provides Imm6 and one post-update for all extended instructions, and one to three post-updates when no immediate is present.
  • iRR instructions use 2 bits to encode Imm6 along with 2 bits for updates :
00 NOP
01 SND+
10 DST+
11 CND- (this helps loops)
  • RRR instructions use 4 bits to encode more complex updates
        00           01         10    11
00 NOP SND+,SI4+,DST+ SI4- SI4+
01 SND-,SI4- SND+,SI4+ SND- SND+
10 DST-,SI4- DST+,SI4+ DST- DST+
11 DST-,SND- DST+,SND+ CND- CND+
- The big advantage of this encoding is that it increases code density for a lot of very common sequences : stack manipulation, string/vector processing, counters... Code density increase does not always mean faster execution but it helps. Different microarchitectures might implement these flags with different approaches (serial or parallel)

- There are several drawbacks as well : the encoding favors density over decoding ease (but what can we do with only 4 bits ?). The new encoding also breaks Imm4 and a new assembler must be recoded from scratch (the current one is aging and its flexibility has been stretched to its limits).

- In the end, it is a progress :

  • Code density will increase again (maybe 20%).
  • Auto-updates are an optional feature but we have freed 2 Imm6 bits for general consumption. This will benefit all the YASEPs out there (which must be updated, fortunately there are not a lot yet ;-D ). Post-update is a first level of compatibility, and pre-update is more difficult to implement so it's a second level (less expected to be available).
  • This help extend the range of PC-relative conditional jumps
  • This solves the limitation of Shift/rotate operations
  • No more unused bits in the instructions !

- Some questions remain :

  • It makes sense to update the A register of a D register that has just been written to (to update the destination for the next write, in a string-copy sequence for example). What about the case where an instruction writes to a R register with post-increment ? What is the priority ? Auto-updates were initially meant for address registers only but later extended, should this be restricted again ? If so, would that break even more symmetry and create more complexity ?

Right now, the priority is to rewrite the assembler/disassembler and keep the simulator and VHDL up-to-date. My work system is in a bad state and it will take time to get everything back in order.


Why no pre-increment or pre-decrement ?


Pre-modification are removed because they break the very important rule that an instruction should not trap (or be able to trap) in the middle of the execution pipeline.
In the case of pre-incrementing an address register, such as MOV -D1, R1, the validity of the new address in A1 is known only after it is being computed, but there is no way to gracefully stop the instruction in the middle or even restart it. The proper way to do it is to move the -D1 into either a previous instruction using A1 or D1, or simply emit a short ADD -1 A1 instruction before the actual move to R1.

Remember : all the operands must be directly ready for use (at decode stage) before the instruction can proceed to execution stage.

The previous table was :
        00       01   10    11
00 NOP +SI4 SI4+ SI4-
01 SND+,SI4+ +SND SND+ SND-
10 DST+,SI4+ +DST DST+ DST-
11 DST+,SND+ +CND CND+ CND-
The new table uses the 4 pre-inc entries for 2-post-decrement and 3-post-increment.

Sunday 14 July 2013

More about the IPC instructions

Jean-Christophe sent me an interesting email loaded with questions about the IPC instructions. Before I address them, I felt that I should provide some background in the precedent post about "limiting YASEP32's threads code sizes to 16 bits". Go read it now !

Done ? OK. And now, the (translated) email.

> Concerning the 3 instructions IPC, IPE et IPR of the YASEP, I have read that you designed them with the HURD's needs in mind. However, I'm not sure to see how it solves the problem.

This story started long ago, in the F-CPU era, and the encounter with the HURD team at RMLL2002 in Bordeaux. They were trying to solve the problem of slow inter-server calls that crippled the efficiency of their system. That's more than 10 years ago...

Since then, many things have evolved and the question is quite a bit different, now that I can redesign the WHOLE computing platform, not even being forced to "run Linux" or "run the HURD". I make YASEP run whatever I need or want and I don't care as much about others' whims. But the question remained.

The IPC instructions solve one part of the problem of switching fast to another thread. These instructions make sense when the YASEP is implemented with a large register bank that can hold 8, 16 or 32 tread contexts. In this case, and if the called thread is preloaded, the switch is almost instantaneous.

Of course, you can't limit a system to run only 8, 16 or 32 contexts. This is only suitable for a SMT architecture (see : "barrel processor") but software size could grow beyond that. My actually running Linux laptop has about 177 tasks right now, and only a few are actually using the CPU. So, for large implementations, the YASEP must store the corresponding, actual thread ID as well as a smaller, 5-bit ID for the cache. Or 6 bits, if you want to emulate Ubicom.
Some associative memory and you implement the cache mechanism. And if your code calls a thread that is not already loaded in the CPU register bank, you "get hit by the miss", but "this should not happen too often". And embedded systems that don't need 32 simultaneous threads can just use the 5-bit ID directly (and trap if the thread ID is too large).

For the rest of this post, bear in mind that I am not a microkernel specialist. There are even several points of view about them and I will only speak about mine, despite having never created a full operating system. At least I know what features and behaviour my OS will have.

> If these instructions are meant to call routines in safe code sections (TCB) it might be a good, flexible solution.

That was the initial purpose : call shared libraries and fast system calls. Later, it evolved with the addition of some additional security checks. It was possible because the YASEP is not created to use classic paged memory only.


> But I understand that the HURD's problem was to provide a platform where same-level users (less privileged than the machine or the admin) could run their own servers and exchange services safely. One typical example being User A creating a USB file system and letting User B access files on the USB flash dongle. B must call functions (readdir, read, write, etc.) in A's server.

There the problem is not totally solved by the IPC instructions but they help by making the context switch faster. However the big problem is the transfer of data across memory addressing spaces, and that requires a later, deeper analysis and smart design. It's way too early for this now.


> What happens when the server does not return to the client ? This could happen if an error or bug occurs in the server, or if it is malicious. There is a need of a mechanism that lets the caller get his control back but it complicates the design of the servers, that could be interrupted at any moment (which is not impossible or desired).

Reliable code MUST be resilient. And code by definition may be interrupted at any point. Exception handling is an integral feature of high-level languages and low-level systems are naturally prone to failures : flash memory errors (worn out ?), file system capacity saturated, network down for any reason, USB cable that is removed without notice... And bugs happen.

Even critical sections could fail. This is why I consider an instruction design (for http://yasep.org/#!ISM/CRIT) where you can check if the critical section has been interrupted or not. In that case, you re-start the critical section. It should be more resilient and safer than blindly disabling interrupts. Expect things to fail instead of relying on assumptions.

> One of the underlying questions is : who "pays" for the resources that are necessary to run the request ? The client, who lends his resources (and that must get them back from the server in case of a failure) or the user who provides the server (who then risks being DDOSed)...

My opinion is that the requester must pay for the resources to run the request, for example by providing CPU time, access rights and memory, that are necessary to complete the request. Of course, certain protections are necessary to prevent abuses or to limit the effects of bugs.

How the resources are accounted is another important thing to define, along with how to easily transfer data blocks or credentials between servers or their instances. For example, if a server is called by User 1, it should not be able to share these informations with the instance that services User 2. I'm interested by hardware-based solutions that speed-up and simplify software design, without turning into a mess like the iAPX432 :-)

A crippled YASEP ? It's for your own good...

The YASEP is my playground so I can experiment freely with computer science and architecture. It lets me consider the whole software and hardware stack, not just some pieces here and there.

A recent conversation about the IPC system (Inter Process Call) reminded me of a characteristic that I once considered. It is quite extreme, but it opens interesting possibilities : limit the code size in YASEP32 to 16 bits addresses. That makes 32K half-words or about 20K instructions per thread. Of course, there would be no absolute limit in the number of threads.

But why cripple the architecture ?

Because the YASEP is meant to be an embedded CPU, running a microkernel OS, and here is why it is possible and desirable.

 - The imm16 field is available to most opcodes and it provides the best code density.

 - Embedded systems and microcontrollers usually run small applications. There is little chance that 20K instructions get exhausted by a single thread, for a single purpose. Once you reach 10K instructions, you have already included libraries, system management code, and of course tons of data, and all of these can (and should) be kept separated.

 - Microkernels spread, divide and separate system functions, using "servers". My idea would be to push this approach and mentality down to the user software level. And there is no way to coerce SW developers to NOT write bloatware without "some sacrifices".

 - GPUs already have such an approach/organisation (small chunks of code for discrete functions) for hardware and software reasons. "computational kernels" are quite short.

 - In the "microkernel" spirit, shared libraries become a special type of server. In POSIX systems, the flat address space of each thread maps to each shared library, and memory paging flips the cards in the background. A huge pointer (and big arithmetics) is necessary. In an early YASEP32 implementing multiple threads, shared libraries become possible without adding memory paging. A bit like a hardened uClinux. Code (and the return stack) would be a little bit more compact (no 32-bit constants for code addresses) and reaching external resources would just require knowing the "extension code", the ID of the thread that provides the feature in a "black box".

 - Such an approach is possible because "IPC is cheap" now. In the "flat linear space" systems used in POSIX, the burden of protection is on the paged memory (TLB) and the fault handlers, who can take hundreds of cycles to switch to a different task... But the YASEP's IPC is marginally more complex than a CALL.

 - Only code would have limited address ranges. Data pointers are not affected. I have programmed and used MS-DOS systems in 16 bits environments and the constant problem was how to access data, not code. Of course there are certain applications that need more than 20K instructions but they need to be split, the PC of these times had "overlays" and other kludgy mechanisms that I don't want to revive.

 - Modularity. Design software so its components are interchangeable, hence a bug here has less chances to affect code there. Or hot-swap a module/server/lib while the system is running. Microkernels can do it, why not applications ?

 - Safety is increased by the extra degree of separation between any "sufficiently large chunk of bugs^Wcode" to prevent or reduce both accidental and malicious malfunction.

But for this approach to be realistict, cheap IPC is not enough. Sharing data, fast and safely, between "servers" is another key !

TBD (to be designed)

Monday 24 September 2012

Virtual Load and Store

 

The Instruction Set Manual is nearing completion.

As if things weren't serious enough, I'm now reviewing the Load and Store instructions. And since there is none, I removed them.

Confused ? It's normal : the YASEP has no load or store operation. But there were opcodes named after this, thinking they would make people feel more comforable. Which was not the brightest idea ever.

So I removed them and kept the aliases I had created : the Insert and Extract opcodes. As the name says, they insert and extract bytes or half-words (8 or 16 bits).

So far the YASEP had these opcodes : IB IH IHH ESB EZB ESH EZH

End of story ? NO !

 

Register set ports

The review of the Instruction Set Manual unearthed some concerns I had for a while : the "Load" and "Store" instructions get auxiliary data from other implicit registers. Which can create quite an electronic mess... And it's not flexible enough. It was created for the sake of load and stores, using the "register pair" system where the Address register provides its LSB to the shifting unit, so it knows how much alignment is necessary.

While reviewing the ESB / EZB instructions, I remarked that it was a one-read, one-write instruction (in FORM_RR or FORM_IR). What a waste of coding space, let's make it a RRR or IRR instruction and get rid of the implicit operand that must be fetched in the other registers. Easy, and architecturally much better.

But the Insert instructions (IB and IH) are another beast, they need 3 operands and a destination. The first operand is the register that contains the data to insert, the second is the word that will receive the data, and the 3rd operand is the implicit shift count that comes from the corresponding A register if a D register is written...

It's not cool. First there is this rule, that I thought was critical, of having only two read ports for the register set. Second, splitting the register set for performing parallel reads is a trick that can backfire mercilessly later. Going 3reads-1write would be great... At what cost ?

But wait, the YASEP is already 3reads-1write because there are the conditions to read. The microYASEP implements this by passing both halves of the extended instruction through the 2-reads register set. And there is one result that remains unused during the second cycle... That's it !
 
So in practice, despite the limitations, the microYASEP is a 4-reads 1-write engine. 1 read for condition, 3 reads for operands and the destination register is one of the operands. Great. Now let's create the new flags "READ_DST3" and "IMM_LSB" and change a lot of the code that is already around... Lots of work indeed.
 
 

The return of the carry flag

The YASEP has a carry flag and a zero flag so let's put them to good use.

One concern of the YASEP32 when inserting and extracting unaligned half-words, is the case when the offset is 3 bytes : only one byte is available in this case. This means that the result is partial and not working as expected. The simulator sends an error message but that's not very handy. The assembly code that detects this case could take at least a pair of instructions, then there is the leftover to align...

The little stroke of genius is to change the carry flag when such a situation occurs. Then, for the user's code, it's just a matter of a few conditional instructions to increment the pointer and store the remaining byte at the new address (or just skip the alignment sequence). Something like

 ; Unaligned write :
; memory is pointed to by A2:D2,
; the code stores 16 bits located in R1
IH A2 R1 D2 ; set the carry flag if A2's LSB are 11

EZB 1 R1 R2 CARRY ; If out-of-word then
; extract the high byte to temporary register
ADD 1 A2 CARRY ; point to the next byte
IB R2 A2 D2 CARRY ; now D2 points to the next word
; and its value, R1's high byte is inserted
; in D2's lower byte

It's a bit far fetched but it's still RISC.

 

Is IHH still needed ?

The instruction "Insert Halfword High" is a special case of IH that was created to supplement MOV so the user could overwrite the high half-word of a register. The intended use was this:

 ; put 12345678h in R1
MOV 5678h R1 ; LSB
IHH 1234h R1 ; MSB

A special opcode was needed because IH would get the shift from one of the A registers, or use 0, yet we needed to shift by 2 bytes...

Now we have a new IH that gets the shift amount from a register or an immediate field so things would be great in theory. In practice, there can be only one immediate number so IH is ruled out.

What we need is a MOV instruction that can shift the immediate value by 16 bits. So here we have it : MOVH. The instruction sequence is modified a bit :

 ; put 12345678h in R1
MOVH 1234h R1 ; MSB
OR 5678h R1 ; LSB
YASEP2013 looks better and better...

Friday 10 August 2012

The Zero status flag

Things are getting quite messy in the architecture now. Sorry for the long rant !


It started with the Carry flag. I hard turned the problem in every sense and perspective, and couldn't find a proper way to deal with it. The Golden Rule of all modern architectures is : no f****** carry flag or status register. MIPS has an overflow trap, F-CPU has a 2-reads-2-writes instruction. But the YASEP can't afford this luxury...


I thought I solved it with a dirty trick : storing the carry flag in the Program Counter's Least Significant Bit (PC's bit #0, to make the number odd or even). But it was really too ugly (I want this critical bit for other purposes later) so I moved the carry bit somewhere else, out of the register set.

It's still messy but... OK, it still works. For example, if an exception occurs, it's still easy for the same HW thread to save and restore the Carry flag's value. It's still longer than the single byte needed to save all the flags of an x86 CPU (PUSHF) but hey, we're RISC aren't we ?...

; save Carry to R1 :
mov 0 R1
mov 1 R1 CARRY
; restore carry from R1 :
add -1 R1

This works for a few entangled reasons. The carry flag is updated by only a handful, adder-specific instructions. This means that it would not be updated by mundane MOV instructions, for example, if the register set is manually saved or restored. The PC's bit #0 can also host the carry flag when it is automatically saved in hardware for an exception handler (that is, if I totally forget the fact that it's SMT and the handler could simply execute from another hardware thread).

And before the carry stuff, there was no such flag in the early YASEP/VSP, true to the Holy Dogma. There were carry-generating versions of ADD and SUB, where ADDC and SUBB would set the destination to 0 or 1 depending on the result's overflow. The problem is that unlike most RISC architectures, the YASEP does not have a lot of available registers. ADD and SUB are often used to compare values' magnitudes and the results require a temporary register. Since the YASEP has only 5 "normal" registers, it means that 20% of that space gets screwed up. Compare that to 3% for MIPS...

ADDC and SUBB have thus been replaced by CMPU and CMPS around 2009. They work by adjusting sign bits, substracting, and not writing the result, so a register is saved. The result goes to the carry flag, which is read by a new condition code.

...

OK, so now the YASEP has a Carry flag. It has found a nice place in the condition codes map. But wait, there is another condition bit available...

At first, I thought that it could be useful as an "overflow" flag (that is : if the result's sign was different from the operands' signs). But to this day, I still have to find a case where it is useful. Futhermore, the CMPS and CMPU instructions already deal with the operands' types.

Another useful bit would be a flag that indicates whether a critical section (opened by the CRIT opcode) had been aborted. This is the quintessential flag because it is totally context-dependent and write-only. No need to save it or restore it : if a trap occurs inside a critical section, just flush the bit and hopefully the critical section will be restarted by the application.

But I got lazy. Or, instead, I programmed real code and found that I would be happy with a flag that indicates if the last result was zero.

I got lazy and wrote this in the first microYASEP VHDL code :

  if WB_en='1' and FlagChangeCarry(int_opcode)='1' then
Carry <= Carry_out;
FlagZero <= zero_out;
end if;

Yes, I realise just now that it gets updated at the same time as the Carry flag. Worse : I see only now that the zero flag gets its value not from the Adder's result, but from the binary difference of both operands.

  zero_out <= '1' when ROP2_xor=(ROP2_xor'range=>'1')
else '0';

It's a nice trick because it doesn't increase the critical datapath for the adder : the big combining ANDN (with 16 or 32 input bits) is computed in parallel with the carry chain of the adder. But then it means that the Zero flag makes sense ONLY with the CMPU/CMPS instructions, not ADD or SUB... Which is a standard and expected behaviour in most other architectures :-/

But wait, in the above VHDL code, FlagZero is also updated for ADD and SUB ? Now this could eventually explains certain curious bugs I had...

...

OK, this is messy now. But it's not finished.

What I would ideally want is to have the zero flag updated any time a register is written, too. This corresponds to tying a big OR to the write bus of the register set. This potentially adds some significant latency to the pipeline. But... Is that useful ? If the result is written to a register then this register can be tested anyway with the usual conditions.

So really, the Zero flag is useful only for CMPU and CMPS, that indeed are the only computation instructions that don't write back the result. The VHDL code must be corrected.

And how/where can one save both the carry flag and the Zero flag ?

Saving one flag was already complex enough, saving two flags is still possible but longer.

; save Carry and Zero to R1 : 10 bytes
mov 0 R1
mov 1 R1 ZERO
add 2 R1 CARRY

; restore carry from R1 : 6 bytes add -2 R1
; restore Zero
and 1 R1 CMPU 1 R1

Wouldn't it be easier if the flags were accessible as a single normal register ? There are no registers left but Special Registers could do it.

; save Carry and Zero to R1 : 2 bytes
GET -1 R1

; restore carry from R1 : 2 bytes PUT R1 -1

But should it ? Moving data to the Special Registers is a slippery slope and when we start doing it, we want to apply this principle over and over and... it gets even more messy ! How many status flags will end up there and will this special register be called "status register" ?

Obviously I don't want this so I'll just avoid it and use the slow method for now.

Friday 27 July 2012

Nikolay's questions

I just received an email from another "hacker" who raised a lot of interesting questions. I answered by email and I share here some important ideas and insights.

Nikolay wrote :
> To be honest, I was more interested in YASEP16, because that would
> much harder task to solve compared to YASEP32 (32-bits are generally
> much more easier to define an instruction format, and if one sticks to
> fixed-length 32-bit instructions & encoded instruction format, things
> will look generally at least acceptable, if not even good).

I don't see what is so hard. It just happens that if you can do more, you can do less. it's explained there :

http://yasep.org/#!doc/16-32

http://yasep.org/#!doc/forms

> Let me start few steps away - I enjoyed to see that you're finally
> pissed-off of using separate tools and sources, and you had started to
> play with an integrated model for the CPU, that's used for
> rtl/docs/assembler/disassembler. I think this is so major step, that I
> can't even find any cool words for this. My understanding was that
> lots of interesting CPU projects die too early because the burden of
> supporting all the separate tools in a compatible state for both the
> developers usage and the community just quickly exausts the peolpe and
> they stop pushing forward. Having the model-based approach will
> hopefully leave more time for fun for the hobby-CPU designers :D.

I suppose that you refer to a hobby project in particular, right ? :-) In fact, building a whole, free, self-contained and EASY TO USE toolset was the main motivation : I think that the YASEP is a sub-standard architecture as it is now (yet it is being polished), but since it is so simple, I can also work on getting the tools RIGHT for a more "potent architecture" (whether totally new or dusted off from the archives...)

> Another interesting thing to see was that the YASEP16 & YASEP32
> partially share the instruction format. I was wondering whether the
> actual intention was to have forward binary compatibility with the
> YASEP32?

This part of the architecture is not well understood and I still struggle to explain it correctly. It is explained there http://yasep.org/#!doc/16-32

YASEP16 and YASEP32 are binary compatible on many levels (see the link above). They do not "partially share" the instruction format. The instruction format and decoder are 95% identical. Bus widths change and some 32-bits specific instructions are invalid in 16-bits mode. That's all.

In fact, you could create a single CPU core that executes 16-bits and 32-bits core with only minor alterations of the decoder. Switching from 16-bits to 32-bits mode should only affect memory addressing and organisation.

Furthermore, I once needed a MCU that would only handle 14-bits data : I see no real limitation that prevents me to make a N-bit CPU (N<=32) as long as the bus width is equal or larger than the instruction address bus. Then YASEP16 becomes just a particular subset of the architecture family.

> Now some questions for the instructions - I looked at the instruction
> formats and also at several instructions. Is there any difference
> between FORM_iR and FORM_Ri (used by GET & PUT)?

This is addressed in http://yasep.org/#!doc/forms#FORM_iR :

"This is just another way of writing iR when the instruction uses the immediate value as a destination address."

In other terms, it's just to make the assembly language conform to the rule 1 that the destination is always the last part. See http://yasep.org/#!doc/asm#asm_ila

Physically, iR is encoded as Ri :

http://yasep.org/#!ASM/asm#a?PUT%20A5%204 : PUT A5 4 => 4762h
http://yasep.org/#!ASM/asm#a?GET%204%20A5 : GET 4 A5 => 4742h

The immediate "4" stays at the same place. However, it is clearer to keep a writing rule and the source operand is easier to spot, it does not depend on the opcode.

> According to basic math instructions - I saw that ADD & SUB generate
> Carry/Borrow, but I'm not sure whether these instructions actually use
> this flag. Generally it's usefull when implementing arithmetics with
> integers wider than the CPU registers (and it's major pain in the ass
> when it's not available).

Carry/Borrow are used only by conditional instructions.

In practice I have seen that ADDC and SUBB are quite rare and I couldn't justify the added opcodes. Having a specific set of conditions, however, is far more interesting. So if you want to add with carry/borrow, it's a bit awkward but here is one way :

; R1 + R2 => R3:R4
ADD R1 R2 R3 ; generate the carry
ADD 1 R4 carry ; suppose R4 was cleared.

It's a bit unusual, but possible. All the necessary data are here, and if you want to do number crunching, use a more appropriate CPU :-) This one is for doing "control" stuff.

> Other things for commenting - I was a little surprised to see the > assembler format like "MOV Rx Ry" and I was wondering what influenced > you when designing the ASM syntax?

Experience with several architectures :-) Before F-CPU I had already made a few assemblers for existing and hypothetical architectures. Simpler is better but expressiveness and consistency must not be forgotten : one must see the intent in the source code.

> SHR, SAR, SHL, ROR, ROL - I was pleasantly surprised to see a full set
> of shift operations on a small micro. I have also one question about
> the shift/rotate operations - I didn't see anywhere that the Carry
> flag is used/updated by these opcodes. Generally playing with the
> Carry is used when converting between receiving/transmitting 1-bit
> data (think for software SPI/I2C/UART). I suppose that your intention
> is to use instead logical operations to extract the LSB, like "AND 1
> R3" for example?

Shift/rotate don't take the carry. I can't remember how many times I had to clear the carry flag before doing a shift on "another widespread architecture". How annoying.

If you want to shift a bit out, there's an easy way :
ADD R1 R1 ; => carry !
OR 1 R2 CARRY;

Or you can :

SHL 1 R1
OR 1 R2 MSB1 R1

Or the other way :

SHR 1 R1
OR 1 R2 LSB1 R1
Another method, if you want to have an arbitrary bit order, is to use a condition on R1's individual bits :
; count the number of bits set in R1's 16 LSB:
MOV 0 R2
ADD 1 R2  BIT0 0
ADD 1 R2  BIT0 1
ADD 1 R2  BIT0 2
ADD 1 R2  BIT0 3
ADD 1 R2  BIT0 4
ADD 1 R2  BIT0 5
ADD 1 R2  BIT0 6
ADD 1 R2  BIT0 7
ADD 1 R2  BIT0 8
ADD 1 R2  BIT0 9
ADD 1 R2  BIT0 10
ADD 1 R2  BIT0 11
ADD 1 R2  BIT0 12
ADD 1 R2  BIT0 13
ADD 1 R2  BIT0 14
ADD 1 R2  BIT0 15

(just an example, there are better and faster ways)

Note that this feature makes sense in a microcontroller, its purpose is to handle bits. It could be unavailable in a more "streamlined" version, because of pipeline delays. but you never know.

> CMPU, CMPS, UMIN, UMAX, SMIN, SMAX - just plain awesome. "Dude, this
> is not your grandfather's PIC!"

That's one way of seeing this. Notice that MIN/MAX are not available on certain implementations though. I can play with the inhibition of writeback to the destination register because i don't want to add a conditional MUX in the datapath, too much wire load. That would slow the whole system down. But it can "partially work" with the inhibition trick on some operands combinations.

> The signed 4-bit immediate is the victim of the short instruction
> format :D.

I think I have found a good balance because short immediates appear very often. However I wish I had room for 8-bits operands. Anyway, the whole thing remains orthogonal and (almost) simple. It's a compromise and there will always be annoying cases.

> I was joking before several days that if we don't have
> support for immediate values, we won't have any data to load because
> there's no way to create the data in RAM in the first place (which is
> not true of course, reading hard-wired constants from a special reg
> and/or incrementing/shifting/performing bit ops will still provide
> valuable way to enter data in the programs, but still... it sucks
> compared to the immediates. Actually, my opinion is that the immediate
> is always a victim of any ISA - when having fixed instructions you
> either sacrifice immediates for register addresses and more operands,
> or you sacrifice the operand count (and reuse one of the operands as
> both source & destination, doh), or you have multiple instruction
> formats that had to be decoded simultaneously in order to provide the
> needed flexibility (that's not cool, but it's inevitable price to
> pay).

That's the same dilemmas for everybody :-)

> About the instruction condition codes - are these available only for
> the YASEP32? And also, are they emulated by the high-level assembler
> when generating machine code for YASEP16?

They are available for both YASEP16 and YASEP32. The datapath width is not the instruction width.

> Btw, I didn't found a description of all the programmer-visible
> registers, so I'm not sure what are these used for (they look somehow
> like memory index & memory window registers, but that's a wild guess).

I am working at this moment on the related page. I work both on the french and english version to keep them in synch. An older version is there : http://yasep.org/yasep2009/docs/registers.html However several things have changed. I'm adding "register parking" and numbers have been changed (R1-R5 instead of R0-R4), plus other subtleties.

> About call/return functionality - I saw that you have done some
> preliminary work on it. I'm by no way expert on VHDL (I typically
> write for my hobby in Verilog), but I checked the RTL and to me it
> looks like that there's no other way to modify the PC register - it's
> not part of the general purpose register file, so instructions like
> CALL/RETURN will be inevitable, imho. Nevertheless, I would be happy
> if you can share your thoughts on this important topic.

It is very important and I resisted for a long while before adopting a particular solution. My main problem is that it's inherently a 2-writes operation. It is necessary to treat PC independently, which raises a lot of issues. But for now, it seems to work in the microYASEP pipeline. It may evolve, for example I have not implemented "call with offset" (CALL2) because i believe it's a slippery slope, but it's "technically possible" so hey...

> PUT, GET - I'm not sure how these functions access the SFRs (they look
> unimplemented in the VHDL).

They were implemented for a prototype and the SR map has not yet been standardised.

> Btw, I would typically advise against
> separating the address space (separation in any form) like memory and
> I/O spaces - it's much more straight-forward to move & manipulate data
> around with unified instructions and to access memory-mapped
> peripherals (but in the same address space). Of course, the inevitable
> price for peripherals/SFRs is the address decoding - it's either
> partial & ugly, or full and expensive :).

I make a distinction because :

  • memory is meant for high-speed, bulk transfers that MAY be performed out of order and with latency. Memory is for data and instructions.
  • SRs are "serialising" and immediate effect, critical for control. Memory mappings, inter-threads protection, configuration of peripherals etc. will be done there.

I hope this clears some misunderstandings :-)

Tuesday 8 November 2011

Register Parking

 

  Warning : this is partially deprecated since 2013-08-09, see http://yasep.org/#!doc/reg-mem#parking

As the YASEP architecture specifies, there are 5 normal registers (R1-R5) and 5 pairs of data/address registers  (A1-D1, A2-D2...) and it's quite difficult to find the right balance between both : each application and approach requires a different optimal number of registers.

When more registers are needed (if you need R6 or R7) then you could assign them to D1 and D2 for example. However you have to set A1 and A2 to a safe location otherwise chaos could propagate in the software. Another issue is that each write to the A registers will update the memory. A similar situation appears if we use the Ax registers as normal registers : each write will trigger a memory read. And in paged/protected memory systems, this would kill the TLB...

This is now "solved" with today's system, which defines hardwired "parking" addresses and internal behaviour (this is still preliminary but looking promising).

  • "Parking" addresses are defined as "negative" addresses (that is : all the MSB are set to 1). This addressing range, at the "top" of the memory space, is normally not used, or used for special purposes, such as "fast constants" addressed by the short immediate values :
    MOV -7, A3 ; mem[-7] contains a constant or a scratch value,
    MOV D3,... ; the address fits in 3 bits
  • To keep the "parking" system compatible with non-parked versions, the addresses are defined globally for all software. They are easy to remember, as the following code shows :
    ; Park all the registers
    MOV -1, A1
    MOV -2, A2
    MOV -3, A3
    MOV -4, A4
    MOV -5, A5
    These will become macros or pseudo-instructions.
  • The internal numbering of the registers is changed to ease hardware implementation. There is a direct match between the binary register number and the binary code of the address (bits 1 to 3) :

    park address  binary    reg.bin       reg.number   register
          -1             1111       1111              15              A1
          -2             1110       1101              13              A2
          -3             1101       1011              11              A3
          -4             1100       1001                9              A4
          -5             1011       0111                7              A5
  • Architecturally, it does not change much. The Data registers are "cached" by the register set. What the hardware parking system adds is just an inhibition of the "data write" signal that would occur normally each time the core writes to a D register.
  • Aliasing : No alias detection is expected. If A4/D4 writes to -2, D2 is not updated. Otherwise it would mean that the result bus could write to 5 registers in parallel, which is not reasonable.
  • Thread backup and restoration : the register set contains the cached version of the memory, it must be refreshed when a thread is restored (swapped in). If the Ax register matches a parked address, the memory doesn't need to be fetched to refresh the cache. Another solution is to save the Dx register through another Ax/Dx, so there is nothing to test during restoration (but memory read cycles could not be spared).
  • This sytem where the "parking" is defined by an auxiliary value (that is inherently preserved through context switches) is "cleaner" than a more radical approach where "status bits" (one per A/D pair) park the registers. The advantage of the radical approach is that two registers can be parked at once (instead of one) but it gets harder to use with a compiler or from user software (you can play with pointers in C or Pascal easily, though you won't be able to define which pair is used). On top of that, adding status/control bits is usually a nightmare
In the end, it's not very complex (not as much as it seems). The hardware price is a few logic gates that detect the parking addresses to inhibit memory writes. For the software writer, it just means more registers on demand and it will work whether the YASEP has the parking hardware or not. You CAN have R6, R7 or R8 but then you'll have to restrict data access and give up A1/D1, A2/D2 and A3/D3. You make the choice !

Sunday 8 May 2011

This little Least Significant Bit

(update : 2011/05/11)

I've been wondering since march of this year if the Least Significant Bit (LSB) of the Next Instruction Pointer (NIP or NPC) could be better used than now.

The YASEP instructions are 16-bits aligned and the instruction addresses have their LSB cleared by convention. This bit is usually wasted in word-aligned byte-oriented computer architectures.

In the current YASEP architecture, this LSB holds the carry flag of ADD/SUB operations. It is the only status flag that I couldn't get rid of with the usual architectural tricks. As a reminder, instructions can check 3 conditions : register is cleared, has its LSB cleared (odd/even) or MSB (sign) cleared. Every condition can be negated and a 4th condition serves as "always" or "reserved" case. Reading the LSB and MSB is easy, checking for a cleared register is more costly. In some implementation, the register set has "shadow" bits with precomputed/cached "register is clear" bits. But otherwise, no dirty trick is employed.

The Carry bit is less easy to handle : it's a dynamic result that can't be reconstructed from the 16 or 32 bits of the registers. It is not possible to restore it after a thread switch. It can't be added to the "condition cache" because it will have to be saved and restored (16 more bits to save ? Bleh...)

Here come the latest changes :

  • The carry bit is now "hidden", not available from the register set for computations (that would make other things more difficult). It exists as a bit that can only be tested via a specific condition code in the conditional instruction forms (certainly one that tests NIP).
  • The LSB of NIP is always cleared. However, when saving/restoring the state in memory, it will hold the carry bit. This is the only case when the two functions (carry and pointer) are mixed.
  • Writing a "1" to the LSB of NIP (other than for saving/restoring the state) triggers a trap. There are several uses :
  1. Breakpointing / tracing / debugging : inject a "1" in the LSB and you can see where the pointer is used.
  2. Safety : for example if the stack is corrupted, there is a chance that the LSB will be set and trigger the trap
In future iterations, this bit could be used for something else more pertinent (such as a second instruction memory bank selector) so it must be carefully handled by programers now.

Saturday 20 November 2010

Fast and secure InterProcess Communications

(post version : 20110108)

(update : 20110515 : environment inheritance)

Recently (2010/11/20) I found the critical elements that solve a crucial problem that the Hurd team submitted to me in ... 2002. It took time and many attempts but I think that the YASEP is a great place to experiment with this idea and prove its worth.

The Hurd uses a lot of processes to separate functions, enforce security and modularize the operating system. It uses "Inter Process Communication" (IPC) such as message passing and this is snail slow on x86 and most other architectures.

The YASEP uses hardware threads which is a concept close, but not identical, to the processes of an operating system. And these last days I have found what was missing : the "execution context" ! So with the YASEP, a process is a hardware thread (a set of registers and special registers) associated to an execution context (the memory mapping, the access rights etc.)

Repeat after me : a process is a thread in a context.

This distinction is necessary because threads are activated for handling interrupts, operating system functions, library function calls and communication between the programs. It's a major feature of the processor which should provide functionalities that go beyond a mere microcontroller...

So IPC is necessary to make a decent OS and it requires several hardware threads (threads can be interleaved at the hardware level to provide with concurrency and better performance) and several contexts (for the operating system, device drivers, libraries, interrupt handlers...). The processor state can jump at will from one to the other with much less latency than an usual CPU.

The antagonistic requirements are as follows :

  1. A process must be able to call code from another context FAST, as fast as possible.
  2. The mechanism must be totally SAFE and SECURE.
  3. The physical implementation must be SIMPLE.

Simple and fast go hand in hand (ask Seymour Cray. Oh, wait, too late...). In the YASEP, communication takes place with a restricted variant of the function call instruction. Function calls are difficult to "harden" and more generic and specific instructions are usually found in other architectures to provide IPC or system calls. These are quite simple to implement in a CISC architecture like x86 because microcodes can do whatever is required... But they are slow because several dependent memory fetches must be performed (read the access rights table then find the address of the code to execute, whatever...)

The YASEP is a RISC-inspired architecture and requires a new approach. What I have found requires just 3 new opcodes :
  1. IPC : InterProcess Call
  2. IPE : InterProcess Call Entry
  3. IPR : InterProcess Call Return
Since the YASEP has a bank of several threads in the register set, the context switch is a matter of a few cycles only. One way to further reduce the execution time is to pre-calculate the destination address of the called code : no call table or things that require several chained/dependent memory accesses. In order to obtain the jump address, a thread must register itself in the called process and obtain the context number and the effective address. The calling thread can then modify its own code (update the constants) or variables to make the proper IPC later. Here is how simple it gets :
     IPC R1, R2    ; call context number R2 at address R1
IPC 1234h, R2 ; call context number R2 at immediate address 1234
Security is a bigger beast and just changing the TID (Thread ID) value is not a good method. The first big problem is that any code can call any context at ANY address and a security mechanism is required to block unwanted calls from succeeding. The policies could be arbitrarily complex (depending on the OS strategies) and don't belong in hardware (unlike x86), a software-based authorisation system is preferred (like MIPS !). This is the role of the IPE instruction :
  1. IPE provides the Thread ID and Process ID of the calling thread (it's a kind of GET). From this, the callee can choose to accept or refuse the call, provide a specific service or even choose to not check at all. Any software can create its own policy, call by call !
  2. IPE is NECESSARY for the IPC instruction to complete. If IPC points to an instruction that is NOT IPE, an error is triggered. This prevents all applications from jumping anywhere in any code.
  3. Each thread can restrict the range of callable addresses so calls can't enter data sections. This is the role of additional registers.
When the thread calls code from another context at the right address, the register set is preserved (not touched) so the transmission of parameters takes no effort. However several new issues appear.

For example, how can one thread in a different context access data from the previous context ? The proposed solution is to provide an attribute to each Address register : the context number. Upon call, the newly spawned process will modify the necessary attributes to access to both the current and the calling process. Which means that all the previous contexts must be kept in the processor (since interthread calls must be reentrant). Before the call, the calling process should mark the memory ranges it accepts to share with the called process (marking the range as "shareable"). This way, no data copy is necessary !

The return address and thread/process/context IDs must be managed by the CPU core itself to prevent tampering by the caller or callee. This is the last point that needs some big work and HW real estate ... A classic stack, with a stack pointer, stack base and stack limit, are necessary hardware resources to add.

So let's sum up the added hardware :
  • Each context must be able to mark memory ranges as data-read and/or data-write by other threads. This can be indicated by flags for each page in the page table. How this can be restricted to certain threads (that are in the call stack) is still uncertain, a token scheme should be created where a permission can be passed to (and inherited from) another thread.
  • Each context has 2 registers that are compared to the called address to restrict unwanted calls.
  • Each process has a set of 3 registers for the IPC stack (pointer, base, limit). Pointer and limit are compared for equality upon call and pointer and base are compared during return.
  • There are also 5 new thread-private registers that determine the owner (thread number) of a pointer. They must be preserved in HW if the caller or callee are not trusting or trusted.
That makes about 10 new registers ! How this will be implemented is still uncertain. Maybe a hardcoded sequence of instructions will be streamed through the instruction decoder, unless everything is done in parallel in big enough chips. This reminds me that in the past, I wanted to add "attributes" to the address generators of the VSP, with base/index/limit/stride, now there is the context number that is some kind of "address space number" (ASN). We can finally merge these ideas and in 16 bits code, we can use ASNs like segments in x86 : one for executable data, one for the stack, several for data, and no opcode prefix is needed.

Whatever the implementation, we're going here from a system initially designed for libraries and system calls, extended to the next level : a micro-kernel oriented architecture where processes can share memory they own so others can work on it, with little overhead. Will the Hurd people be finally happy now ?

Saturday 4 April 2009

First details of the new "extended" long instruction

A precedent post has summarised the available "instruction forms", with or without immediate field (4 or 16-bits), with 2, 3 or 4 register addresses. Here we look at the "long form" (32-bit) using the "extended" fields that add 2 register addresses, conditional (speculative) execution and pointer updates.

Let's now examine the structure of the 16 bits that are added to the basic instruction word :

  • One bit indicates if the source is Imm4 (it replaces the corresponding field in the basic instruction).
  • 2 bits indicate a condition (LSB, MSB, Zero, Always) and another bit negates the result (The condition "never" will be used later but I'm not sure how).
  • 4 bits indicate which register is being tested
  • 4 bits indicate the destination register (replacing the src/dest field in the basic instruction)
  • 2 fields of 2 bits each encode the auto-update functions of one source register and the destination register (nop, post-inc, post-dec, pre-dec)

These fields are mostly orthogonal and can work in almost any combination. One can auto-update 2 registers (whether they are normal or belong to a memory access register pair), perform a 3-address operation and enable write-back depending on 97 conditions. It also preserves the availability of short immediate values, which further reduces code size. However it can increase the core's complexity.

One unexpected bonus is that this new architecture iteration is more compiler-friendly. At least, it's much less awkward or embarassing.

One bit could have been saved : the imm4 flag could be merged in the auto-update field for a source register. However this increases the logic overhead and prevents simultaneous use of auto-update AND imm4.

Stay tuned...

Yet another Instruction Set Architecture change

I wish it could stabilize soon, but at least movement is a sign of activity (or the reverse :-))

I was annoyed by the ASU operations :

  ADD, SUB, ADDS1, SUBS1, ADDS2, SUBS2, MIN, MAX

These instructions were the last ones that used skip technique, since it is progressively dropped in favor of relative branches by conditional add/sub to the PC register.

How is it possible to provide the same functionality without skip ? It's the same old question that decades of research has not yet answered definitively. The Carry Flag is the obvious solution but I have just dropped the "status/mode register" in favor of another general purpose register. So where can I find a stupid bit of room ?

The answer is there under my eyes : the LSB of the PC ...

OK OK I know it's ugly. But consider these aspects :

  • The PC points to the next instruction and never uses the LSB because all the YASEP instructions are aligned on 2-bytes boundaries.
  • Any write to the PC register modifies the bits 1 to 31. Bit 0 comes from the ASU's carry output.
  • We can declare that only the ASU operations (or context changes) can change the PC's LSB. All the other instructions can read it and test it, so the informations is easily available.
  • Since we dropped the 4 instructions that used skip, these "slots" can be filled by other instructions :
 CMPS, CMPU, SMIN, SMAX

CMPx are just like SUB but don't write the result back. I wish it could set the LSB of any register but the current architecture doesn't allow this, so please keep the destination field to PC when encoding the assembly instruction.

3 new instructions deal with signed comparison : CMPS, SMIN & SMAX. They were missing from the previous opcode maps but the elimination of the skip-instructions leaves enough room. I have to update the VHDL now...

  • Keeping the carry bit in the LSB of the PC can have a curious side effect : relative jumps with odd values will make the carry bit ripple to the other bits of the result, so the destination address that is written in the PC will depend on the value of the carry bit. In practice, there is no speed or size advantage (compared to condition codes in the new opcode extension) but the possibility is there...
  • Clearing the carry flag is done with
  CMP Rx, Rx
  • Setting the carry flag is done with
  CMP -1, Rx

(or something like that)

Usually, I would end the post with something along the lines of "this is good and everybody is happy". Now, I feel a bit disapointed that YASEP looks more like other architectures, and has less distinguishing features. It is less groundbreaking and it will have to face the same problems as the others, on top of its inherent quirks. But it's still better than nothing and I do my best to keep the system rather coherent and orthogonal.

Tuesday 6 January 2009

Evolution of the instruction set

As the execution units mature and get integrated as one block, things become clear, at least concerning the computation instructions. I'm currently focusing on the 16-bit flavour of YASEP and I expect that the following will hold true for YASEP32.

The ALU16 is nearing completion, though feature creep is still rampant. But I have identified a bunch of instructions that will not change much in the future, and they are gathered here :

- ROP2 : AND, OR, XOR, ANDN, ORN, XNOR, NAND, NOR
- ASU : ADD, SUB, ADDS1, SUBB1, ADDS2, SUBS2, MIN, MAX
- SHL : SHR, SHL, ROR, ROL, SAR  + MUL : MUL8L, MUL8H, MULINIT
- IE : MOV, SB, LSB, LZB (16/32b) SH, SHH, LSH, LZH (32 bits only)

This nice and square table represents the large majority of the used instructions, and this fits into 4 groups of 8 instead of the planned 8 groups. So...

This saves a bit that is used to encode other addressing modes. In 2008, there were 2 modes : short mode (RR) and long mode (RRImm16). Now, it is also possible to encode a short immediate in the short mode (RImm4, the register is replaced by a value), or use another register as a destination in the long mode (but 12 bits are unused).

Yes there are now 4 addressing modes and most code should feel their binary size shrink ! Furthermore, the datapath complexity is not impacted and the 3-registers version should reduce the number of cycles for a given portion of code.

How this affects usual code :

- add 1, r1 ==> r1 += 1

now takes 2 bytes instead of 4. The constant can range from -8 to +7.

- add r1, r2, r3 ==> r1 = r2 + r3

It takes 4 bytes as previously but it saves 1 clock cycle, compared to

- mov r2, r1
- add r3, r1

Note that the yasep.org site is not yet updated, I'll wait until things settle down.

Monday 18 August 2008

New register organisation

The architecture of YASEP is very unorthodox. It is a living experiment and evolves in many unexpected directions.

However, one known uncertainty has always been how to implement the instruction fetch mechanisms. The memory queues have been a guideline but no organisation has been tested yet, and validated. Back when VSP emerged from the chaos of my brain, I wanted to use one of four queues to fetch instructions, and to indicate the current queue in the 2-bit CQ register. This idea was already implemented in the RCA1802 processor (the 4-bit P register) but this adds some overhead (and YASEP's instruction stream was never meant to be ultracompact).

Funny : I find more and more common traits between YASEP and 1802 :-)

The CQ register (just like the COSMAC's P register) also slows down the core as a whole cycle is necessary to fetch the opcode from the queue. This goes against the idea of a pipelined processor, the pipeline being the implementation of a sequential principle and sequence occurs a lot in an instruction flow.

However, the availability of several queues as potential pre-cooked jump destination (address as well as corresponding data) is very interesting and this remains in the YASEP architecture. A jump instruction with a direct immediate adress remains possible but with some (future and planned) architectures, there is the risk of a high execution latency.

I recently came to the conclusion that a compromise between the completely weird and the classical approaches is necessary.

So I keep the memory queues but the first one is modified and assigned to the instruction pointer and a status register. I had sworn that I would never do that, but I'm forced to admit that in a sense, and in the current situation (where no cache can support parallel memory accesses) something "looking like that" is necessary. And I'll do my best to avoid the inherent traps !

First, why do I need the registers #0 and #1 to hold these values ? In the currently planned first implementation, I can use a bank of 512 registers, or 32 banks of 16 registers. This means that context swapping can be very fast (1 major cycle) and I need to save many informations at once. If these informations are stored in the SR space (as previously planned), more cycles are needed to save/restore the "whole" context. So the best place to store these critical informations is in the register set itself. I could have chosen to create another parallel register bank but this would consume too much memory. The availability of the "Current/Next IP" is also very useful for computing addresses in position-independent code.

So the new register map is :

0h: IP (replaces A0)
1h: ST (replaces D0)
2h: A1 \ Q1
3h: D1 /
4h: A2 \ Q2
5h: D2 /
6h: A3 \ Q3 
7h: D3 /
8h: A3 \ Q4
9h: D3 /
Ah: A3 \ Q5 
Bh: D3 /
Ch: R0
Dh: R1
Eh: R2
Fh: R3

Second: what does the Status Register contain ? Of course, I avoid the storage of carry flags and such. But I can't avoid the auto-update bits of the 5 remaining queues. They use 2x5 bits and 6 bits are unaffected yet (for how long ?). These two bits per queue represent the following codes :

00 : no update
10 : post-incrementation
11 : post-decrementation

2 queues are able to implement a normal stack (LIFO) and 2 additional bits represent this ability. So Q4 and Q5 have the following properties in the Status Register:

bit N   : update on/off
bit N+1 : update up/down
bit N+2 : stack on/off (pre/post modification)

Third: These registers are not really real registers. These are "shadowed" registers with a physical instance copied somewhere else. This is necessary because the register set can't have enough ports and these 2 specific registers are critical and accessed every cycle . So their incorporation in the register map makes them easily remanent through context switches and IRQs, as well as easily alterable (without going through get/put instructions) but the register bank is updated only when these new registers are accessed. Some new datapaths must be reserved for them.

sigh...

This means that most of the opcode map (the part with the jump instructions) must be redesigned.

re-sigh...

Tuesday 5 August 2008

More flexibility and options for YASEP

After some concertation with me, myself and my other instances, we came to the conclusion that it would be almost costless to make a 16-bit version of the YASEP architecture. In fact, only a few modifications are necessary to adapt the current 32-bit core for 16-bit operation. I can even foresee a 16-bit "compatibility mode" where a 16-bit program/thread can execute in a 32-bit core.

What is the difference ? Essentially, beside the smaller registers (smaller numbers can be computed), the memory access method needs to be adapted. There will be a limitation to 16-bit pointers, so instead of "segments", I'll use 4KB pages (protected memory will be logically available as a logical extension). The 64KB addressable range is split into 16 pages, each page can be configured for base address (on 4KB boundaries). A 16-bit core can then access up to 256MB of RAM (the limit of YASEP).

Such a small YASEP is suitable for smaller FPGAs and when horsepower is even less necessary. For "efficient" implementations, the pipeline remains the same and several threads will be interleaved as before. The speed and code density will be similar. Remove the pipeline gates and you get a slower but smaller core for typical microcontroller applications.

The structures and the instruction set remain untouched. The SHH instruction will be useless (or not) but it's apparently the only exception. The same tools will be used to generate the executable bytestream, the same source code could be used for 16-bit or 32-bit targets (to some extent).

The physical addresses and registers are the same too, so 16-bit and 32-bit "threads" can coexist/coexecute. The byte addressability works with the same principles (it's just implemented a bit differently).

It all looks promising and i'm updating my VHDL code now.

Another big modification will be the support for "short" immediate fields (Imm4) in the place of the SRC (register number) field. So i'll see later.