YASEP news

To content | To menu | To search

Architecture

Entries feed

Tuesday 8 November 2011

Register Parking

As the YASEP architecture specifies, there are 5 normal registers (R1-R5) and 5 pairs of data/address registers  (A1-D1, A2-D2...) and it's quite difficult to find the right balance between both : each application and approach requires a different optimal number of registers.

When more registers are needed (if you need R6 or R7) then you could assign them to D1 and D2 for example. However you have to set A1 and A2 to a safe location otherwise chaos could propagate in the software. Another issue is that each write to the A registers will update the memory. A similar situation appears if we use the Ax registers as normal registers : each write will trigger a memory read. And in paged/protected memory systems, this would kill the TLB...

This is now "solved" with today's system, which defines hardwired "parking" addresses and internal behaviour (this is still preliminary but looking promising).

  • "Parking" addresses are defined as "negative" addresses (that is : all the MSB are set to 1). This addressing range, at the "top" of the memory space, is normally not used, or used for special purposes, such as "fast constants" addressed by the short immediate values :
    MOV -7, A3 ; mem[-7] contains a constant or a scratch value,
    MOV D3,... ; the address fits in 3 bits
  • To keep the "parking" system compatible with non-parked versions, the addresses are defined globally for all software. They are easy to remember, as the following code shows :
    ; Park all the registers
    MOV -1, A1
    MOV -2, A2
    MOV -3, A3
    MOV -4, A4
    MOV -5, A5
    These will become macros or pseudo-instructions.
  • The internal numbering of the registers is changed to ease hardware implementation. There is a direct match between the binary register number and the binary code of the address (bits 1 to 3) :

    park address  binary    reg.bin       reg.number   register
          -1             1111       1111              15              A1
          -2             1110       1101              13              A2
          -3             1101       1011              11              A3
          -4             1100       1001                9              A4
          -5             1011       0111                7              A5
  • Architecturally, it does not change much. The Data registers are "cached" by the register set. What the hardware parking system adds is just an inhibition of the "data write" signal that would occur normally each time the core writes to a D register.
  • Aliasing : No alias detection is expected. If A4/D4 writes to -2, D2 is not updated. Otherwise it would mean that the result bus could write to 5 registers in parallel, which is not reasonable.
  • Thread backup and restoration : the register set contains the cached version of the memory, it must be refreshed when a thread is restored (swapped in). If the Ax register matches a parked address, the memory doesn't need to be fetched to refresh the cache. Another solution is to save the Dx register through another Ax/Dx, so there is nothing to test during restoration (but memory read cycles could not be spared).
  • This sytem where the "parking" is defined by an auxiliary value (that is inherently preserved through context switches) is "cleaner" than a more radical approach where "status bits" (one per A/D pair) park the registers. The advantage of the radical approach is that two registers can be parked at once (instead of one) but it gets harder to use with a compiler or from user software (you can play with pointers in C or Pascal easily, though you won't be able to define which pair is used). On top of that, adding status/control bits is usually a nightmare
In the end, it's not very complex (not as much as it seems). The hardware price is a few logic gates that detect the parking addresses to inhibit memory writes. For the software writer, it just means more registers on demand and it will work whether the YASEP has the parking hardware or not. You CAN have R6, R7 or R8 but then you'll have to restrict data access and give up A1/D1, A2/D2 and A3/D3. You make the choice !

Sunday 8 May 2011

This little Least Significant Bit

(update : 2011/05/11)

I've been wondering since march of this year if the Least Significant Bit (LSB) of the Next Instruction Pointer (NIP or NPC) could be better used than now.

The YASEP instructions are 16-bits aligned and the instruction addresses have their LSB cleared by convention. This bit is usually wasted in word-aligned byte-oriented computer architectures.

In the current YASEP architecture, this LSB holds the carry flag of ADD/SUB operations. It is the only status flag that I couldn't get rid of with the usual architectural tricks. As a reminder, instructions can check 3 conditions : register is cleared, has its LSB cleared (odd/even) or MSB (sign) cleared. Every condition can be negated and a 4th condition serves as "always" or "reserved" case. Reading the LSB and MSB is easy, checking for a cleared register is more costly. In some implementation, the register set has "shadow" bits with precomputed/cached "register is clear" bits. But otherwise, no dirty trick is employed.

The Carry bit is less easy to handle : it's a dynamic result that can't be reconstructed from the 16 or 32 bits of the registers. It is not possible to restore it after a thread switch. It can't be added to the "condition cache" because it will have to be saved and restored (16 more bits to save ? Bleh...)

Here come the latest changes :

  • The carry bit is now "hidden", not available from the register set for computations (that would make other things more difficult). It exists as a bit that can only be tested via a specific condition code in the conditional instruction forms (certainly one that tests NIP).
  • The LSB of NIP is always cleared. However, when saving/restoring the state in memory, it will hold the carry bit. This is the only case when the two functions (carry and pointer) are mixed.
  • Writing a "1" to the LSB of NIP (other than for saving/restoring the state) triggers a trap. There are several uses :
  1. Breakpointing / tracing / debugging : inject a "1" in the LSB and you can see where the pointer is used.
  2. Safety : for example if the stack is corrupted, there is a chance that the LSB will be set and trigger the trap
In future iterations, this bit could be used for something else more pertinent (such as a second instruction memory bank selector) so it must be carefully handled by programers now.

Saturday 20 November 2010

Fast and secure InterProcess Communications

(post version : 20110108)

(update : 20110515 : environment inheritance)

Recently (2010/11/20) I found the critical elements that solve a crucial problem that the Hurd team submitted to me in ... 2002. It took time and many attempts but I think that the YASEP is a great place to experiment with this idea and prove its worth.

The Hurd uses a lot of processes to separate functions, enforce security and modularize the operating system. It uses "Inter Process Communication" (IPC) such as message passing and this is snail slow on x86 and most other architectures.

The YASEP uses hardware threads which is a concept close, but not identical, to the processes of an operating system. And these last days I have found what was missing : the "execution context" ! So with the YASEP, a process is a hardware thread (a set of registers and special registers) associated to an execution context (the memory mapping, the access rights etc.)

Repeat after me : a process is a thread in a context.

This distinction is necessary because threads are activated for handling interrupts, operating system functions, library function calls and communication between the programs. It's a major feature of the processor which should provide functionalities that go beyond a mere microcontroller...

So IPC is necessary to make a decent OS and it requires several hardware threads (threads can be interleaved at the hardware level to provide with concurrency and better performance) and several contexts (for the operating system, device drivers, libraries, interrupt handlers...). The processor state can jump at will from one to the other with much less latency than an usual CPU.

The antagonistic requirements are as follows :

  1. A process must be able to call code from another context FAST, as fast as possible.
  2. The mechanism must be totally SAFE and SECURE.
  3. The physical implementation must be SIMPLE.

Simple and fast go hand in hand (ask Seymour Cray. Oh, wait, too late...). In the YASEP, communication takes place with a restricted variant of the function call instruction. Function calls are difficult to "harden" and more generic and specific instructions are usually found in other architectures to provide IPC or system calls. These are quite simple to implement in a CISC architecture like x86 because microcodes can do whatever is required... But they are slow because several dependent memory fetches must be performed (read the access rights table then find the address of the code to execute, whatever...)

The YASEP is a RISC-inspired architecture and requires a new approach. What I have found requires just 3 new opcodes :
  1. IPC : InterProcess Call
  2. IPE : InterProcess Call Entry
  3. IPR : InterProcess Call Return
Since the YASEP has a bank of several threads in the register set, the context switch is a matter of a few cycles only. One way to further reduce the execution time is to pre-calculate the destination address of the called code : no call table or things that require several chained/dependent memory accesses. In order to obtain the jump address, a thread must register itself in the called process and obtain the context number and the effective address. The calling thread can then modify its own code (update the constants) or variables to make the proper IPC later. Here is how simple it gets :
     IPC R1, R2    ; call context number R2 at address R1
IPC 1234h, R2 ; call context number R2 at immediate address 1234
Security is a bigger beast and just changing the TID (Thread ID) value is not a good method. The first big problem is that any code can call any context at ANY address and a security mechanism is required to block unwanted calls from succeeding. The policies could be arbitrarily complex (depending on the OS strategies) and don't belong in hardware (unlike x86), a software-based authorisation system is preferred (like MIPS !). This is the role of the IPE instruction :
  1. IPE provides the Thread ID and Process ID of the calling thread (it's a kind of GET). From this, the callee can choose to accept or refuse the call, provide a specific service or even choose to not check at all. Any software can create its own policy, call by call !
  2. IPE is NECESSARY for the IPC instruction to complete. If IPC points to an instruction that is NOT IPE, an error is triggered. This prevents all applications from jumping anywhere in any code.
  3. Each thread can restrict the range of callable addresses so calls can't enter data sections. This is the role of additional registers.
When the thread calls code from another context at the right address, the register set is preserved (not touched) so the transmission of parameters takes no effort. However several new issues appear.

For example, how can one thread in a different context access data from the previous context ? The proposed solution is to provide an attribute to each Address register : the context number. Upon call, the newly spawned process will modify the necessary attributes to access to both the current and the calling process. Which means that all the previous contexts must be kept in the processor (since interthread calls must be reentrant). Before the call, the calling process should mark the memory ranges it accepts to share with the called process (marking the range as "shareable"). This way, no data copy is necessary !

The return address and thread/process/context IDs must be managed by the CPU core itself to prevent tampering by the caller or callee. This is the last point that needs some big work and HW real estate ... A classic stack, with a stack pointer, stack base and stack limit, are necessary hardware resources to add.

So let's sum up the added hardware :
  • Each context must be able to mark memory ranges as data-read and/or data-write by other threads. This can be indicated by flags for each page in the page table. How this can be restricted to certain threads (that are in the call stack) is still uncertain, a token scheme should be created where a permission can be passed to (and inherited from) another thread.
  • Each context has 2 registers that are compared to the called address to restrict unwanted calls.
  • Each process has a set of 3 registers for the IPC stack (pointer, base, limit). Pointer and limit are compared for equality upon call and pointer and base are compared during return.
  • There are also 5 new thread-private registers that determine the owner (thread number) of a pointer. They must be preserved in HW if the caller or callee are not trusting or trusted.
That makes about 10 new registers ! How this will be implemented is still uncertain. Maybe a hardcoded sequence of instructions will be streamed through the instruction decoder, unless everything is done in parallel in big enough chips. This reminds me that in the past, I wanted to add "attributes" to the address generators of the VSP, with base/index/limit/stride, now there is the context number that is some kind of "address space number" (ASN). We can finally merge these ideas and in 16 bits code, we can use ASNs like segments in x86 : one for executable data, one for the stack, several for data, and no opcode prefix is needed.

Whatever the implementation, we're going here from a system initially designed for libraries and system calls, extended to the next level : a micro-kernel oriented architecture where processes can share memory they own so others can work on it, with little overhead. Will the Hurd people be finally happy now ?

Saturday 4 April 2009

First details of the new "extended" long instruction

A precedent post has summarised the available "instruction forms", with or without immediate field (4 or 16-bits), with 2, 3 or 4 register addresses. Here we look at the "long form" (32-bit) using the "extended" fields that add 2 register addresses, conditional (speculative) execution and pointer updates.

Let's now examine the structure of the 16 bits that are added to the basic instruction word :

  • One bit indicates if the source is Imm4 (it replaces the corresponding field in the basic instruction).
  • 2 bits indicate a condition (LSB, MSB, Zero, Always) and another bit negates the result (The condition "never" will be used later but I'm not sure how).
  • 4 bits indicate which register is being tested
  • 4 bits indicate the destination register (replacing the src/dest field in the basic instruction)
  • 2 fields of 2 bits each encode the auto-update functions of one source register and the destination register (nop, post-inc, post-dec, pre-dec)

These fields are mostly orthogonal and can work in almost any combination. One can auto-update 2 registers (whether they are normal or belong to a memory access register pair), perform a 3-address operation and enable write-back depending on 97 conditions. It also preserves the availability of short immediate values, which further reduces code size. However it can increase the core's complexity.

One unexpected bonus is that this new architecture iteration is more compiler-friendly. At least, it's much less awkward or embarassing.

One bit could have been saved : the imm4 flag could be merged in the auto-update field for a source register. However this increases the logic overhead and prevents simultaneous use of auto-update AND imm4.

Stay tuned...

Yet another Instruction Set Architecture change

I wish it could stabilize soon, but at least movement is a sign of activity (or the reverse :-))

I was annoyed by the ASU operations :

  ADD, SUB, ADDS1, SUBS1, ADDS2, SUBS2, MIN, MAX

These instructions were the last ones that used skip technique, since it is progressively dropped in favor of relative branches by conditional add/sub to the PC register.

How is it possible to provide the same functionality without skip ? It's the same old question that decades of research has not yet answered definitively. The Carry Flag is the obvious solution but I have just dropped the "status/mode register" in favor of another general purpose register. So where can I find a stupid bit of room ?

The answer is there under my eyes : the LSB of the PC ...

OK OK I know it's ugly. But consider these aspects :

  • The PC points to the next instruction and never uses the LSB because all the YASEP instructions are aligned on 2-bytes boundaries.
  • Any write to the PC register modifies the bits 1 to 31. Bit 0 comes from the ASU's carry output.
  • We can declare that only the ASU operations (or context changes) can change the PC's LSB. All the other instructions can read it and test it, so the informations is easily available.
  • Since we dropped the 4 instructions that used skip, these "slots" can be filled by other instructions :
 CMPS, CMPU, SMIN, SMAX

CMPx are just like SUB but don't write the result back. I wish it could set the LSB of any register but the current architecture doesn't allow this, so please keep the destination field to PC when encoding the assembly instruction.

3 new instructions deal with signed comparison : CMPS, SMIN & SMAX. They were missing from the previous opcode maps but the elimination of the skip-instructions leaves enough room. I have to update the VHDL now...

  • Keeping the carry bit in the LSB of the PC can have a curious side effect : relative jumps with odd values will make the carry bit ripple to the other bits of the result, so the destination address that is written in the PC will depend on the value of the carry bit. In practice, there is no speed or size advantage (compared to condition codes in the new opcode extension) but the possibility is there...
  • Clearing the carry flag is done with
  CMP Rx, Rx
  • Setting the carry flag is done with
  CMP -1, Rx

(or something like that)

Usually, I would end the post with something along the lines of "this is good and everybody is happy". Now, I feel a bit disapointed that YASEP looks more like other architectures, and has less distinguishing features. It is less groundbreaking and it will have to face the same problems as the others, on top of its inherent quirks. But it's still better than nothing and I do my best to keep the system rather coherent and orthogonal.

Tuesday 6 January 2009

Evolution of the instruction set

As the execution units mature and get integrated as one block, things become clear, at least concerning the computation instructions. I'm currently focusing on the 16-bit flavour of YASEP and I expect that the following will hold true for YASEP32.

The ALU16 is nearing completion, though feature creep is still rampant. But I have identified a bunch of instructions that will not change much in the future, and they are gathered here :

- ROP2 : AND, OR, XOR, ANDN, ORN, XNOR, NAND, NOR
- ASU : ADD, SUB, ADDS1, SUBB1, ADDS2, SUBS2, MIN, MAX
- SHL : SHR, SHL, ROR, ROL, SAR  + MUL : MUL8L, MUL8H, MULINIT
- IE : MOV, SB, LSB, LZB (16/32b) SH, SHH, LSH, LZH (32 bits only)

This nice and square table represents the large majority of the used instructions, and this fits into 4 groups of 8 instead of the planned 8 groups. So...

This saves a bit that is used to encode other addressing modes. In 2008, there were 2 modes : short mode (RR) and long mode (RRImm16). Now, it is also possible to encode a short immediate in the short mode (RImm4, the register is replaced by a value), or use another register as a destination in the long mode (but 12 bits are unused).

Yes there are now 4 addressing modes and most code should feel their binary size shrink ! Furthermore, the datapath complexity is not impacted and the 3-registers version should reduce the number of cycles for a given portion of code.

How this affects usual code :

- add 1, r1 ==> r1 += 1

now takes 2 bytes instead of 4. The constant can range from -8 to +7.

- add r1, r2, r3 ==> r1 = r2 + r3

It takes 4 bytes as previously but it saves 1 clock cycle, compared to

- mov r2, r1
- add r3, r1

Note that the yasep.org site is not yet updated, I'll wait until things settle down.

Monday 18 August 2008

New register organisation

The architecture of YASEP is very unorthodox. It is a living experiment and evolves in many unexpected directions.

However, one known uncertainty has always been how to implement the instruction fetch mechanisms. The memory queues have been a guideline but no organisation has been tested yet, and validated. Back when VSP emerged from the chaos of my brain, I wanted to use one of four queues to fetch instructions, and to indicate the current queue in the 2-bit CQ register. This idea was already implemented in the RCA1802 processor (the 4-bit P register) but this adds some overhead (and YASEP's instruction stream was never meant to be ultracompact).

Funny : I find more and more common traits between YASEP and 1802 :-)

The CQ register (just like the COSMAC's P register) also slows down the core as a whole cycle is necessary to fetch the opcode from the queue. This goes against the idea of a pipelined processor, the pipeline being the implementation of a sequential principle and sequence occurs a lot in an instruction flow.

However, the availability of several queues as potential pre-cooked jump destination (address as well as corresponding data) is very interesting and this remains in the YASEP architecture. A jump instruction with a direct immediate adress remains possible but with some (future and planned) architectures, there is the risk of a high execution latency.

I recently came to the conclusion that a compromise between the completely weird and the classical approaches is necessary.

So I keep the memory queues but the first one is modified and assigned to the instruction pointer and a status register. I had sworn that I would never do that, but I'm forced to admit that in a sense, and in the current situation (where no cache can support parallel memory accesses) something "looking like that" is necessary. And I'll do my best to avoid the inherent traps !

First, why do I need the registers #0 and #1 to hold these values ? In the currently planned first implementation, I can use a bank of 512 registers, or 32 banks of 16 registers. This means that context swapping can be very fast (1 major cycle) and I need to save many informations at once. If these informations are stored in the SR space (as previously planned), more cycles are needed to save/restore the "whole" context. So the best place to store these critical informations is in the register set itself. I could have chosen to create another parallel register bank but this would consume too much memory. The availability of the "Current/Next IP" is also very useful for computing addresses in position-independent code.

So the new register map is :

0h: IP (replaces A0)
1h: ST (replaces D0)
2h: A1 \ Q1
3h: D1 /
4h: A2 \ Q2
5h: D2 /
6h: A3 \ Q3 
7h: D3 /
8h: A3 \ Q4
9h: D3 /
Ah: A3 \ Q5 
Bh: D3 /
Ch: R0
Dh: R1
Eh: R2
Fh: R3

Second: what does the Status Register contain ? Of course, I avoid the storage of carry flags and such. But I can't avoid the auto-update bits of the 5 remaining queues. They use 2x5 bits and 6 bits are unaffected yet (for how long ?). These two bits per queue represent the following codes :

00 : no update
10 : post-incrementation
11 : post-decrementation

2 queues are able to implement a normal stack (LIFO) and 2 additional bits represent this ability. So Q4 and Q5 have the following properties in the Status Register:

bit N   : update on/off
bit N+1 : update up/down
bit N+2 : stack on/off (pre/post modification)

Third: These registers are not really real registers. These are "shadowed" registers with a physical instance copied somewhere else. This is necessary because the register set can't have enough ports and these 2 specific registers are critical and accessed every cycle . So their incorporation in the register map makes them easily remanent through context switches and IRQs, as well as easily alterable (without going through get/put instructions) but the register bank is updated only when these new registers are accessed. Some new datapaths must be reserved for them.

sigh...

This means that most of the opcode map (the part with the jump instructions) must be redesigned.

re-sigh...

Tuesday 5 August 2008

More flexibility and options for YASEP

After some concertation with me, myself and my other instances, we came to the conclusion that it would be almost costless to make a 16-bit version of the YASEP architecture. In fact, only a few modifications are necessary to adapt the current 32-bit core for 16-bit operation. I can even foresee a 16-bit "compatibility mode" where a 16-bit program/thread can execute in a 32-bit core.

What is the difference ? Essentially, beside the smaller registers (smaller numbers can be computed), the memory access method needs to be adapted. There will be a limitation to 16-bit pointers, so instead of "segments", I'll use 4KB pages (protected memory will be logically available as a logical extension). The 64KB addressable range is split into 16 pages, each page can be configured for base address (on 4KB boundaries). A 16-bit core can then access up to 256MB of RAM (the limit of YASEP).

Such a small YASEP is suitable for smaller FPGAs and when horsepower is even less necessary. For "efficient" implementations, the pipeline remains the same and several threads will be interleaved as before. The speed and code density will be similar. Remove the pipeline gates and you get a slower but smaller core for typical microcontroller applications.

The structures and the instruction set remain untouched. The SHH instruction will be useless (or not) but it's apparently the only exception. The same tools will be used to generate the executable bytestream, the same source code could be used for 16-bit or 32-bit targets (to some extent).

The physical addresses and registers are the same too, so 16-bit and 32-bit "threads" can coexist/coexecute. The byte addressability works with the same principles (it's just implemented a bit differently).

It all looks promising and i'm updating my VHDL code now.

Another big modification will be the support for "short" immediate fields (Imm4) in the place of the SRC (register number) field. So i'll see later.