View unanswered posts | View active topics It is currently Thu Mar 28, 2024 12:33 pm



Reply to topic  [ 305 posts ]  Go to page Previous  1 ... 8, 9, 10, 11, 12, 13, 14 ... 21  Next
 74xx based CPU (yet another) 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Ed,

Just a couple of examples of what the SELCC and SETCC instructions can do
Code:
int sftest( int x, int y, int a, int b )
{
  if ( a >= 8 )
  {
      x = x - a;
      y = - b;
  }
  return x+y;
}

CPU74
Code:
sftest:
   push   r4
   ld.w   [SP, 6], r2    // load b from the stack
   neg   r2, r2          // -b goes to R2
   ld.w   [SP, 4], r3    // load a from the stack to R3
   neg   r3, r4           // -a goes to R4
   cmp   r3, 7            // compare a with 7
   selgt   r2, r1, r1     // select -b or y depending on a>7 into R1 which is the new y
   mov   0, r2            // temporary constant zero
   selgt   r4, r2, r2      // select -a or 0 depending on a>7 into R2
   add   r2, r0, r0       // add R2 with x and store in R0 which is the new x
   add   r0, r1, r0       // add x to y
   pop   r4
   ret



Code:
int testX16(  int a, int b, int *c )
{
  *c = a==b ? 2 : 3;
  return a==b;
}


CPU74
Code:
testX16:
   cmp   r0, r1         // compare a with b and set flags
   mov   3, r0          // set temporary values in /same/ registers
   mov   2, r1
   seleq   r1, r0, r0   // select R0 to either 3 or 2 depending on comparison above
   ld.w   [SP, 2], r1  // load c pointer from the stack (this is as per the calling convention)
   st.w   r0, [r1, 0]  // store R0 to the address pointed by c
   seteq   r0           // the condition code is still valid at this time, so just set the return value R0 to 1 if the comparison at the start of the function resulted 'equal'.
   ret


The above code shows that no branch instructions have been used, despite the clearly explicit C code suggesting branches. By combining the SETCC and SELCC instructions in proper ways the compiler is able to avoid any branches altogether.

Other interesting examples is the management of 32 bits comparisons.

Code:
int cmpeq32(long a,  long b) {
  return a==b;

}
int cmpgt32(long a,  long b) {
  return a>b;
}


Code:
   .globl   cmpeq32
cmpeq32:
   ld.w   [SP, 4], r2   // load b high word into R2
   xor   r1, r2, r1      // xor with high word a
   ld.w   [SP, 2], r2   // load low word b into R2 (resuse register !)
   xor   r0, r2, r0      // xor with low word a
   or   r0, r1, r0      // or the two xor'ed resuls
   seteq   r0            // set result to 1 if zero (equal flag)
   ret

   .globl   cmpgt32
cmpgt32:
   ld.w   [SP, 2], r2   // load low word b into R2
   cmp   r0, r2          // compare with low word a
   setugt   r0          // set R0 to 1 if it was /unsigned/ greater than (low words comparison result)
   ld.w   [SP, 4], r2   // load high word b into R2 (reuse register)
   cmp   r1, r2          // compare with high word a
   setgt   r1            // set R1 to 1 if it was /signed/ "greater than" (high words comparison result)
   seleq   r0, r1, r0    // now select either the high word or low word comparison result depending on whether the high word comparison was "equal"
   ret


I think all examples are interesting but note that on the last example the complier even uses conveniently a different set of flags, "greater than" and "equal to", that come from the same compare instruction, which is quite remarkable (although possibly the same would be generated on a fully predicated system)

Also it's interesting to note that most of the above is just the way this compiler behaves, there's really not much explicitly done by me to get that kind of output code. My implementation around the compiler consists on telling the compiler that the CMP instruction (as well as the ADD, SUB, AND, OR, XOR) generate an 'implicit' result which is the Status Register. Similarly, the instructions (BRCC, SETCC, SELCC, ADDC) use such implicit result (the Status Register) as one of their operands. The compiler selects machine instructions based on what's available, and tries to create the best possible output by combining the given instructions. It's easier said than done, and in some cases you must 'help' the compiler (as it was the case of my previous example with the shifts), but the general idea is pretty always the same.

I suppose that a fully predicated system can potentially find additional opportunities to remove branches (after all every single instruction turns into a short branch if we look at it this way), and that's what the ARM architecture has. But alas, it uses 4 bits of the encoding only for that, which is only possible because instructions are 32 bit long. On the more restricted Thumb set predicates were removed, as the entire set must fit in 16 bit encodings...


Fri Aug 16, 2019 11:42 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
The select instruction is rather nice - does any other machine have it?


Fri Aug 16, 2019 12:12 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Very nice indeed.

Not really possible with a two address machine, but your three address machine provides the opportunity to implement the trigraph, (cond ? true : false). This is one of my favorite constructs in C/C++ and Verilog.

It certainly appears to make the generated code free of a bunch of short branches which can typically be expected to sap performance due to pipeline flushes.

_________________
Michael A.


Fri Aug 16, 2019 1:55 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
The select instruction is rather nice - does any other machine have it?

It's similar to one of the native LLVM IR (intermediate representation) instructions, which is why the compiler is very so happy with my backend having it.

- The LLVM IR has a Select instruction but it does not take any status register flags but the boolean value of a register to decide what to select. I.e instead of a set of condition codes as input, it has a register containing 0 or 1. The LLVM IR does not have condition codes or Status Register. The Compare instruction incorporates the condition as an immediate operand (for example less than) and sets true or false as the result. The Set instruction does not exist because that's just the output of the Compare instruction. The conditional Jump instructions also take a boolean value to determine whether to jump or not. It's all pretty higher level compared with actual hardware based processors.

Some real processors (I am aware of) having something similar to SET or SELECT instructions are the following:

- The MIPS and RISCV isas (are these real processor?) do not have Status Register, so all conditional execution is performed though instructions incorporating both comparison operands along with the comparison code. It's like folding CMP and BRCC instructions in a single instructions and removing the need for Status Flags. However they have a form of conditional Set and t MIPS has a conditional MOVE. Neither of them have a full Select because the conditional move is rather a predicated instruction than a select instruction.

- The X86-64 (X86-32 ?) have the more similar instructions. Condition Codes are set for most ALU operations as well as the CMP instruction, so obviously the processor has Status Register(s). This processor has Jcc (Jump Conditional), SETcc (Set conditional), and CMOVcc ( Move Conditional). The latter is not totally a genuine SELECT instruction because the x86 instructions only have 2 operands: the destination is constrained to be one of the source operands. This makes that in practice it acts as a predicated move, not an actual select.

So if one of the operands of my SEL instruction is the destination, these two instructions are pretty equivalent

(x86) CMOVGE R1, R2
(cpu74) SELGE R1, R2, R2

I found this that lists all the conditions and addressing modes of the x86 https://www.felixcloutier.com/x86/cmovcc.

Another problem of the constrained destination operand (same destination as one source) on the x86, is that all possible conditions and their opposites must be available, because you can't swap operands to account for an unsupported condition code. I could reduce the expression of condition codes to just 3 bits instead of the normally required 4 because I can play with swapping operands for the unsupported conditions.

Other than that, I am not aware of any other cpu that has a full SELECT instruction the way I implemented it or the way the LLVM IR implements it.


Fri Aug 16, 2019 2:54 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Great info - thanks!

Staying close to LLVM's model is an interesting tactical choice which reminds me of the minimal CPU constructed to support an easy port of C... ah yes, Jan Gray's xr16:
viewtopic.php?f=13&t=422


Fri Aug 16, 2019 3:09 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
Great info - thanks!

Staying close to LLVM's model is an interesting tactical choice which reminds me of the minimal CPU constructed to support an easy port of C... ah yes, Jan Gray's xr16:
viewtopic.php?f=13&t=422


Hi Ed,

Thanks for sharing.

This quite scares me. I started by reading the 24 pages short articles series, and I do not understand anything beyond page 2, and there's 22 pages left. I just hope I do not need anything of that :shock: or my project will become stalled just shortly after I'm done with the simulator...


Fri Aug 16, 2019 4:02 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
If in doubt, sleep on it, and read it again! (You might well not need any of it.)


Fri Aug 16, 2019 4:18 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
If in doubt, sleep on it, and read it again! (You might well not need any of it.)

I'm a software guy, you know... so my hardware goals are pretty simple because just doing something in hardware is already a goal. I just essentially want to get it working reliably, with an as clean as possible design, but not much more. No pipelining or other complexities. I keep the contact of forum member "Drass", who successfully implemented a very nice and powerful 74xx based 6502. He explained me the tools and methods he used, which were not that sophisticated for what I can tell, and yet he managed to obtain what I regard as an outstanding design. That makes me think that I will not need to actually enter into the world of FPGAs, which is what could set me back, I think. He also taught me some essential concepts I needed to get started. I'm sure he will continue answering my questions and assessing me as per his available time. And I purchased some books too on basic CPU design, that I mostly understand. So I should not be that worried, I suppose. Interesting times ahead for me.


Fri Aug 16, 2019 5:30 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
As I'm working on the simulator and realising in a more direct way what's required to execute all instructions, I am now pondering whether my earlier decision to avoid multiple shift instructions was actually the best choice.

In effect, I have been introducing special instructions to try to overcome the lack of multiple shifts and as a compiler aid to generate denser code. The instructions that were specially introduced for that are the following:

Code:
bswap    // Swaps the high and low bytes of a word.
sextw    // Sign extends word.


The "bwap" instruction helps with long shifts because the instruction can be combined with zext or sext instructions along with the single bit shift instructions. That one would not be required if long shifts are implemented because it is not used by the compiler elsewhere.

The "sextw" instruction is in reality a 15 bit arithmetic shift right, so it would not be required if long shifts are implemented.

These instructions help, but shift amounts of up to 7 bits or up to 14 bits are still expensive, because they still require a large number of single shift instructions in a row.

The positive thing about these two instructions is that they can be implemented in hardware (I think) relatively easily, compared with full word shifts that would require a 4 stages barrel shifter.

Both the above instructions belong to the T10 instruction type, (two register operands, source and destination). Unfortunately, if they are to be replaced by constant shift instructions, there are no available slots for that, I.e., no free slots with a constant immediate to express the shift amount (even considering that only 4 immediate bits would be required for that). So from the point of view of instruction encoding, having long constant shifts is also difficult.

Alternatively, I might create a set of shift instructions to make multiple of two shifts. For example shifts by 1, 2, 4 and 8, so they can be combined by the compiler to optimise long shifts in no more than 4 instructions (actually a sort of software barrel shifter). They could be coded in the T9 type (same source and destination), but again there's not room for that many instructions unless I widen the current 10 bit opcode range, which I do not feel right.

There's yet another alternative, which is leaving the "swapb" (possibly the "sextw" too) instruction, and incorporating "swapn" and "sextn", "zextn", with the "n" suffix standing to 'nibble' (4 bits). That is, instructions that perform nibble swaps and extensions. That's only 3 additional instructions which just fit in the available slots, and they can be combined by the compiler to generate optimised shifts of any size with no more than 4 instructions.

Anyway, some food for thought.


Mon Aug 19, 2019 10:52 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I have the simulator almost complete, just needs some code clean up and final touches (which might take longer than expected, as usual).

The following is the CPU diagram that I consider for the implementation.
The first picture is the full diagram with control lines (or most of them anyway), and the the second one is the basic diagram without control lines

FULL
Attachment:
CPU74DiagramFull.png
CPU74DiagramFull.png [ 60.76 KiB | Viewed 3484 times ]


BASIC
Attachment:
CPU74Diagram.png
CPU74Diagram.png [ 45.32 KiB | Viewed 3484 times ]


The acronyms are as follows:

PAR : Program memory address register
MAR : Data memory address register
MDR : Data memory data register
PC : Program counter
Rx : General purpose register
SP : Stack pointer
IR : Instruction register
MIR : Microinstruction register
SR : Status register

On the diagrams:
- Pale blue lines represent control lines or control logic
- Arrows ending in a white triangle represent buffer or multiplexer controlled transfers
- Arrows ending in a solid black triangle are clock edge controlled transfers

According to this diagram, unless I missed something:
- Instructions involving data memory access, either reads or writes, use 2 cycles
- Instructions using long immediate, 'imml' in the diagram, use 2 cycles
- Unconditional jumps, or conditional jumps taking the branch, use 2 cycles
- The read from program memory instruction uses 2 cycles
- Push, pop from the stack, use 2 cycles
- Call instructions use 3 cycles
- Ret instruction uses 3 cycles
- All the remaining instructions are executed in 1 cycle

The CPU prefetches the Instruction Register, this means that the PC is always one word ahead of the currently executing instruction. The pre-fetching enables loads of long immediates -in the next word- without having to use additional clock cycles, and it should reduce the total time required to execute instructions. The pre-fetchig also implies that all instructions affecting the PC such as jump, call, return, require one additional cycle, but the reduced execution time (faster clock rate) should compensate for that.

I was considering at some time to incorporate a pipelining stage between the Decoder and the ALU. This would enable increasing the clock frequency, but the overall cycles per instruction would slightly worsen, or at best remain essentially the same, so I decided to avoid such complication because increased clock frequencies would have their own separate challenges, which I guess I'd rather leave for a future project.

Joan

Edit: typo in one of the diagrams


Last edited by joanlluch on Mon Aug 26, 2019 5:17 pm, edited 1 time in total.



Thu Aug 22, 2019 9:48 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I just updated the github repo with the source code of the Simulator https://github.com/John-Lluch/CPU74 I tried two slightly different approaches at instruction decoding. First one was a 10 bit hash table fed with the 10 most significative bits of the instructions with no more operations. This is in the file named "machine10b.swift". The second approach is the one that I currently favour. It's made of a pre-decoding pass all done on simple logic gates that converts the 10 bit wide instruction encoding into a 7 bit opcode. This 7 bit opcode is then fed to a table to determine the control signals, or rather the function to be executed in the simulator. This is in file "machine.swift".


Sat Aug 24, 2019 3:05 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
As I am playing with the simulator I start to realise things that went unnoticed before, or that I forgot in the middle of the process.

The most stinging one is that during the early definition of the instruction set, I correctly and consciously avoided the definition of instructions that would perform ALU operations with long immediates. Indeed, such instructions are performance killers because of the pre-fetching nature of the processor. The prefetching implies that:
- Instructions and their embedded fields are available very early at the start of the clock cycle, thus saving total cycle time and enabling faster clock rates.
- Long immediates in the next instruction word are also available on the same cycle, but at a much later time, because they are obtained from program memory.

The above implies that long immediates should never be used in time consuming operations such as the ALU.

As said, I was fully aware of this constraint at the beginning of my design, but for some reason I forgot it at some point and then added the indirect long immediate addressing, which are performance killers based on the above.

I think that now I have two solutions:
- First one is just allow more cycles for the offending instructions through additional microcode steps. Although this may work, I kind of feel that's an inelegant solution. It just defeats the whole purpose of a RISC processor, which essentially is executing simple instructions in a shorter time.
- The second approach is just removing these instructions and let the compiler use long immediate loads to registers when required. I'm not totally happy with that either because that will increase register pressure, but I think that this is more coherent with the initial goals.

So, now removing these instructions is not as easy as just removing them... It has some implications on the way the compiler should generate function prologue/epilog code or stack frame accesses because there may be cases were temporary registers are required, and this happens after physical register allocation has already run, which is a tricky thing to solve...

Anyway, it has been today a rather productive afternoon on the simulator, and I guess it's time now for some rest before I take a final decision.

(I updated the CPU diagrams that I posted earlier, with more detail in the decoder section and my newly proposed treatment of the long immediates. The change doesn't affect the already posted text, so that's left unchanged)


Mon Aug 26, 2019 6:19 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Interesting predicament. Would it be enough to set aside a single register for constructing long immediates? That would presumably have minimal effect on register pressure.


Mon Aug 26, 2019 6:31 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Joan:

I don't think that pure RISC should be an absolute goal. It's been noted in many cases that RISC carries with it some issues that are sometimes best solved with some slight variations. I think that you're better off balancing the work your compiler and the HW implementation of your architecture have to do. Some warts are a natural part of most ISAs.

Even the most RISCy of modern commercial architectures, the ARM processors, have deviated considerably from a pur RISC architecture. Most readers of your architecture would not really notice or care about the deviation from a pure RISC architecture that you describe above. They will certainly appreciate the balance that you appear to have achieved with the instruction set and the code generation of the compiler.

For a project like yours, some compromises are more than acceptable, they are almost a requirement, in order to complete the project in a reasonable amount of time and effort. The project can't become another job. However, since this is your project, you're entitled to make it as "pure" as you want it to be. :)

To follow up on Ed's suggestion, perhaps you can consider the Operand register found in the Inmos Transputer. It functions to construct wider immediate values to be used by the following instruction. (In the Transputer, it is also used to extend the instruction set's opcode space.) I used such a mechanism in the implementation of my MiniCPU FPGA core. It uses a Positive Prefix instruction, and a Negative Prefix instruction. These two instructions build up a 16-bit constant in the operand register, K, 4-bits at a time. The register is automatically cleared on the following non-prefix instruction. An immediate field in the following non-prefix instruction will complete the construction of a 16-bit constant.

The assembler can easily expand an instruction with an immediate value into 1,2, or 3 prefix instructions followed by the immediate mode instruction. This approach creates a variable length instruction format, but it preserves the single cycle execution of each instruction, and most immediate constants will be either 4-bits or 8-bits in length. Your architecture already supports the realization that most immediate values are small. Extending it with a Transputer like Operand register adds a small wart, but its a RISCy wart. :)

_________________
Michael A.


Tue Aug 27, 2019 2:09 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Michael, thanks for your comment

I didn't mean to imply that I want a "pure" RISC processor, and I agree that some 'deviations' are acceptable and even desirable. In fact, I have instructions with an operand in the second word, which some RISC processors don't even allow in order to simplify pipelining. I refer to the rather academically defined MIPS and RISC-V processors. I suppose we would have to define what RISC actually means but I suppose these two processors are the ones that adhere to it to the fullest.

You are making a good point about the Inmos Transputer approach that you adopted. I wasn't aware of this approach in the exact way that you described with the use of the 'operand' register, and that's something that made me think... The above mentioned "pure" RISC processors have instructions to load "half" words into the upper and lower halves of registers, so two consecutive instructions load the full word. The goal is the same: to keep constant length encodings while being able to use long immediates. Since instructions that perform ALU operations with regular registers must be there anyway I suppose that the 'operand' register is avoided in these cases, but I understand that the 'operand' register approach saves instructions (and cycles)

If you think on it, my approach is even less RISCy than any of the above, because I allow a full word immediate in the second instruction word. I could have implemented a couple of instructions to load half immediate words into the upper and lower half of registers, but I decided to avoid them after learning about the way the 6502 fetches operands, which I thought I could adopt. The same operative is then used for the immediate indirect addressing instructions.

Now, (I could say unfortunately) there's yet another approach that has returned to my mind, and it is in fact my very first approach when I still had a lot to investigate. I refer to drop altogether the 3-operand instructions, and switch to a purely 2-operand system, but with 16 registers instead of just 8. That suddenly would increment the available encoding slots, with consistent 8 bit immediate lengths for all instructions requiring them. If all immediates are now 8 bits, then the introduction of the 'operand' register behaviour for the upper half word, would be a great benefit, as that would turn 8 bit immediates into 16 bit ones by just prepending an instruction!. The 'operand' register might even be one of the regular ones, and I might use a status register 'flag' to indicate when it is in use, instead of clearing the register when it is not, so that such register can be used normally when no immediate fields are involved :shock:


Last edited by joanlluch on Tue Aug 27, 2019 11:18 am, edited 1 time in total.



Tue Aug 27, 2019 10:42 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 305 posts ]  Go to page Previous  1 ... 8, 9, 10, 11, 12, 13, 14 ... 21  Next

Who is online

Users browsing this forum: AhrefsBot and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software