Last visit was: Fri Jun 02, 2023 9:14 am
|
It is currently Fri Jun 02, 2023 9:14 am
|
74xx based CPU (yet another)
Author |
Message |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
Ok, So these are several compilation examples after the changes that I described on my previous post Example 1: ( the negation) Code: int test( int a ) { return !a; } CPU74 (The compiler will just generate an equal to zero SET instruction) Code: test: ; @test ; %bb.0: ; %entry cmp r0, #0 ; encoding: [0x28,0x00] seteq r0 ; encoding: [0xa2,0x00] ret ; encoding: [0x02,0x00] Example 2 (The testing for less than zero) Code: int test( int a ) { return a < 0; } CPU74 (I now generate a regular SET instruction for this case. No more need for fancy workarounds) Code: test: ; @test ; %bb.0: ; %entry cmp r0, #0 ; encoding: [0x28,0x00] setlt r0 ; encoding: [0xb2,0x00] ret ; encoding: [0x02,0x00] Example 3: Code: int test( int a, int b ) { return a > b; } CPU74 (just straight code again for this case) Code: test: ; @test ; %bb.0: ; %entry cmp r0, r1 ; encoding: [0x00,0x48] setgt r0 ; encoding: [0xbe,0x00] ret ; encoding: [0x02,0x00] Example 4: Code: int test( int a, int b ) { return a <= b; } CPU74 (Less than or equal is not supported so both the compare operands and condition are swapped) Code: test: ; @test ; %bb.0: ; %entry cmp r1, r0 ; encoding: [0x00,0x41] setge r0 ; encoding: [0xb6,0x00] ret ; encoding: [0x02,0x00] Example 5: Code: int test( int a ) { return a <= 3; } CPU74 (Less than or equal is not supported so the constant is added 1 and the condition replaced by a supported one. Edge cases are obviously taken into account) Code: test: ; @test ; %bb.0: ; %entry cmp r0, #4 ; encoding: [0x28,0x20] setlt r0 ; encoding: [0xb2,0x00] ret ; encoding: [0x02,0x00] More complex examples: Example 6: Code: int test( int a, int b, int c, int d ) { return a > b && c > d; } CPU74 (Code is comparatively much shorter than ARM-Thumb as it takes a lot of branching for ARM to complete that) Code: test: ; @test ; %bb.0: ; %entry cmp r2, r3 ; encoding: [0x00,0x5a] setgt r2 ; encoding: [0xbe,0x02] cmp r0, r1 ; encoding: [0x00,0x48] setgt r0 ; encoding: [0xbe,0x00] and r0, r2, r0 ; encoding: [0x8a,0x10] ret Example 7: (The dreaded pointer case) Code: int test( char *c ) { return c && *c == 'A'; } CPU74 (the compiler recognises the dangerous situation and expands the comparison into a branch instead) Code: test: ; @test ; %bb.0: ; %entry cmp r0, #0 ; encoding: [0x28,0x00] breq &.LBB0_2 ; encoding: [0xc0,0x00] ; %bb.1: ; %land.rhs ld.zb [r0, #0], r0 ; encoding: [0x50,0x00] cmp r0, #65 ; encoding: [0x2a,0x08] seteq r0 ; encoding: [0xa2,0x00] ret ; encoding: [0x02,0x00] .LBB0_2: mov #0, r0 ; encoding: [0x20,0x00] ret ; encoding: [0x02,0x00] Example 8: (Two assignations in a single if clause) Code: int sftest( int x, int y, int a, int b ) { if ( a >= 8 ) { x = x - a; y = - b; } return x+y; } CPU74 (The compiler decided that branching is still not worth in this case. Also note that the same condition code is used in two different SEL instructions despite a subtraction in the middle. This is correct because subtraction is not longer affecting comparison flags. Code: sftest: ; @sftest ; %bb.0: ; %entry mov #0, r4 ; encoding: [0x20,0x04] sub r4, r3, r3 ; encoding: [0x85,0x1b] cmp r2, #7 ; encoding: [0x28,0x3a] selgt r3, r1, r1 ; encoding: [0xbc,0xc9] sub r4, r2, r2 ; encoding: [0x85,0x12] selgt r2, r4, r2 ; encoding: [0xbc,0xa2] add r2, r0, r0 ; encoding: [0x80,0x80] add r0, r1, r0 ; encoding: [0x80,0x08] ret ; encoding: [0x02,0x00] The above function can be manually optimised in C code by writing it like this: Example 9: Code: int sftest2( int x, int y, int a, int b ) { if ( a >= 8 ) return x - a - b; return x+y; } CPU74 (This produces shorter code, and still no jumps) Code: sftest2: ; @sftest2 ; %bb.0: ; %entry add r1, r0, r1 ; encoding: [0x80,0x41] sub r0, r2, r0 ; encoding: [0x84,0x10] sub r0, r3, r0 ; encoding: [0x84,0x18] cmp r2, #7 ; encoding: [0x28,0x3a] selgt r0, r1, r0 ; encoding: [0xbc,0x08] ret ; encoding: [0x02,0x00] Introducing some pointer arithmetic causes the compiler to resort to branches for code generation. Example 10: Code: void sftest3( int *x, int *y, int a, int b ) { if ( a >= 8 ) { *x = *x - a; *y = - b; } } CPU74 (The compiler chose to generate a branch instruction for that case, but automatically reversed the branching condition to prevent excessive branching) Code: sftest3: ; @sftest3 ; %bb.0: ; %entry cmp r2, #8 ; encoding: [0x28,0x42] brlt &.LBB1_2 ; encoding: [0xd0,0x00] ; %bb.1: ; %if.then ld.w [r0, #0], r4 ; encoding: [0x40,0x04] sub r4, r2, r2 ; encoding: [0x85,0x12] st.w r2, [r0, #0] ; encoding: [0x60,0x10] mov #0, r0 ; encoding: [0x20,0x00] sub r0, r3, r0 ; encoding: [0x84,0x18] st.w r0, [r1, #0] ; encoding: [0x60,0x01] .LBB1_2: ; %if.end ret ; encoding: [0x02,0x00] I can say that at least for these snippets of code, the CPU74 beats all other architectures. I hope you all find it interesting, and I'm ready to test any code snippet that you may want to try to push the compiler into a stressful situation. Thanks for watching
|
Sat Apr 20, 2019 4:17 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1754
|
Certainly is interesting. As you say, there's no +1, but rest assured, there are people reading.
BTW, I have just a little concern about your workaround for missing conditions: > For Branch and Set instructions with constant comparisons I just add 1 to the constant at compile time and replace the unsupported conditions by 'less than' or 'unsigned less than' respectively, which are both supported.
I haven't quite the clarity of mind to think it through, but are there edge cases near 0 or -1 where you can't add 1 without changing the meaning?
|
Sat Apr 20, 2019 5:06 pm |
|
 |
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
I too am reading along.
_________________ Michael A.
|
Sat Apr 20, 2019 5:08 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
BigEd wrote: Certainly is interesting. As you say, there's no +1, but rest assured, there are people reading.
BTW, I have just a little concern about your workaround for missing conditions: > For Branch and Set instructions with constant comparisons I just add 1 to the constant at compile time and replace the unsupported conditions by 'less than' or 'unsigned less than' respectively, which are both supported.
I haven't quite the clarity of mind to think it through, but are there edge cases near 0 or -1 where you can't add 1 without changing the meaning? Hi Ed, thanks for your comment. As a matter of general information, the machine 'CMP' instruction in most architectures do not know in advance how other instructions will latter interpret the comparison results, so they just set the Z C V S status flags in the typical way to represent comparison results. I posted the meaning of the several combinations of these flags just a few posts ago, but I assume this is quite a known subject. In my architecture, and many others, instruction embedded constants are interpreted as signed values that must be Sign Extended to the machine register size before they are used by the instruction. There are 8 encoding bits available for the constant in the compare instruction. So as a signed value it can go physically from -128 to +127, although the actual comparison is performed as a 16 bit subtraction. Considering this, 0 or -1 are not edge cases, because these constants can be added 1 without exceeding the range. The first edge case is when there's not enough room on the Constant Comparison instruction for a big constant. As said, there are 8 encoding bits available for the constant. If the constant (as a signed integer) does not fit in such 8 bits, then an intermediate register must be used, which is previously loaded with a full 16 bit value, and Register to Register Compare instruction is used instead. Once we resorted to use a Register to Register Comparison, there's no longer the need to increment a constant. For Register to Register comparisons the strategy is to swap the operands on the comparison instruction and to use swapped test conditions in any subsequent instructions that make use of the comparison results. The operator and condition swapping does not create any further edge cases. I hope this makes sense. I have actually tested all edge cases that I could think of and I found it to generate correct code (fingers crossed)
Last edited by joanlluch on Sat Apr 20, 2019 7:15 pm, edited 1 time in total.
|
Sat Apr 20, 2019 7:04 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1754
|
Ah, thanks, yes, the extra width you have will avoid any potential problem.
BTW, I mean to say, that's an interesting idea to have a pair of status registers. But I did expect to see a V in the arithmetic register.
|
Sat Apr 20, 2019 7:06 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
BigEd wrote: Ah, thanks, yes, the extra width you have will avoid any potential problem.
BTW, I mean to say, that's an interesting idea to have a pair of status registers. But I did expect to see a V in the arithmetic register. Yes, I suppose it can be added if I find that it's required. I must admit that I still have some confusion about how to deal with (or what to do) with the Arithmetic Status Register. I can clearly see the carry flag in use for type-promoted arithmetic calculations, let's say we want to ADD two 32 bit integer values, this shall be done by adding the lower 16 bit words together then use the ADDC (ADD with carry) instruction for the two higher words. Similarly, we may have to use the carry flag in some way to compute promoted-type shifts. But beyond that, I'm confused about the actual utility of the remaining flags in the context of only arithmetic operations. That's something I still need to look into.
|
Sat Apr 20, 2019 7:43 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 1947 Location: Canada
|
Nit-picking a bit here. I just noticed in the instruction formats P4:T10 that the immediate constant is in a different position than for formats T7 to T9. It might be better to keep the constant in the same place in order to save some hardware multiplexors. Also there is an op field for add/sub sp. Since only a single add instruction is really needed (assuming a sign extended constant), this format could be used for other instructions.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 21, 2019 3:13 am |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
Hi Rob, Thank you for that observation. You are totally right!. However, there's not an easy way to move such 'constant' fields all on the same place because the instruction encoding is already very compressed. If I move the constant on the T10 template to the position of the T7-T9 templates, then I kill a lot of encoding slots up to the point that the current instruction set does no longer fit in the remaining ones. Another possibility is to move all constants to the right, and then allow variations on the register encoding positions. So instead of having always the registers in the same places, we have always the constant in the same places (But not both at the same time) However, maybe it's still possible to have always both the constants and the registers in the same exact places, by allowing the instruction encoding to be distributed along the free positions, ie, not in contiguous places, as they are now. Not sure if this can still guarantee the absence of duplicates. My guess, is that having it all is not possible because otherwise companies like AMD would have already figured out a way. Look for instance to the ARM-Thumb encoding. It has a quite very well designed encoding, all constants are more or less on the same places, but still the 3th, 11th and 12th patterns have the register in an unique place. I would have almost the same situation, if I just move all the constants in T7 though T9 to the least significant bits. The key think for me to ask is this: what configuration might be easier to decode? As this is not in my field of expertise I have a difficult time to try to figure it out by myself... About the ADD, SUB instructions, what I have seen in some architectures is that they tend to interpret the constants as unsigned values. Normally, signed constants are used on load instructions and addressing mode offsets, but unsigned ones are used on sub/add instructions. We could certainly eliminate the constant sub instruction, but if we keep the same constant field length and remove the sub instruction, this reduces by half the available range of constant values (either in the positive or in the negative zone). However, this brings another interesting possibility. What if we use ADD for signed, positive and negative values, and use the SUB with reversed operands?. This would make possible expressions like K - Rn, as well as Rn + K, where K could be either possitive or negative. Currently, only Rn - K or Rn + K are possilbe!
You do not have the required permissions to view the files attached to this post.
|
Sun Apr 21, 2019 9:45 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1754
|
My feeling is that splitting up the mode bits is a better choice if it allows you to put registers and constants in fixed positions. That's because the mode bits inevitably go into some random decode, whereas the registers and constants go into some kind of regular mux trees. I've a vague recollection that RISC-V made an effort to make the instruction format fields very regular and static. See for example https://www.csl.cornell.edu/courses/ece ... rv-isa.txt
|
Mon Apr 22, 2019 8:20 am |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
BigEd wrote: My feeling is that splitting up the mode bits is a better choice if it allows you to put registers and constants in fixed positions. That's because the mode bits inevitably go into some random decode, whereas the registers and constants go into some kind of regular mux trees. I've a vague recollection that RISC-V made an effort to make the instruction format fields very regular and static. See for example https://www.csl.cornell.edu/courses/ece ... rv-isa.txtThanks for your comment. I guess I should try to do as you suggest (splitted mode bits). Even if I'm unable to fully get it consistent, some improvement should be possible. This subject is trickier on 16 bit instruction architectures than 32 bit ones because of course you have less encoding slots. The RISC-V is apparently 32 bit, which may have helped. The regular ARM (I.e not the ARM-Thumb) is also a 32 bits with very stable fields among instructions: http://vision.gel.ulaval.ca/~jflalonde/cours/1001/h17/docs/arm-instructionset.pdf. In this case, with the exception of one undefined instruction pattern and the ignored bits on the software interrupt instruction which do not affect decoding, and the branch instruction, all instructions use 4 bits for predicates, 8 bits for opcode plus options, and the rest are rather consistently placed fields containing registers or offsets or additional options. My point is that the same company (ARM) had apparently a harder time to reproduce such nicety on its 16 bit based Thumb architecture.
|
Mon Apr 22, 2019 9:46 am |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1754
|
Good point about 16 bits being more cramped. In a recent update on the OPC machines we could see that more complex decode was costing us clock speed. Of course, a more pipelined architecture has a chance to absorb some of the cost of decode.
|
Mon Apr 22, 2019 10:21 am |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
BigEd wrote: Good point about 16 bits being more cramped. In a recent update on the OPC machines we could see that more complex decode was costing us clock speed. Of course, a more pipelined architecture has a chance to absorb some of the cost of decode. I am following the OPC thread with interest. Actually I recently posted a question that you contributed to answer effectively. My currently stablished goal is to build a 3 stage pipelined processort: Fetch; Decode; Execute/Load/Store. Branches are the only kind of instructions that get affected by this level of pipelining. That's in part why I have put some effort in removing excessive branching generation by the compiler. I'm still unsure if this is a too big demand for a 74xx cpu, but in case this is possible, it means that I should have some time for decoding, at least as much as it takes an ALU operation or a Load/Store memory operation.
Last edited by joanlluch on Mon Apr 22, 2019 4:01 pm, edited 1 time in total.
|
Mon Apr 22, 2019 3:27 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
On an unrelated subject, I just found a bug on the ARM-Thumb backend. The bug can be shown in a relatively straightforward way, so it is quite surprising because this is on an industry standard compiler for the ARM architecture. I have been testing very large number of arguments passed to functions to force SP offsets beyond the limits of the embedded immediate fields. For the ARM the compiler can generate code like this: ARM-Thumb Code: ldr r1, [sp, #564] @ encoding: [0x8d,0x99] adds r0, r0, r1 @ encoding: [0x40,0x18] ldr r1, [sp, #568] @ encoding: [0x8e,0x99] adds r0, r0, r1 @ encoding: [0x40,0x18] ldr r1, [sp, #572] @ encoding: [0x8f,0x99] adds r0, r0, r1 @ encoding: [0x40,0x18] ldr r1, [sp, #576] @ encoding: [0x90,0x99] adds r0, r0, r1 @ encoding: [0x40,0x18] ldr r1, [sp, #580] @ encoding: [0x91,0x99] adds r0, r0, r1 @ encoding: [0x40,0x18] Well, if you look at the offsets, they are WAY above the instruction set limits. Also the encoding is totally messed up. The offending instruction corresponds to pattern 11 on the ARM-Thumb instruction summary that I posted earlier. The immediate value should not be greater than 8 bits for ARM-Thumb. (Other architectures seem to handle that right) We might consider that SP offsets greater than 127 are extremely rare, and that's possibly right, (I am passing more than 200 arguments to a function to cause that), but I would rather have the compiler crashing with a software exception than generating incorrect code. So that's what I did for my CPU74 implementation. I don't attempt in this case to expand into alternative instructions because as said a range of -128 to +127 seems more than enough, unless the C programmer was crazy.
|
Mon Apr 22, 2019 3:53 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 1947 Location: Canada
|
Quote: On an unrelated subject, I just found a bug on the ARM-Thumb backend. Is it possible that the large offsets are handled by the assembler spitting out additional code? In my assembler it supports constants that are too large for the particular instruction type by loading them first into a temp register, then using register addressing mode. That way the compiler doesn't have to worry about the size of the constants. For instance the BEQI instruction only supports an eight bit field, but the compiler can output a value much larger than eight bits. The assembler handles it.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 23, 2019 3:30 am |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
robfinch wrote: Quote: On an unrelated subject, I just found a bug on the ARM-Thumb backend. Is it possible that the large offsets are handled by the assembler spitting out additional code? In my assembler it supports constants that are too large for the particular instruction type by loading them first into a temp register, then using register addressing mode. That way the compiler doesn't have to worry about the size of the constants. For instance the BEQI instruction only supports an eight bit field, but the compiler can output a value much larger than eight bits. The assembler handles it. I understand what you say, but it doesn't work like that on the LLVM infrastructure. Backends can optionally produce output as human readable assembly files or as object files for a linker. This is the last pass of the code generation, and there's not an assembler pass. You must implement classes to enable both kinds of output, and be sure the machine instruction descriptions have all required information for both possible outputs. The key thing to know is that by the time the output is generated (assembly or object file) all the instruction selection, register allocation, and target dependent optimisations are already performed. In fact, it would not be possible for an assembler to expand instructions with the incorporation of temporary registers because that would certainly break all the previous register allocation work and optimisations performed by the backend. It may not even be any available register at the time an assembler realises that it may require one more. To further elaborate on this point, the case is that a function in the ARM backend to check for the limits of immediate fields in this particular scenario and to perform the required code expansions, is implemented, but it just fails to perform a valid check in this particular case. The bug concerns an incorrect checking of the particular architecture being compiled, and therefore it fails for the ARM-Thumb architecture that has smaller immediate fields than the regular 32 bit ARM instruction set. I may have missed something fundamental here, or I may have read the code in a wrong way (I have not even run the debugger through it, because it doesn't concern the CPU74 backend), but this is totally what it looks to me.
|
Tue Apr 23, 2019 7:48 am |
|
Who is online |
Users browsing this forum: CCBot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|