View unanswered posts | View active topics It is currently Thu Mar 28, 2024 9:54 am



Reply to topic  [ 305 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8 ... 21  Next
 74xx based CPU (yet another) 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Ok, So these are several compilation examples after the changes that I described on my previous post

Example 1: ( the negation)
Code:
int test( int a )
{
  return !a;
}
CPU74 (The compiler will just generate an equal to zero SET instruction)
Code:
test:                                   ; @test
; %bb.0:                                ; %entry
   cmp   r0, #0                  ; encoding: [0x28,0x00]
   seteq    r0                      ; encoding: [0xa2,0x00]
   ret                             ; encoding: [0x02,0x00]


Example 2 (The testing for less than zero)
Code:
int test( int a )
{
  return a < 0;
}
CPU74 (I now generate a regular SET instruction for this case. No more need for fancy workarounds)
Code:
test:                                   ; @test
; %bb.0:                                ; %entry
   cmp   r0, #0                  ; encoding: [0x28,0x00]
   setlt    r0                      ; encoding: [0xb2,0x00]
   ret                             ; encoding: [0x02,0x00]


Example 3:
Code:
int test( int a, int b )
{
  return a > b;
}
CPU74 (just straight code again for this case)
Code:
test:                                   ; @test
; %bb.0:                                ; %entry
   cmp   r0, r1                  ; encoding: [0x00,0x48]
   setgt    r0                      ; encoding: [0xbe,0x00]
   ret                             ; encoding: [0x02,0x00]


Example 4:
Code:
int test( int a, int b )
{
  return a <= b;
}
CPU74 (Less than or equal is not supported so both the compare operands and condition are swapped)
Code:
test:                                   ; @test
; %bb.0:                                ; %entry
   cmp   r1, r0                  ; encoding: [0x00,0x41]
   setge    r0                      ; encoding: [0xb6,0x00]
   ret                             ; encoding: [0x02,0x00]


Example 5:
Code:
int test( int a )
{
  return a <= 3;
}
CPU74 (Less than or equal is not supported so the constant is added 1 and the condition replaced by a supported one. Edge cases are obviously taken into account)
Code:
test:                                   ; @test
; %bb.0:                                ; %entry
   cmp   r0, #4                  ; encoding: [0x28,0x20]
   setlt    r0                      ; encoding: [0xb2,0x00]
   ret                             ; encoding: [0x02,0x00]


More complex examples:

Example 6:
Code:
int test( int a, int b, int c, int d )
{
  return a > b && c > d;
}
CPU74 (Code is comparatively much shorter than ARM-Thumb as it takes a lot of branching for ARM to complete that)
Code:
test:                                   ; @test
; %bb.0:                                ; %entry
   cmp   r2, r3                  ; encoding: [0x00,0x5a]
   setgt    r2                      ; encoding: [0xbe,0x02]
   cmp   r0, r1                  ; encoding: [0x00,0x48]
   setgt    r0                      ; encoding: [0xbe,0x00]
   and   r0, r2, r0              ; encoding: [0x8a,0x10]
   ret     


Example 7: (The dreaded pointer case)
Code:
int test( char *c )
{
  return c && *c == 'A';
}
CPU74 (the compiler recognises the dangerous situation and expands the comparison into a branch instead)
Code:
test:                                   ; @test
; %bb.0:                                ; %entry
   cmp   r0, #0                  ; encoding: [0x28,0x00]
   breq   &.LBB0_2                ; encoding: [0xc0,0x00]
; %bb.1:                                ; %land.rhs
   ld.zb   [r0, #0], r0            ; encoding: [0x50,0x00]
   cmp   r0, #65                 ; encoding: [0x2a,0x08]
   seteq    r0                      ; encoding: [0xa2,0x00]
   ret                             ; encoding: [0x02,0x00]
.LBB0_2:
   mov   #0, r0                  ; encoding: [0x20,0x00]
   ret                             ; encoding: [0x02,0x00]


Example 8: (Two assignations in a single if clause)
Code:
int sftest( int x, int y, int a, int b )
{
  if ( a >= 8 )
  {
      x = x - a;
      y = - b;
  }
  return x+y;
}

CPU74 (The compiler decided that branching is still not worth in this case. Also note that the same condition code is used in two different SEL instructions despite a subtraction in the middle. This is correct because subtraction is not longer affecting comparison flags.
Code:
sftest:                                 ; @sftest
; %bb.0:                                ; %entry
   mov   #0, r4                  ; encoding: [0x20,0x04]
   sub   r4, r3, r3              ; encoding: [0x85,0x1b]
   cmp   r2, #7                  ; encoding: [0x28,0x3a]
   selgt    r3, r1, r1              ; encoding: [0xbc,0xc9]
   sub   r4, r2, r2              ; encoding: [0x85,0x12]
   selgt    r2, r4, r2              ; encoding: [0xbc,0xa2]
   add   r2, r0, r0              ; encoding: [0x80,0x80]
   add   r0, r1, r0              ; encoding: [0x80,0x08]
   ret                             ; encoding: [0x02,0x00]


The above function can be manually optimised in C code by writing it like this:
Example 9:
Code:
int sftest2( int x, int y, int a, int b )
{
  if ( a >= 8 )
    return x - a - b;
  return x+y;
}
CPU74 (This produces shorter code, and still no jumps)
Code:
sftest2:                                ; @sftest2
; %bb.0:                                ; %entry
   add   r1, r0, r1              ; encoding: [0x80,0x41]
   sub   r0, r2, r0              ; encoding: [0x84,0x10]
   sub   r0, r3, r0              ; encoding: [0x84,0x18]
   cmp   r2, #7                  ; encoding: [0x28,0x3a]
   selgt    r0, r1, r0              ; encoding: [0xbc,0x08]
   ret                             ; encoding: [0x02,0x00]


Introducing some pointer arithmetic causes the compiler to resort to branches for code generation.
Example 10:
Code:
void sftest3( int *x, int *y, int a, int b )
{
  if ( a >= 8 )
  {
      *x = *x - a;
      *y = - b;
  }
}
CPU74 (The compiler chose to generate a branch instruction for that case, but automatically reversed the branching condition to prevent excessive branching)
Code:
sftest3:                                ; @sftest3
; %bb.0:                                ; %entry
   cmp   r2, #8                  ; encoding: [0x28,0x42]
   brlt   &.LBB1_2                ; encoding: [0xd0,0x00]
; %bb.1:                                ; %if.then
   ld.w   [r0, #0], r4            ; encoding: [0x40,0x04]
   sub   r4, r2, r2              ; encoding: [0x85,0x12]
   st.w   r2, [r0, #0]            ; encoding: [0x60,0x10]
   mov   #0, r0                  ; encoding: [0x20,0x00]
   sub   r0, r3, r0              ; encoding: [0x84,0x18]
   st.w   r0, [r1, #0]            ; encoding: [0x60,0x01]
.LBB1_2:                                ; %if.end
   ret                             ; encoding: [0x02,0x00]


I can say that at least for these snippets of code, the CPU74 beats all other architectures.

I hope you all find it interesting, and I'm ready to test any code snippet that you may want to try to push the compiler into a stressful situation.

Thanks for watching


Sat Apr 20, 2019 4:17 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Certainly is interesting. As you say, there's no +1, but rest assured, there are people reading.

BTW, I have just a little concern about your workaround for missing conditions:
> For Branch and Set instructions with constant comparisons I just add 1 to the constant at compile time and replace the unsupported conditions by 'less than' or 'unsigned less than' respectively, which are both supported.

I haven't quite the clarity of mind to think it through, but are there edge cases near 0 or -1 where you can't add 1 without changing the meaning?


Sat Apr 20, 2019 5:06 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
I too am reading along.

_________________
Michael A.


Sat Apr 20, 2019 5:08 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
Certainly is interesting. As you say, there's no +1, but rest assured, there are people reading.

BTW, I have just a little concern about your workaround for missing conditions:
> For Branch and Set instructions with constant comparisons I just add 1 to the constant at compile time and replace the unsupported conditions by 'less than' or 'unsigned less than' respectively, which are both supported.

I haven't quite the clarity of mind to think it through, but are there edge cases near 0 or -1 where you can't add 1 without changing the meaning?

Hi Ed, thanks for your comment.

As a matter of general information, the machine 'CMP' instruction in most architectures do not know in advance how other instructions will latter interpret the comparison results, so they just set the Z C V S status flags in the typical way to represent comparison results. I posted the meaning of the several combinations of these flags just a few posts ago, but I assume this is quite a known subject.

In my architecture, and many others, instruction embedded constants are interpreted as signed values that must be Sign Extended to the machine register size before they are used by the instruction. There are 8 encoding bits available for the constant in the compare instruction. So as a signed value it can go physically from -128 to +127, although the actual comparison is performed as a 16 bit subtraction. Considering this, 0 or -1 are not edge cases, because these constants can be added 1 without exceeding the range.

The first edge case is when there's not enough room on the Constant Comparison instruction for a big constant. As said, there are 8 encoding bits available for the constant. If the constant (as a signed integer) does not fit in such 8 bits, then an intermediate register must be used, which is previously loaded with a full 16 bit value, and Register to Register Compare instruction is used instead. Once we resorted to use a Register to Register Comparison, there's no longer the need to increment a constant. For Register to Register comparisons the strategy is to swap the operands on the comparison instruction and to use swapped test conditions in any subsequent instructions that make use of the comparison results. The operator and condition swapping does not create any further edge cases.

I hope this makes sense. I have actually tested all edge cases that I could think of and I found it to generate correct code (fingers crossed)


Last edited by joanlluch on Sat Apr 20, 2019 7:15 pm, edited 1 time in total.



Sat Apr 20, 2019 7:04 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Ah, thanks, yes, the extra width you have will avoid any potential problem.

BTW, I mean to say, that's an interesting idea to have a pair of status registers. But I did expect to see a V in the arithmetic register.


Sat Apr 20, 2019 7:06 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
Ah, thanks, yes, the extra width you have will avoid any potential problem.

BTW, I mean to say, that's an interesting idea to have a pair of status registers. But I did expect to see a V in the arithmetic register.

Yes, I suppose it can be added if I find that it's required.
I must admit that I still have some confusion about how to deal with (or what to do) with the Arithmetic Status Register. I can clearly see the carry flag in use for type-promoted arithmetic calculations, let's say we want to ADD two 32 bit integer values, this shall be done by adding the lower 16 bit words together then use the ADDC (ADD with carry) instruction for the two higher words. Similarly, we may have to use the carry flag in some way to compute promoted-type shifts. But beyond that, I'm confused about the actual utility of the remaining flags in the context of only arithmetic operations. That's something I still need to look into.


Sat Apr 20, 2019 7:43 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Nit-picking a bit here.
I just noticed in the instruction formats P4:T10 that the immediate constant is in a different position than for formats T7 to T9. It might be better to keep the constant in the same place in order to save some hardware multiplexors. Also there is an op field for add/sub sp. Since only a single add instruction is really needed (assuming a sign extended constant), this format could be used for other instructions.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 21, 2019 3:13 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Rob,
Thank you for that observation. You are totally right!. However, there's not an easy way to move such 'constant' fields all on the same place because the instruction encoding is already very compressed. If I move the constant on the T10 template to the position of the T7-T9 templates, then I kill a lot of encoding slots up to the point that the current instruction set does no longer fit in the remaining ones.

Another possibility is to move all constants to the right, and then allow variations on the register encoding positions. So instead of having always the registers in the same places, we have always the constant in the same places (But not both at the same time)

However, maybe it's still possible to have always both the constants and the registers in the same exact places, by allowing the instruction encoding to be distributed along the free positions, ie, not in contiguous places, as they are now. Not sure if this can still guarantee the absence of duplicates.

My guess, is that having it all is not possible because otherwise companies like AMD would have already figured out a way. Look for instance to the ARM-Thumb encoding. It has a quite very well designed encoding, all constants are more or less on the same places, but still the 3th, 11th and 12th patterns have the register in an unique place. I would have almost the same situation, if I just move all the constants in T7 though T9 to the least significant bits.

Attachment:
Thumb.png
Thumb.png [ 228.65 KiB | Viewed 5067 times ]


The key think for me to ask is this: what configuration might be easier to decode? As this is not in my field of expertise I have a difficult time to try to figure it out by myself...


About the ADD, SUB instructions, what I have seen in some architectures is that they tend to interpret the constants as unsigned values. Normally, signed constants are used on load instructions and addressing mode offsets, but unsigned ones are used on sub/add instructions. We could certainly eliminate the constant sub instruction, but if we keep the same constant field length and remove the sub instruction, this reduces by half the available range of constant values (either in the positive or in the negative zone). However, this brings another interesting possibility. What if we use ADD for signed, positive and negative values, and use the SUB with reversed operands?. This would make possible expressions like K - Rn, as well as Rn + K, where K could be either possitive or negative. Currently, only Rn - K or Rn + K are possilbe!


Sun Apr 21, 2019 9:45 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
My feeling is that splitting up the mode bits is a better choice if it allows you to put registers and constants in fixed positions. That's because the mode bits inevitably go into some random decode, whereas the registers and constants go into some kind of regular mux trees.

I've a vague recollection that RISC-V made an effort to make the instruction format fields very regular and static. See for example https://www.csl.cornell.edu/courses/ece ... rv-isa.txt


Mon Apr 22, 2019 8:20 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
My feeling is that splitting up the mode bits is a better choice if it allows you to put registers and constants in fixed positions. That's because the mode bits inevitably go into some random decode, whereas the registers and constants go into some kind of regular mux trees.

I've a vague recollection that RISC-V made an effort to make the instruction format fields very regular and static. See for example https://www.csl.cornell.edu/courses/ece ... rv-isa.txt

Thanks for your comment. I guess I should try to do as you suggest (splitted mode bits). Even if I'm unable to fully get it consistent, some improvement should be possible.

This subject is trickier on 16 bit instruction architectures than 32 bit ones because of course you have less encoding slots. The RISC-V is apparently 32 bit, which may have helped. The regular ARM (I.e not the ARM-Thumb) is also a 32 bits with very stable fields among instructions:http://vision.gel.ulaval.ca/~jflalonde/cours/1001/h17/docs/arm-instructionset.pdf. In this case, with the exception of one undefined instruction pattern and the ignored bits on the software interrupt instruction which do not affect decoding, and the branch instruction, all instructions use 4 bits for predicates, 8 bits for opcode plus options, and the rest are rather consistently placed fields containing registers or offsets or additional options. My point is that the same company (ARM) had apparently a harder time to reproduce such nicety on its 16 bit based Thumb architecture.


Mon Apr 22, 2019 9:46 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Good point about 16 bits being more cramped. In a recent update on the OPC machines we could see that more complex decode was costing us clock speed. Of course, a more pipelined architecture has a chance to absorb some of the cost of decode.


Mon Apr 22, 2019 10:21 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
Good point about 16 bits being more cramped. In a recent update on the OPC machines we could see that more complex decode was costing us clock speed. Of course, a more pipelined architecture has a chance to absorb some of the cost of decode.

I am following the OPC thread with interest. Actually I recently posted a question that you contributed to answer effectively. My currently stablished goal is to build a 3 stage pipelined processort: Fetch; Decode; Execute/Load/Store. Branches are the only kind of instructions that get affected by this level of pipelining. That's in part why I have put some effort in removing excessive branching generation by the compiler. I'm still unsure if this is a too big demand for a 74xx cpu, but in case this is possible, it means that I should have some time for decoding, at least as much as it takes an ALU operation or a Load/Store memory operation.


Last edited by joanlluch on Mon Apr 22, 2019 4:01 pm, edited 1 time in total.



Mon Apr 22, 2019 3:27 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
On an unrelated subject, I just found a bug on the ARM-Thumb backend. The bug can be shown in a relatively straightforward way, so it is quite surprising because this is on an industry standard compiler for the ARM architecture. I have been testing very large number of arguments passed to functions to force SP offsets beyond the limits of the embedded immediate fields. For the ARM the compiler can generate code like this:

ARM-Thumb
Code:
ldr   r1, [sp, #564]          @ encoding: [0x8d,0x99]
   adds   r0, r0, r1              @ encoding: [0x40,0x18]
   ldr   r1, [sp, #568]          @ encoding: [0x8e,0x99]
   adds   r0, r0, r1              @ encoding: [0x40,0x18]
   ldr   r1, [sp, #572]          @ encoding: [0x8f,0x99]
   adds   r0, r0, r1              @ encoding: [0x40,0x18]
   ldr   r1, [sp, #576]          @ encoding: [0x90,0x99]
   adds   r0, r0, r1              @ encoding: [0x40,0x18]
   ldr   r1, [sp, #580]          @ encoding: [0x91,0x99]
   adds   r0, r0, r1              @ encoding: [0x40,0x18]

Well, if you look at the offsets, they are WAY above the instruction set limits. Also the encoding is totally messed up. The offending instruction corresponds to pattern 11 on the ARM-Thumb instruction summary that I posted earlier. The immediate value should not be greater than 8 bits for ARM-Thumb. (Other architectures seem to handle that right)

We might consider that SP offsets greater than 127 are extremely rare, and that's possibly right, (I am passing more than 200 arguments to a function to cause that), but I would rather have the compiler crashing with a software exception than generating incorrect code. So that's what I did for my CPU74 implementation. I don't attempt in this case to expand into alternative instructions because as said a range of -128 to +127 seems more than enough, unless the C programmer was crazy.


Mon Apr 22, 2019 3:53 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
On an unrelated subject, I just found a bug on the ARM-Thumb backend.
Is it possible that the large offsets are handled by the assembler spitting out additional code? In my assembler it supports constants that are too large for the particular instruction type by loading them first into a temp register, then using register addressing mode. That way the compiler doesn't have to worry about the size of the constants. For instance the BEQI instruction only supports an eight bit field, but the compiler can output a value much larger than eight bits. The assembler handles it.

_________________
Robert Finch http://www.finitron.ca


Tue Apr 23, 2019 3:30 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
robfinch wrote:
Quote:
On an unrelated subject, I just found a bug on the ARM-Thumb backend.
Is it possible that the large offsets are handled by the assembler spitting out additional code? In my assembler it supports constants that are too large for the particular instruction type by loading them first into a temp register, then using register addressing mode. That way the compiler doesn't have to worry about the size of the constants. For instance the BEQI instruction only supports an eight bit field, but the compiler can output a value much larger than eight bits. The assembler handles it.

I understand what you say, but it doesn't work like that on the LLVM infrastructure. Backends can optionally produce output as human readable assembly files or as object files for a linker. This is the last pass of the code generation, and there's not an assembler pass. You must implement classes to enable both kinds of output, and be sure the machine instruction descriptions have all required information for both possible outputs.

The key thing to know is that by the time the output is generated (assembly or object file) all the instruction selection, register allocation, and target dependent optimisations are already performed. In fact, it would not be possible for an assembler to expand instructions with the incorporation of temporary registers because that would certainly break all the previous register allocation work and optimisations performed by the backend. It may not even be any available register at the time an assembler realises that it may require one more.

To further elaborate on this point, the case is that a function in the ARM backend to check for the limits of immediate fields in this particular scenario and to perform the required code expansions, is implemented, but it just fails to perform a valid check in this particular case. The bug concerns an incorrect checking of the particular architecture being compiled, and therefore it fails for the ARM-Thumb architecture that has smaller immediate fields than the regular 32 bit ARM instruction set. I may have missed something fundamental here, or I may have read the code in a wrong way (I have not even run the debugger through it, because it doesn't concern the CPU74 backend), but this is totally what it looks to me.


Tue Apr 23, 2019 7:48 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 305 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8 ... 21  Next

Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software