View unanswered posts | View active topics It is currently Thu Apr 25, 2024 12:46 pm



Reply to topic  [ 305 posts ]  Go to page Previous  1 ... 15, 16, 17, 18, 19, 20, 21  Next
 74xx based CPU (yet another) 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
(I think my instinct might be to update the flags on a register-register move)
I regard this as a wrong design decision on some early, and not that early, processors. The VAX-11 is the most surprising example of this. Having all instructions updating flags looks orthogonal, even a cool thing to do, but in practice this prevents o lot of compiler optimisations. Just think on the simple case of a processor working on wider than native data widths, requiring the insertion of register-saving move instructions just after a compare but before a branch. With flag altering moves this is not possible.

Some processors bring that subject to the extreme, and they even specify a bit in the instruction encoding to tell whether flags are altered of not. On the ARM, for example. You can turn any logical or arithmetic instruction into a non-flag-altering instruction by just setting a bit on the instruction encoding. Or you can turn any instruction into a conditional instruction by using a 4 bit condition code in it. It's quite flexible, not only all instructions can be turned into conditional instructions, but also most instructions can be turned into flag generating instructions (a.k.a compares) !


Mon Nov 02, 2020 1:30 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Yes, my instinct could easily be wrong! (I can imagine a machine which can only do register to register through the ALU, needing an OR with zero, or an ADD with zero, in which case we get the flags updated, unless we have an ARM-like mechanism to avoid it.)

But: should a compiler be doing much moving of data between registers? Perhaps it should, but it seems a bit like a relabelling problem to me.


Mon Nov 02, 2020 1:59 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
Yes, my instinct could easily be wrong! (I can imagine a machine which can only do register to register through the ALU, needing an OR with zero, or an ADD with zero, in which case we get the flags updated, unless we have an ARM-like mechanism to avoid it.)

But: should a compiler be doing much moving of data between registers? Perhaps it should, but it seems a bit like a relabelling problem to me.

That's good observations again.

Most modern architectures do not rely on ALU operations for small constant loads, so flag setting is prevented by using these instructions, they are just moves. In addition, it is also fairly common to have some kind of non-flag-altering ADD or SUBtract, which are generally used in the context of address calculation. The 68000 for example has an instruction named LEA, which stands for Load Effective Address, supporting all the addressing modes that normally load memory, but resulting in an address instead of the memory contents. It's in definitive just a non-flag.altering ADD. I think X86 has that too, and I have a LEA instruction in my architecture too.

By this I mean that even if processors do not go that far as the ARM (as this is probably a bit over engineered) they still have at least an ADD instruction (whatever the name) that does not change the status flags.

About compilers, yes they try to avoid moving data between registers if possible, but small constants moved into registers happen all the time. Instruction sets featuring 3 register operands also help to scape register to register moves.


Mon Nov 02, 2020 2:11 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
I guess it depends on the cpu design, general registers modify flags, index registers do not.
I/O instructions also may have strange operation flags.
I don't use flags with my TTL design, just test a register on BCC or SCC. The micro code needs
to jump so having flags would not make any changes in speed. If one used conditional logic
like pc = pc + (flags? 1: # offset) then flags would be faster.
All the code I remember like Tiny C, tested the AC rather than using flags. You need also about
a branch of 2K instructions for programs on a small machine.
Ben.


Mon Nov 02, 2020 4:50 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
(Oh, good point about LEA - I was never sure if that was just a bit of syntactic sugar, but if it's actually flag-free arithmetic that makes sense!)


Mon Nov 02, 2020 5:23 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Ed,

BigEd wrote:
(Oh, good point about LEA - I was never sure if that was just a bit of syntactic sugar, but if it's actually flag-free arithmetic that makes sense!)

Compilers for the x86, with all the architecture weirdness, make often use of the 'lea' instruction to optimise register usage or to produce interesting side effects.
Code:
long f(long x) {
  return x * 5;
}

Code:
LEA EAX, [EAX * 4 + EAX]         //  this is  EAX = EAX + EAX * 4
ret

Funny isn't it?, the above single instruction performs a multiplication by 5. And there's also the following
Code:
LEA EAX, [EAX * 2 + EAX]   ;EAX = EAX * 3
LEA EAX, [EAX * 4 + EAX]   ;EAX = EAX * 5
LEA EAX, [EAX * 8 + EAX]   ;EAX = EAX * 9

But that's not all, because you can also add a literal constant on the same instruction. In fact, any expression of the form
Code:
Base + (Index * Scale) + Displacement
where Base, Index are registers, Scale is one of 0, 1, 2, 4, 8, and Displacement is any constant, can be expressed in a single X86-64 instruction using the 'lea' syntax, and of course compilers are fully aware of that!


Last edited by joanlluch on Mon Nov 02, 2020 7:32 pm, edited 1 time in total.



Mon Nov 02, 2020 7:24 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Yikes! (But then x86 is uber-CISC...)


Mon Nov 02, 2020 7:31 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Well, don't say that too quick. The ARM has that too, except that you can't mix the Scale with the Displacement on the same instruction. In this case they just didn't use any different name than just ADD. By not specifying the flags generating bit, the ADD behaves exactly as the x86 LEA.


Mon Nov 02, 2020 7:53 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Ed,

BigEd wrote:
(Oh, good point about LEA - I was never sure if that was just a bit of syntactic sugar, but if it's actually flag-free arithmetic that makes sense!)


Just to make the point about flag-modifying and non-flag-modifing 'adds', I prepared an example on the CPU74 architecture to show the difference. The first scenario is with 'lea' instructions disabled/removed from the compiler. The second one is the normal one:

Code:
 int testX16(  int a, int b, int *c )
{
  *c = a==b ? 2+a : 3+b;
  return a==b;
}


CPU74 (lea instructions disabled)
Code:
   .globl   testX16
testX16:
   add   SP, -2, SP
   st.w   r4, [SP, 0]
   mov   r0, r3
   add   r3, 2, r3
   mov   r1, r4
   add   r4, 3, r4
   cmp.eq   r0, r1
   selcc   r3, r4, r0
   st.w   r0, [r2, 0]
   setcc   r0
   ld.w   [SP, 0], r4
   add   SP, 2, SP
   ret


CPU74 (lea instruction enabled)
Code:
   .globl   testX16
testX16:
   cmp.eq   r0, r1
   lea   r1, 3, r1
   lea   r0, 2, r0
   selcc   r0, r1, r0
   st.w   r0, [r2, 0]
   setcc   r0
   ret


Recall that 'lea' is just a normal 'add' but it does not affect status register flags.

Notice the big difference. In the first scenario the compiler used normal flag-altering 'add' instructions. This prevents instruction reordering optimisations which results in more register pressure and even the creation of a stack frame to save used registers.

On the second case, the compiler starts by first comparing the function arguments a, b. Then speculatively adds constant values to them without modifying the flags, then selects one of them based on the flags that are still valid from the compare. Next, the result is stored in memory pointed by argument c, still without modifying flags, and finally the still valid flags are used again to provide the function result.

It can't be more efficient than that. Compared with the first version, generated by the same compiler with just one instruction removed from the set, it looks that having non-flag-altering adds is a big advantage.


Fri Nov 06, 2020 7:04 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Thanks - that's an impressive demonstration!


Fri Nov 06, 2020 9:06 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I made a further step on the optimisation of the instruction set by slightly changing the semantics of some instructions.

The isa already specifies that all word memory accesses must be aligned, that means all word memory accesses happen in even memory addresses. This brings the opportunity to encode all immediate fields as scaled by a factor of two by shifting them right and discarding the least significant bit (which is always zero). This automatically doubles the reach of such instructions without having to prefix them.

In addition, I also changed the semantics of the BaseRegister plus IndexRegister indirect addressing mode by interpreting that the IndexRegister already comes scaled by a factor of two in case of word access instructions.

The above semantics are very common in modern processors featuring the said addressing mode, including the ARM, the X86 and virtually all Risk processors out there.

The only reason why I didn't incorporate this long ago is because I thought this would add a significant performance hit on the hardware. This is because that requires the insertion of a 'shift' unit in the data path to the ALU. However, I have recently realised that in this case this is very cheap because it can be done with 74CBT analog switches, which add a negligible propagation delay once they are pre-selected, which can be done right from the decoder before register data gets there.

The Compiler and the Assembler has been already updated for that, and only the Simulator remains waiting on my pipeline, so there's really not a lot of tests been performed yet. But I can show some code generation implications of this. The following are a couple of examples of compiled code before and after
Code:
int (*funcListx[6])() = {f1, f2, f3, f4, f5, f6};

int swTest2( int a, int b, int i )
{
  return funcListx[i]( a, b );
}

CPU74 Before
Code:
swTest2:
   add   r2, r2, r2
   ld.w   [r2, &funcListx], r2
   call   r2
   ret

CPU74 After
Code:
swTest2:
   mov   &funcListx, r3
   ld.w   [r3, r2], r2
   call   r2
   ret

The previous example is a dispatch table where a function is called based on index. There's not a lot of differences in this case, but I like this example because it clearly shows the point.

- On the first case example, the compiler used the trick of placing the function table address as the immediate field of a register + offset addressing mode. The base register is then used as index and must be multiplied by two in order to correctly point the relevant table element. The same could be done by just loading the address on the first register and using a second one as the index, but that would have required one more instruction and two registers instead of just one.

- On the second case, the compiler just avoids any tricks and plainly uses the scaling property of the register+register indirect addressing. In this case the base table address is loaded into register R3, and register R2 is used as index. Thanks to the implied scaling, there's no need to multiply it by 2.

The most clear example of the semantics difference is this simple one however:
Code:
int test33b( int *ss, int i)
{
  return ss[i];
}

CPU74 before
Code:
   .globl   test33b
test33b:
   add   r1, r1, r1
   ld.w   [r0, r1], r0
   ret

CPU74 After
Code:
   .globl   test33b
test33b:
   ld.w   [r0, r1], r0
   ret

In the second case, as per the updated semantics, register R1 is scaled by 2 on the load word instruction, so there's no need to multiply it by 2 beforehand.

The next example shows what happens in a loop involving word memory accesses.
Code:
void pep (int *dst, const int *src, unsigned int len)
{
    for ( int i=0 ; i<len ; ++i )
      dst[i] = src[i];
}

CPU74 Before
Code:
   .globl   pep
pep:
   addx   SP, -2, SP
   st.w   r4, [SP, 0]
   cmp.eq   r2, 0
   brcc   .LBB1_3
   mov   0, r3
.LBB1_2:
   ld.w   [r1, r3], r4
   st.w   r4, [r0, r3]
   addx   r3, 2, r3
   sub   r2, 1, r2
   brncc   .LBB1_2
.LBB1_3:
   ld.w   [SP, 0], r4
   addx   SP, 2, SP
   ret

CPU74 After
Code:
   .globl   pep
pep:
   addx   SP, -2, SP
   st.w   r4, [SP, 0]
   cmp.eq   r2, 0
   brcc   .LBB1_3
   mov   0, r3
.LBB1_2:
   ld.w   [r1, r3], r4
   st.w   r4, [r0, r3]
   addx   r3, 1, r3
   cmp.eq   r2, r3
   brncc   .LBB1_2
.LBB1_3:
   ld.w   [SP, 0], r4
   addx   SP, 2, SP
   ret

In the first case, the compiler uses a decrementing loop induction variable for the loop count, and and a second one incrementing by 2 variable for the memory access index. So two induction variables are required. The second case is simpler, as there's a single incrementing loop induction variable which is reused for memory access index. It looks that thanks to compiler optimisations on the first case, the result in terms of register usage and cycle count is identical, so we could conclude that both semantics for the load instruction are equally good. However, compiler optimisations on the second case are lighter, almost inexistent, as it basically just needs to follow the source code, and this is an advantage in itself. The advantage is that the compiler is less constrained to apply more complex optimisations that may help in complex scenarios, as I have already found for the case of the "queens" code in the "some C compilers", which is shorter and potentially faster.

I hope to be able to post the "queens" example with somewhat improved results after I get them...

I think this change is a big improvement that makes the isa look more 'professional' to me, (if this word is even applicable to this project, that is). Now it is time to upgrade the software simulator before I move to hardware logic design again.

Joan

[EDIT: I forgot to mention that I renamed the "lea" instruction to "addx", so I think it has now a clearer meaning]
[EDIT: use of English]


Sat Nov 07, 2020 7:21 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
One other option is reserve that bit as long/short data if you move to 32 bit cpu.
Ben.


Sun Nov 08, 2020 5:08 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
oldben wrote:
One other option is reserve that bit as long/short data if you move to 32 bit cpu.
Ben.

Hi Ben, that's a possibility, but moving to 32 bits architecture while still supporting 16 bit load/stores would also require that all baseRegister + IndexRegister addressing instructions would support 16 bit accesses too, and there's not enough encoding slots for that extra instructions unless immediate field sizes are reduced in other instructions.

Another possibility is just not directly supporting 16 bit load/stores. This would not be anything new, because the ARM1 didn't support 16 bits either, but just 8 bits and 32 bits. In practice, the lack of 16 bit support is not an unsolvable problem because the compiler can be instructed to always align 16 bit variables to 32 bit boundaries.

- 16 bit global vars can be aligned and padded to 32 bit boundaries, so the compiler can safely read or write 32 bits.
- 16 bit struct fields can be aligned and padded to 32 bit boundaries, so the compiler can safely read or write 32 bits.
- The stack can be aligned to 32 bits for all accesses, so the above access conditions apply to local vars as well
- More difficult is the access of 16 bit array elements. This can just be forbidden or allow the compiler to use 32 bit elements in a transparent way to the user.

The first DEC Alpha processors didn't even support 8 bit accesses, everything was 32 bits, but it was possibly very inefficient to work with character strings and byte sized data. I think that supporting 8 and 32 bits, but not 16 bits, is a more convenient tradeoff, and for the CPU74 instruction set it would just be a move from 16 to 32 bits, with essentially the same instruction set and addressing modes.

Another important requisite for an efficient 32 bit processor is the availability of any-shift instructions, i.e a barrel shifter in the processor. Compilers love them, and in the case of 32 bits they can't get without it. This just would imply the replacement of the current sets of 1 bit and 8 bit shift instructions by one single set with a small 5 bit immediate to specify up to 32 bit shifts, and another set with shift amounts specified in a register.

A different subject is that moving to a 32 bit architecture implies almost doubling the number of components, or at least the register file, and loosing performance, in particular if we still maintain the 16 bit alu and internal busses. Ideally the alu should be converted to 32 bits too...

Anyway, something to consider in the future...


Sun Nov 08, 2020 10:26 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I have updated the software simulator with the latest isa changes, this also includes all the bit fields and instructions rearrangement that I've been posting while developing the logisim model. I am happy to post that at least a basic set of instructions is already working correctly. I used this as the first basic test program

It's essentailly a "Hello World" program that also prints a number :

Code:
void myprintstr( char *str )
{
  while ( *str )
    putchar( *str++ );


void printnum( unsigned int num )
{
  int factors[] = {10000, 1000, 100, 10, 1};
  for ( int i=0 ; i<(sizeof factors)/2 ; ++i )
  {
    char ch = '0';
    while ( num >= factors[i])
    {
      ch = ch + 1;
      num = num - factors[i];
    }
    putchar( ch );
  }
}

int main()
{
  myprintstr( "Hello world!\n" );
  printnum( 12345 );
  putchar( '\n' );
}


After compiling - assembling - simulating, this is the result

Code:
/Users/joan/Documents-Local/Relay/CPU74/Simulator/DerivedData/Simulator/Build/Products/Debug/c74-sim
Hello world!
12345
Executed instruction count: 330
Total cycle count: 457
Elapsed simulation time: 0.00870718 seconds
Calculated execution time at 1MHz : 0.000457 seconds
Calculated execution time at 8MHz : 5.7125e-05 seconds
Calculated execution time at 16MHz: 2.85625e-05 seconds
Program ended with exit code: 0


Sun Nov 08, 2020 10:44 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
16 bit access could be done as macro for the few cases you need to fake a structure
from a byte array like a disk directory structure.
read16(x) ((char) *x+(char)*(x+1)<<8)
Ben.


Mon Nov 09, 2020 2:16 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 305 posts ]  Go to page Previous  1 ... 15, 16, 17, 18, 19, 20, 21  Next

Who is online

Users browsing this forum: No registered users and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software