AnyCPU - View topic - 74xx based CPU (yet another)

Page 18 of 21

[ 305 posts ]

Go to page Previous 1 ... 15, 16, 17, 18, 19, 20, 21 Next

74xx based CPU (yet another)

Author	Message
joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) BigEd wrote: (I think my instinct might be to update the flags on a register-register move) I regard this as a wrong design decision on some early, and not that early, processors. The VAX-11 is the most surprising example of this. Having all instructions updating flags looks orthogonal, even a cool thing to do, but in practice this prevents o lot of compiler optimisations. Just think on the simple case of a processor working on wider than native data widths, requiring the insertion of register-saving move instructions just after a compare but before a branch. With flag altering moves this is not possible. Some processors bring that subject to the extreme, and they even specify a bit in the instruction encoding to tell whether flags are altered of not. On the ARM, for example. You can turn any logical or arithmetic instruction into a non-flag-altering instruction by just setting a bit on the instruction encoding. Or you can turn any instruction into a conditional instruction by using a 4 bit condition code in it. It's quite flexible, not only all instructions can be turned into conditional instructions, but also most instructions can be turned into flag generating instructions (a.k.a compares) !
Mon Nov 02, 2020 1:30 pm

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1836	Re: 74xx based CPU (yet another) Yes, my instinct could easily be wrong! (I can imagine a machine which can only do register to register through the ALU, needing an OR with zero, or an ADD with zero, in which case we get the flags updated, unless we have an ARM-like mechanism to avoid it.) But: should a compiler be doing much moving of data between registers? Perhaps it should, but it seems a bit like a relabelling problem to me.
Mon Nov 02, 2020 1:59 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) BigEd wrote: Yes, my instinct could easily be wrong! (I can imagine a machine which can only do register to register through the ALU, needing an OR with zero, or an ADD with zero, in which case we get the flags updated, unless we have an ARM-like mechanism to avoid it.) But: should a compiler be doing much moving of data between registers? Perhaps it should, but it seems a bit like a relabelling problem to me. That's good observations again. Most modern architectures do not rely on ALU operations for small constant loads, so flag setting is prevented by using these instructions, they are just moves. In addition, it is also fairly common to have some kind of non-flag-altering ADD or SUBtract, which are generally used in the context of address calculation. The 68000 for example has an instruction named LEA, which stands for Load Effective Address, supporting all the addressing modes that normally load memory, but resulting in an address instead of the memory contents. It's in definitive just a non-flag.altering ADD. I think X86 has that too, and I have a LEA instruction in my architecture too. By this I mean that even if processors do not go that far as the ARM (as this is probably a bit over engineered) they still have at least an ADD instruction (whatever the name) that does not change the status flags. About compilers, yes they try to avoid moving data between registers if possible, but small constants moved into registers happen all the time. Instruction sets featuring 3 register operands also help to scape register to register moves.
Mon Nov 02, 2020 2:11 pm

oldben Joined: Mon Oct 07, 2019 2:41 am Posts: 814	Re: 74xx based CPU (yet another) I guess it depends on the cpu design, general registers modify flags, index registers do not. I/O instructions also may have strange operation flags. I don't use flags with my TTL design, just test a register on BCC or SCC. The micro code needs to jump so having flags would not make any changes in speed. If one used conditional logic like pc = pc + (flags? 1: # offset) then flags would be faster. All the code I remember like Tiny C, tested the AC rather than using flags. You need also about a branch of 2K instructions for programs on a small machine. Ben.
Mon Nov 02, 2020 4:50 pm

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1836	Re: 74xx based CPU (yet another) (Oh, good point about LEA - I was never sure if that was just a bit of syntactic sugar, but if it's actually flag-free arithmetic that makes sense!)
Mon Nov 02, 2020 5:23 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) Hi Ed, BigEd wrote: (Oh, good point about LEA - I was never sure if that was just a bit of syntactic sugar, but if it's actually flag-free arithmetic that makes sense!) Compilers for the x86, with all the architecture weirdness, make often use of the 'lea' instruction to optimise register usage or to produce interesting side effects. Code: long f(long x) { return x * 5; } Code: LEA EAX, [EAX * 4 + EAX] // this is EAX = EAX + EAX * 4 ret Funny isn't it?, the above single instruction performs a multiplication by 5. And there's also the following Code: LEA EAX, [EAX * 2 + EAX] ;EAX = EAX * 3 LEA EAX, [EAX * 4 + EAX] ;EAX = EAX * 5 LEA EAX, [EAX * 8 + EAX] ;EAX = EAX * 9 But that's not all, because you can also add a literal constant on the same instruction. In fact, any expression of the form Code: Base + (Index * Scale) + Displacement where Base, Index are registers, Scale is one of 0, 1, 2, 4, 8, and Displacement is any constant, can be expressed in a single X86-64 instruction using the 'lea' syntax, and of course compilers are fully aware of that! Last edited by joanlluch on Mon Nov 02, 2020 7:32 pm, edited 1 time in total.
Mon Nov 02, 2020 7:24 pm

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1836	Re: 74xx based CPU (yet another) Yikes! (But then x86 is uber-CISC...)
Mon Nov 02, 2020 7:31 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) Well, don't say that too quick. The ARM has that too, except that you can't mix the Scale with the Displacement on the same instruction. In this case they just didn't use any different name than just ADD. By not specifying the flags generating bit, the ADD behaves exactly as the x86 LEA.
Mon Nov 02, 2020 7:53 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) Hi Ed, BigEd wrote: (Oh, good point about LEA - I was never sure if that was just a bit of syntactic sugar, but if it's actually flag-free arithmetic that makes sense!) Just to make the point about flag-modifying and non-flag-modifing 'adds', I prepared an example on the CPU74 architecture to show the difference. The first scenario is with 'lea' instructions disabled/removed from the compiler. The second one is the normal one: Code: int testX16( int a, int b, int c ) { c = a==b ? 2+a : 3+b; return a==b; } CPU74 (lea instructions disabled) Code: .globl testX16 testX16: add SP, -2, SP st.w r4, [SP, 0] mov r0, r3 add r3, 2, r3 mov r1, r4 add r4, 3, r4 cmp.eq r0, r1 selcc r3, r4, r0 st.w r0, [r2, 0] setcc r0 ld.w [SP, 0], r4 add SP, 2, SP ret CPU74 (lea instruction enabled) Code: .globl testX16 testX16: cmp.eq r0, r1 lea r1, 3, r1 lea r0, 2, r0 selcc r0, r1, r0 st.w r0, [r2, 0] setcc r0 ret Recall that 'lea' is just a normal 'add' but it does not affect status register flags. Notice the big difference. In the first scenario the compiler used normal flag-altering 'add' instructions. This prevents instruction reordering optimisations which results in more register pressure and even the creation of a stack frame to save used registers. On the second case, the compiler starts by first comparing the function arguments a, b. Then speculatively adds constant values to them without modifying the flags, then selects one of them based on the flags that are still valid from the compare. Next, the result is stored in memory pointed by argument c, still without modifying flags, and finally the still valid flags are used again to provide the function result. It can't be more efficient than that. Compared with the first version, generated by the same compiler with just one instruction removed from the set, it looks that having non-flag-altering adds is a big advantage.
Fri Nov 06, 2020 7:04 pm

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1836	Re: 74xx based CPU (yet another) Thanks - that's an impressive demonstration!
Fri Nov 06, 2020 9:06 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) I made a further step on the optimisation of the instruction set by slightly changing the semantics of some instructions. The isa already specifies that all word memory accesses must be aligned, that means all word memory accesses happen in even memory addresses. This brings the opportunity to encode all immediate fields as scaled by a factor of two by shifting them right and discarding the least significant bit (which is always zero). This automatically doubles the reach of such instructions without having to prefix them. In addition, I also changed the semantics of the BaseRegister plus IndexRegister indirect addressing mode by interpreting that the IndexRegister already comes scaled by a factor of two in case of word access instructions. The above semantics are very common in modern processors featuring the said addressing mode, including the ARM, the X86 and virtually all Risk processors out there. The only reason why I didn't incorporate this long ago is because I thought this would add a significant performance hit on the hardware. This is because that requires the insertion of a 'shift' unit in the data path to the ALU. However, I have recently realised that in this case this is very cheap because it can be done with 74CBT analog switches, which add a negligible propagation delay once they are pre-selected, which can be done right from the decoder before register data gets there. The Compiler and the Assembler has been already updated for that, and only the Simulator remains waiting on my pipeline, so there's really not a lot of tests been performed yet. But I can show some code generation implications of this. The following are a couple of examples of compiled code before and after Code: int (funcListx[6])() = {f1, f2, f3, f4, f5, f6}; int swTest2( int a, int b, int i ) { return funcListx[i]( a, b ); } CPU74 Before Code:* swTest2: add r2, r2, r2 ld.w [r2, &funcListx], r2 call r2 ret CPU74 After Code: swTest2: mov &funcListx, r3 ld.w [r3, r2], r2 call r2 ret The previous example is a dispatch table where a function is called based on index. There's not a lot of differences in this case, but I like this example because it clearly shows the point. - On the first case example, the compiler used the trick of placing the function table address as the immediate field of a register + offset addressing mode. The base register is then used as index and must be multiplied by two in order to correctly point the relevant table element. The same could be done by just loading the address on the first register and using a second one as the index, but that would have required one more instruction and two registers instead of just one. - On the second case, the compiler just avoids any tricks and plainly uses the scaling property of the register+register indirect addressing. In this case the base table address is loaded into register R3, and register R2 is used as index. Thanks to the implied scaling, there's no need to multiply it by 2. The most clear example of the semantics difference is this simple one however: Code: int test33b( int ss, int i) { return ss[i]; } CPU74 before Code:* .globl test33b test33b: add r1, r1, r1 ld.w [r0, r1], r0 ret CPU74 After Code: .globl test33b test33b: ld.w [r0, r1], r0 ret In the second case, as per the updated semantics, register R1 is scaled by 2 on the load word instruction, so there's no need to multiply it by 2 beforehand. The next example shows what happens in a loop involving word memory accesses. Code: void pep (int dst, const int src, unsigned int len) { for ( int i=0 ; i<len ; ++i ) dst[i] = src[i]; } CPU74 Before Code: .globl pep pep: addx SP, -2, SP st.w r4, [SP, 0] cmp.eq r2, 0 brcc .LBB1_3 mov 0, r3 .LBB1_2: ld.w [r1, r3], r4 st.w r4, [r0, r3] addx r3, 2, r3 sub r2, 1, r2 brncc .LBB1_2 .LBB1_3: ld.w [SP, 0], r4 addx SP, 2, SP ret CPU74 After Code: .globl pep pep: addx SP, -2, SP st.w r4, [SP, 0] cmp.eq r2, 0 brcc .LBB1_3 mov 0, r3 .LBB1_2: ld.w [r1, r3], r4 st.w r4, [r0, r3] addx r3, 1, r3 cmp.eq r2, r3 brncc .LBB1_2 .LBB1_3: ld.w [SP, 0], r4 addx SP, 2, SP ret In the first case, the compiler uses a decrementing loop induction variable for the loop count, and and a second one incrementing by 2 variable for the memory access index. So two induction variables are required. The second case is simpler, as there's a single incrementing loop induction variable which is reused for memory access index. It looks that thanks to compiler optimisations on the first case, the result in terms of register usage and cycle count is identical, so we could conclude that both semantics for the load instruction are equally good. However, compiler optimisations on the second case are lighter, almost inexistent, as it basically just needs to follow the source code, and this is an advantage in itself. The advantage is that the compiler is less constrained to apply more complex optimisations that may help in complex scenarios, as I have already found for the case of the "queens" code in the "some C compilers", which is shorter and potentially faster. I hope to be able to post the "queens" example with somewhat improved results after I get them... I think this change is a big improvement that makes the isa look more 'professional' to me, (if this word is even applicable to this project, that is). Now it is time to upgrade the software simulator before I move to hardware logic design again. Joan [EDIT: I forgot to mention that I renamed the "lea" instruction to "addx", so I think it has now a clearer meaning] [EDIT: use of English]
Sat Nov 07, 2020 7:21 pm

oldben Joined: Mon Oct 07, 2019 2:41 am Posts: 814	Re: 74xx based CPU (yet another) One other option is reserve that bit as long/short data if you move to 32 bit cpu. Ben.
Sun Nov 08, 2020 5:08 am

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) oldben wrote: One other option is reserve that bit as long/short data if you move to 32 bit cpu. Ben. Hi Ben, that's a possibility, but moving to 32 bits architecture while still supporting 16 bit load/stores would also require that all baseRegister + IndexRegister addressing instructions would support 16 bit accesses too, and there's not enough encoding slots for that extra instructions unless immediate field sizes are reduced in other instructions. Another possibility is just not directly supporting 16 bit load/stores. This would not be anything new, because the ARM1 didn't support 16 bits either, but just 8 bits and 32 bits. In practice, the lack of 16 bit support is not an unsolvable problem because the compiler can be instructed to always align 16 bit variables to 32 bit boundaries. - 16 bit global vars can be aligned and padded to 32 bit boundaries, so the compiler can safely read or write 32 bits. - 16 bit struct fields can be aligned and padded to 32 bit boundaries, so the compiler can safely read or write 32 bits. - The stack can be aligned to 32 bits for all accesses, so the above access conditions apply to local vars as well - More difficult is the access of 16 bit array elements. This can just be forbidden or allow the compiler to use 32 bit elements in a transparent way to the user. The first DEC Alpha processors didn't even support 8 bit accesses, everything was 32 bits, but it was possibly very inefficient to work with character strings and byte sized data. I think that supporting 8 and 32 bits, but not 16 bits, is a more convenient tradeoff, and for the CPU74 instruction set it would just be a move from 16 to 32 bits, with essentially the same instruction set and addressing modes. Another important requisite for an efficient 32 bit processor is the availability of any-shift instructions, i.e a barrel shifter in the processor. Compilers love them, and in the case of 32 bits they can't get without it. This just would imply the replacement of the current sets of 1 bit and 8 bit shift instructions by one single set with a small 5 bit immediate to specify up to 32 bit shifts, and another set with shift amounts specified in a register. A different subject is that moving to a 32 bit architecture implies almost doubling the number of components, or at least the register file, and loosing performance, in particular if we still maintain the 16 bit alu and internal busses. Ideally the alu should be converted to 32 bits too... Anyway, something to consider in the future...
Sun Nov 08, 2020 10:26 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) I have updated the software simulator with the latest isa changes, this also includes all the bit fields and instructions rearrangement that I've been posting while developing the logisim model. I am happy to post that at least a basic set of instructions is already working correctly. I used this as the first basic test program It's essentailly a "Hello World" program that also prints a number : Code: void myprintstr( char str ) { while ( str ) putchar( str++ ); } void printnum( unsigned int num ) { int factors[] = {10000, 1000, 100, 10, 1}; for ( int i=0 ; i<(sizeof factors)/2 ; ++i ) { char ch = '0'; while ( num >= factors[i]) { ch = ch + 1; num = num - factors[i]; } putchar( ch ); } } int main() { myprintstr( "Hello world!\n" ); printnum( 12345 ); putchar( '\n' ); } After compiling - assembling - simulating, this is the result Code:* /Users/joan/Documents-Local/Relay/CPU74/Simulator/DerivedData/Simulator/Build/Products/Debug/c74-sim Hello world! 12345 Executed instruction count: 330 Total cycle count: 457 Elapsed simulation time: 0.00870718 seconds Calculated execution time at 1MHz : 0.000457 seconds Calculated execution time at 8MHz : 5.7125e-05 seconds Calculated execution time at 16MHz: 2.85625e-05 seconds Program ended with exit code: 0
Sun Nov 08, 2020 10:44 pm

oldben Joined: Mon Oct 07, 2019 2:41 am Posts: 814	Re: 74xx based CPU (yet another) 16 bit access could be done as macro for the few cases you need to fake a structure from a byte array like a disk directory structure. read16(x) ((char) x+(char)(x+1)<<8) Ben.
Mon Nov 09, 2020 2:16 am

Page 18 of 21

[ 305 posts ]

Go to page Previous 1 ... 15, 16, 17, 18, 19, 20, 21 Next

74xx based CPU (yet another)

Who is online