View unanswered posts | View active topics It is currently Mon Aug 19, 2019 11:57 am



Reply to topic  [ 159 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10, 11  Next
 74xx based CPU (yet another) 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
BigEd wrote:
Interesting - and very positive - that you can get a compiler to make good use of the long life of a flag..

Indeed, and I will show a simple example in a following post where this feature is quite appealing.

BigEd wrote:
There are two kinds of flags, or three if you could distinguish the use of C for shifting and the use of C for arithmetic.

Making a distinction between the Shift Carry and the Add/Sub carry is something that I consider. At this point I believe it's not necessary, but I may be wrong. I will figure it out when I get expanded type shifts working.

BigEd wrote:
It might be that having the different sets of flags in different registers is crucial for the compiler.

This is an interesting observation. In fact, the compiler is not necessarily aware of the contents or even the meaning of the individual flags. The only thing that the compiler knows is that some instructions use them and that some instructions modify them. This information is used to determine which instructions must go before which others, and to perform reordering optimisations, for example to reduce register usage pressure. Along the compilation stages all instructions are represented as a linked graph based on their operands and results.The treatment of Status Flags is not different than operands or results.

The ADD instruction for example requires 2 operands and produces 2 results. The 2 results are the sum, and the carry. The ADDC instruction requires 3 operands (the last one is the carry in) and produces 2 results, the sum and the carry out.

To implement 32 bit addition I set the Carry Out of the ADD instruction to a explicit physical register (which is in fact the ASR status register, but it could be any abstractly created register). I also set the ADDC Carry In in to the same register. The compiler then knows that these two instructions are linked through this register, but this does not prevent other instructions to be added in between, as long as this register is not modified.

So to expand on your comment, yes that's correct. And furthermore, you could even implement a separate abstract register for every individual Flag, such as the N Z C V individual flags in the 6502, and use them independently only in the instructions that actually use some of them. So for example, if a particular instruction is known to never change the C flag, even if it can change the Z flag, then this instruction can be safely inserted between instructions that would use only C.


Fri May 24, 2019 4:26 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
After so much time looking at the generated assembly code, I realised that I could possibly reduce both overall register pressure and code size, without compromising or even improving performance, by making minor changes on the calling convention and relaxing the register preservation policy.

These are the changes:

- Two registers, R0 and R1, instead of three, are now used for function arguments, any additional arguments are passed in the stack.

- Registers R0 and R1 can be used as return values, through R3 can be modified by functions without guarantee of being preserved. This means that functions are only required to preserve registers from R4 to R7. Previously, I preserved all registers from R3 to R7.

This is the current register usage as per the calling convention:

R0 : Used as Argument and Return value (if required)
R1 : Used as Argument and Return value (if required)
R2 : Not preserved by the callee
R3 : Not preserved by the callee
R4 : Preserved by the callee
R5 : Preserved by the callee
R6 : Preserved by the callee
R7 : Preserved by the callee, Frame pointer (if required)

The improvement comes from the fact that the compiler always attempts to make as much use of registers as possible and to let them live for as long as possible. This is theoretically good, but this often comes at a price: registers must be preserved by the callee and push/pop instructions must be inserted during prologue/epilogue emission. By reducing the usage of registers as arguments, and relaxing the policy of register preservation, we get the double benefit of freeing more registers, which also do not generally need to be saved on the stack.


Sat May 25, 2019 10:23 pm
Profile
Online

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
An interesting tradeoff!


Sun May 26, 2019 6:55 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
I've made some further improvements on the compiler code generation.

I added a collection of 4 instructions to perform add/subs with general purpose registers and the SP. Previously, only a single instruction to add to the SP was available. Now, all combinations are possible:
Code:
add SP, Rd, SP
sub SP, Rd, SP
add SP, Rd, Rd
sub SP, Rd, Rd
(Recall that the destination register is always the one towards the right hand side)

I added and/or logical instructions with constant immediate:
Code:
and Rd, K, Rd
or Rd, K, Rd
These take a single register and perform or/and immediate on the same register. The immediate is only 6 bit long, but I fond it's still useful for performing bit tests and other logical operations on the lower bits.

Also added a special instruction to perform a 16 bit sign extent of a register to a second register:
Code:
sextw Rs, Rd
In particular the instruction sets all ones on Rd if Rs is negative, or zero otherwise. It's also equivalent to a 15 bit Arithmetic Shift Right. I call it "Sign Extend Word". Previously, I used the following sequence of instructions to perform the same:
Code:
cmp Rs, 0
setlt Rd
neg Rd
I decided to create the new instruction because I found it is frequently used in 32 bit arithmetic operations. (More on that latter)

I removed all the 'Shift' with Carry instructions. They were intended to implement 32 bit or larger shifts, but I found that they are not that useful in most cases. Particularly, the CPU74 only supports one bit shifts. Shifts of larger amount must be coded anyway with the help of tricks such as byte swaps and sign/zero extensions. For shifts of greater amount the limitation remains and the availability of 'Shift' with Carry does not improve anything. Instead, the compiler generates suitable instruction sequences with logical operations to get the same results. (More on that latter)

To fit the new instructions in the highly constrained encoding space, I reduced the immediate field for conditional branch from 10 bits to 9 bits and refactored the entire instruction set (again). 9 bits still allow for conditional jumps forward or backward of 255 normal instructions. This is a total range of 1 K bytes if we consider that program memory addresses are expressed in 16 bit words. I have yet to implement the 'expand branches' pass, but I don't expect that it will have much work to do in practice (.i.e replacing immediate jumps by long jumps)

As part of the instruction encoding refactoring, I also moved encoding fields to more convenient places, which should make the hardware decoder slightly simpler. For example, the total bit range for instruction opcodes is now only 11 bits (as opposed to 12), so that will require one less wire from the instruction register to the decoder PLA


Tue Jun 04, 2019 7:24 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
Following some examples that show the new improvements on the compiler:

The variable sized objects example that I posted before now compiles like this:
Code:
void test ( int *a, int *x, int *y );
void callTest( int siz0, int siz1)
{
  int arr0[siz0];
  int a=22;
  int arr1[siz1];

  test( &a, arr0, arr1 );
}
CPU74
Code:
callTest:
   push   r7
   mov   SP, r7
   sub   SP, 2, SP
   mov   r1, r2
   mov   r0, r1
   lsl   r1
   sub   SP, r1, r1
   mov   r1, SP
   mov   22, r0
   st.w   r0, [r7, -2]
   lsl   r2
   sub   SP, r2, r2
   mov   r2, SP
   mov   r7, r0
   sub   r0, 2, r0
   call   test
   mov   r7, SP
   pop   r7
   ret

Compared with the earlier version (found here http://anycpu.org/forum/viewtopic.php?f=23&t=583&start=90#p4489) The output assembly is now shorter by 4 instructions. The main difference is the use of 'sub SP, r2, r2' and 'sub SP, r1, r1' instructions that were not available before. Also the reduced register pressure enables the compiler to generate the entire function with only 3 registers usage (r0 to r2), so there's no need to save them on the stack.


Tue Jun 04, 2019 7:56 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
Another compiler improvement is the avoidance of 'compare' instructions when it's known that the Status Register contains the right flags. For example, consider the following code that involves a 32 bit comparison
Code:
int cmpeql(long a,  long b) {
  return a==b;
}
Code:
cmpeql:
   ld.w   [SP, 4], r2      // load higher 16 bits of b
   xor   r1, r2, r1       // xor with higher 16 bits of a (result zero if equal)
   ld.w   [SP, 2], r2     // load lower 16 bits of b
   xor   r0, r2, r0       // xor with lower 16 bits of a (result zero if equal)
   or   r0, r1, r0       // or the two partial results (result zero if both zero) <- this is the comparison result, SR gets updated
   seteq   r0            // use the SR to set the comparison result
   ret
The seteq instruction does not require a previous 'cmp' instruction, as it was formerly the case, because the previous 'or' instruction already sets the Status Register.

A more complex example is this one:
Code:
int testX(  long a, long b, long *c )
{
  *c = a==b ? 2 : 3;
  return a==b;
}
Code:
testX:
   ld.w   [SP, 6], r2    // load c (this is a 16 bit pointer)
   mov   0, r3            // load constant 0
   st.w   r3, [r2, 2]    // store constant 0 in upper 16 bits of *c
   ld.w   [SP, 4], r3    // load higher 16 bits of b
   xor   r1, r3, r1       // xor with higher 16 bits of a (result zero if equal)
   ld.w   [SP, 2], r3    // load lower 16 bits of b
   xor   r0, r3, r0       // xor with lower 16 bits of a (result zero if equal)
   or   r0, r1, r0       // or the two partial results (result zero if both zero) <- this is the comparison result, SR gets updated
   mov   3, r1            // load constant 3 - does not affect SR
   mov   2, r3            // load constant 2 - does not affect SR
   seleq   r3, r1, r1     // select constant based on the SR - SR still not affected!
   st.w   r1, [r2, 0]    // store the selected constant in lower 16 bits of *c - does not affect SR
   seteq   r0             // now use the SR (again) to set the comparison result
   ret                  // return with the result in R0

The comments in the assembly code describe what it does. Note that both the 'seleq' and 'seteq' instructions in the code rely on the same value of the Status Register generated by a previous 'or' instruction.

The code above shows that the relevant instruction modifying the SR (in this case an 'or') does not need at all to be just before a 'set', 'sel' or 'branch' instruction. As long as any in-between instructions will not affect the SR, they can be inserted. This is great flexibility for compiler optimizations.

In addition, the code above shows other two interesting optimizations: First, the upper 16 bits of *c are set to zero earlier in the function because whatever is the result, they will always come to zero. This is a measure aimed at reducing register pressure; Second the 'a==b' code is not computed twice, but only once. In this particular example, the partial result is not even stored in a register, as the unaffected Status Register is reused to set the return value.

I never cease to find compilers amazing.


Tue Jun 04, 2019 10:09 pm
Profile
Online

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
joanlluch wrote:
I never cease to find compilers amazing.

Hear hear!


Wed Jun 05, 2019 8:26 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
I tried to compile some essential runtime functions in order to figure out whether it's worth to implement them in assembly or I just leave them as c sources. The first obvious try is the 'memcpy' function. This is the most basic form of the memcpy function:

Code:
__attribute__((builtin))
void *simple_memcpy (void *dst, const void *src, unsigned int len)
{
  char *d = dst;
  const char *s = src;
  for ( int i=0 ; i<len ; ++i )
    *d++ = *s++;

  return dst;
}

While compiling this, I realised that the compiler wants to favour down-counting semantics. That is, for the number of required loop iterations the compiler would use a register that would be decremented on each iteration until it is zero. Well, that's quite a common practice, and it generally works well, but this assumes that it would not add any overhead on the loop body, which may not be always the case. For example, the counting-down register can not be used as the index of the loop body, so if the architecture does not support post-increment load/stores then an additional ascending counter and extra instructions are needed

My architecture gets particularly affected by this, and I was not happy with the code generated by the compiler, which consisted in an indirect load/store pair (2 instructions) plus 2 more instructions to increment the two pointers, plus the descending counter (one more instruction) to know when to stop.

Ideally, the compiler should generate a register indexed load/store pair using the index both as a the memory index and counter, for a total of 3 instructions instead of the 5 discussed in the previous paragraph.

I managed to achieve that with some help of the LLVM community mailing-lists, but I must say that it was harder than I expected. I am not going to explain in detail how I managed to change the compiler behaviour, but in summary it consisted in tweaking the 'cost' function of the "Loop Strength Reduce" optimisation pass. Incidentally, the Intel x86 architecture also suffers from the same exact problem because of lack of post-increments, and it's indeed the only compiler target with a custom 'cost' function, which I used to learn what was required.

So at the end of the day, the compiler produced this nice output for the simple_memcpy function:
CPU74
Code:
   .globl   simple_memcpy
simple_memcpy:
   push   r4
   mov   0, r3
   jmp   Lbl_BB1_2
Lbl_BB1_1:
   ld.sb   [r1, r3], r4
   st.b   r4, [r0, r3]
   add   r3, 1, r3
Lbl_BB1_2:
   cmp   r2, r3
   brne   Lbl_BB1_1
   pop   r4
   ret

Note that the loop body is just 3 instructions and it can't be shorter than that.

[edited: use of English]


Sat Jun 08, 2019 4:28 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
I have not posted in a while, but I made some progress, summarised below:

Switched to LLVM 9.0 for the Compiler. Until recently, I have been using the LLVM 7.0.1 release for this. I decided to upgrade after I realised that a loop strength reduction optimisation that I was proposing to the LLVM community was already in place in the soon to be released version 9. So rather than trying to improve on an old version I just switched to the one currently in development. LLVM 9.0 is not officially released yet but it’s already available through gitHub. The switch took longer than I anticipated because there were some regression bugs that prevented the official code from compiling in Xcode. It looks that most LLVM developers use 'Ninja', so xCode is not that popular among them. But I'm familiar with xCode and I will continue using it because I'm comfortable with it. Thankfully, somebody looked at the LLVM 9.0 issues with Xcode and solved the problems.

I also started working on the Assembler. My first intention was to code it in C or C++, but after having worked half of my career on Objective-C, I finally decided to give a chance to Apple's "Swift" programming language. This also took some time because I had to learn a language I never used before. I can give now an opinion on Swift, and to be honest, it has some aspects that I do not totally appreciate. I mean, it's ok for quick development because it's kind of a scripting language that gets compiled. I'm also familiar with the foundation APIs because they are the same Objective-C ones which has helped. However, I miss the C like low level features of Objective-C such as explicit pointer arithmetic, and others that have been removed from the language. This makes Swift too verbose for performing simple low-level instructions.

Anyway, I suppose I can say I have now reached one of my early milestones, with the compiler nearly complete, only pending to be tested in real situations, for which I need the assembler and then the CPU simulator.

I pushed the assembler and the instruction set docs to a gitHub repo for reference:

https://github.com/John-Lluch/CPU74

The assembler is coded around a top down descendant parser to generate an internal representation of the assembly file, which will then be converted into binary machine code on a second pass. I chose to do so instead of playing with regexs because I am not that used to them, and I sort of of qualify them best for interpreted languages, rather than compiled ones.

It will also support multiple input ".s" files that will be assembled together into a single executable file, simulating what a "linker" would do. It's not an "industry standard" way to do it, but it will still allow me to implement library source files that I will not have to copy every time into normal program files. I hope this makes sense.

Joan


Fri Jul 12, 2019 10:02 am
Profile
Online

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
An excellent adventure! It's helpful to others, I'm sure, to have got the XCode environment working.

BTW, in your docs in github you see to have nine registers: R0 to R8 - maybe a typo?


Sat Jul 13, 2019 9:19 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
BigEd wrote:
An excellent adventure! It's helpful to others, I'm sure, to have got the XCode environment working.

BTW, in your docs in github you see to have nine registers: R0 to R8 - maybe a typo?

Hi Ed,

thanks for catching that. It’s indeed an error! I mean to have 8 general purpose registers + SP + PC + Status.
I also lately figured out that I do not really need to have physically separated registers for SR and ASR flags. This was convenient for the compiler implementation because that made flag dependant instruction reordering easier, but I can just arrange all the flags in a single 8 bit register for the real thing.

I will correct that latter today.

(Currently at the “World Roller Games” event in Barcelona, as my daughter enrolled as a volunteer)

Joan


Sat Jul 13, 2019 11:18 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
I updated now the incorrect instruction set doc. I think it's correct now.


Sun Jul 14, 2019 2:37 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
I have been working a bit more on the assembler and pushed an update to the gitHub repo that I linked before.

The assembler is not yet finished but I suppose it's already good time to briefly describe the main classes and functions. I have not based it on existing code, but I coded it my own way from scratch. I hope there's not something extremely wrong on the approach I took.

So the assembler classes are split into four source flies plus the "main" file

* main *

This just initialises two objects, a 'console' and a 'c74' instance. The 'console' object deals with the terminal command line. The 'c74' object just keeps an object instance of the 'Assembler' class which in turn contain an array of sources in an internal format.

* Console *

This contains classes to read the command line and to store the information in it, such as the list of assembly files and the executable file name. The class is also ready to be easily expanded to support command line options that may be required at some time in the future

* PrimitiveParser *

This is a crude implementation of a base class for a computer language parser. It essentially contains functions to parse common appearances in source files, such as tokens, integers, or strings. It is the base class for the 'SourceParser'. I adapted it by reducing the number of functions and adding some convenience ones for the particular needs of the assembler

* SourceParser *

Contains the implementation of a top-down (recursive) parser for the assembler. I'm aware that doing this way is possibly a bit overkill because I suppose an assembler can be implemented in terms of just reading lines alone, but anyway this is the way I ended doing it. The purpose of the parser is to convert text source files in an internal representation that is then processed to machine code bytes.

* Asm *

This file contains classes to keep the internal representation of the source assembly file. It also includes the 'Assembler' class which is the one that joins it all together to produce the final output

* InstInfo *

This is possibly the more interesting set of classes for the assembler. It contains the encoding definition of all the machine instructions as well as a list (actually a hash table) of all of them, for quick identification and opcode generation. The hash table (or 'Dictionary' in swift nomenclature) uses the internal instruction representation as the input 'key' and returns the machine instruction type as well as the instruction encoding for that type. The full machine instruction encoding is generated in the 'getMachineInstruction' function in a very straightforward way by just creating an instance of the given type with operands. The final step is updating any unresolved symbols with the actual relative offsets or absolute addresses. This also happens in a rather straightforward way in the 'assemble' function of the 'Assembler' class.

Joan


Sat Jul 20, 2019 5:18 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
I just pushed the last commit to the git repo (https://github.com/John-Lluch/CPU74

The assembler was essentially finished, except for minor details and a couple of remaining bugs to be fixed, but after testing a lot of source code on the compiler and the assembler, I just decided that I wanted to try yet another iteration of the instruction set.

So the NEW instruction set is now uploaded to the Doc sections of the mentioned git repository. Differences are as follows:

- I reduced the total number of registers by one, and made the SP to belong to the general purpose set. So besides the PC and the Status Register, I have still 8 general purpose registers, but one of them is the SP (R7 is in fact the SP). Most special SP instructions that were required before have been removed in favour of general register operations. This also results in a more orthogonal and simpler instruction set, with only 10 encoding patters instead of the previous 13, and more encoding slots for new instructions.

To compensate for the one-less register I also made the following changes:

- Added more instructions that use the second word as immediate operand. Particularly, it is now possible to perform load/stores with large immediate offsets, which avoid the use of intermediate registers to load the offsets. In addition, the T10 pattern is ready for the future inclusion of ALU instructions with word sized immediates, these can be added easily if proved beneficial. The T7, T8, and T9 patterns can also be extended with more instructions by expanding their secondary 'op' fields

- Instruction embedded immediates for ALU operations are all now 8 bit long, instead of the previous 6 bit for some of them. There's also better encoding consistency.

- Conditional branches can now use 10 bit offsets, instead of just 9 bit. The number of bits available for branch offsets had to be reduced when the 'set and 'select' instructions were added, and this has been reverted now.

- Far subroutine calls can now be performed to a full address in the second word, rather than having to resort to register indirect calls for that

- There's no longer any ALU operation that will force the use the same register as source and destination (except some instructions with embedded immediates)

The changes are more significant than they may appear, specially because of the removal of the SP as an special register. The downside is that I had already a compiler and assembler tested for the previous instruction encoding, and now I will have to fix them for the new instructions and encodings. The assembler is pretty easy to fix, but the compiler will require entire parts to be revised and reimplemented in order to manage the upgraded instruction set. This will cause a project delay, but after all I'm doing it all for learning and fun and I know that I will not live comfortably if I do not re-make the compiler. I could just go ahead with the already working instruction set rather than stepping backwards and re-make the compiler, but suppose this is just the way I am.

Joan


Wed Jul 31, 2019 10:33 pm
Profile
Online

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
An interesting evolution! And sounds like it's all improvement too - arguably you have lost a register, but I think you've gained a lot in exchange for that. And presumably the instruction set now would support a programming idiom with dual stacks?

(Did you see the conversation on 6502.org about Callback Tables? It might be interesting to see how that works, in a given architecture.)


Thu Aug 01, 2019 7:11 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 159 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10, 11  Next

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software