View unanswered posts | View active topics It is currently Mon Aug 19, 2019 12:00 pm



Reply to topic  [ 159 posts ]  Go to page Previous  1 ... 7, 8, 9, 10, 11  Next
 74xx based CPU (yet another) 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
BigEd wrote:
On the Callback Tables thread, the key is perhaps the wish to have this facility:
> JSL (JumpTable, X)
so there's a table of addresses somewhere, and you'd like to index into them and use the indexed one as the address to call as a subroutine. One possibility is to push things on the stack and return, but that's only one possible way to do it. It's an indexed dispatch table, but rather than being a computed GOTO it's a computed GOSUB.


Hi Ed,

As promised, this is an example of compiled source involving subroutine indexed dispatch table.

Source code:
Code:
// Define some functions

int add(int a, int b) {return a+b;}
int sub(int a, int b) {return a-b;}
int and(int a, int b) {return a&b;}
int or(int a, int b) {return a|b;}
int xor(int a, int b) {return a^b;}

// Define an array of function pointers.

int (*funcList[6])() = {and, sub, and, or, xor};

// Define a function that calls one of the above depending on index 'i'

int swTest2( int a, int b, int i )
{
  return funcList[i]( a, b );
}


This gets compiler for the CPU74 architecture (already using the newest instruction set) like that:

CPU74
Code:
   .text
   .file   "main.c"
# ---------------------------------------------
# add
# ---------------------------------------------
   .globl   add
add:                 // Entry point for function 'add'
   add   r1, r0, r0
   ret

# ---------------------------------------------
# sub
# ---------------------------------------------
   .globl   sub
sub:                   // Entry point for function 'sub'
   sub   r0, r1, r0
   ret

# ---------------------------------------------
# and
# ---------------------------------------------
   .globl   and
and:                 // Entry point for function 'and'
   and   r1, r0, r0
   ret

# ---------------------------------------------
# or
# ---------------------------------------------
   .globl   or
or:
   or   r1, r0, r0
   ret

# ---------------------------------------------
# xor
# ---------------------------------------------
   .globl   xor
xor:
   xor   r1, r0, r0
   ret

# ---------------------------------------------
# swTest2
# ---------------------------------------------
   .globl   swTest2
swTest2:                // On entry, register r0 contains 'a' and register r1 contains 'b'
   ld.w   [SP, 2], r2   // As per the ISA calling convention, the third argument is passed on the stack, so this gets the index 'i' on register r2
   lsl   r2, r2         // Shifts left once (multiply by 2) because pointers are 16 bits
   ld.w   [r2, &funcList], r2     // The compiler used one of the new long indexed addressing instructions to load the relevant function address into r2 based on the table
   call   r2            // Call to the address contained in r2. Registers r0, and r1 already contain the function arguments so the compiler does not need to do anything special
   ret                  // As per the ISA, function return values go to r0, so it's already where it should for returning this one

# ---------------------------------------------
# Global Data
# ---------------------------------------------
   .data
   .globl   funcList
   .p2align   1
funcList:              // This is the beginning of the table. It's already initialised at compile time in this case
   .short   and       // So the position zero (index 0) contains the address of function 'add'
   .short   sub       // The position one (index 1) contains the address of function 'sub'
   .short   and       // so on...
   .short   or
   .short   xor
   .short   0                            // this is I guess a padding because the array in the C code was defined as 6 elements and it only uses 5


I manually added some comments to the generated code to indicate what's going on.

Similar code is generated automatically for switch statements that can be optimised this way, of course "jmp" is used instead of "call" in such case.

Joan


Mon Aug 12, 2019 7:39 pm
Profile
Online

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
Thanks for the example! That's very clear.


Mon Aug 12, 2019 9:03 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
I corrected some bugs of the assembler and also submitted some code examples to the git repo that went to this sub-directory:

https://github.com/John-Lluch/CPU74/tree/master/Test-Examples

This includes a first version of the runtime libraries, currently implemented in C, for the cpu74. The relevant files for this are named "system.c" and "system.s". The ".s" extension is the compiled version of the former.

The function libraries that I have implemented so far are the following:
Code:
void *memcpy (void *dst, const void *src, unsigned int len);
int __ashrhi3 (sint16_type a, int ammount);
uint16_type __lshrhi3 (uint16_type a, int ammount);
sint16_type __ashlhi3 (sint16_type a, int ammount);
sint32_type __ashrsi3 (sint32_type a, int ammount);
uint32_type __lshrsi3 (uint32_type a, unsigned int ammount);
sint32_type __ashlsi3 (sint32_type a, int ammount);
uint16_type __mulhi3 (uint16_type a, uint16_type b);
uint16_type __udivhi3 (uint16_type num, uint16_type den);
uint16_type __umodhi3 (uint16_type num, uint16_type den);
sint16_type __divhi3 (sint16_type a, sint16_type b);
sint16_type __modhi3 (sint16_type a, sint16_type b);
uint32_type __mulsi3 (uint32_type a, uint32_type b);
uint32_type __udivsi3 (uint32_type num, uint32_type den);
uint32_type __umodsi3 (uint32_type num, uint32_type den);
sint32_type __divsi3 (sint32_type a, sint32_type b);
sint32_type __modsi3 (sint32_type a, sint32_type b);

These library calls implement non-constant shifts and basic arithmetic for 16 and 32 bit values, particularly multiplication, division and remainder. I found most (or variations) of the code for this on the internet. I basically adapted existing 32-bit processor versions to my 16 bits versions. Unfortunately, it was not possible to look at the 64 bit functions of 32-bit processors to create my own 32 bit functions, because 32-bit processors have already basic multiplication and other special instructions that I don't have. So, my 32 bit functions, specially the 32 bit division, are very unoptimised. I still search for 16-bit processor versions of 32 bit functions that are implemented in C. So far no luck...

All functions are direct outputs from the compiler using the -Oz flag (optimize for size). I can obtain slightly faster versions using more aggressive compiler flags but then the code size increases and I do not like that. There's no manual assembler code tweaking. I think that my priority now should be testing, so I think that I will have chances in the future to code such functions manually in assembly if that proves beneficial.

Finally, I also pushed the "testfile.c" and related files, which basically are a set of user functions that will either call the library functions, or produce inline code.

The file "testfile.log" is the actual "log file" output of the assembler, which I invoked to assemble "testfile.s" and "system.s" together. It's interesting to look towards the end of the log file and watch the assembler generated opcodes.

https://github.com/John-Lluch/CPU74/blob/master/Test-Examples/testfile.log

At this time, I have the compiler working (minus a few known bugs), the assembler working, and I have the ability to generate assembler log files to figure out any instruction encoding errors.

So, I guess it's now time to begin thinking on the cpu74 emulator, which should help to assert that everything is ok (software wise). For that, I suppose I have two options.

- The first option is to implement the emulator just as a direct opcode interpreter, so that the machine code gets executed but nothing (or very little) of the machine architecture is factored in.

- The second option is to implement the emulator, for as much as possible, in terms of the actual hardware architecture, so that it does what is actually happening in the processor hardware.

The first option is much easier, but it will provide less opportunities for testing. The second option is potentially more useful, but I think that I may end with something that isn't that much assimilable to the hardware after all, even if it works fine in terms of software emulation. So I'm still undecided about what to do...

Joan


Wed Aug 14, 2019 5:02 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 169
Location: Huntsville, AL
Joan:

I recommend the second option. I've used an approach like the second approach in my Python model of the M65C02A. It is not an exact approach to the used in my FPGA RTL because I did not build a microprogram interpreter. However, I did build the model in the manner that the RTL implements the instructions. This approach has helped me work out some of the few remaining issues with the RTL model. In other words, it enabled to determine how do add the last few unimplemented capabilities that I want for my 16-bit extensions in a more HW-oriented manner.

I still have a few additional elements like virtual memory that I've not implemented in the Python model, but all of the other features have been implemented and tested. So I'm confident that I can get those features into the HW without too much trouble.

I am also able to test my Pascal compiler in the emulator, and make note of those instruction sequences that could be better implemented with dedicated instructions. Although, I've not added the capability to the py65 environment yet, I expect to add an instruction histogram capability to provide data for further instruction set tuning.

_________________
Michael A.


Wed Aug 14, 2019 6:31 pm
Profile
Online

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
In CPU teams I've seen both kinds of emulators written. The more fine-detailed one is a pain to keep up to date as the implementation changes, and is eventually abandoned, but at an early stage it's useful for architectural explorations.


Wed Aug 14, 2019 9:10 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
Hi Michael

Thanks for your reply. Actually, your suggestion on the emulator is the one that (in principle) appeals more to me. But I'm aware that a 100% accurate hardware emulation is not possible. Not even a 50% accurate implementation is possible or practical I suppose. But maybe I can get to something that is able to decode instructions in software in a way that can be extrapolated to hardware and execute instructions by theoretically creating the same control signals that the real thing should generate. I will not go that far as to emulate the ALU because I think this is not really necessary, after all this is something that I should be able to test in Logisim as an unit, and then with real ICs plugged into breadboards, or that's what I think.

I am interested in learning more about your project, but I do not seem to find the proper links, not even in the 6502.org forum. I found information on Py65 by searching on google, but I'm not sure if what I found is your work, or you are just using parts of it. I didn't manage to find anything abut your Pascal compiler either.

About Pascal, it is is a language I remember with great nostalgia, I think it was the first serious computer language that I learnt (possibly after Basic) and I made a lot of code in Borland's Turbo Pascal many years ago. That compiler was amazingly fast, (but it only gave a single source code error per build, lol) . Also, the entire APIs of the original Apple Macintosh were made in a beautiful mix of Pascal and assembly. I sometimes think that it's a pity that the language eventually lost traction in favour of the much worse looking and oddly named language "C", but I suppose every historic event has its reasons. Btw, I also recall that the VAX-11 Pascal compiler was EXTREMELLY slow. It did take 3-4 hours to compile a 1000 line source code, whereas the same sized Fortran or C program might take "only" 15 minutes or so. Times have really changed...


Fri Aug 16, 2019 8:35 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
BigEd wrote:
In CPU teams I've seen both kinds of emulators written. The more fine-detailed one is a pain to keep up to date as the implementation changes, and is eventually abandoned, but at an early stage it's useful for architectural explorations.

Hi Ed, I see your point and I will possibly end doing a mix of the two. Please see my reply to Michael above. What I'm more interested about is in the decoding of the actual instructions into microinstruction opcodes, and the generation of control signals. As part of that, I might implement a black box that executes programs based on the microinstruction opcodes alone, or optionally based on the control signals alone, so that I can execute programs before I get to implement the control signals box (if that makes sense)


Fri Aug 16, 2019 9:01 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
I will start working on the simulator soon.

In the mean time, I realised that there are a number of occasions where 15 bit shifts are required. I chose to not implement hardware shifts, so shifts are really expensive for this processor. Unfortunately, the LLVM compiler seems to have a total predilection for using them. I suppose that's because most modern architectures have them embedded in the processor and they are supposed to be cheap. I have already adopted some strategies such as swapping bytes and using 'extend' instructions for shifts amounts over 8, but the fairly used 15 bit shifts remained relatively expensive. So I just wanted to show what the compiler generates now for some fixed amounts. See comments below:
Code:
// 8 bit ammount shifts
int asr_8bit( int a ) { return a >> 8; }
unsigned lsr_8bit( unsigned a ) { return a >> 8; }
unsigned lsl_8bit( unsigned a ) { return a << 8; }

// 15 bit ammount shifts
int asr_15bit( int a ) { return a >> 15; }
unsigned lsr_15bit( unsigned a ) { return a >> 15; }
unsigned lsl_15bit( unsigned a ) { return a << 15; }


CPU74
Code:
   .text
   .file   "main.c"
# ---------------------------------------------
# asr_8bit
# ---------------------------------------------
   .globl   asr_8bit
asr_8bit:
   bswap   r0, r0    // swap bytes
   sext   r0, r0    // sign extend byte
   ret

# ---------------------------------------------
# lsr_8bit
# ---------------------------------------------
   .globl   lsr_8bit
lsr_8bit:
   bswap   r0, r0    // swap bytes
   zext   r0, r0    // zero extend byte
   ret

# ---------------------------------------------
# lsl_8bit
# ---------------------------------------------
   .globl   lsl_8bit
lsl_8bit:
   zext   r0, r0    // zero extend byte
   bswap   r0, r0    // swap bytes
   ret

# ---------------------------------------------
# asr_15bit
# ---------------------------------------------
   .globl   asr_15bit
asr_15bit:
   sextw   r0, r0    // sign extend word
   ret

# ---------------------------------------------
# lsr_15bit
# ---------------------------------------------
   .globl   lsr_15bit
lsr_15bit:
   cmp   r0, 0       // compare with zero
   setlt   r0        // set to 1 if negative
   ret

# ---------------------------------------------
# lsl_15bit
# ---------------------------------------------
   .globl   lsl_15bit
lsl_15bit:
   and   r0, 1, r0      // test for the least significant bit (this will set or reset the Status Register zero flag)
   mov   0, r0          // set constant zero to a temporary register
   mov   -32768L, r1    // set constant 0x8000 to a temporary register (most significative bit to 1)
   seleq   r1, r0, r0   // select 0x8000 or zero depending on wheter the test above was true
   ret

The tweak requiring more code is the 15 bit logical shift left. I have given it some thought and I think there's no easier way to achieve that with the existing instructions. I think the usual approach, in the absence of arbitrary amount shift instructions, is either repeating a number of one bit shift instructions or inserting them in the body of a loop, so the solutions above should be better...


Fri Aug 16, 2019 9:27 am
Profile
Online

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
If the result of lsl15 is to examine bit0 and produce either zero or 0x8000, how about something like this:
Code:
   mov   -32768L, r1    // set constant 0x8000 to a temporary register (most significative bit to 1)
   and   r0, 1, r0      // test for the least significant bit (this will set or reset the Status Register zero flag)
   addeq r1, r1, r1     // double 0x8000 to transform it to zero
This assumes you have addeq...


Fri Aug 16, 2019 9:37 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
Ops, I realised I had the operands reversed on the compiler to generate the SEL instruction above. So that's corrected now and the correct code is now generated for the lsl_15bit function
CPU74
Code:
# ---------------------------------------------
# lsl_15bit
# ---------------------------------------------
   .globl   lsl_15bit
lsl_15bit:
   and   r0, 1, r0
   mov   -32768L, r0
   mov   0, r1
   seleq   r1, r0, r0
   ret

I forgot to mention that in actual real programs, the code above tends to be slightly shortened because the compiler may be able to reuse register constants from some previous computation, specially the quite common Zero constant, and thus there's no need to reload it if it's already available in a register


Fri Aug 16, 2019 9:42 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
BigEd wrote:
If the result of lsl15 is to examine bit0 and produce either zero or 0x8000, how about something like this:
mov -32768L, r1 // set constant 0x8000 to a temporary register (most significative bit to 1)
and r0, 1, r0 // test for the least significant bit (this will set or reset the Status Register zero flag)
addeq r1, r1, r1 // double 0x8000 to transform it to zero
This assumes you have addeq...

Hi ed, I think we crossed posts.
I don't have the 'addeq' instruction. I assume this is some sort of conditional add?, or you attempt to play with carry somehow? . I do have the addc (add with carry instruction though...)


Fri Aug 16, 2019 9:47 am
Profile
Online

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
Indeed, I was imagining that your seleq was part of a more general predicated instruction set... which I could easily have checked!


Fri Aug 16, 2019 9:48 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
BigEd wrote:
Indeed, I was imagining that your seleq was part of a more general predicated instruction set... which I could easily have checked!

I actually gave that a thought, indeed. But it turns to be that this consumes a lot of encoding slots, because you know, I need 3 bits for the condition codes. The SEL instruction alone uses 3 bits for the condition codes and 3x3=9 bits for the registers, thats 12 bits out of 16 bits consumed just for a single instruction, which can't be overlapped by other instruction encodings. So that's really tight and the reason I have only 3 types of conditional instructions: "SELCC" (the one just discused), "BRCC" ( conditional branch, which also consumes a lot due to the embeeded PC offset field), and the SETCC (which is fairly light because only requires the destination register besides the condition code)


Fri Aug 16, 2019 9:56 am
Profile
Online

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
I wonder if using just two bits for predication would be any use? I'm sure that not all codes are used equally often.
For example, you might have ALWAYS, NOTZERO, CARRYSET, NEGATIVE - or some other set.

Of course, this sort of cherry-picking might not play well with a compiler. And it does still consume two bits.


Fri Aug 16, 2019 10:05 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 126
Location: Girona-Catalonia
BigEd wrote:
I wonder if using just two bits for predication would be any use? I'm sure that not all codes are used equally often.
For example, you might have ALWAYS, NOTZERO, CARRYSET, NEGATIVE - or some other set.

Of course, this sort of cherry-picking might not play well with a compiler. And it does still consume two bits.

Believe it or not, these are all things that I balanced at some time, and I'm aware that at least some of the "single page" processors use that approach. But I found that in order to have a complete instruction set capable of using memory bytes, with signed/unsigned arithmetic, and a relatively direct implementation of all the features available in C, I found that I either needed a longer base instruction encoding (such as 32 bit), like modern risc processors, or a variable sized encoding, like old cisc and not so old processors, if I was to encode some sort or predication. At some time I found that implementing the SELCC and SETCC instructions was very compiler friendly, and already saved a lot of branching in practice. I will post some pieces of compiled code in the next post using that instructions that are pretty amazing. So far I think that's the best trade-off between instruction density and execution speed, but of course I have never designed an instruction set before, so there's surely many things that I could have missed or overlooked


Fri Aug 16, 2019 10:23 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 159 posts ]  Go to page Previous  1 ... 7, 8, 9, 10, 11  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software