View unanswered posts | View active topics It is currently Sun Aug 18, 2019 9:36 am



Reply to topic  [ 158 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 11  Next
 74xx based CPU (yet another) 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
Are you certainly dealing with Thumb, and not with Thumb-2? The latter has the full ARM instruction set available, I think, where the former does not. That is, Thumb-2 allows for mixed instruction streams, whereas Thumb has more of a mode bit for each routine. Some ARMs support only Thumb and no full-ARM, IIRC. So a Thumb-only ARM can't even fallback to full ARM on the granularity of a routine.


Tue Apr 23, 2019 8:02 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 124
Location: Girona-Catalonia
BigEd wrote:
Are you certainly dealing with Thumb, and not with Thumb-2? The latter has the full ARM instruction set available, I think, where the former does not. That is, Thumb-2 allows for mixed instruction streams, whereas Thumb has more of a mode bit for each routine. Some ARMs support only Thumb and no full-ARM, IIRC. So a Thumb-only ARM can't even fallback to full ARM on the granularity of a routine.

This is the reported list of 'registered' targets according to the LLC tool (LLC is the program that takes an LLVM Intermediate Representation and converts it to a Target Machine Code.
Code:
[iMac-Alumini-2:~] joan% llc -version
LLVM (http://llvm.org/):
  LLVM version 7.0.1
  DEBUG build with assertions.
  Default target: x86_64-apple-darwin16.7.0
  Host CPU: penryn

  Registered Targets:
    aarch64    - AArch64 (little endian)
    aarch64_be - AArch64 (big endian)
    amdgcn     - AMD GCN GPUs
    arm        - ARM
    arm64      - ARM64 (little endian)
    armeb      - ARM (big endian)
    avr        - Atmel AVR Microcontroller
    bpf        - BPF (host endian)
    bpfeb      - BPF (big endian)
    bpfel      - BPF (little endian)
    cpu74      - CPU74 [experimental]
    hexagon    - Hexagon
    lanai      - Lanai
    mips       - Mips
    mips64     - Mips64 [experimental]
    mips64el   - Mips64el [experimental]
    mipsel     - Mipsel
    msp430     - MSP430 [experimental]
    nvptx      - NVIDIA PTX 32-bit
    nvptx64    - NVIDIA PTX 64-bit
    ppc32      - PowerPC 32
    ppc64      - PowerPC 64
    ppc64le    - PowerPC 64 LE
    r600       - AMD GPUs HD2XXX-HD6XXX
    sparc      - Sparc
    sparcel    - Sparc LE
    sparcv9    - Sparc V9
    systemz    - SystemZ
    thumb      - Thumb
    thumbeb    - Thumb (big endian)
    x86        - 32-bit X86: Pentium-Pro and above
    x86-64     - 64-bit X86: EM64T and AMD64
    xcore      - XCore
[iMac-Alumini-2:~] joan%

There are a lot because I selected "all" when I built the whole thing on my computer. Of course the "cpu74" on the list was a latter addition. As you can see there are several ARM related architectures supported. I chose "Thumb" for my tests. I'm unsure about what Thumb version it is, but I noticed that it always produce 2 byte encoding instructions (i.e no 4 byte instructions that I have encountered so far).

On a second look at what's going on, I think it may not be a bug after all, and the generated code is correct !
I now realised that all the offsets on the code that I posted earlier are 4 byte aligned. This means two things, that all the arguments for that architecture were passed in 4 byte aligned memory positions, and second, that maybe the immediate field on the instruction already assumes that this is always the case.
So I looked again into the ARM-Thumb data sheet, and found the following regarding the related instruction. Copied and pasted literally from the data sheet just below:
Quote:
L=0, Add unsigned offset (255 words, 1020 bytes) in Imm to the current value of the SP (R7). Store the contents of Rd at the resulting address.
L=1, Add unsigned offset (255 words, 1020 bytes) in Imm to the current value of the SP (R7). Load the word from the resulting address into Rd.
NOTE: The offset supplied in #Imm is a full 10-bit address, but must always be word-aligned (ie bits 1:0 set to 0), since the assembler places #Imm >> 2 in the Word8 field.

So this means that the assembly code actually shows a 10 bit address offset, which could be up to 1020 bytes, but the encoding is actually two bits shorter, and conveniently stored in a 8 bit immediate field. Conclusion: correct code!!

This is interesting, and just gave me the idea of doing the same for my architecture. I use 16 bit aligned function arguments, so I should be able to interpret the 8 bit width offset as words (instead of bytes) and double the range of accessible addresses :D


Tue Apr 23, 2019 9:02 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
Aha! It's most often the case, when one finds a bug in a compiler, that one hasn't... but sometimes one has.


Tue Apr 23, 2019 9:07 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Quote:
The key thing to know is that by the time the output is generated (assembly or object file) all the instruction selection, register allocation, and target dependent optimisations are already performed. In fact, it would not be possible for an assembler to expand instructions with the incorporation of temporary registers because that would certainly break all the previous register allocation work and optimisations performed by the backend. It may not even be any available register at the time an assembler realises that it may require one more.
I didn't realize I was doing something so unusual with the FT64 assembler. A register is specifically reserved for the assembler's usage (r23) so it can output additional instructions to support the requested mode of an instruction. It doesn't interfere with the compiler's ability to allocate registers because it's hidden from the compiler (the compiler simply isn't allowed to use the register). Expanding an immediate is done relatively infrequently, and with a modern cpu architecture which has register renaming, it doesn't matter that the same register is used, if there are consecutive uses. I'm not sure about the compiler's ability to select and optimize instructions, the simple compiler I'm using doesn't do a lot of manipulations. It does substitute shifts for multiplies and divides where possible. I think the compiler would be largely unaffected, since the additional instructions output are no different than if the ISA supported large immediate constants. All the compiler can do to optimize an immediate value is to place it in a register. This kind of optimization isn't affected by the assembler's implementation.

_________________
Robert Finch http://www.finitron.ca


Tue Apr 23, 2019 10:46 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 124
Location: Girona-Catalonia
Well, I suppose that there may be several ways to get to a similar result, but generally speaking I suppose that preventing a compiler to make use of an existing register may be considered in some way as a less than optimal approach, if this does not bring additional advantages. In my case, I don’t have plenty of registers so better use them all if possible.

The compilers that I developed many years ago (and again not that long ago in the form of a JIT, Ruby-like language to RPN code compiler that was integrated into an iOS app) only performed replacements of constant 2-multiple multiplications and divisions by constant shifts, performed constant expression folding into the actual constant value at compile time, with further removal of zero adds and subs, and times 1 resulting expressions. But that was all. My compilers were essentially top-down recursive parsers that would generate machine or assembly code as they went. At most they required a second custom pass to compute relative jump or call offsets, or to resolve forward declarations. I never really went to the route of the assembler, as I directly produced some sort of executable file that the target machine was able to run.

Commercial compilers are however at a completely superior level. For example, this document lists all the available optimisation passes that the LLVM compiler is able to perform: https://llvm.org/docs/Passes.html Look at the "Transform Passes" section. Some of them are relatively irrelevant, but others make a big difference. The desired optimisation level is usually specified when invoking the compiler and can be selected by the -O1 -O2 -O3 command line options, but it's not clear which ones are included with each option. There's also a special -Os option that aims for the shortest possible code. I always chose -Os because I found that it produces the best possible code for a target with only 64k addressable word space. The -Os is meant to include all optimisations that would result in shorter code while discarding the ones that would result in larger code such as -loop-unroll and -loop-unswitch. The -O3 option produces unbelievably large code, but it's supposed to be the fastest on highly pipelined processors.

The LLVM project started as a ground up re-coding of the GNU GCC compiler, driven by the numerous GCC shortcomings. Clang is the C, C++ front end of LLVM and it remains fully compatible with GCC. The earlier versions even required GCC to be compiled. Currently, LLVM-Clang can obviously compile itself, but not at the beginning.

From my experience at using both GCC and Clang, I can tell that Clang is much superior than GCC. Around the year 2012 or so, I worked on my own company as an iOS developer. At the time, the default compiler for iOS App development was GCC. Eventually, Apple decided to switch to LLVM-Clang and that made a huge difference. Even if both compilers probably generate similar quality code, compilation times on CLang suddenly dropped to one third, and compiler error messages suddenly became spot on, fully informative and reported on the right place. This was in contrast to the sluggishness of GCC on big projects, and the always un-decipherable and totally wrong-placed compiler errors of GCC. LLVM is also able to target several architectures with the same base code just by switching an option (as I have repeatedly shown here) while GCC requires the creation of a separate compiler for every target.

So, regrettably or not, thanks to the open source community, mature top of the range compilers have now become a commodity. For this project, I decided to try with a custom LLVM backend rather than implementing my own as I already did in the past. After I have almost overcome the quite steep initial learning curve, I'm very happy to have adopted this approach.

EDIT:

About compilers, back in the late 80's I purchased the usually referred as the "dragon book" about compilers: "Compilers. Principles, Techniques and Tools" from Alfred V.Aho, Ravi Sthi, and Jeffrey D. Ullman. There are more modern books, but this one is the classics and covers the most common optimisations. I found that it's still available on Amazon and it's definitely a very interesting read, if you are on this kind of subjects.

Another interesting read is the "Advanced Compiler Design Implementation" from Steven S, Muchnick. This 900 pages book, is more recent and focuses exclusively on compiler optimisations. It makes you realise how complex it all can become if you aim for really aggressive optimisations, and how much of a 'black art' the implementation of such optimisations are.

Finally, "Engineering a Compiler" by Keith Cooper and Linda Torczon seems to be the most up to date one, and the state of the art book on compilers. But I have not it on my library, so I can't really comment.


Tue Apr 23, 2019 1:35 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 124
Location: Girona-Catalonia
In order to help instruction decoding, I have moved all the constant fields in the instructions to the right (least significative bits). Now, when picking an immediate field I only have to sign- extent them to 16 bit starting from from their number of encoding bits. This indirectly causes register fields to vary among instructions, but I balanced it all and I think it's a small price to pay in order to have easier constant decoding. So instructions formats for the new encodings look like this (Sorry, I think I have posted a similar table already too many times :roll: )
Attachment:
CPU74InstrSetV4.png
CPU74InstrSetV4.png [ 222.88 KiB | Viewed 2170 times ]

The interesting part, is that I a have already started to figure out the instruction decoder.
I created an EASY-EDA schematic with what would be the first version of the decoder. It's attached as a pdf file
Attachment:
Schematic_InstrDecoder_Sheet-1_20190426154736.pdf [69.15 KiB]
Downloaded 58 times

This is only a conceptual version and thus it is not meant to be complete, although I think it should lack very little. It stars from the Instruction Register that is supposed to already contain the current instruction. The decoding works as follows:

(1) The instruction is decoded simultaneously into its useful parts for further processing. These parts are:
"Opcode", "Register Selection", "Condition Selection", "Immediate Value".
All four parts are 'decoded' for all instructions, but only the "Opcode" has a relevant meaning for all instructions. The other parts are only used by the CPU when they have an actual meaning for the instruction.

(2) Instructions are converted into an unique 7 bit microcode (0..127) by the "Opc Decoder". Currently, slightly less than 64 instructions are available, so decoding could be done into 6 bits only, but I chose 7 bits because I'm already very near the 6 bit limit and I may add a couple of instructions at any time. The actual Opcode Decoding is performed by a ATF22V10 PLD. The upper 12 bits of the instruction are converted into a 7 bit by the PLD internal table. To show the concept I attach the PLD file as txt. The actual decoding is implemented as a conversion table from the 12 bit native instruction encoding with 'not-cares', into a concrete 7 bit microcode. Note that the shown PLD code does not support yet all instructions:
Attachment:
InstDecoder.txt [2.5 KiB]
Downloaded 53 times


(3) The "Register Decoder" uses another ATF22V10 PLD to extract the register fields from the native instruction encoding. Input PIN 13 of the "Reg Decoder" is feed with an output of the "Opc Decoder" to chose among the two possible patterns of register encoding. This is required because now, T6 to T9 in the instruction summary have a non-regular pattern for register encoding. The "Register Decoder" WILL output garbage for instructions with less than 3 registers, but this is unimportant because the processor will only make use of the number of relevant registers for the instruction being executed.

(4) The "Condition Selector" is just an extract of the condition code field for instructions containing it. As before, it will contain garbage for any other instructions, but that's not an issue.

(5) The "Imm Decoder" takes one of the 4 possible representations of instruction embedded constants and converts it into a sign extended 16 bit value. As this is actual input data for the ALU, a couple of 74xx541 tristate buffers have been incorporated into the Schematic above but they may be placed elsewhere or a different approach may be chosen to select the ALU inputs. The "Imm Decoder" takes two inputs from the "Opcode Decoder" to know what to do, i.e, which one of the 4 immediate constant forms must select.

Another interesting aspect of the Schematic is the micro-instruction decoder. Some instructions require more than one microcode iteration (and clock cycles) to complete. Instead of using a counter to keep track of it, I am using a 'linked list':

(1) First, the instruction opcode is feed into the "Micro Instruction Decoder" through a multiplexer made around a couple of 74xx157. The multiplexer initially selects the Opcode Decoder output. The "Micro Instructions Decoder" computes the control signals required for the current micro instruction. The microcode can be regarded as a row number in a table where all the control signals are put in columns. This is consistent with previous approaches from others who used a PROM or EEPROM for the actual decoding. I am using PLDs instead but that's not conceptually different. The main difference is that the "next" microcode is part of the data stored in said columns.

(2) As discussed earlier, all microcodes coming from the "Opcode Decoder" are 7 bits. This means that they use at most the fist 127 rows of the Micro Instruction Decoder. There are no additional rows reserved in that space for multi cycle instructions. Instead, the Micro Instruction Decoder provides the row (if any) for the next multi-cycle instruction microcode on the upper 128 to 255 row space, which is stored into the "Micro Instruction Register" at the next clock edge.

(3) The MicroCode Selector made around the 74xx157 multiplexers decides based on the most significant bit on the Micro Instruction Register, whether to pass the contents of the Micro Instruction Register or a new opcode from the "Opcode Decoder" to the Micro Instruction Decoder. If bit 7 is 1, then we did not finish last instruction so we must feed the Micro Instruction Decoder with the contents of the Micro Instruction Register. The process may repeat until all micro steps for a multi cycle instruction are completed. For the last step, (or for single step instructions) the 'next' microcode can be simply all zeros. This will select the microcode coming from the opcode decoder for the next cycle.

I hope it makes sense.


Fri Apr 26, 2019 3:57 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Thank you for the book references. The book I started out with was 'Compiler Design in C' by Holub. It's more hands-on coding approach and has source code for a working compiler and several other tools. In my youthful foolishness I manually keyed in the code rather than send away for it. This was before the internet.

_________________
Robert Finch http://www.finitron.ca


Sat Apr 27, 2019 2:10 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 124
Location: Girona-Catalonia
robfinch wrote:
Thank you for the book references. The book I started out with was 'Compiler Design in C' by Holub. It's more hands-on coding approach and has source code for a working compiler and several other tools. In my youthful foolishness I manually keyed in the code rather than send away for it. This was before the internet.

Hi Rob, compilers is a fascinating subject, isn’t it? I guess l must be older than you because my first reference book was this one: https://en.m.wikipedia.org/wiki/Algorithms_%2B_Data_Structures_%3D_Programs by Swiss Pascal Language inventor Niklaus Wirth. This book introduced me deeply into computers and software design. On the last couple of chapters the author proposed a fully functional interpreter for a Pascal-like language including complete source code written in Pascal. He named that language PL/0. Only a few years later I wrote my first compiler in C language on a VAX-11 computer, but before that I had written math expression interpreters in Fortran. Unfortunately, I think that the original book chapters about parsers and compilers were removed on a more recent, updated book adaption by another author. The book that I had (and still have somewhere in a pretty discoloured form) wasn’t even the English version, because at the time I hadn’t even started with my English language learning and I didn't understand English, go figure. That was many years ago...


Sat Apr 27, 2019 7:37 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 124
Location: Girona-Catalonia
While working more on the compiler I realised that there were a couple of missing instructions that were required to fully compile things. Contrary to what I said before, I am now supporting any size offsets for objects stored in the stack. I will show in another post what changes I made (still in progress) on the compiler to support said large stack objects. I decided to make these changes after I realised that big objects in the stack are not that uncommon after all. The most common scenario is possibly the declaration of local arrays in functions. These can easily exceed the max immediate offset that is available in the instruction set (-128 to +127). Until now, exceeding this limit stopped the compiler. There were the following problems in the instruction set:

- Lack of byte load/stores for objects in the stack based on the SP with immediate offset. Only word access was available. So I have now added instructions to load/store bytes on the stack in addition to the already existing word instructions.

- Lack of instructions to add/subtract the SP with a value in a register. Only immediate add/sub was available. This prevented stack frame adjustments above 128 bytes, which may be required for large objects. So these instructions were added as well.

The instruction set is already very tight on instruction slots, there aren't virtually any free. So I had to decide on some trade-offs in order to make some space for the missing instructions. This is what I did:

- I removed the unsigned byte load with immediate offset instruction. The compiler will now generate a signed load followed by a zero extent when a zero extended load is required. This single instruction wasted several slots because it came next to the two word load/store instruction and the other two byte load/store instructions. In total 5 possibilities that required a 3 bit field in the instruction. This has now been reduced to 4 possibilities, which only require 2 encoding bits on the instruction set. In this case, instead of using the extra slot for more instructions, I decided to extend the size of the immediate to 6 bits instead of just 5. So the following instruction (and friends) has now a 6 bit immediate field.
Code:
ld.w [#K, Rs], Rd  ; #K is now 6 bits long


- I noticed that the Conditional Set instruction was a significant waste of slots. It had several don't-care bits that after some general arrangement could be used in benefit of more available slots. The new instruction patterns have essentially been taken from what was wasted by that single instruction arrangement.

The changes in the instruction set patterns are several and significant, and I have taken great advantage of BidEd's recommendation about splitting the instruction opcodes while keeping the operand fields as constantly placed as possible. The differences are as follows:

- Added the instructions referred above.
- 6 bit offset instead of 5 for the load/store immediate instructions.
- 5 bit available op fields, instead of 4 for the single register instruction pattern. This makes room for 32 eventual instructions instead of 16 on that pattern.
- 5 bit available op fields, instead of 4 for the zero operands instruction pattern. Room for 32 eventual instructions on that pattern instead of 16.
- 4 bit available op fields, instead of 3 for the two register operands instruction pattern. This provides room for 16 eventual instructions instead of only 8 on that pattern.
- 4 additional free slots, currently not used, for a type of instruction having a 8 bit immediate but no explicit register. This can be used for example to implement instructions that could make use of that immediate value in some implicit way (software interrupts?)

So finally, the instruction set now looks like this:
Attachment:
CPU74InstrSetV5.png
CPU74InstrSetV5.png [ 234.49 KiB | Viewed 1943 times ]

As described, the instruction opcodes are now split along the instruction encodings in an attempt to enforce constant placement of immediate fields, registers, and condition codes. The instruction decoder schematic that I posted before should not have any issue at decoding that, as the decoding principle remains the same. I think that this is a real improvement on what I had before because there's more flexibility to add additional instructions thanks to the more efficient encoding.

Joan


Wed May 08, 2019 10:53 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 124
Location: Girona-Catalonia
I made some advances on the compiler implementation relative to stack frames and function argument passing / return values. They can be summarised as follows:

- The first arguments are passed in up to 3 registers, R0 to R2, as long as arguments fit. For example 3 ints will be passed in R0, R1, R2 in order (recall that ints are 16 bits). Arguments smaller than 16 bits, such as the char type, are extended to 16 bit type and passed the same way.

- The remaining arguments, if any, are passed in the stack. The stack layout is little endian for simple type arguments, but arguments are located from lower to higher stack addresses. That is, they appear in ascending memory addresses in the same order as they have been specified.

- Non simple type arguments, such as structs, are always passed on the stack. If more arguments are specified, registers R0 to R2 may be used to complete the list. Structs are passed by value as the C specification dictates. Naturally, the programer is able to change that behaviour by using pointers.

- Simple typed return values are returned in registers R0 to R2, otherwise implicit references are created and passed as (hidden) additional arguments.

- Structs are returned by means of an implicit reference. The caller creates a frame slot in the calling function the size of the struct and passes its address as the first argument (in register R0). The callee function just needs to update the fields pointed to by the passed in reference. This behaves consistently to the C specification, and many compilers just do that.

This is the example of struct passing and returning that I posted before, but now showing the compiler enhanced behaviour.
Code:
struct A
{
  char l[3];
  char m;
  char n;
};

struct B
{
  char w[2];
  char y;
  char z;
};

struct B convert( struct A a )
{
  struct B b;
  b.y = a.m;
  b.z = a.n;
  return b;
}

void convert2( struct B *b, struct A *a)
{
  b->y = a->m;
  b->z = a->n;
}

int callConvert2()
{
  struct A a = { { "ab" }, 3, 4 };
  struct B b;
  convert2( &b, &a );
  return b.y+b.z;
}

 
This gets compiled as follows:

CPU74 code
Code:
convert:                                ; @convert
; %bb.0:                                ; %entry
   ld.sb   [SP, #5], r1
   st.b   r1, [r0, #2]
   ld.sb   [SP, #6], r1
   st.b   r1, [r0, #3]
   ret

convert2:                               ; @convert2
; %bb.0:                                ; %entry
   ld.sb   [r1, #3], r2
   st.b   r2, [r0, #2]
   ld.sb   [r1, #4], r1
   st.b   r1, [r0, #3]
   ret

callConvert2:                            ; @callConvert
; %bb.0:                                ; %entry
   mov   #7, r0
   ret
In the above assembly code struct 'a' is passed by value to the 'convert' function, so it's accessed through SP. The return location is implicitly passed as the first argument in R0. The generated code is just a simple sequence of load/stores with an intermediate register. The 'convert2' function is equivalent but using pointers instead.

Interestingly, the "callConvert" function gets completely optimised away by the compiler frontend to only return the compiler calculated result.

To force the compiler to actually generate something longer for the calling functions, I must declare 'convert' as external. So this is what I get for the callConvert2 function:

CPU74 Code
Code:
callConvert2:                           ; @callConvert2
; %bb.0:                                ; %entry
   push   r3
   sub   SP, #10, SP
   mov   SP, r3
   add   r3, #5, r3
   mov   &.LcallConvert2.a, r1
   mov   r3, r0
   mov   #5, r2
   call   &memcpy
   mov   SP, r0
   add   r0, #1, r0
   mov   r3, r1
   call   &convert2
   ld.sb   [SP, #3], r0
   ld.sb   [SP, #4], r1
   add   r1, r0, r0
   add   SP, #10, SP
   pop   r3
   ret

I instructed the frontend to generate 'memcpy' when a hardcoded copy would result in more than 2 pairs of load/stores. This can be adjusted to a higher value, but so far I am favouring code density over ultimate speed, so I found this to be ok.


Mon May 13, 2019 8:15 am
Profile

Joined: Wed May 15, 2019 1:17 am
Posts: 21
Joan, I've just quickly browsed through this thread. If you haven't already done so, have you considered doing a first implementation of your CPU in a language like VHDL or Verilog?

The advantages are: it will be simulated at the logic level, you will be able to generate waveforms and see what is happening, you can simulate propagation delays too.

Also, you can then modify the code so that it can be synthesized on an FPGA development board, and get it to run at real hardware speeds. FPGA boards are ridiculously cheap these days. For a 16-bit CPU, something like the TInyFPGA Bx would be perfectly suitable.

And, if you still want to make a TTL version, you can do what I did and model each TTL device, and this helps to confirm that the design will work when you build the TTL version.

Just a thought. Cheers, Warren

Edit: Can you also post the latest version of CPU74InstrSetVnn.pdf?!


Thu May 16, 2019 3:26 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 124
Location: Girona-Catalonia
Hi Warren,

This is current/latest version of the full instruction set.
Attachment:
CPU74InstrSetV5.pdf [61.71 KiB]
Downloaded 45 times

About FPGAs, they are something that I have always regarded as a too complex thing for me, but I suppose this is just a totally prejudicial feeling with no objective reasoning. After all, virtually everybody who has ever designed a TTL circuit, eventually switches to FPGAs for convenience. I don't know what I will end doing, but so far the idea of a bunch of TTL-packed PCBs all connected to a base plate, that may even need fans to keep them cold, appeals me. I suppose it's a matter of nostalgia. I learned all the basics of computers, compilers, and beyond by playing on a VAX/750 computer, and that machine was amazingly made of huge PCBs absolutely packed with DIL 74xx ICs. Her older daughter, the VAX/780 ran at an amazing 5MHz despite no pipelining and a totally RISCless architecture.

For testing, my current plans are still software: My own coded emulator for the conceptual testing of the instruction set and compiler; and the venerable "logisim" software to check the hardware units, along with spreadsheets to manually figure out the critical paths. This does not mean that I close the doors to FPGAs, it's just that I do not currently know anything about them.

Joan


Thu May 16, 2019 2:08 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
I'm all for simulation, by whatever means. As for FPGAs, there are a couple of things which are difficult in FPGA: you generally can't have on-chip tristate busses; you're generally advised not to use transparent latches or any technique different from synchronous design.

Transparent latches can be very useful to juggle timing slack from one part of a circuit to another, but the resulting system can be much harder to analyse.
https://logicsense.wordpress.com/2011/0 ... -stealing/


Thu May 16, 2019 4:21 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
(But there are also great advantages to FPGAs: rapid re-implementation when the design changes, high performance, no limitations on which kinds of logic function you can use.)


Thu May 16, 2019 4:23 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 124
Location: Girona-Catalonia
After some more work on the compiler Stack Frame code, I think that I have now a fully working implementation. I finally decided to remove all limitations for stack sizes, so it's now allowed to pass any size argument by value such as big sized structs.

In the process I found a bug in LLVM that affected 'memcpy' optimisations for targets not supporting mis-aligned memory accesses. This affected the CPU74 target, but also the MSP430 and MIPS16 targets. The bug consisted in a wrong computation of the optimal alignment for replacements of memcpy by direct load/store instructions. This resulted in weird code generated at a latter stage, that dealt with the mis-aligned load/stores by inserting shifts/swaps/ors, thus creating significantly suboptimal code.

Just as a matter of possible interest, this is the compiler output of the same source code, before and after the bug fix:
Code:
struct AA
{
  char n;
  char m;
  char j;
};

extern void convertA( struct AA *a);

void callConvertA()
{
  struct AA a = {1, 2, 3};
  convertA( &a );
}

CPU74 Before Bug Fix
Code:
callConvertA:                           # @callConvertA
# %bb.0:                                # %entry
   sub   SP, 4, SP
   mov   &.LcallConvertA.a, r0
   ld.sb   [r0, 2], r1
   st.b   r1, [SP, 2]
   ld.sb   [r0, 0], r1
   zext   r1, r1
   ld.sb   [r0, 1], r0
   zext   r0, r0
   bswap   r0, r0
   or   r0, r1, r0
   st.w   r0, [SP, 0]
   mov   SP, r0
   call   &convertA
   add   SP, 4, SP
   ret

CPU74 After Bug Fix
Code:
callConvertA:                           # @callConvertA
# %bb.0:                                # %entry
   sub   SP, 4, SP
   mov   &.LcallConvertA.a, r0
   ld.sb   [r0, 2], r1
   st.b   r1, [SP, 2]
   ld.sb   [r0, 1], r1
   st.b   r1, [SP, 1]
   ld.sb   [r0, 0], r0
   st.b   r0, [SP, 0]
   mov   SP, r0
   call   &convertA
   add   SP, 4, SP
   ret

In the first case, the compiler generates 3 single byte loads of the constant pool, and 2 stores to the stack. The second stack stores is 2 bytes, so the individual bytes gets combined by swap/sext and or.

In the second case, 3 single byte pairs of loads/stores are generated instead, which results in smaller code, (although possibly not much faster due to the extra store)

The above could still be improved by using one 1 byte load/store pair, and one 2 bytes load/store pair, but this is all internal LLVM code and I have yet to figure out how to make it happen.

Joan


Thu May 16, 2019 9:39 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 158 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 11  Next

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software