View unanswered posts | View active topics It is currently Thu Mar 28, 2024 4:10 pm



Reply to topic  [ 305 posts ]  Go to page Previous  1 ... 12, 13, 14, 15, 16, 17, 18 ... 21  Next
 74xx based CPU (yet another) 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I've been doing some work on the LLVM compiler and posted it to the LLVM community. It turns to be that this compiler is highly optimised for big architectures such as the x86 and ARM but it sometimes feels over-engineered for the small 8 and 16 bit guys. There are two things that are not as good as they could be when compiling for such small architectures:

1 - The compiler has a strong bias towards "optimizing" source code patterns into shifts. For example comparisons with power of 2 numbers will be optimised away by replacing the compare instruction by a shift. Sign extensions or sign comparisons are replaced also by shifts. And there are a lot of other cases. This is ok if the target machine has multiple shift instructions, but it is inefficient and results in expensive code if you only have single-bit shift instructions.

Unfortunately, the compiler does not provide any useful hooks to relax such tendency. So I proposed a series of patches to the core compiler classes to at least allow targets to reverse this. In case there's some interest, these are the links:

https://reviews.llvm.org/D69116
https://reviews.llvm.org/D69120
https://reviews.llvm.org/D69326

So far only the first one has been accepted, but of course I have them committed to my local copy.

2 - The second issue is the optimisation of loops with loop-invariant execution count:

If the compiler can figure out a formula that gives the same result than a loop, then the loop is entirely replaced by such formula. Seems clever and desirable, but alas, the formula can contain multiplications, divisions and other expensive stuff (for 8 and 16 bit cpus).

This is an example:
Code:
int countHundred( int num )
{
  int count = 0;
  while ( num >= 100) { count++ ; num = num - 100; }
  return count;
}


CPU74
Code:
   .globl   countHundred
countHundred:
   cmp.lt   r0, 100L
   mov   0, r1
   brcc   .LBB1_2
   sub   r0, 100, r0
   mov   100, r1
   call   @__udivhi3
   lea   r0, 1, r1
.LBB1_2:
   mov   r1, r0
   ret

As can be seen, the loop is completely gone and replaced by a DIVIDE instruction, which then is converted by a library call by the CPU74 backend. This is not desirable because the division can turn to be a lot more expensive than the original loop.

Fortunately, in this case there's a general compiler command line option that disables the "replacement of exit values" and thus the above code can be turned into this more preferable version:

Code:
   .globl   countHundred
countHundred:
   mov   r0, r1
   mov   0, r0
.LBB1_1:
   cmp.lt   r1, 100L
   brcc   .LBB1_3
   sub   r1, 100, r1
   lea   r0, 1, r0
   jmp   .LBB1_1
.LBB1_3:
   ret

In this case the loop in the original source code is fully preserved, resulting in better code for the CPU74 architecture (and the MSP430 and AVR, by the way).

I hope that I have now finally the compiler fully 'domesticated', but I can't tell for sure, because the more progress I make by testing things on the simulator the more I find the actual aggressiveness of the compiler, which is sometimes too much for what I would have liked.


Thu Oct 24, 2019 7:32 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Very good - especially as you see benefit for commercial CPUs as well as our homebrew ones!


Thu Oct 24, 2019 7:48 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
My post above was about my 'adventures' with the compiler, but the purpose of it is of course to be able to use the C language for the CPU74.

So following Ken (monsonite) progress on his "Suite-16", I decided to try the same regarding the programming of a "putchar" and "printnum" routines. That's an excellent exercise and this also unveiled a couple of bugs that I had to fix. So that was good. This is what I came up with:

I created a "systemio.h" and "systemio.c" files with the following code.

"systemio.h" file:
Code:
inline void putchar( char ch ) { *(char*)0xffff = ch; }
void myprintstr( char *str );
void printnum( unsigned int num );

"systemio.c" file:
Code:
#include "systemio.h"

void myprintstr( char *str )
{
  while ( *str )
    putchar( *str++ );
}

void printnum( unsigned int num )
{
  int factors[] = {10000, 1000, 100, 10, 1};
  for ( int i=0 ; i<(sizeof factors)/2 ; ++i )
  {
    char ch = '0';
    while ( num >= factors[i])
    {
      ch = ch + 1;
      num = num - factors[i];
    }
    putchar( ch );
  }
}

"main.c" file
Code:
#include "systemio.h" 

int main()
{
  myprintstr( "Hello world!\n" );
  printnum( 12345 );
  putchar( '\n' );
}


Compiled code results in this:

"systemio.s"
Code:
   .text
   .file   "systemio.c"
# ---------------------------------------------
# myprintstr
# ---------------------------------------------
   .globl   myprintstr
myprintstr:
.LBB0_1:
   ld.sb   [r0, 0], r1
   cmp.eq   r1, 0
   brcc   .LBB0_3
   st.b   r1, [-1L]      # this is putchar
   lea   r0, 1, r0
   jmp   .LBB0_1
.LBB0_3:
   ret

# ---------------------------------------------
# printnum
# ---------------------------------------------
   .globl   printnum
printnum:
   mov   0, r1
.LBB1_1:
   cmp.eq   r1, 5
   brcc   .LBB1_6
   mov   48, r2
   add   r1, r1, r3
   ld.w   [r3, &.L__const.printnum.factors], r3
.LBB1_3:
   cmp.ult   r0, r3
   brcc   .LBB1_5
   sub   r0, r3, r0
   lea   r2, 1, r2
   jmp   .LBB1_3
.LBB1_5:
   st.b   r2, [-1L]      # this is putchar!
   lea   r1, 1, r1
   jmp   .LBB1_1
.LBB1_6:
   ret

# ---------------------------------------------
# Global Data
# ---------------------------------------------
   .section   .rodata,"a",@progbits
   .p2align   1
.L__const.printnum.factors:
   .short   10000
   .short   1000
   .short   100
   .short   10
   .short   1


"main.s"
Code:
   .text
   .file   "main.c"
# ---------------------------------------------
# main
# ---------------------------------------------
   .globl   main
main:
   mov   &.L.str, r0
   call   @myprintstr
   mov   12345L, r0
   call   @printnum
   mov   10, r0
   st.b   r0, [-1L]    # this is putchar!
   mov   0, r0
   ret

# ---------------------------------------------
# Global Data
# ---------------------------------------------
   .section   .rodata.str1.1,"aMS",@progbits,1
.L.str:
   .asciz   "Hello world!\n"


The "main" function first prints "hello world" and then a number. After assembling and running with the simulator the output is this:
Code:
/Users/joan/Documents-Local/Relay/CPU74/Simulator/DerivedData/Simulator/Build/Products/Debug/c74-sim
Hello world!
12345
Program ended with exit code: 0


So it worked !!

For now, the inlined "putchar" function just writes a character to physical address '-1' (actually 0xffff). This can be programmed directly in C (Isn't the C language fantastic?). The inlined function gets efficiently compiled by just a single store instruction to that address. Of course the Stack Pointer is now initialised to a slightly lower address to avoid data corruption.

The "myprintstr" function just traverses a C string while calling "putchar" until it finds the string terminator. It's pretty straightforward.

The "printnum" function is implemented as a loop that subtracts powers of 10 in order to find the actual decimal digits. It's highly inspired on the "Suite-16" function that I mentioned earlier. To avoid using division or repeated code, the power-of-10 values are placed in an array, that the compiler conveniently stores in the 'constant data' section.

The real cpu can map the upper addresses for i/o purposes, I have not decided that fully, but for now the simulator just intercepts any writes to address 0xffff to produce the actual output of the "putchar" function.


Thu Oct 24, 2019 1:55 pm
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
Hi Joan,

This is great progress - congratulations, and I am honoured that my "work" has been a source of inspiration to you.

I learned fairly early on that the first things to get working on an unfamiliar system (after the Hello World or blinkenled) are putchar and getchar closely followed by printnum and getnum.

With these basic routines debugged and working you can very quickly make progress. Its also a sufficiently good test to try out your instruction set and highlight any weaknesses.


Fri Oct 25, 2019 10:34 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I have now a first schematic of the instruction decoder. It is based on the architecture, instruction set, and CPU diagram that I posted on my github repo.
This is the schematics pdf file:

Attachment:
InstDecoder.pdf [46.4 KiB]
Downloaded 204 times


The program memory is not there yet, but the schematic starts from the Instruction Register (IR) that is shown on the Top-Left
There are three modules
(1) The instruction register and primary decoder on the top left.
(2) The inctruction opcode decoder on the top right.
(3) The immediate constant decoder on the bottom.

It works this way:

First a couple of 74xx138 decoders as well as nand, or, not gates provide the initial bits that will be used for the next decoding stage. This is all parallel and takes less than 10ns with 74AC logic

(1) The instruction decoder (top right) determines the instruction encoding type with simple logic. There are three possible encoding types (see CPU74InstrSetV9.pdf). The relevant instruction encoding is always 5 bits which are taken from the appropriate range of the IR. For each type, two additional bits are added as the low significant ones. The result is an unique 7 bit representation for every single instruction that is buffered to the MR bus. The 7 bit representation is then input to several 16V8 PALs that will generate the required control signals. In the schematic there's prevision for up to 27 control signals. i expect that not all of them will be necessary, but this can be easily extended or reduced by simply adding or removing 16V8 chips.

Some instructions may require additional cycles to be completed. This is handled by the MicroInstruction Register (MIR). This register is feed by the current instruction with an internal 4 bit opcode (5 bits are not really necessary in this case) that will act as the next microcode instruction. Following the next clock cycle, these 4 (5) bits are combined to form the unique 7 bit representation. In addition, bit 7 of the MIR register is stored as a flag that indicates whether the IR buffers should be disabled and the MIR should be used instead.

Instructions can have any number of execution cycles, additional cycles are created by just chaining microcodes through the MIR register. I found this procedure easier and less memory demanding than implementing a counter, specially if a large EEPROm would be used for decoding instead or 16V8 PALs

The total MAX estimated time for opcode decoding is in the neighbourhoods of 45 ns, using 74AC logic and ATF16V8C PLDs, but his needs to be looked at more carefully. This time could be reduced by eliminating all pre-decoding logic and inputing the raw 9 bit instruction opcodes into more capable PLDs, but I thought that this would defeat the purpose of designing something that could reasonably be made before the mid 80's.


(2) The Immediate constant decoder (bottom) takes the relevant bits of the IR based on the raw IR opcodes and takes into account whether the value must be zero or sign extended. The four possible cases are buffered into the IM bus, and eventually taken out to the ALU-LEFT bus.

The Prefix mechanism is also implemented as part of the immediate constant decoder. It works like this:

A register (the Pfix register) is used to store 11 bits of the previous instruction immediate. This happens at every cycle as the clock signal arrives and for ALL instructions. If the previous instruction was a pfix, additionally a bit value of 1 is also clocked to the bit 15 of the Pfix register, or 0 otherwhise. During the current immediate decoding, this bit is used to select whether we want the value of the IM bus (not prefixed instruction) or we want the combination of that value with the value stored in the prefix register (a prefixed instruction).

The estimated decoding time for immediates is about 50 ns, (the worse cases are the I2 immediate and the I1 sign extension, which require one more gating that the other cases). This means that the immediate value will be available on the ALU-LEFT bus slightly after the the new control signals are out. I hope this doesn't turn to be a problem.

--
Any comments or recommendations are welcome.

Thanks.


Mon Oct 28, 2019 6:09 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
joanlluch wrote:
I have now a first schematic of the instruction decoder. It is based on the architecture, instruction set, and CPU diagram that I posted on my github repo.
This is the schematics pdf file:

The total MAX estimated time for opcode decoding is in the neighbourhoods of 45 ns, using 74AC logic and ATF16V8C PLDs, but his needs to be looked at more carefully. This time could be reduced by eliminating all pre-decoding logic and inputing the raw 9 bit instruction opcodes into more capable PLDs, but I thought that this would defeat the purpose of designing something that could reasonably be made before the mid 80's.

--
Any comments or recommendations are welcome.

Thanks.


I like 22v10's over 16V8's or 20v8's simply because you have the extra two output pins, as well as the extra output terms. 22v10's are from about 1982, and Signetics 82S100's (50 ns) are from about
1975.The other advantage is you can select active high or low outputs and have a true reset of the D registers. The common PAL from the 80's while useful, may need creative thinking to handle odd requrements as they tended to have active low outputs.
Using LS ttl has the advantage of slow edge rates, so a simple 2 sided pcb can be used
rather than a multlayer one with faster chips.
Having a version of your docs in just B&W might be useful for people like me who just has B&W ink the printer. Color comes out as light grey.

The current version of my 20 bit cpu (1976) has a total of 18 TTL including 3 22V10C's that emulate
82S100's in the control PCB. The TTL logic for the data path is uses 9 ttl per 4 bit slice and 2 glue chips.


Tue Oct 29, 2019 2:06 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
I have just been simplifing my FPGA 18 bit cpu. The LS TTL (1976?) version uses 25 LS chips for the control section with 2 256x8 proms for state decode and 1 32x8 prom for 74181 decoding, giving a total of 27 chips. The 1973/1974 TTL version would be about 5 chips more ( 32 total ) using 4 256x4 proms and 1 32x8 prom, and minor glue logic changes for the control pcb. The Alu pcb is 35 ttl chips + 1 glue. The Display pcb has the clock,reset and front panel interface.
72 pin card edge connectors (.156") are used for the mother board. Tristate logic buffers will be used, for the bus, so late 1973/early 1974 would have the TTL logic/proms out in the market place. Core style memory at ~1.7 us ( 6 mhz/5).


Sat Nov 09, 2019 11:25 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I do encourage you to start a thread for your adventures oldben. (Don't worry about audience - everyone here will see it. We're sharing a small space.)

Edit: I notice you do have a thread here
which I split off from another conversation. Feel free to re-title that one, to extend it, or to start a new one.


Sun Nov 10, 2019 8:19 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I worked to make this architecture as fast as possible and compiler friendly. Implemented a compiler backend, an assembler and a simulator, which showed compact executables, requiring few instructions to perform general programming tasks. I also ran clock-cycle based custom benchmarks that showed that the CPU74, if properly built, should vastly outperform, not only a regular 6502 and any 8 bit CPU of the time, but also a 32 bit VAX-780 of the early 80's.

As mentioned on the opening post, one of my initial goals was running 'basic' and 'space invaders' on it. That was ok, but it now looks to me that just doing that would feel rather deceiving given the on-paper potential of this architecture. It kind of seems that given how the compiler design has turned out, it would be much nicer to port an early version of UNIX and run in on top of the CPU74. I feel this is what this architecture deserves now.

Furthermore, to make UNIX happier, the current Harvard architecture based CPU74, which already runs at nearly 1 cycle per instruction, can be turned into a Newmann architecture, while keeping almost the same 1 cycle per instruction but faster clock rates, by incorporating a 3 stage pipeline in a similar way than the original ARM1 processor.

But that's of course a major goal, possibly more challenging than I currently am able to accomplish by carefully looking at my time/energy/skills. To this, I must add the uncharted territory (for me) of actually building/testing/troubleshooting pcb circuits with tiny parts on it, which I have zero experience about.

So after some thought, I decided to put this project in stand-by mode, and to initiate a more realistic one for me. This project is now at the perfect stage to do so, because the architecture is fully defined, the compiler is fully tested and working, and the assembler and simulator tools are available. This means that the project can be continued at any time in the future without any disruption, when I am ready for it.

For my new project, I will move to the opposite side in raw performance, but hopefully a reasonable one for its kind. I will attempt a RISC Relay based computer. I suppose this conceivably still qualifies as "anycpu" and I will be happy to share it as a new thread in this forums.

Joan


Fri Nov 29, 2019 12:12 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Ah yes, a relay machine would be very appropriate!

When it comes to your doubts about building hardware, I'd urge you to look at logic simulation. There are many ways to tackle it - standard HDLs such as Verilog and VHDL, or non-standard higher level languages, or a graphical construction - but it does mean you can learn something about your design's cycle-by-cycle behaviour and performance without needing a soldering iron. And then, depending on what you chose to do, you might well be able to program your design onto an FPGA and run it at speed. If all looks well, you can perhaps reconsider building it with ICs on a PCB.


Fri Nov 29, 2019 3:05 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
BigEd wrote:
When it comes to your doubts about building hardware, I'd urge you to look at logic simulation. There are many ways to tackle it - standard HDLs such as Verilog and VHDL, or non-standard higher level languages, or a graphical construction - but it does mean you can learn something about your design's cycle-by-cycle behaviour and performance without needing a soldering iron. And then, depending on what you chose to do, you might well be able to program your design onto an FPGA and run it at speed. If all looks well, you can perhaps reconsider building it with ICs on a PCB.


I'll second that - there's nothing quite like watching your design fly for real on an FPGA!

My FPGA experience is only as a hobbyist and mostly with Altera parts, but the Altera/Intel software (Quartus) allows you to mix VHDL, Verilog and schematics in a project - but it's also perfectly possible to design entirely in the schematic editor and then have the software convert the schematic to VHDL or Verilog, which you can then run in a simulator.
I must also confess that I've only in the last couple of months explored simulation properly. Instead, for the last several years, I've been using the SignalTap logic analyser to snoop on the actual project while it runs on the FPGA.


Fri Nov 29, 2019 10:59 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
I use the time honored trial and error programing method for my fpga stuff.
Race conditions always have hampered my development, it works only under a blue
moon with that compile option. I use ADHL rather than the other stuff since I can never figure
out what is header information and what is compiled into logic. Altera has all the TTL MACRO's
I need so I can simulate TTL logic designs, and then someday build it. I keep reading how all
the early computers where 36 bits, even using wierd parts like Relay's or Parametron's. I have working 18 bit design but I'd like to figure shoe horn in a 36 bit acc rather than a 18 bit one.
Early Computers http://museum.ipsj.or.jp/en/computer/dawn/index.html


Sat Nov 30, 2019 4:42 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Thanks to all for your suggestions. It seems to me that I will have to look at FPGAs at some time... But the thing is that I would still need to port a lot of software that I would want to see running on it. I think there's no point on having a working CPU if I can't run meaningful software on it. By meaningful, I mean according to the capabilities of that CPU, not just a few algorithms. I already have a simulator, so I could theoretically start porting any software for it, even before the actual physical thing exists, but that's a considerable task. Defining and running the architecture on a different software tool, or on a physical FGPAs, may help with the hardware, but that does not change much my mind, because as said I already have a simulator.
I feel that in order to get the max potential of this project/architecture, a small team would be required. That's why I ultimately decided to take a break, and try a relay computer instead, which is far less demanding in software. I mean, the relay computer does not require an operating system to be 'meaningful', as the max expectative I can have for it is to run a scientific calculator on it. I hope this makes sense.


Sun Dec 01, 2019 7:01 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
It certainly does make sense, and it's always down to each of us to decide what the scope of a project is, and when to step away.

I do enjoy collaborations, and sometimes they come together as a consequence of showing a project coming into shape.

(On the FPGA front, I'd only want to note that I mention it as an alternative to PCBs, chips, sockets, and solder. Both of them are implementation techniques, and you rightly point out that once something is implemented it starts to become clear that a toolchain and some software is the next logical step. An emulator is another kind of implementation, of course, and can even be a valid end-goal.)


Mon Dec 02, 2019 2:02 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Ok, after some while of inactivity I decided to look at implementing a Logisim model of this Processor.

I started with the ALU, it looks like this:

Attachment:
ALU.png
ALU.png [ 190.44 KiB | Viewed 792 times ]


It's basically a typical carry-look-ahead implementation of a 16 bit adder, preceded by logical units that can conveniently be configured to produce ALU logical functions as well as the propagate and generate signals for the carry look-ahead units (CLA). This means that once we have the carries for each bit, only 1 xor gate per bit is required to perform arithmetic. This is conceptually identical to the workings of the famous combinations of 74181 ALU and 74182 CLA units, and highly inspired on Dieter's documents about ALU designs on the 6502.org forum.

Since 74182 chips went out of production some time ago, my CLA unit is designed around an ATF16v8b PLD, which is strictly used as a GAL16v8 PLA device. The CLA unit schematic looks like the next picture. The logic functions are directly borrowed from wikipedia, https://en.wikipedia.org/wiki/Lookahead_carry_unit https://en.wikipedia.org/wiki/Carry-lookahead_adder, 5 CLA units are required because they are arranged in 2 levels:

Attachment:
CarryLookAhead.png
CarryLookAhead.png [ 66.9 KiB | Viewed 792 times ]


As opposed to the 74181/74182 implementations, this CLA generates look ahead carrys up to bit 4. This reduces the propagation delay for the final Carry bit, as it is known 3 gate delays earlier than the result. This in turn helps to have the V flag early, which will reduce the total time required for compare instructions, which at this time I anticipate that it will have the longest critical path.

The Z flag is generated by a number of NOT gates in open collector configuration. This saves one or two gate delays compared with cascaded ORs or NOR/AND combinations, but I'm not sure about how well this approach works in practice, so any suggestions are welcome.

The planed functionality of the ALU is depicted on the fist schematic. The ALU performs a different function depending on the select signals PS, GS, CS as shown. This covers most required functions for the CPU74 instruction set, with the following remarks:
- Increment/decrement functions are not required because push/pop instructions were removed from the instruction set.
- Shift Left instructions are missing from the architecture because the compiler generates ADD or ADDC instructions instead.
- Shift Right instructions as well as Sign/Zero Extend, and Byte Swap instructions will be implemented outside of the ALU


Mon Sep 14, 2020 6:54 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 305 posts ]  Go to page Previous  1 ... 12, 13, 14, 15, 16, 17, 18 ... 21  Next

Who is online

Users browsing this forum: AhrefsBot and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software