Last visit was: Sun Jan 23, 2022 6:16 am
It is currently Sun Jan 23, 2022 6:16 am



 [ 45 posts ]  Go to page Previous  1, 2, 3
 Some C compilers 
Author Message

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 78
roelh wrote:
Hi all, my C compiler is now integrated into the Kobold K2 online Javascript assembler/simulator.

On a single browser page you can now compile, assemble and simulate the C code, and download a file to program into the Kobold.

See it live at http://www.enscope.nl/simas_kobold2/ !


Very impressive - nice work!
It looks like you can now cope with the working set not fitting within registers - how did you approach register allocation / spillage?


Wed Mar 11, 2020 10:57 pm

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 43
Thank you Robinson ! About the register allocation:

The workspace holds max 16 locals in RAM. At a function call, the workspace pointer WP is incremented by 32, so now the called function
has 16 fresh local variables (but for leaf functions the locals are in zpage, this saves the instructions needed to change WP).
There are 8 registers (4 data and 4 address). PC (A0) and WP(A1) are address registers, and one register is needed for the return address, this leaves 5 registers for parameter passing. So the max nr of parameters is 5. This could be extended by having another parameter pass convention for parameter 6 and higher, for instance they could be directly placed into the next workspace).

Calculations normally are done in register D3. This is also the register for function results and for the first parameter of a function call. Within the code generator, there is a function that returns a new register to use (and reserves it), but it is currently only used one level deep. Address calculations are mostly done in D2 and A2.

For expressions, the code generator will first generate code for the deepest branch of the expression. This minimizes the number of intermediate results. But if intermediate results must be stored, this is generally done in the workspace.

The code generator does several other optimizations that will be described later.

There are no compare instructions in the Kobold instruction set, a subtract must be used (that is an add with the 2's complement). So the register that is compared will be changed after the instruction. But there are also ADD instructions that deliver the result in an address register, like:

MOVC (var),D3 ; variable to D3 and complement it
ADDI 45,D3,A3 ; add and increment. Comparison result comes in carry.

This can used for compare. The D3 value is not changed, and the result in A3 is simply ignored.

You can have a look at the code generator here: http://www.enscope.nl/simas_kobold2/generator.js


Thu Mar 12, 2020 8:46 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1666
Nice - can confirm, 3 is prime, and 9 isn't. (You might find that a fast-forward function will help a lot - allow for multiple steps between GUI updates. Ken's calculator simulator runs a few hundred(?) steps at normal speed and then speeds up. Visualy6502 takes a parameter to adjust the fast-forward button's step length.)


Thu Mar 12, 2020 9:29 am

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 43
Hi all, I did a benchmark test to check my C compiler in combination with my computer Kobold K2, against a few known systems.

You find my full test description here:
https://hackaday.io/project/167605-kobold-k2-risc-ttl-computer/log/180963-how-fast-is-this-thing-anyway

Ran the Sieve of Eratosthenes: (tests mainly the speed of loops and writing to arrays). result: 3.3 sec

6502 - The 6.25 MHz Kobold K2 programmed sieve in C, is more than 10 times faster as a 2 MHz 6502 programmed in optimized assembler.

68000 - The 6.25 MHz Kobold K2 programmed sieve in C, beats all C, Ada and Pascal compilers (available in 1983) targeting an 8 MHz 68000 !

8086 - The 6.25 MHz Kobold K2 programmed sieve in C, beats all C compilers targeting a 10 MHz 8086 !

Ran the recursive Fibonacci algorithm: (tests mainly the speed of recursive function calls and returns) result: 5 sec

8086 - The 6.25 MHz Kobold K2, programmed Fibonacci in C, beats all C compilers targeting a 10 MHz 8086 with a factor 3 or more !


Sun Jul 19, 2020 5:17 pm

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 43
The source code of the C compiler and editor/assembler/simulator can be found here:

https://hackaday.io/project/167605/files

And the short description here:

https://hackaday.io/project/167605-kobold-k2-risc-ttl-computer/log/182395-c-compiler-assembler-simulator-source-code

Perhaps some of you dare to look at it. I would not be surprised when you see a resemblance to a certain Italian food...


Wed Aug 19, 2020 8:07 pm

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 7
roelh wrote:
Hi all, I did a benchmark test to check my C compiler in combination with my computer Kobold K2, against a few known systems.

You find my full test description here:
https://hackaday.io/project/167605-kobold-k2-risc-ttl-computer/log/180963-how-fast-is-this-thing-anyway

Ran the Sieve of Eratosthenes: (tests mainly the speed of loops and writing to arrays). result: 3.3 sec

6502 - The 6.25 MHz Kobold K2 programmed sieve in C, is more than 10 times faster as a 2 MHz 6502 programmed in optimized assembler.

68000 - The 6.25 MHz Kobold K2 programmed sieve in C, beats all C, Ada and Pascal compilers (available in 1983) targeting an 8 MHz 68000 !

8086 - The 6.25 MHz Kobold K2 programmed sieve in C, beats all C compilers targeting a 10 MHz 8086 !

Ran the recursive Fibonacci algorithm: (tests mainly the speed of recursive function calls and returns) result: 5 sec

8086 - The 6.25 MHz Kobold K2, programmed Fibonacci in C, beats all C compilers targeting a 10 MHz 8086 with a factor 3 or more !


Hi,

At the risk of straying OT but always looking for something to compare my own system to, I gave this a go...

Firstly my system is a BCPL program compiled into a 32-bit VM/bytecode interpreter running on a 16Mhz, 16-bit CPU (65c816) with an 8-bit data bus, so it's never going to be optimal, however, assuming my interpretation (and translation) of the code is correct and you're doing 10 iterations of the inner loop, then I get 13.014 seconds. Not as fast as your C version but I'm pleasantly surprised it's on-par with the AVR at the same clock speed.

Code:
GET "libhdr.h"
GET "sys.h"

MANIFEST
{
  size       = 8190
  iterations =   10
}

LET start () = VALOF
{
  LET flags = getvec (size + 1)
  LET tStart, tEnd = ?,?
  LET count = 0

  writes ("Starting*n")
  UNLESS flags THEN
  {
    writes ("Out of memory*n")
    RESULTIS 1
  }

  tStart := sys (Sys_cputime)
  FOR iter = 1 TO iterations DO
  {

    FOR i = 0 TO size DO
      flags!i := TRUE

    FOR i = 0 TO size DO
    {
      LET prime, k = ?, ?

      IF flags!i THEN
      {
        prime := i + i + 3
        k := i + prime
        WHILE k <= size DO
        {
          flags!k := FALSE
          k := k + prime
        }
        count := count + 1
      }
    }
  }
  tEnd := sys (Sys_cputime)

  writef ("Time: %d ms*n", tEnd - tStart)
  writef ("  %d  primes found.*n", count)

  freevec (flags)

  RESULTIS 0
}

Code:
Starting
Time: 13014ms
  18990 primes found.


Cheers,

-Gordon


Thu Aug 20, 2020 1:04 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 292
My bengol compiler accepts a simple version of c with some code changes for porting purposes.
I get about 10 seconds. on my 20 bit 1.33 us core memory computer. 1899 primes.
Looking at the byte benchmarks, very few tests have same speed and bus width.
4 Mhz z80/IBM pc about the speed.
Ben.


You do not have the required permissions to view the files attached to this post.


Thu Aug 20, 2020 3:36 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Ok, although I'm a bit late to join, I also accepted the challenge and tried with my CPU74 architecture and compiler

This is the test C code
Code:
#include "systemio.h"
#include "system.h"

#define true 1
#define false 0
#define size 8190
#define sizepl 8191
char flags[sizepl];
int main() {
    int i, prime, k, count, iter;
    myprintstr("10 iterations\n");
    for (iter = 1;iter <= 10;iter ++) {
        count=0 ;
        for (i = 0;i <= size;i++)
            flags [i] = true;
        for (i = 0;i <= size;i ++) {
            if (flags[i]) {
                prime = i + i + 3;
                k = i + prime;
                while (k <= size) {
                    flags[k] = false;
                    k += prime;
                }
                count = count + 1;
            }
        }
    }
    putchar('\n'); printnum(count);
    myprintstr(" primes\n\n");
}


Resulting assembly code:

[CPU74 assembly]
Code:
   .globl   main
main:
   add   SP, -8, SP
   st.w   r4, [SP, 6]
   st.w   r5, [SP, 4]
   st.w   r6, [SP, 2]
   st.w   r7, [SP, 0]
   mov   &.L.str, r0
   call   @myprintstr
   mov   1, r6
   mov   8191L, r5
   mov   0, r7
.LBB0_1:
   mov   &flags, r0
   mov   1, r1
   mov   r5, r2
   call   @memset
   mov   3, r1
   mov   3, r2
   mov   1, r4
   mov   0, r0
   mov   0, r3
.LBB0_2:
   and   r4, 255L, r4
   brcc   .LBB0_7
   add   r3, r3, r4
   add   r4, r3, r4
   lea   r4, 3, r4
   cmp.ugt   r4, 8190L
   brcc   .LBB0_6
   mov   r2, r4
.LBB0_5:
   st.b   r7, [r4, &flags]
   add   r4, r1, r4
   cmp.ult   r4, 8191L
   brcc   .LBB0_5
.LBB0_6:
   lea   r0, 1, r0
.LBB0_7:
   lea   r3, 1, r3
   cmp.eq   r3, 8191L
   brcc   .LBB0_9
   lea   r1, 2, r1
   lea   r2, 3, r2
   ld.sb   [r3, &flags], r4
   jmp   .LBB0_2
.LBB0_9:
   lea   r6, 1, r6
   cmp.eq   r6, 11
   brncc   .LBB0_1
   mov   10, r1
   st.b   r1, [-1L]
   call   @printnum
   mov   &.L.str.1, r0
   call   @myprintstr
   mov   0, r0
   ld.w   [SP, 0], r7
   ld.w   [SP, 2], r6
   ld.w   [SP, 4], r5
   ld.w   [SP, 6], r4
   add   SP, 8, SP
   ret

# ---------------------------------------------
# Global Data
# ---------------------------------------------
   .section   .rodata.str1.1,"aMS",@progbits,1
.L.str:
   .asciz   "10 iterations\n"

   .comm   flags,8191,2
.L.str.1:
   .asciz   " primes\n\n"


As you can see the compiler is clever enough to figure out that the initialisation loop can be replaced by a convenient call to memset. As a matter of info, this is my implementation of memset. It's optimised for the word size of the processor (16 bits) but not even programmed in assembler:

[memset function]
Code:
__attribute__((builtin))
void *memset (void *dst, int val, unsigned int len)
{
  if ( !((int)dst & 1) )
  {
    int *d = dst;
    int xval = (val<<8) | (val&0xff);
    for ( int i=0 ; i<(len&0xfffe) ; i+=2 )
      *d++ = xval;
      
    if ( len & 1 )
      *(char*)d = (char)val;

    return dst;
  }

  char *d = dst;
  for ( int i=0 ; i<len ; ++i )
    *d++ = (char)val;

  return dst;
}


And now the results on the CPU74 simulator, that I must recall that it is cycle accurate.

[Simulation results]
Code:
/Users/joan/Documents-Local/Relay/CPU74/Simulator/DerivedData/Simulator/Build/Products/Release/c74-sim
10 iterations

01899 primes

Executed instruction count: 2187612
Total cycle count: 2801131
Elapsed simulation time: 3.87819 seconds
Calculated execution time at 1MHz : 2.80113 seconds
Calculated execution time at 8MHz : 0.350141 seconds
Calculated execution time at 16MHz: 0.175071 seconds
Program ended with exit code: 0


Well, I guess this beats them all by an astronomic margin !!


Last edited by joanlluch on Fri Aug 21, 2020 8:22 am, edited 1 time in total.



Fri Aug 21, 2020 6:48 am
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
As a continuation of my previous post, I also accepted the Fibonacy challenge, and this is what I got for my CPU74 architecture:

[Source code]
Code:
int fib(int x)
{
   if (x > 2)
      return fib(x-1) + fib(x-2);
  else return 1;
}

Code:
#include "systemio.h"
int main()
{
    int iter, result;
    myprintstr("10 iterations\n");
    for (iter = 1;iter <= 10;iter ++)
    {
       result = fib(24);
    }
   
      printnum( result );
      putchar('\n');
}


Of course, I had to place the 'main' and 'fib' funcitons in separate files to prevent the compiler optimizing away the entire loop altogether. If both functions are placed on the same file, only one single call is perfomed to 'fib' from, regardless of the loop variable. By compiling these functions separately, the compiler does not have a clue about what possible side effects the fib function may have, so it will call it 10 times as per the source code. I have to do that to show a more comparable result.

So this is the compiled code:

[CPU74]
Code:
   .globl   fib
fib:
   add   SP, -4, SP
   st.w   r4, [SP, 2]
   st.w   r5, [SP, 0]
   mov   r0, r4
   cmp.lt   r4, 3
   mov   1, r5
   brcc   .LBB0_3
   mov   1, r5
.LBB0_2:
   mov   r4, r0
   sub   r0, 1, r0
   call   @fib
   add   r0, r5, r5
   sub   r4, 2, r4
   cmp.gt   r4, 2
   brcc   .LBB0_2
.LBB0_3:
   mov   r5, r0
   ld.w   [SP, 0], r5
   ld.w   [SP, 2], r4
   add   SP, 4, SP
   ret

Code:
   .globl   main
main:
   add   SP, -2, SP
   st.w   r4, [SP, 0]
   mov   &.L.str, r0
   call   @myprintstr
   mov   10, r4
.LBB0_1:
   mov   24, r0
   call   @fib
   sub   r4, 1, r4
   brncc   .LBB0_1
   call   @printnum
   mov   10, r0
   st.b   r0, [-1L]
   mov   0, r0
   ld.w   [SP, 0], r4
   add   SP, 2, SP
   ret

# ---------------------------------------------
# Global Data
# ---------------------------------------------
   .section   .rodata.str1.1,"aMS",@progbits,1
.L.str:
   .asciz   "10 iterations\n"


Now, the results are the following:

Code:
/Users/joan/Documents-Local/Relay/CPU74/Simulator/DerivedData/Simulator/Build/Products/Release/c74-sim
10 iterations
46368
Executed instruction count: 9096884
Total cycle count: 13160704
Elapsed simulation time: 17.5367 seconds
Calculated execution time at 1MHz : 13.1607 seconds
Calculated execution time at 8MHz : 1.64509 seconds
Calculated execution time at 16MHz: 0.822544 seconds
Program ended with exit code: 0


It's still significantly faster than anything else, and by a very long extend!. According to Roelth posted figures on the 10 MHz 8086, the CPU74 at just 8 MHz, is 10 times faster than the 10 MHz 8086. Even at only 1 MHz, the CPU74 is able to beat the 10MHz 8086 for all the tested compilers. That's a very impressive result, much better than I had ever imagined.


Fri Aug 21, 2020 8:21 am

Joined: Mon Oct 07, 2019 2:41 am
Posts: 292
To be fair, I think to compare to a 68000 would be a better choice.
The intel cpu's have to acess variables from memory, so it is slower.
Ben.


Sat Aug 22, 2020 8:49 pm

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 7
I tried the fib test too, but was too embarrassed to post the results... [1]

The cintcode VM that the BCPL compiler produces is a bytecode for a load/store architecture which has a 2-deep register stack so almost every single operation is from memory to register or register to memory. Running the 32-bit VM on a 16-bit CPU with an 8-bit data bus is somewhat sub-optimal...

-Gordon

[1] ok, here it is:
Code:
Time: 37102ms
  Result was 46368


So 37 seconds to do 10 iterations of this calculation.

The BCPL code is:
Code:
LET fib (x) = VALOF
{
  TEST x > 2 THEN
    RESULTIS fib (x - 1) + fib (x - 2)
  ELSE
    RESULTIS 1
}


And the bytecode (which I commented) looks like:
Code:
// Entry to:   fib
  20: L1:
  20:     L2            ; regB := regA ; regA := 2
  21:    JLE  L3        ; Jump to L3 is <= 0
  23:    LM1            ; regB := regA ; regA := -1
  24:    AP3            ; Add local var 3 into A
  25:     LF  L1        ; B := A ; Load addr of L1 into A
  27:     K4            ; Call function, reserve 4 stack locations
  28:    SP4            ; Store A into stack loc. 4.
  29:     LM    2       ; B := A ; Load -2 into A
  31:    AP3            ; Add stack 3 into A
  32:     LF  L1        ; B := A ; Load addr of L1 into A
  34:     K5            ; Call function, reserve 5 stack locs.
  35:    AP4            ; B := A ; A := stack 4
  36:    RTN 
  37: L3:
  37:     L1            ; B := A ; A := 1
  38:    RTN 

Note that when a function is called, the first argument is in A
but also in stack location 3. 2nd argument is in stack 4, etc.
Reg B is popped into A and stack 3 as part of the K call sequence.


I think the compiler is doing a good job of the code generation - the bytecode VM is more or less the "ideal" machine for this, but interpreting it on an 65816 is the crux here. If only I had a 32-bit CPU that could do the B := A operation internally in a single cycle for example... currently with 2 x 65816 LDA/STA instructions in 16-bit mode it takes 20 machine cycles (Which is a reason I'm lurking here, hello, and sorry for slightly hijacking the thread, I'm finding I like Retro-BCPL more than I like the 65816 so maybe one day I'll either find a CPU that's a better fit for this bytecode, or adapt some existing core, you never know!)

-G.


Sun Aug 23, 2020 9:50 am
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
oldben wrote:
To be fair, I think to compare to a 68000 would be a better choice.
The intel cpu's have to acess variables from memory, so it is slower.
Ben.
I strongly believe that the 68000 processor will rank no better than the 8080 for 16 bit data. The 68000 processor uses many cycles to do its thing, from 4 to even more than 20 in some instructions. It has the advantage of having more general purpose registers than the 8080, which may help to reduce some memory access overhead, but at the end of the day, my guess is that there's should be not that much difference (on algorithms performed with 16 bit arithmetic)

I think I have posted this elsewhere, but there's a lot of reasons why the CPU74 is so fas, here's a summaryt:

The CPU74 is designed as a Harward arquitecture, pure Load/store architecture, with enough registers to avoid a lot of memory accesses. Instruction fetching and initial decoding of the next instruction, happens simultaneously with the execution of the current instruction. Thanks to that:

- Most instructions are executed in 1 cycle.
- Data memory reads of writes, use 2 cycles.
- Unconditional jumps and conditional taken jumps use 2 cycles,
- and calls and returns take 3 cycles.

On the other hand, the instruction set is very optimised for compilers to take full advantage of it, with instructions such as:

- conditional moves and selects that help to avoid branching, and promote the insertion of speculative code
- careful selection of the instructions that depend on status register flags, which gives the compiler freedom to reorder instructions, while the status flags might be used one or two instructions later (This is something that, surprisingly, the original designers of CISC processors failed to understant. The VAX-11 for example, failed completely to get this right, however he 68000 instruction set does a much better job in this regard)

In summary, it's the combination of the Harvard architecture, the LLVM compiler, and the very compiler friendly instruction set that makes it so fast.

To me, the most interesting benchmark against my CPU74 architecture would be the original ARM-V1 from 1985, or even the ARM-V2 processor at comparable clock frequencies. This is a link to the description of the original architecture https://en.wikichip.org/wiki/acorn/microarchitectures/arm1. In effect, the CPU74 borrows a lot of the ideas of the ARM1, and they are architecturally very similar, except that the former is Harvard architecture with simultaneous fetching and execution (exactly 1 cycle per basic instruction), and the later is a 3 stage pipelined Von Newmann architecture (effectively only requiring 1 clock cycle for most sequential instructions).

Among the available processors before or up to 1985, I'm pretty sure that the only one that would beat the CPU74 is the ARM1 (at the same clock frequency). But as noted before, it would be really nice to have actual data with known algorithms to make the comparison. I would really like having that... Also it is interesting to note that the ARM2 processor was released in 1987 with a clock frequency of only 8 MHz, and I believe that should have been a beast compared with anything else available at the time of the same size. Still, I suspect that one possible issue of the original ARM1 at the time, was the availability of quality compilers that would really know how to squeeze the best of it.

Joan

[Edit:]
I just found this list of the Instructions per second data for different processors: https://en.wikipedia.org/wiki/Instructions_per_second As a mater of comparison the fibonacci test execution for the CPU74 resulted in 9096884 instructions executed in 13160704 cycles, so that's 0.6912 instructions per cycle in that particular case. This means that it would give 5.53 MIPS at 8MHz or 11,06 MIPS at 16 MHz, versus the average 4 MIPS of the ARM2 at 8 MHz. But of course the ARM is 32 bits, this is just raw instruction execution data, and for specific algorithm benchmarks the compiler and the instruction set have a strong influence.


Sun Aug 23, 2020 10:35 am

Joined: Mon Oct 07, 2019 2:41 am
Posts: 292
I speeded up my cpu by having a faster clock. Looking at my CISC design I found
I could make more decoding don't cares and simplify to 2 logic levels to the rom
lookup tables. Rom tables are 20 x 256 bits. I can get a 420 ns alu cycle using ttl.
74ls and a few 74S for a 20 bit alu. That is the best I can design TTL to, on paper,
or FPGA emulation.
The next project may be a 24 bit register file cpu, if I find dual port memory.
Ben.


Sun Aug 23, 2020 8:46 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 292
Time to bump the thread. I have a new ISA for my cpu. It runs at 5.3 Mhz ~ 1976. 16 bit bus, 32 bit ints.
I get ~ 15 seconds for the sive. Int size is 32 bits.


Sun May 09, 2021 8:11 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1666
Probably the wrong thread - this one is called "Some C compilers"


Sun May 09, 2021 8:54 pm
 [ 45 posts ]  Go to page Previous  1, 2, 3

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software