View unanswered posts | View active topics It is currently Thu Mar 28, 2024 10:39 am



Reply to topic  [ 45 posts ]  Go to page 1, 2, 3  Next
 Some C compilers 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
.
(well, the air turned blue there for a while as I lost a previous draft of this...)

Over on stardot, davidb just found an upcoming presentation for a C compiler project, and that led me to an interesting writeup about another project, and I thought it might be good to have some collected links to C compiler projects, inevitably in various states of progress.

First then, Rui Ueyama's interesting story of building a C compiler over 40 days. Even an experienced C programmer might find the language is more complicated than they thought.

Second, I should mention two existing threads here:
- C64 Compiler by Rob, a more-than-C compiler targeting several cores, including FT64 and OPC6
- Ide68k C compiler an IDE found by Dajgoro

And now, here are some links to repositories:
- bbc-c "C compiler for the BBC Micro series of micros" mentioned above (written in python)
- 8cc, the 40-day compiler mentioned above.
- LCC: A Retargetable C Compiler: Design and Implementation
- TCC: Tiny C Compiler by Francis Bellard
- cc65 of course, for the 6502
- Puppeh's gcc-6502 work in progress (and support tools including a libc)
- bcompiler, bootstrapping "a tiny compiler for a toy programming language somewhat reminiscent of C and Forth"
- C in four functions as an educational demonstration (good discussion here.)
- legacy cc - "earliest versions of the very first c compiler"
- selfie, an "educational software system of a tiny self-compiling C compiler, a tiny self-executing MIPS emulator, and a tiny self-hosting MIPS hypervisor."



Once upon a time, there was going to be a huge space-trading-exploration online multiuser game which would involve programming the ship's computer. The architecture was called DCPU-16 and was a bit similar to a 6502. Several people started working on compilers for this architecture - their work might be interesting as starting points.
- dcpu16 target for LLVM (my fork of a vanished project)
- LLVM backend for dcpu-16 processor - a spiritual successor to the above
- DCPUB, "a language similar to B"
- TenC, "A high-level, C-like language for the DCPU-16"
- py-dcpu-c-compiler C to DCPU-16 compiler written in python.
- DCPU-TCC "Super Awesome DCPU C Compiler based on TCC"
- d16cc "A C compiler for notch's DCPU16"
- 0x10c-Compiler "Effort to create a compiler for 0x10c"


Thu Aug 31, 2017 10:34 am
Profile

Joined: Thu Aug 24, 2017 8:52 pm
Posts: 4
Hi,

Here is another C compiler that might belong on your list:

https://github.com/alexfru/SmallerC or http://hackaday.io/project/5569-smaller-c

-Xark


Tue Sep 05, 2017 8:41 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Thanks Xark - that looks pretty interesting, as it targets x86, MIPS or trillek RISC, and is self-hosting.


Tue Sep 05, 2017 8:43 pm
Profile

Joined: Thu Aug 24, 2017 8:52 pm
Posts: 4
You are very welcome. I will also mention that a friend of mine Valentin ported this compiler to his custom Sweet32 FPGA CPU (RISC-ish). It worked quite well. Code (for VHDL CPU and compiler port) are at https://github.com/Basman74/Sweet32-CPU

-Xark


Wed Sep 06, 2017 4:10 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Sweet!


Wed Sep 06, 2017 7:52 am
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
vbcc is well worth a look - I've been creating a new backend for it recently, and the interface is really nice. The compiler itself is mature and well-tested, having been widely used on mc68000-based platforms in the 90s. There's even a "generic risc" backend which can be used as a starting point for a new one.
http://www.compilers.de/vbcc.html

(The license doesn't allow commercial use or distribution of modified versions of the compiler without written permission.)


Sat Dec 07, 2019 9:13 pm
Profile

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 46
Hi, for my new Kobold computer I would like to have a c compiler (but that might be a long road).
In the past I made a Pascal compiler in assembler for the Z80, and over the last few years I did spend some time making a web-based C compiler, but I did not yet finish it.

When modifying an existing C compiler to support a new instruction set, the general approach is to choose a C compiler and change its code generator.
The problem is, that after this is done, you are stuck with the C compiler that was chosen. Also, making a code generator might not be easy, depending on the interface between the parser and code generator of the chosen compiler. It might also be that the parser does optimizations that are not effective for your instruction set, or even might make things worse.

It would be good to have a standard interface between parser and code generator, especially for hobbyists and single-person companies.

I have a proposal for this interface. It is based on JSON syntax, so also well equipped for a web-based Javascript compiler.

Here it is: Format for C compiler code generator

This is a first draft, and all details are not yet in place.

I hope that it is clear that after a single JSON-read action, you will have the whole program (or a single function) in memory, as a
structured tree (AST). Perhaps a few actions, that were previously done by the parser, must now be done in the code generator.

Also, this approach makes the parser totally independent from the target CPU.

Please let me know what you think !


Last edited by roelh on Fri Dec 20, 2019 8:33 pm, edited 2 times in total.



Thu Dec 12, 2019 3:10 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
I think keep small. Is self compiling on the host machine possible? What about floating point and C libraries?


Fri Dec 13, 2019 12:38 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
roelh wrote:
Hi, for my new Kobold computer I would like to have a c compiler (but that might be a long road).
[...]
Also, this approach makes the parser totally independent from the target CPU.

Please let me know what you think !


Hi Roelh,

The conversion of C code to JSON and back to a memory graph in the way you propose, certainly would help to avoid the C parser and would enable custom backends to start from that. However, in my opinion it is a relatively small step that still requires a lot of custom work on the target code generator. Although the conversion from plain C to JSON is an interesting exercise, it is only a relatively direct transformation that does not take into account code semantics, and it is highly dependent on the language (C language in this case). In other words, it is not anything else than encapsulating plain C, into a JSON string, and of course the target code generators would still require a JSON parser (although this is quite straightforward)

In my opinion, it would be more interesting to produce a canonicalised, medium-level, intermediate representation form (IR) instead, which would be both language independent and target independent. This is the approach taken by modern compilers such as gcc and llvm. The IR form is largely target independent but fully expresses the semantics of the original C code (or any source language)

As I worked on llvm, I can post an example from what I learned. Consider the following C code and equivalent llvm IR representation:

C source code
Code:
void testfunction(int x)
{
  if (x > 0) {
    printfunction(x);
  }
}


IR Representation
Code:
define void @testfunction(i16 %x)  {
entry:
  %cmp = icmp sgt i16 %x, 0
  br i1 %cmp, label %if.then, label %if.end

if.then: 
  %call = tail call i16 bitcast (i16 (...)* @printfunction to i16 (i16)*)(i16 %x)
  br label %if.end

if.end:             
  ret void
}


As it stands, the llvm produced IR code is expressed as a human readable plain text file, which requires a dedicated IR parser on the backend implementation to create a memory graph (DAG) from it. But that's just an example based on what llvm offers: the same IR can be generated in a much more efficient binary form that backends can use in nearly the same way. Nothing prevents you from creating your own JSON encapsulated version of it.

The point that I want to show is that in my opinion, starting from IR (or a JSON encapsulated version of it) to produce a custom target compiler, is a better approach than starting from JSON encapsulated C. Once the IR code is available, creating a backend from it (or code generator as to refer to it), is easier because the IR is no longer tied to C source code, it may already incorporate a number of "target independent" optimisations such as at least constant expression folding and dead code removal, which targets do not need to care about. Since the IR already resembles assembly language, the backend implementation is mostly a matter of applying the required transformations and optimisations to it, and convert it into actual target CPU instructions, which require fewer steps than starting from C.

[EDIT]: Just to complete this post, I want to post the resulting assembly code from the above IR that my llvm backend implementation for the CPU74 architecture produces:

CPU74
Code:
   .globl   testfunction
testfunction:
   cmp.lt   r0, 1
   brcc   .LBB0_2
   call   @printfunction
.LBB0_2:
   ret
In this case, the transformations to the IR just involved the reversal of the condition code to skip the "if.then" code and jump directly to the "if.end" label when the reverse condition is met. Also, since the CPU74 architecture does not directly support "signed less than or equal", the reverse condition is canonicalised to a "signed less than" condition to conform with the available instructions.


Fri Dec 20, 2019 8:50 am
Profile

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 46
Hi Joan,

thank you for your reaction.

But I think I made something not very clear. The proposed JSON format has been processed by the parser. All checks that a C compiler should do are allready done. So the JSON format is (almost ?) on the same level as the IR format that you mention.

As I do several things web-based (like the assembler), the code generator would at first probably be
made in Javascript, and that ignited the idea to have JSON as intermediate format.
I already had a half-working C compiler, strangely enough programmed in PHP, that provided an intermediate format. This week I've been
busy translating this intermediate format to the JSON format, and thats 75% done.
A javascript code generator will then generate assembler.

In the end, I would like to have the compiler running on the homebuilt CPU itself. Both the PHP parser and the Javascript code generator must then be converted to C, and after the first compile this should theoretically be selfhosting. Also another parser can be used (I like the 8cc compiler because it seems quite small and simple) but then there is the question how to interface to the IR code. That's a question that you have solved for the LLVM compiler, but I have the feeling that the LLVM compiler is much too big to run on a small computer.


My current PHP translates C as follows (example):

Code:
int ar[10];

void printfunction (int u, char c)
{    }

void f(int i){
char t;
i = ar[i] + 1;
}

void testfunction(int x)
{
  if (x) {
    printfunction(x, 'a');
  }
}


Code:
{"function": "printfunction","par": { "u": "int","c": "char"} ,"return": "void",
"mod": [ ],"body":
[ { "var": { } } ,
  ]
}

{"function": "f","par": { "i": "int"} ,"return": "void",
"mod": [ ],"body":
[ { "var": { "t": "char"} } ,
  [ "i","=",[ [ "ar",":","i"] ,"+",1] ] ]
}

{"function": "testfunction","par": { "x": "int"} ,"return": "void",
"mod": [ ],"body":
[ { "var": { } } ,
  { "if": "x",
    "then": [ [ "printfunction",[ "x",97] ] ] } ]
}


It delivers a list of functions, and also a list of global variables and types (both not shown).


Fri Dec 20, 2019 8:19 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Roelh,

Yes LLVM is a monster. Definitely only suitable for cros-compiling purposes as long as small processors are concerned .

Back on subject, I don’t see a marked difference between the C code and your JSON string, at least on your proposed example. I would be interested to see what’s the output if you replace if(x) by if(x!=0) on your example. The C code semantics would be identical, so the output should be identical too, right? however I assume that the output will just include the ‘!=‘ embedded in the JSON string as it is present in the C code (please correct me if I’m wrong). This is not what a canonical IR representation would do.

I’m also unsure if I managed to fully explain the implications of a IR representation. I’m not dismissing your approach, just saying that it appears to be a very different one. My point is that your JSON string is still very high level (and similar to C) as it does not include actual relatively low level instruction representations that would resemble assembly language. Your output only appears to work at the stage of the parser, not the compiler. For example, on all processors the if statement requires a comparison or similar instruction to tell whether a branch should execute or not. Just look at the IR example I posted earlier. It contains such a compare instruction and assembly language like labels to jump to. The following step to transform IR to target assembly code is minimal in this particular example because the IR is already relatively low level. However, such low level representation appears to be missing on your JSON representation, which would require a different kind of work on the target code generator (?). I hope I made myself clearer now.

In any case, please keep the good work, as I’m already learning a lot from your previous builds!

Joan


Fri Dec 20, 2019 11:27 pm
Profile

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 46
Hi Joan,

It seems the LLVM IR output is some generalized assembly output. This probably assumes a load-store type cpu with a lot of registers, where all operations are between registers.

This is not suitable at all for small cpu's like Kobold. Kobold has 6 available registers (PC and WP can not be used freely), only two can address memory, and there are only a few possibilities for register-to-register operations.

To do proper code generation for simple processors like Kobold, the code generator module has to have a high-level view. That is what the JSON structure provides.


Ok, so I made a simple C compiler (2400 lines PHP) that outputs the JSON intermediate language. And I made a Javascript code generator (600 lines) that produces assembly for the Kobold K2. Both the C compiler and code generator are not yet finished.

You can try the compiler-generator combination online at http://www.enscope.nl/compiler/. It contains links to my C language manual, to the JSON language specification, and to the Javascript code generator. There is an integrated editor with syntax highlighting and matching parenthesis indication.

The C-compiler starts with a demo program for function call, function pointers, pointers, structures, arrays. I also handles IF statements. You can paste the result in the Kobold assembler http://www.enscope.nl/simas_kobold2/ and have it assembled. It will not run yet, because there is no start-up code and there are no support functions.

The code generator can at this moment only handle expressions that do not exceed the available registers. There are also some other limitations that must be solved.


Sat Dec 28, 2019 2:25 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Roelh,

That's pretty awesome, particularly because it's all online ! (and must have been a lot of fun, too)

The LLVM IR assumes 'infinite' number of registers, or rather, every instruction line that produces a result is assigned to a new variable (not actually a physical register). It's the responsibility of the backend to use the available physical registers, or to create memory spills if not enough are available, or to implement the memory addressing modes. It's also true that the more available physical registers and the more general purpose they are, the less code needs to be overridden from the base classes. So yes, that means that generating code for small cpus can get harder with LLVM than starting from scratch by parsing a C file and generating code right from it. On the positive side, if the cpu has at least some minimal set of features which LLVM is happy with, it is possible to get high quality, very optimised compiled code.


Sat Dec 28, 2019 6:11 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Roelh,

just as a matter of fun, I picked the code you posted as the example for your compiler (with a minor syntax difference) and compiled it using my CPU74 LLVM backend, and this is what I got:

Code:
int sum (int num1, int num2)
{
    return num1+num2;
}
int (*f2p) (int, int);
int n;

void f()
{ int op1, op2;
    op2 = sum(op1,n);
    f2p = sum;
    //Calling a function using function pointer
    op1 = (*f2p)(10, 13);
}

// demo structure member, array element, and if statement

char ar[8];

struct tt {
  int r1    ;
  int r2;
};

int i;

void test(){
struct tt* p;
if (i==5)
  { i = p-> r2; }
else ar[6] = i;
}


CPU74
Code:
# ---------------------------------------------
# sum
# ---------------------------------------------
   .globl   sum
sum:
   add   r1, r0, r0
   ret

# ---------------------------------------------
# f
# ---------------------------------------------
   .globl   f
f:
   mov   &sum, r0
   st.w   r0, [&f2p]
   ret

# ---------------------------------------------
# test
# ---------------------------------------------
   .globl   test
test:
   ld.w   [&i], r0
   cmp.eq   r0, 5
   brcc   .LBB2_2
   st.b   r0, [&ar+6]
.LBB2_2:
   ret

# ---------------------------------------------
# Global Data
# ---------------------------------------------
   .comm   n,2,2
   .comm   f2p,2,2
   .comm   i,2,2
   .comm   ar,8,2


Well, as you can see, most of the original C code is nowhere in the assembly output because it is never executed or it does not produce an useful result !

So I made some changes to the original code to get some actual code generation actually happening: basically the f() function now returns a value, and the struct pointer is passed as an argument on the test() function:

Code:
int sum (int num1, int num2)
{
    return num1+num2;
}
int (*f2p) (int, int);
int n;

int f()
{ int op1, op2;
    op2 = sum(op1,n);
    f2p = sum;
    //Calling a function using function pointer
    op1 = (*f2p)(10, 13);
    return op1;
}

// demo structure member, array element, and if statement

char ar[8];

struct tt {
  int r1    ;
  int r2;
};

int i;

void test(struct tt* p){
if (i==5)
  { i = p-> r2; }
else ar[6] = i;
}


The result is this:

CPU74
Code:
# ---------------------------------------------
# sum
# ---------------------------------------------
   .globl   sum
sum:
   add   r1, r0, r0
   ret

# ---------------------------------------------
# f
# ---------------------------------------------
   .globl   f
f:
   mov   &sum, r0
   st.w   r0, [&f2p]
   mov   23, r0
   ret

# ---------------------------------------------
# test
# ---------------------------------------------
   .globl   test
test:
   ld.w   [&i], r1
   cmp.ne   r1, 5
   brcc   .LBB2_2
   ld.w   [r0, 2], r0
   st.w   r0, [&i]
   ret
.LBB2_2:
   st.b   r1, [&ar+6]
   ret

# ---------------------------------------------
# Global Data
# ---------------------------------------------
   .comm   n,2,2
   .comm   f2p,2,2
   .comm   i,2,2
   .comm   ar,8,2


Pretty good, isn't' it?
(Both the call to the pointer function, and the address of the array element are folded as constant expressions, so there's no call to 'sum', and no indexed memory access to 'ar' (&ar+6 is a physical memory location, not an indexed addressing mode))


Sat Dec 28, 2019 6:49 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
roelh wrote:
Ok, so I made a simple C compiler (2400 lines PHP) that outputs the JSON intermediate language. And I made a Javascript code generator (600 lines) that produces assembly for the Kobold K2. Both the C compiler and code generator are not yet finished.


Wow - that is very cool! Nice work!

Quote:
The code generator can at this moment only handle expressions that do not exceed the available registers. There are also some other limitations that must be solved.


I'll watch with interest as this progresses - I'm especially interested in how much trouble you have dealing with object lifespans and register spilling.

Like joanlluch I couldn't resist running your test program through my own pet project's toolchain - in my case it's a vbcc backend for my EightThirtyTwo CPU. VBCC has a function that backend writers can use for debugging, which dumps a human-readable version of its intermediate representation - so first, for comparison, here's what it looks like for your program:
Code:

sum:
   allocreg r2
   add int M0+-8(FP)(num2),M0+-4(FP)(num1)->M0+r2(0x178ff40)[S]
   set-return int [tgt-addressing-mode]
   freereg r2

f:
   nop
   push int M0+_n(n) size=4
   push int M0+0(FP)(op1) size=4
   call function M0+_sum(sum) size=8 => sum
   move pointer #M0+_sum(sum)->M0+_f2p(f2p) size=4
   push int #I13 size=4
   push int #I10 size=4
   call function M0+_sum(sum) size=8 => sum

test:
   nop
   compare int M0+_i(i),#I5
   bne L5
   allocreg r2
   add-int-to-pointer int M0+0(FP)(p),#I4->M0+r2(0x1792110)[S] ptype=pointer
   move int [tgt-addressing-mode]->M0+_i(i) size=4
   freereg r2
   bra L6
L5
   convert char M0+_i(i)->M6+_ar(ar) from int
L6


The code itself probably looks ridiculously verbose - and yes, there's plenty of room for optimisation - but each instruction is only a single byte, so the code footprint's actually not that bad:

Code:
_sum:
   stdec   r6
   mt   r2
   stdec   r6

   li   IMW0(12)
   ldidx   r6
   mr   r2

   li   IMW0(8)
   ldidx   r6
   add   r2

   mt   r2
   mr   r0

   ldinc   r6
   mr   r2
   ldinc   r6
   mr   r7


_f:
   stdec   r6
   stdec   r6

   ldinc   r7
   .int   _n
   ldt
   stdec   r6

   li   IMW0(4)
   ldidx   r6
   stdec   r6

   ldinc   r7
   .int   _sum
   exg   r7

   li   IMW0(8)
   add   r6

   ldinc   r7
   .int   _f2p
   mr   r1

   ldinc   r7
   .int   _sum

   st   r1

   li   IMW0(13)
   stdec   r6

   li   IMW0(10)
   stdec   r6


   ldinc   r7
   .int   _sum
   exg   r7

   li   IMW0(8)
   add   r6

   ldinc   r6
   ldinc   r6
   mr   r7

_test:
   stdec   r6
   mt   r2
   stdec   r6
   stdec   r6


   ldinc   r7
   .int   _i

   ldt
   mr   r1

   li   IMW0(5)
   sgn
   cmp   r1


   cond   NEQ

   li   IMW1(PCREL(l5)-1)
   li   IMW0(PCREL(l5)-0)
      add   r7

   ld   r6
   mr   r2

   li   IMW0(4)
   add   r2

   ldinc   r7
   .int   _i

   mr   r1

   ld   r2

   st   r1

   li   IMW1(PCREL(l6)-1)
   li   IMW0(PCREL(l6)-0)
   add   r7
l5: #


   ldinc   r7
   .int   _ar + 6

   mr   r1

   ldinc   r7
   .int   _i

   byt

   ldt

   stbinc   r1

l6: #
   ldinc   r6
   ldinc   r6
   mr   r2
   ldinc   r6
   mr   r7


Sat Dec 28, 2019 8:00 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 45 posts ]  Go to page 1, 2, 3  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software