View unanswered posts | View active topics It is currently Fri Mar 29, 2024 1:26 pm



Reply to topic  [ 22 posts ]  Go to page Previous  1, 2
 EightThirtyTwo 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
Congratulations! It sort of looks vaguely like the feedback from llvm would push you towards a more CISCy machine - is that so?

Although this project is using a different compiler, based on my relatively recent experience in implementing a LLVM backend, I would rather say the opposite.

LLVM is happy with pure load/store architectures with plenty of general purpose registers, with a fully orthogonal, 3-operand instruction set, supporting only the essential alu ops and addressing modes, and including as large as possible embedded immediates.

According to that, CISC architectures (or even some so called RISC, like the AVR or the MSP430) don't really play well with LLVM because of (1) their exceptions to full orthogonality, (2) the special use of some registers, like the 68000 Address Registers, or the AVR X,Y, Z registers, and (3) the presence of special instructions that are difficult for the compiler to use at their full potential, or at all.

I would say that LLVM would even have some trouble at producing fully optimised code for the epitome of CISC, the VAX/11 architecture, despite it is fully orthogonal, because some of its addressing modes would be difficult to use, and memory to memory operations would have to be handled in explicit ways, possibly missing optimisation opportunities or creating more register pressure than necessary for an architecture that was not meant to be a 'load/store' one. Moreover, maybe as much as 1/3 of the available instructions would never be used by the compiler.

The above is all based on my own experience with implementing the LLVM CPU74 backend, of course, so others may have a more general view.

Joan


Mon Dec 09, 2019 9:07 am
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
joanlluch wrote:
Although this project is using a different compiler, based on my relatively recent experience in implementing a LLVM backend, I would rather say the opposite.

LLVM is happy with pure load/store architectures with plenty of general purpose registers, with a fully orthogonal, 3-operand instruction set, supporting only the essential alu ops and addressing modes, and including as large as possible embedded immediates.


So... pretty much the opposite of my CPU, then! Thanks for the insight into LLVM - I'm glad I didn't pour significant time into learning about it for this project.

Quote:
According to that, CISC architectures (or even some so called RISC, like the AVR or the MSP430) don't really play well with LLVM because of (1) their exceptions to full orthogonality, (2) the special use of some registers, like the 68000 Address Registers, or the AVR X,Y, Z registers, and (3) the presence of special instructions that are difficult for the compiler to use at their full potential, or at all.


vbcc may have an advantage here, since I 68000 was one of its first (if not the very first) target platforms, so the idea that different registers might have different capabilities has been taken into account from the start.

Quote:
The above is all based on my own experience with implementing the LLVM CPU74 backend, of course, so others may have a more general view.


Much of it echoes what I'm discovering with the vbcc backend, actually - it's easy to generate working code, but much less easy to generate code that's efficient, and which makes use of CPU features. The biggest obstacle is lack of registers - and in the face of that lack of registers anything that makes accessing the stack simpler is a valuable CPU feature. Beyond that, the only opportunity to make good use of CPU features is when emitting code blocks for inlined memcopy and suchlike.

I've re-written a few compiled routines in hand-written assembler, and brought their size down to about 30% of the compiled version so there's plenty of scope for optimisation, but I doubt I'll be able to bring the compiled output down by more than about 20 - 25% from where it is now.


Mon Dec 09, 2019 11:53 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I would say that the actual instruction set has a considerable influence on what can be optimised by hand, and to what extend, compared with compiler output. I think that compiler generated code for some architectures such as the RISC-V can hardly be improved, because there's simply not much to be done beyond what the compiler already does. Others such as the 6502 are totally different: compiler generated code for it can't be good, compared with human generated assembly code. Possibly the 68000 is somewhat in the middle unless the compiler is specifically designed for it.

My experience with the LLVM compiler is that it aggressively attempts to perform a lot of supposed "target independent" optimisations that may not be desirable or appropriate for simple targets, including the generation of speculative code to avoid branches; the emission of code fragments involving multiple-shift instructions as a replacement of virtually anything that has a power of two in it; the emission of end of loop operations such as multiply or divide as a replacement of actual iterations, and others. This behaviour is highly detrimental on architectures with cheap branching and expensive shifts or arithmetic. These targets must explicitly reverse all that, as there's not enough compiler hooks to prevent (supposed) 'target independent' optimisations, and in my opinion, it is the reason why gcc still remains a better compiler option for AVR, MSP430, and many of our favourite 'old' targets (the latter are not even supported on LLVM, to begin with).


Mon Dec 09, 2019 1:19 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
joanlluch wrote:
I would say that the actual instruction set has a considerable influence on what can be optimised by hand, and to what extend, compared with compiler output.


Yes indeed - and again the number of available registers is a big factor too. When registers are limited it becomes much more important to be smart about the order in which operations occur. The compiler tends to just reach for the easy option of just shoving things on the stack.

Quote:
My experience with the LLVM compiler is that it aggressively attempts to perform a lot of supposed "target independent" optimisations that may not be desirable or appropriate for simple targets, including the generation of speculative code to avoid branches; the emission of code fragments involving multiple-shift instructions as a replacement of virtually anything that has a power of two in it;


Yeah that would be bad for EightThirtyTwo - I have a single-cycle multiply (because the FPGA has embedded multipliers - I might as well use them) but my shifter only shifts one bit per cycle - I actually optimise in the opposite direction there, replacing shifts with multiplies where I can!

Quote:
and in my opinion, it is the reason why gcc still remains a better compiler option for AVR, MSP430, and many of our favourite 'old' targets (the latter are not even supported on LLVM, to begin with).


Some of them, sadly, may not be supported on gcc much longer - VAX and AVR are both under threat. I don't believe anyone's stepped up yet to port either backend to the new condition-code scheme (a decidedly non-trivial task), and the old one is deprected in gcc 10, scheduled for removal in gcc 11. M68k was under threat too, but thankfully someone's updated it.


Mon Dec 09, 2019 11:04 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
(Sorry about my llvm/vbcc confusion upthread!)

Reading your latest blog, Alastair, I'm quite intrigued by the combination of the small register set and the tmp register. It seems almost like an accumulator, in the sense of being an implicit register in some (many?) instructions.

I see you're running at 133MHz - on what hardware platform is that? I don't think I've seen hobby FPGA CPUs clocked quite so fast.


Sun Jan 12, 2020 9:03 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
BigEd wrote:
I'm quite intrigued by the combination of the small register set and the tmp register. It seems almost like an accumulator, in the sense of being an implicit register in some (many?) instructions.


Yes, in fact my original intention was for it to be an accumulator in the sense of being the target register for arithmetic operations. As it turns out, I found that I liked the feel of the instruction set better if the results of arithmetic operations went to the nominated registers instead. The vast majority of instructions either read from or write to tmp - I think the only ones that don't are the "cond"itional instruction, and the sgn, byt and hlf modifiers.

Quote:
I see you're running at 133MHz - on what hardware platform is that? I don't think I've seen hobby FPGA CPUs clocked quite so fast.


I'm using an Altera DE2 board, quite old now and hardly state-of-the-art - but it does have the fastest speed grade of Cyclone II chip. A standalone EightThirtyTwo with just block RAM and a UART tops out at a shade under 150Mhz before it no longer meets timing! A full SOC with graphics, sound, interrupts and SDRAM still just meets timing at 133MHz.

My eventual target platforms have a Cyclone III or Cyclone 10LP at the regular speed grade, however. They can still just about cope with the standalone CPU at 133MHz - which is good because one of my goals was to be able to integrate this into existing projects without having to add extra clocks; it's fast enough to run on the same clock as an SDRAM controller.

Most of the other CPU projects I've seen top out anywhere from 25MHz to 90MHz - though many of them will do much more work per cycle than 832.


Sun Jan 12, 2020 10:34 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
I finally bit the bullet and wrote an assembler, linker and disassembler to go along the existing emulator and simulations.
As a result my compiled executables are about 10% smaller, and a nice speed boost, too.

The linker was a particularly interesting challenge, since I wanted to support linker relaxation: 832 can chain 'li' instructions to load a value into a register, the number of bytes required to resolve a reference can therefore vary depending on address. So far I was using a fixed worst-case solution; the linker can now use an appropriately sized reference and adjust the addresses of all subsequent symbols and sections accordingly.

http://retroramblings.net/?p=1355


Sat Feb 08, 2020 4:32 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 22 posts ]  Go to page Previous  1, 2

Who is online

Users browsing this forum: AhrefsBot, SemrushBot and 15 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software