View unanswered posts | View active topics It is currently Thu Mar 28, 2024 6:52 pm



Reply to topic  [ 26 posts ]  Go to page Previous  1, 2
 Notes on the OPC5 [now OPC6] - a one-page-CPU, 16 bits 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Just one more thing worth noting: the one-word instructions are handy for making branches shorter, which is handy for making more use of the limited displacement of the one-word versions of the branches. And all of that code density improvement should also help our small instruction cache to hide the latency of the multi-cycle byte-wide RAM we have.


Wed Aug 02, 2017 9:33 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Yes. I find much useful for the compiler.
Okay, I removed the ',0' in most places so the code should be shorter now.

I think it will be difficult (but not impossible) for the compiler to use the inc / dec instructions with the PC. The problem is the branch range is too short. That makes it necessary to count the number of instructions generated then go backwards and patch the generated code with an inc instruction. To be on the safe side the compiler may use only 1/2 the range because it doesn't know how many words the instructions will actually assemble into. I suppose I could have it assemble the code on the fly then count the number of words, but sheesh.

The compiler is about 24,000 lines of code (or 364 pages). Definitely exceeds the one page challenge.

_________________
Robert Finch http://www.finitron.ca


Thu Aug 03, 2017 4:09 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
After the big complaint, it took me all of ten minutes to modify the code to use the inc / dec instructions for pc branches :)
Here is sample output from some simpler functions.
Code:
   code
_abs:
   #    return a < 0 ? -a : a;
               cmp     r8,r0
            pl.inc     r15,TestAbs_4-PC
               not     r1,r8,-1
               inc     r15,TestAbs_5-PC
TestAbs_4:
               mov     r1,r8
TestAbs_5:
               mov     r15,r13
_min:
   #    return a < b ? a : b;
               cmp     r8,r9
            pl.inc     r15,TestAbs_11-PC
               mov     r1,r8
               inc     r15,TestAbs_12-PC
TestAbs_11:
               mov     r1,r9
TestAbs_12:
               mov     r15,r13
_max:
   #    return a > b ? a : b;
               cmp     r8,r9
            mi.inc     r15,TestAbs_18-PC
               cmp     r8,r9
             z.inc     r15,TestAbs_18-PC
               mov     r1,r8
               inc     r15,TestAbs_19-PC
TestAbs_18:
               mov     r1,r9
TestAbs_19:
               mov     r15,r13
_minu:
   #    return a < b ? a : b;
               cmp     r8,r9
            nc.inc     r15,TestAbs_24-PC
               mov     r1,r8
               inc     r15,TestAbs_25-PC
TestAbs_24:
               mov     r1,r9
TestAbs_25:
               mov     r15,r13
   rodata
   extern   _minu
   extern   _abs
   extern   _min
   extern   _max

_________________
Robert Finch http://www.finitron.ca


Thu Aug 03, 2017 4:38 am
Profile WWW

Joined: Tue Feb 10, 2015 7:07 am
Posts: 52
Rob,

This is absolutely fantastic work you are doing here.

As soon as you feel ready, I'd love to try some simple examples on a real system, even if it's not completely finished and there are bugs.

We have two real OPC6 systems available to use:
- one that has 16K words of RAM and is runs on a variety of standalone FPGA board
- the other has 64K works of RAM (~56K useable) and is a Co Processor for the BBC Micro

Dave


Thu Aug 03, 2017 7:30 am
Profile

Joined: Tue Feb 10, 2015 7:07 am
Posts: 52
robfinch wrote:
Dave, you're welcome to try the compiler at any time, but I don't guarantee the code will work. It might be a few days yet before I've run things in the simulator.
It's built on Windows as a console app using the MS Visual Studio 10 Express (free) edition. the target machine was set to X86. It looks like it might be a 32 bit app.
It also needs FPP.exe (or something called fpp.exe) which it shells out to perform the pre-processing.


I found where you are working so I can follow along:
https://github.com/robfinch/Cores/tree/ ... 20-%20OPC5
https://github.com/robfinch/Cores/tree/ ... re/fpp/src

Have you, or anyone else, ever tried building FPP or C64 on Linux?

Dave


Thu Aug 03, 2017 9:56 am
Profile

Joined: Sat Aug 05, 2017 6:57 pm
Posts: 26
Following a tip-off from BigEd about the OPC project - which looks pretty good! - I've had a look into porting the PLASMA VM to it. I've created a new thread (viewtopic.php?f=3&t=426) for it, to try to avoid disrupting this one too much.


Sat Aug 05, 2017 8:05 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
It's very exciting to see not one but two HLL endeavours for the OPC!

For convenience, I've made a new thread for the C64 purpose - hope that's good for everyone.
viewtopic.php?f=3&t=427
"Porting the C64 compiler to target OPC5 (or OPC6)"

(Rob, you can edit the topic title by editing the Subject of the head post)


Sun Aug 06, 2017 5:01 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Just to note (some of) the latest updates to the OPC6 line:
- addition of push and pop instructions
- addition of memory access trace to the python emulator
- addition of BYTE directive and local labels to the assembler
- assembler will accept 'inc' with a negative offset and emit a 'dec' instruction, for convenience
- improved error handling in the assembler
- a separate assembler for byte-orientated assembly
- a library of routines to support high level languages


Mon Aug 14, 2017 4:02 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
.
A long post on performance... hope it's of interest!

Here's Revaldinho's sketch of the pipeline for OPC6:
Quote:
This is basically how it works for arithmetic:
    Fetch0
    Fetch1 (if operand present)
    EAD (if operand present and non r0 src reg)
    Exec
... so, many of the one word instructions need only Fetch0 and Exec. Some two word instructions need all 4 states and others can skip Effective Address to be just 3.

Fetch0 and Exec can overlap except when r15 was altered by the instruction and when an interrupt occurs.

This doesn't cover load and store type instructions, which always go through EAD. Reads then go through RDMEM before finishing in Exec. Writes go to a different WRMEM state and skip Exec.

The only other state available is INT which is where to go on completion of the current instruction if an interrupt is taken.


Revaldinho collected some stats on programs running on the OPC6, with particular reference to branching and predication. And we had a bit of a conversation, in which I played the part of the person who didn't quite get the picture and he took the part of the person who understood what was going on.

We looked at a short code sequence where we have some conditional code. Question is, how does this code look, for size and speed, when written using branches, predicates, or the new short branch? Here's the executive summary. Counting only the optionally executed STO and MOV instructions, we can compare two machines and two tactics:
Code:
                          Instr    Cycles   Cycles
                          Words    (Exec)   (Skip)
OPC5LS predication          3        6         3
OPC5LS branching            5        8         3
OPC6   predication          2        5         2
OPC6   branching            3        6         2


Diving into the details, here's the code, from the pi spigot program, in the earliest form, for the OPC5ls:
Code:
0000               pdcloop:
0000  0aa2         sub r2,r10         # update pointer to next predigit
0002  3a01 0009    cmp r1,r0,9        # is predigit=9 (ie would it overflow if incremented?)
0004  4620       z.sto r0,r2          # store 0 to predigit if yes (preserve Z)
0005  500f 0000  z.mov pc,r0,pdcloop  # loop again to correct next predigit
0007  1401 0001    add r1,r0,1        # if predigit wasnt 9 fall thru to here and add 1


And here's a pipeline diagram for the crucial part, in OPC5ls instructions using predication:
Code:
CMP       F0 F1 :EX       :       for Z=0 we nop next two instructions
Z.STO           :F0       :       one cycle
Z.MOV           :   F0 F1 :       two cycles
ADD             :         : F0

CMP       F0 F1 :EX               :       for Z=1 we execute next two instructions
Z.STO           :F0 EA WR         :       three cycles
Z.MOV           :         F0 F1 EX:       three cycles
SUB             :                 : F0


If instead we use a branch:
Code:
CMP       F0 F1 :EX                :                      Z=0/NZ=1
NZ.MOV          :F0 F1 EX          :
ADD             :                  : F0 ...

CMP       F0 F1 :EX                      :       Z=1/NZ=0
NZ.MOV          :F0 F1                   :
STO             :      F0 EA WR          :
MOV             :               F0 F1 EX :
SUB             :                        : F0 ...


Switching now to OPC6, using predication and using INC for a relative instead of absolute branch:
Code:
CMP       F0 F1 :EX    :               Z=0
Z.STO           :F0    :
Z.INC           :   F0 :
ADD             :      : F0

CMP       F0 F1 :EX            :       Z=1
Z.STO           :F0 EA WR      :
Z.INC           :         F0 EX:
SUB             :              : F0


..and finally rewriting for OPC6 using short branches instead of predication:
Code:
CMP       F0 F1 :EX    :                      Z=0/NZ=1
NZ.INC          :F0 EX :
ADD             :      : F0

CMP       F0 F1 :EX                :       Z=1/NZ=0
NZ.INC          :F0                :
STO             :   F0 EA WR       :
IN              :            F0 EX :
SUB             :                  : F0 ...


More from our conversation - guess who is speaking here:
Quote:
Stats for a single potentially skipped instruction would favour predication more, but I think I always had it in mind that 2-3 instructions was about the right limit for predicated sequences in OPC5. OPC6 has definitely changed things by making short branches much cheaper in code and cycles. Even so predication wins in this one example, and in this case execution of the predicates represents reiterating through a loop. So in the OPC5 case that's a 2 cycle saving each time around the loop and even in the OPC6 case there's still a cycle saving each time. For the 'skip' case, exiting the loop, both predication and branching match in cycle count in both machines for skipping over 2 instructions. In all cases the predicated code is smaller (significantly for OPC5) so I'm sure that with more stats gathering we could compute a better 'wastage' figure for those opcode bits.


About that 'wastage' figure: I'd recklessly said
Quote:
We've got a short pipeline, and yet we've got predication. We can explain ourselves: we had instruction bits to spare, and we wanted a really simple decoder, so predicated mov to r15 was almost a free way to gain branches. Any other predicated instruction is a bonus. But those "spare" instruction bits which we've used for predication are in a sense poor value and might help to explain our lower code density.

That said, if 38% of instructions are predicated... that means 62% of the time we're "wasting" 25% of our instruction bits. And, perhaps most interesting, the majority of our predicated instructions are not short branches - so we're getting value there. If we removed predication, we'd need a lot more branches to deal with those instructions.


and had been corrected:
Quote:
I went for predication for regularity first of all - remember the size of the original OPC5 instruction set - so, yes, the 'free' branching was definitely a feature and regularity is the key to the OPC-ness of the machine.

As you say, without predicates we would definitely incur the cost of more branch instructions in the code.

BTW our 'wastage' is less than 25%. We waste 3 bits in a 16bit word, and use 1 out of 8 predicate combinations as an opcode bit anyway. So I'd say that's 2.875/16 ~ 18% ! :)


(That number is subject to revision, of course!)

As a preamble to the above conversation, Revaldinho offered an analysis tool and some results:
Quote:
I've just checked in a little utility into the utils area - it's a histogram generator which you can run either on the static input from an assembler listing output or the dynamic instruction trace from the python emulator. e.g.
Code:
python3  ../utils/histogram.py  --dynamic --filename pi-spigot-rev.trace
python3  ../utils/histogram.py  --static --filename pi-spigot-rev.lst

Here's the output from the dynamic run on my OPC6 pi spigot code. (NB I've broken out any instructions which use PC as a destination separately - nothing magic about the actual operation of these, but I though it'd be instructive to see how often things like INC, DEC etc are used as branches as opposed to simple arithmetic).

Code:
Dynamic Instruction Usage from pi-spigot-rev.trace

All Instructions

           adc       4032 : *****************************************************************
           add       3342 : ******************************************************
   dec[dst=pc]       2335 : **************************************
           inc       2277 : *************************************
           cmp       2034 : *********************************
           sub       2016 : *********************************
           mov        680 : ***********
           lsr        411 : *******
           dec        405 : *******
   mov[dst=pc]        395 : *******
           jsr        264 : *****
           sto        154 : ***
            ld        132 : ***
   inc[dst=pc]         12 : *
            in         10 : *
           and         10 : *
           out         10 : *
          halt          1 : *

Instructions using predication

   dec[dst=pc]       2335 : **************************************
           sub       2016 : *********************************
           adc       2016 : *********************************
           add        411 : *******
   mov[dst=pc]        132 : ***
           inc        126 : ***
   inc[dst=pc]         12 : *
           sto          6 : *
           jsr          3 : *

Predicate usage

             c       4569 : *****************************************************************
            nz       2353 : *********************************
             z        129 : **
            nc          6 : *


Instruction Summary by Type

All instructions          :      18520
Predicated instructions   :       7057 (38.1%)
Jumps                     :        395
Short Branches            :       2347


...and for comparison, the same stats from Ed's translation of the original

Code:
Dynamic Instruction Usage from pi-spigot-bruce.trace

All Instructions

           add       7634 : *****************************************************************
           mov       6922 : ***********************************************************
   dec[dst=pc]       5311 : *********************************************
   inc[dst=pc]       5300 : *********************************************
           adc       3648 : *******************************
           cmp       1842 : ****************
   mov[dst=pc]        456 : ****
           jsr        341 : ***
           sbc        208 : **
           sto        135 : **
            ld        108 : *
           dec        108 : *
            in         10 : *
           and         10 : *
           out         10 : *
           xor          6 : *
          halt          1 : *

Instructions using predication

   dec[dst=pc]       5311 : *********************************************
   inc[dst=pc]       5300 : *********************************************
   mov[dst=pc]        114 : *

Predicate usage

            nz       5427 : *****************************************************************
            nc       5292 : ***************************************************************
             c          6 : *


Instruction Summary by Type

All instructions          :      32050
Predicated instructions   :      10725 (33.5%)
Jumps                     :        456
Short Branches            :      10611



Sat Aug 26, 2017 4:30 pm
Profile

Joined: Sat Aug 05, 2017 6:57 pm
Posts: 26
Definitely interesting, thanks, although I don't think I've managed to get my head round this completely on the first read. I suspect once I come to actually optimise some code and play around with the analysis tools on that code it will come together a bit more. It's already a far cry from the simple cycle-counting I'm used to on the 6502. :-)


Sat Aug 26, 2017 9:45 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I think it comes out not too bad for OPC6. One cycle for a one-word instruction, two cycles for an r0-based two-word instruction, otherwise three cycles. Add a cycle for a memory access... OK, I'm guessing here!

For most purposes, we're probably not cycle-counting. So my advice would be
- remember to use shorter instructions where they are available
- use predication to skip one or two instructions, but branch around longer sequences


Sun Aug 27, 2017 8:36 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 26 posts ]  Go to page Previous  1, 2

Who is online

Users browsing this forum: AhrefsBot and 13 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software