AnyCPU :: View topic - Notes on the OPC5 [now OPC6]

Quote:

This is basically how it works for arithmetic:

... so, many of the one word instructions need only Fetch0 and Exec. Some two word instructions need all 4 states and others can skip Effective Address to be just 3.

Fetch0 and Exec can overlap except when r15 was altered by the instruction and when an interrupt occurs.

This doesn't cover load and store type instructions, which always go through EAD. Reads then go through RDMEM before finishing in Exec. Writes go to a different WRMEM state and skip Exec.

The only other state available is INT which is where to go on completion of the current instruction if an interrupt is taken.

Quote:

Stats for a single potentially skipped instruction would favour predication more, but I think I always had it in mind that 2-3 instructions was about the right limit for predicated sequences in OPC5. OPC6 has definitely changed things by making short branches much cheaper in code and cycles. Even so predication wins in this one example, and in this case execution of the predicates represents reiterating through a loop. So in the OPC5 case that's a 2 cycle saving each time around the loop and even in the OPC6 case there's still a cycle saving each time. For the 'skip' case, exiting the loop, both predication and branching match in cycle count in both machines for skipping over 2 instructions. In all cases the predicated code is smaller (significantly for OPC5) so I'm sure that with more stats gathering we could compute a better 'wastage' figure for those opcode bits.

Quote:

We've got a short pipeline, and yet we've got predication. We can explain ourselves: we had instruction bits to spare, and we wanted a really simple decoder, so predicated mov to r15 was almost a free way to gain branches. Any other predicated instruction is a bonus. But those "spare" instruction bits which we've used for predication are in a sense poor value and might help to explain our lower code density.

That said, if 38% of instructions are predicated... that means 62% of the time we're "wasting" 25% of our instruction bits. And, perhaps most interesting, the majority of our predicated instructions are not short branches - so we're getting value there. If we removed predication, we'd need a lot more branches to deal with those instructions.

Quote:

I went for predication for regularity first of all - remember the size of the original OPC5 instruction set - so, yes, the 'free' branching was definitely a feature and regularity is the key to the OPC-ness of the machine.

As you say, without predicates we would definitely incur the cost of more branch instructions in the code.

BTW our 'wastage' is less than 25%. We waste 3 bits in a 16bit word, and use 1 out of 8 predicate combinations as an opcode bit anyway. So I'd say that's 2.875/16 ~ 18% !

Quote:

I've just checked in a little utility into the utils area - it's a histogram generator which you can run either on the static input from an assembler listing output or the dynamic instruction trace from the python emulator. e.g.

Code:

python3  ../utils/histogram.py  --dynamic --filename pi-spigot-rev.trace
python3  ../utils/histogram.py  --static --filename pi-spigot-rev.lst

Here's the output from the dynamic run on my OPC6 pi spigot code. (NB I've broken out any instructions which use PC as a destination separately - nothing magic about the actual operation of these, but I though it'd be instructive to see how often things like INC, DEC etc are used as branches as opposed to simple arithmetic).

Code:

Dynamic Instruction Usage from pi-spigot-rev.trace

All Instructions

           adc       4032 : *****************************************************************
           add       3342 : ******************************************************
   dec[dst=pc]       2335 : **************************************
           inc       2277 : *************************************
           cmp       2034 : *********************************
           sub       2016 : *********************************
           mov        680 : ***********
           lsr        411 : *******
           dec        405 : *******
   mov[dst=pc]        395 : *******
           jsr        264 : *****
           sto        154 : ***
            ld        132 : ***
   inc[dst=pc]         12 : *
            in         10 : *
           and         10 : *
           out         10 : *
          halt          1 : *

Instructions using predication

   dec[dst=pc]       2335 : **************************************
           sub       2016 : *********************************
           adc       2016 : *********************************
           add        411 : *******
   mov[dst=pc]        132 : ***
           inc        126 : ***
   inc[dst=pc]         12 : *
           sto          6 : *
           jsr          3 : *

Predicate usage

             c       4569 : *****************************************************************
            nz       2353 : *********************************
             z        129 : **
            nc          6 : *


Instruction Summary by Type

All instructions          :      18520
Predicated instructions   :       7057 (38.1%)
Jumps                     :        395
Short Branches            :       2347

...and for comparison, the same stats from Ed's translation of the original

Code:

Dynamic Instruction Usage from pi-spigot-bruce.trace

All Instructions

           add       7634 : *****************************************************************
           mov       6922 : ***********************************************************
   dec[dst=pc]       5311 : *********************************************
   inc[dst=pc]       5300 : *********************************************
           adc       3648 : *******************************
           cmp       1842 : ****************
   mov[dst=pc]        456 : ****
           jsr        341 : ***
           sbc        208 : **
           sto        135 : **
            ld        108 : *
           dec        108 : *
            in         10 : *
           and         10 : *
           out         10 : *
           xor          6 : *
          halt          1 : *

Instructions using predication

   dec[dst=pc]       5311 : *********************************************
   inc[dst=pc]       5300 : *********************************************
   mov[dst=pc]        114 : *

Predicate usage

            nz       5427 : *****************************************************************
            nc       5292 : ***************************************************************
             c          6 : *


Instruction Summary by Type

All instructions          :      32050
Predicated instructions   :      10725 (33.5%)
Jumps                     :        456
Short Branches            :      10611

Author:	BigEd [ Wed Aug 02, 2017 9:33 pm ]
Post subject:	Re: Notes on the OPC5 - a one-page-CPU, 16 bits
Just one more thing worth noting: the one-word instructions are handy for making branches shorter, which is handy for making more use of the limited displacement of the one-word versions of the branches. And all of that code density improvement should also help our small instruction cache to hide the latency of the multi-cycle byte-wide RAM we have.

Author:	robfinch [ Thu Aug 03, 2017 4:09 am ]
Post subject:	Re: Notes on the OPC5 - a one-page-CPU, 16 bits
Yes. I find much useful for the compiler. Okay, I removed the ',0' in most places so the code should be shorter now. I think it will be difficult (but not impossible) for the compiler to use the inc / dec instructions with the PC. The problem is the branch range is too short. That makes it necessary to count the number of instructions generated then go backwards and patch the generated code with an inc instruction. To be on the safe side the compiler may use only 1/2 the range because it doesn't know how many words the instructions will actually assemble into. I suppose I could have it assemble the code on the fly then count the number of words, but sheesh. The compiler is about 24,000 lines of code (or 364 pages). Definitely exceeds the one page challenge.

Author:	robfinch [ Thu Aug 03, 2017 4:38 am ]
Post subject:	Re: Notes on the OPC5 - a one-page-CPU, 16 bits
After the big complaint, it took me all of ten minutes to modify the code to use the inc / dec instructions for pc branches Here is sample output from some simpler functions. Code: code _abs: # return a < 0 ? -a : a; cmp r8,r0 pl.inc r15,TestAbs_4-PC not r1,r8,-1 inc r15,TestAbs_5-PC TestAbs_4: mov r1,r8 TestAbs_5: mov r15,r13 _min: # return a < b ? a : b; cmp r8,r9 pl.inc r15,TestAbs_11-PC mov r1,r8 inc r15,TestAbs_12-PC TestAbs_11: mov r1,r9 TestAbs_12: mov r15,r13 _max: # return a > b ? a : b; cmp r8,r9 mi.inc r15,TestAbs_18-PC cmp r8,r9 z.inc r15,TestAbs_18-PC mov r1,r8 inc r15,TestAbs_19-PC TestAbs_18: mov r1,r9 TestAbs_19: mov r15,r13 _minu: # return a < b ? a : b; cmp r8,r9 nc.inc r15,TestAbs_24-PC mov r1,r8 inc r15,TestAbs_25-PC TestAbs_24: mov r1,r9 TestAbs_25: mov r15,r13 rodata extern _minu extern _abs extern _min extern _max

Author:	hoglet [ Thu Aug 03, 2017 7:30 am ]
Post subject:	Re: Notes on the OPC5 - a one-page-CPU, 16 bits
Rob, This is absolutely fantastic work you are doing here. As soon as you feel ready, I'd love to try some simple examples on a real system, even if it's not completely finished and there are bugs. We have two real OPC6 systems available to use: - one that has 16K words of RAM and is runs on a variety of standalone FPGA board - the other has 64K works of RAM (~56K useable) and is a Co Processor for the BBC Micro Dave

Author:	hoglet [ Thu Aug 03, 2017 9:56 am ]
Post subject:	Re: Notes on the OPC5 - a one-page-CPU, 16 bits
robfinch wrote: Dave, you're welcome to try the compiler at any time, but I don't guarantee the code will work. It might be a few days yet before I've run things in the simulator. It's built on Windows as a console app using the MS Visual Studio 10 Express (free) edition. the target machine was set to X86. It looks like it might be a 32 bit app. It also needs FPP.exe (or something called fpp.exe) which it shells out to perform the pre-processing. I found where you are working so I can follow along: https://github.com/robfinch/Cores/tree/ ... 20-%20OPC5 https://github.com/robfinch/Cores/tree/ ... re/fpp/src Have you, or anyone else, ever tried building FPP or C64 on Linux? Dave

AnyCPU http://anycpu.org/forum/

Notes on the OPC5 [now OPC6] - a one-page-CPU, 16 bits http://anycpu.org/forum/viewtopic.php?f=3&t=395	Page 2 of 2

Author:	SteveF [ Sat Aug 05, 2017 8:05 pm ]
Post subject:	Re: Notes on the OPC5 - a one-page-CPU, 16 bits
Following a tip-off from BigEd about the OPC project - which looks pretty good! - I've had a look into porting the PLASMA VM to it. I've created a new thread (viewtopic.php?f=3&t=426) for it, to try to avoid disrupting this one too much.

Author:	BigEd [ Sun Aug 06, 2017 5:01 pm ]
Post subject:	Re: Notes on the OPC5 - a one-page-CPU, 16 bits
It's very exciting to see not one but two HLL endeavours for the OPC! For convenience, I've made a new thread for the C64 purpose - hope that's good for everyone. viewtopic.php?f=3&t=427 "Porting the C64 compiler to target OPC5 (or OPC6)" (Rob, you can edit the topic title by editing the Subject of the head post)

Author:	BigEd [ Mon Aug 14, 2017 4:02 pm ]
Post subject:	Re: Notes on the OPC5 - a one-page-CPU, 16 bits
Just to note (some of) the latest updates to the OPC6 line: - addition of push and pop instructions - addition of memory access trace to the python emulator - addition of BYTE directive and local labels to the assembler - assembler will accept 'inc' with a negative offset and emit a 'dec' instruction, for convenience - improved error handling in the assembler - a separate assembler for byte-orientated assembly - a library of routines to support high level languages

Author:	BigEd [ Sat Aug 26, 2017 4:30 pm ]
Post subject:	Re: Notes on the OPC5 [now OPC6] - a one-page-CPU, 16 bits
. A long post on performance... hope it's of interest! Here's Revaldinho's sketch of the pipeline for OPC6: Quote: This is basically how it works for arithmetic: Fetch0 Fetch1 (if operand present) EAD (if operand present and non r0 src reg) Exec ... so, many of the one word instructions need only Fetch0 and Exec. Some two word instructions need all 4 states and others can skip Effective Address to be just 3. Fetch0 and Exec can overlap except when r15 was altered by the instruction and when an interrupt occurs. This doesn't cover load and store type instructions, which always go through EAD. Reads then go through RDMEM before finishing in Exec. Writes go to a different WRMEM state and skip Exec. The only other state available is INT which is where to go on completion of the current instruction if an interrupt is taken. Revaldinho collected some stats on programs running on the OPC6, with particular reference to branching and predication. And we had a bit of a conversation, in which I played the part of the person who didn't quite get the picture and he took the part of the person who understood what was going on. We looked at a short code sequence where we have some conditional code. Question is, how does this code look, for size and speed, when written using branches, predicates, or the new short branch? Here's the executive summary. Counting only the optionally executed STO and MOV instructions, we can compare two machines and two tactics: Code: Instr Cycles Cycles Words (Exec) (Skip) OPC5LS predication 3 6 3 OPC5LS branching 5 8 3 OPC6 predication 2 5 2 OPC6 branching 3 6 2 Diving into the details, here's the code, from the pi spigot program, in the earliest form, for the OPC5ls: Code: 0000 pdcloop: 0000 0aa2 sub r2,r10 # update pointer to next predigit 0002 3a01 0009 cmp r1,r0,9 # is predigit=9 (ie would it overflow if incremented?) 0004 4620 z.sto r0,r2 # store 0 to predigit if yes (preserve Z) 0005 500f 0000 z.mov pc,r0,pdcloop # loop again to correct next predigit 0007 1401 0001 add r1,r0,1 # if predigit wasnt 9 fall thru to here and add 1 And here's a pipeline diagram for the crucial part, in OPC5ls instructions using predication: Code: CMP F0 F1 :EX : for Z=0 we nop next two instructions Z.STO :F0 : one cycle Z.MOV : F0 F1 : two cycles ADD : : F0 CMP F0 F1 :EX : for Z=1 we execute next two instructions Z.STO :F0 EA WR : three cycles Z.MOV : F0 F1 EX: three cycles SUB : : F0 If instead we use a branch: Code: CMP F0 F1 :EX : Z=0/NZ=1 NZ.MOV :F0 F1 EX : ADD : : F0 ... CMP F0 F1 :EX : Z=1/NZ=0 NZ.MOV :F0 F1 : STO : F0 EA WR : MOV : F0 F1 EX : SUB : : F0 ... Switching now to OPC6, using predication and using INC for a relative instead of absolute branch: Code: CMP F0 F1 :EX : Z=0 Z.STO :F0 : Z.INC : F0 : ADD : : F0 CMP F0 F1 :EX : Z=1 Z.STO :F0 EA WR : Z.INC : F0 EX: SUB : : F0 ..and finally rewriting for OPC6 using short branches instead of predication: Code: CMP F0 F1 :EX : Z=0/NZ=1 NZ.INC :F0 EX : ADD : : F0 CMP F0 F1 :EX : Z=1/NZ=0 NZ.INC :F0 : STO : F0 EA WR : IN : F0 EX : SUB : : F0 ... More from our conversation - guess who is speaking here: Quote: Stats for a single potentially skipped instruction would favour predication more, but I think I always had it in mind that 2-3 instructions was about the right limit for predicated sequences in OPC5. OPC6 has definitely changed things by making short branches much cheaper in code and cycles. Even so predication wins in this one example, and in this case execution of the predicates represents reiterating through a loop. So in the OPC5 case that's a 2 cycle saving each time around the loop and even in the OPC6 case there's still a cycle saving each time. For the 'skip' case, exiting the loop, both predication and branching match in cycle count in both machines for skipping over 2 instructions. In all cases the predicated code is smaller (significantly for OPC5) so I'm sure that with more stats gathering we could compute a better 'wastage' figure for those opcode bits. About that 'wastage' figure: I'd recklessly said Quote: We've got a short pipeline, and yet we've got predication. We can explain ourselves: we had instruction bits to spare, and we wanted a really simple decoder, so predicated mov to r15 was almost a free way to gain branches. Any other predicated instruction is a bonus. But those "spare" instruction bits which we've used for predication are in a sense poor value and might help to explain our lower code density. That said, if 38% of instructions are predicated... that means 62% of the time we're "wasting" 25% of our instruction bits. And, perhaps most interesting, the majority of our predicated instructions are not short branches - so we're getting value there. If we removed predication, we'd need a lot more branches to deal with those instructions. and had been corrected: Quote: I went for predication for regularity first of all - remember the size of the original OPC5 instruction set - so, yes, the 'free' branching was definitely a feature and regularity is the key to the OPC-ness of the machine. As you say, without predicates we would definitely incur the cost of more branch instructions in the code. BTW our 'wastage' is less than 25%. We waste 3 bits in a 16bit word, and use 1 out of 8 predicate combinations as an opcode bit anyway. So I'd say that's 2.875/16 ~ 18% ! (That number is subject to revision, of course!) As a preamble to the above conversation, Revaldinho offered an analysis tool and some results: Quote: I've just checked in a little utility into the utils area - it's a histogram generator which you can run either on the static input from an assembler listing output or the dynamic instruction trace from the python emulator. e.g. Code: python3 ../utils/histogram.py --dynamic --filename pi-spigot-rev.trace python3 ../utils/histogram.py --static --filename pi-spigot-rev.lst Here's the output from the dynamic run on my OPC6 pi spigot code. (NB I've broken out any instructions which use PC as a destination separately - nothing magic about the actual operation of these, but I though it'd be instructive to see how often things like INC, DEC etc are used as branches as opposed to simple arithmetic). Code: Dynamic Instruction Usage from pi-spigot-rev.trace All Instructions adc 4032 : *************************************************************** add 3342 : ************************************************** dec[dst=pc] 2335 : ********************************** inc 2277 : ********************************* cmp 2034 : ***************************** sub 2016 : ***************************** mov 680 : ******* lsr 411 : *** dec 405 : *** mov[dst=pc] 395 : *** jsr 264 : * sto 154 : * ld 132 : *** inc[dst=pc] 12 : * in 10 : * and 10 : * out 10 : * halt 1 : * Instructions using predication dec[dst=pc] 2335 : ************************************ sub 2016 : ***************************** adc 2016 : ***************************** add 411 : *** mov[dst=pc] 132 : * inc 126 : *** inc[dst=pc] 12 : * sto 6 : * jsr 3 : * Predicate usage c 4569 : *************************************************************** nz 2353 : ***************************** z 129 : nc 6 : * Instruction Summary by Type All instructions : 18520 Predicated instructions : 7057 (38.1%) Jumps : 395 Short Branches : 2347 ...and for comparison, the same stats from Ed's translation of the original Code: Dynamic Instruction Usage from pi-spigot-bruce.trace All Instructions add 7634 : *************************************************************** mov 6922 : ******************************************************* dec[dst=pc] 5311 : ***************************************** inc[dst=pc] 5300 : ***************************************** adc 3648 : *************************** cmp 1842 : ************ mov[dst=pc] 456 : jsr 341 : * sbc 208 : sto 135 : ld 108 : * dec 108 : * in 10 : * and 10 : * out 10 : * xor 6 : * halt 1 : * Instructions using predication dec[dst=pc] 5311 : ******************************************* inc[dst=pc] 5300 : ******************************************* mov[dst=pc] 114 : * Predicate usage nz 5427 : *************************************************************** nc 5292 : ************************************************************* c 6 : * Instruction Summary by Type All instructions : 32050 Predicated instructions : 10725 (33.5%) Jumps : 456 Short Branches : 10611

Author:	SteveF [ Sat Aug 26, 2017 9:45 pm ]
Post subject:	Re: Notes on the OPC5 [now OPC6] - a one-page-CPU, 16 bits
Definitely interesting, thanks, although I don't think I've managed to get my head round this completely on the first read. I suspect once I come to actually optimise some code and play around with the analysis tools on that code it will come together a bit more. It's already a far cry from the simple cycle-counting I'm used to on the 6502.

Author:	BigEd [ Sun Aug 27, 2017 8:36 am ]
Post subject:	Re: Notes on the OPC5 [now OPC6] - a one-page-CPU, 16 bits
I think it comes out not too bad for OPC6. One cycle for a one-word instruction, two cycles for an r0-based two-word instruction, otherwise three cycles. Add a cycle for a memory access... OK, I'm guessing here! For most purposes, we're probably not cycle-counting. So my advice would be - remember to use shorter instructions where they are available - use predication to skip one or two instructions, but branch around longer sequences

Page 2 of 2	All times are UTC
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/