Last visit was: Sat Dec 21, 2024 12:37 pm
|
It is currently Sat Dec 21, 2024 12:37 pm
|
One Page Computing - roll your own challenge
Author |
Message |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
Could of questions for you Rob: - is JAL a jump-and-link - is PC in your register file? - how do you return from a subroutine? (I see revaldinho has just added a trio of instructions to support subroutine calls, has dropped one bit of address and still fits in a 9572. He's using JSR, LXA and RTS. There's still no stack, but LXA provides access to the link register, so all is well provided each subroutine preserves it. The link register is actually only the top few bits of the PC, the accumulator accounting for the bottom byte of the former PC, so even leaf subroutines need to do a dance before they can get started. See the test program for an illustration.)
|
Mon May 01, 2017 3:22 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Quote: Could of questions for you Rob: - is JAL a jump-and-link - is PC in your register file? - how do you return from a subroutine?
JAL is indeed jump-and-link. The JAL instruction in this case stores the return address (The address of the next instruction) in the specified target register, while loading a new address into the PC which is the sum of a register (Ra) and an immediate value. JAL is impressively powerful because it allows returning from a subroutine in addition to calling one. To call a subroutine storing the return address in R62: Code: JAL R62,my_routine or JAL R62,my_routine[R0]
To return from a subroutine, jump to the address stored in R62 Any register may be used to hold a return address (any register may be used for the link register). The program counter is not located in the register file. This code is FPGA oriented as opposed to CPLD. The register file makes use of distributed memory in the FPGA which eliminates having to multiplex that a flip-flop output register file would require. Placing the program counter in the register file would require several multiplexers which are otherwise not needed. The program counter value if needed can be loaded into a register by performing a JAL to the next instruction. I have to admit I haven't extensively tested the core yet. The above is just how it should work assuming no errors in the code.
_________________Robert Finch http://www.finitron.ca
|
Mon May 01, 2017 4:55 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
Thanks Rob! I believe RISC-V also has a branch-and-link which can be used for either call or return.
|
Mon May 01, 2017 5:00 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
Revaldinho and I - mostly Revaldinho - have been making very interesting progress in one-page computing. In fact, we've almost finished our fourth major iteration, with thoughts on a fifth. You can see the current state of play at our minisite: https://revaldinho.github.io/opc/ (which also fits in one page, of course) I'd like to say a bit about the evolution of our thoughts, and the machines we've developed. I have to say, it's been great fun, and it even looks like the latest model might be a pleasure to program - a long way from one-instruction computing! So, we've designed and simulated 4 models so far, of increasing complexity. The first two aim to fit in a small 9572 CPLD, and then we relaxed that constraint. - all the OPCs now have a one-page macro assembler, verilog HDL, both python and javascript emulators, and one or two test programs. - OPC1 was to fit a small CPLD, as an accumulator machine, and it worked out but is quite cramped. It does have a link register and therefore a subroutine mechanism. It's 8 bits wide, with all instructions two bytes, and the address space is just 11 bits. That's 12 instructions with three addressing modes. It supports short addresses for a sort of zero-page. We started with 12 bits of address, but had to reduce it to use less logic as we added necessary instructions. This machine feels like a success - surprisingly functional. - OPC2 was also for a small CPLD, an attempt to make a load-store register machine, but it only has room for two registers, and one of them starts looking quite like an accumulator. We had to reduce the address space to 10 bits. It's a CPU, but feels too limited. The hope here was that load-store would be simpler so we'd have more room, but the opposite was true. Both the above small machines can do subroutine calls, by means of a link register, which means manually maintaining a return stack. Just like the old days. - OPC3 is a simplistic translation of OPC1 to a 16 bit machine, and is now bigger than the smallest CPLD, although fits in a 95144. Instructions are now 16 bits wide, still two words. Addresses are now 16 bits. It is of course a word-based machine and zero page is now gone. It has 17 opcodes, which is to say 12 instructions some of which have three addressing modes - this is the same as OPC1. There's plenty of room for improving this machine, with the wider instructions now available. - OPC4 is a name reserved for a fleshed-out version of OPC3, which we haven't thought about at all. There are 11 bits unused in every instruction word, so lots of room for expansion in the encoding, so long as we have room in the one-page sources. However, we thought we could do better by changing direction a little. - OPC5 is where it got really interesting. We dropped the CPLD idea, and now aim for a limit of 128 slices. We have a 16-bit machine, word-based, with 16 registers - R0 is zero and R15 is PC. Instructions encode two register operands and an optional 16-bit word which is added to the second source register. The first source register is also the destination. All instructions are predicated. This gives us branches, jumps, subroutines, absolute and indexed indirect addressing. There's a length bit which allows the operand word to be absent, so this is a variable instruction length machine. Initially we had two predicate bits, which supported JNZ and JCS type of operations. Initially we had just four instructions: NAND, ADD, LOAD and STORE which is, remarkably, enough. But then we improved that to 8: AND, OR, XOR, ROR, ADD, ADC, LOAD, STORE, and we also expanded the predicates to three bits, which now support JZ, JNZ, JCC, JCS, JCZ and JCCNZ type of operations. Revaldinho has written a multiplication routine and a division routine. The code looks pretty dense and performs pretty well. As an early estimate, the machine presently fits in 60 slices and does 100MHz, so easily within 128 slices and limited by our one-page rule. Here are some code comparisions, for a simple 16-bit Fibonacci test program: OPC-5 seems to come out at 100MHz on some typical FPGA, at a first pass, so this is looking pretty good. With lots of registers, we find a lot of instructions are just single word. - nOPC6 then is the next step: build on the OPC5 architecture, fitting within 128 slices but relaxing the one page rule, sticking with a 16 bit word-based machine, probably with 16 registers but possibly more, adding more capabilities to improve performance and code density, and adding necessary IRQ, NMI, and RDY handling. Maybe pipelined - if not, see nOPC7. Will fit the rules of Arlet's 8-bit competition with a simple half-speed implementation interfacing to byte wide memory.
|
Sun May 28, 2017 4:24 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Ed & Revaldinho,
You've obviously been busy. I'm amazed that reasonable CPU would into just a CPLD. It's interesting to see the evolution of the CPU. It seems that once they're "finished" they seem to magically expand after that. They just grow, and grow and grow.
I've been mulling over using a CPU (Butterfly2) I designed (modelled after Jan Gray's XR16) a few years ago. It's only about 120 slices, but it uses a 16 bit bus. And it supports interrupts and single stepping. I can see where a bus multiplexer would drive the design over the limit. It's far outside the one page challenge, being about 1,000 lines of code.
_________________Robert Finch http://www.finitron.ca
|
Mon May 29, 2017 12:52 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
It'd be good to see a thread about your Butterfly2. Sounds like it's about the same amount of code and same number of slices as Arlet's 6502. Of course, by necessity Revaldinho's verilog is quite dense. We think possibly all these one-page pieces of code and HDL would be interesting to learn about CPU design, and about python, verilog, and javascript. They are not necessarily exemplary pieces of code, but they are short enough to print out and study, and they do just one thing.
It is interesting and a bit of a surprise how much function can be packed into a one page.
|
Mon May 29, 2017 2:06 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
BigEd wrote: Here are some code comparisions, for a simple 16-bit Fibonacci test program: OPC-5 seems to come out at 100MHz on some typical FPGA, at a first pass, so this is looking pretty good. With lots of registers, we find a lot of instructions are just single word. Breaking news update: revaldinho has been rejigging the state machine for OPC5, and managed to improve the cycle counts - and at the same time, moved around some of the logic between states, and improved the clock speed too! So today, we're up from 59 to 68 slices, but the cycle count is down from 921 to 700, and clock speed is up from 100MHz to 150MHz. And it's still a one-page machine. Haven't yet done a full-on synthesis. Because this is a 16-bit machine, we can imagine a half-speed version could connect to an 8-bit wide memory and take two bites at every access - so we need a 2x performance over the 6502 in order to win in an 8-bit world, and indeed we've just about got that, but it looks like we win handsomely if we have a full width memory. For me, this is like the 68000 and the 68008, or indeed, like the 8086 and the 8088. Same architecture, but a half-width version for cost-constrained systems, which takes a bit of a performance hit.
|
Sat Jun 03, 2017 10:39 am |
|
|
hoglet
Joined: Tue Feb 10, 2015 7:07 am Posts: 52
|
Hi Ed,
Do you think the OPC-5 instruction set is stable enough that it's worth starting to write a small machine code monitor?
Am I right in thinking this has yet to be deployed to real hardware?
Dave
|
Sat Jun 03, 2017 10:57 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
Hi Dave Yes, the OPC5 behaviour is set now - we had a quick exchange as to whether SUB would be better than ADD, but decided against. So, these latest developments have been purely in the verilog implementation - the emulators and spec have been unchanged, which is a good sign of stability. But indeed no, as yet no implementation in real hardware. The latest development in emulation land was to add an output device at FE09 - that's meant to stand in as a UART, of course, with status/control at FE08 and I/O at FE09. So, an FPGA with core, RAM, a monitor pre-loaded, and some kind of UART model, should not be far out of reach. There is now a small suite of test programs which can also serve as examples. Indeed, as many as six. Ed
|
Sat Jun 03, 2017 11:13 am |
|
|
Garth
Joined: Tue Dec 11, 2012 8:03 am Posts: 285 Location: California
|
Amazing progress! How would the OPC5 be clocked? Do you hang a crystal on the outside and have a PLL onboard, or do you need to derive the final frequency outboard of the CPLD, whether 100MHz, 150MHz, or whatever? Or does it set its own frequency which may vary somewhat from one instruction to the next based on the length of the logic path for that instruction?
_________________http://WilsonMinesCo.com/ lots of 6502 resources
|
Sat Jun 03, 2017 4:43 pm |
|
|
Revaldinho
Joined: Tue Apr 25, 2017 7:33 pm Posts: 32
|
Garth wrote: Amazing progress! How would the OPC5 be clocked? Do you hang a crystal on the outside and have a PLL onboard, or do you need to derive the final frequency outboard of the CPLD, whether 100MHz, 150MHz, or whatever? Or does it set its own frequency which may vary somewhat from one instruction to the next based on the length of the logic path for that instruction? Thanks Garth, I think the MHz number is just interesting at this point. It's good to keep track of how different changes affect the machine, but I haven't given a lot of thought to practical implementation issues while getting the basic CPU core working so far. It's definitely a fixed clock cycle that's expected. As the code is today, the CPU assumes a single cycle synchronous memory, so quite what we might achieve with real memory in a physical implementation may well be much lower. I was going to postpone thinking about that for the next one which won't strictly be a OPC any more, but it will extend OPC5 into something that we can use to address Arlet's challenge. For that we will need an interrupt module, and some way of connecting to an 8 bit bus. Dave is keen to try out an FPGA implementation of the basic OPC5 using just block RAM as the memory interface though so that'll be interesting to follow. OPC5 has turned out to be quite a surprising machine. Just when you think it's all done and not possible to fit anything more in 66 lines of code it turns out that you can! The emulator and assembler has been stable for a while but over the last week or two I've been able to rework the verilog a number of times to improve the average cycles per instruction. We have a small number of test cases now which I use as a regression suite and also as non totally scientific benchmarks: 32b unsigned division, multiplication and square root routines as well as the simpler Fibonacci number generator and a small test program mainly intended to exercise assembler syntax. This is the recent progression on OPC-5 in terms of cycle efficiency, i.e. all changes in the inner workings of the RTL implementation only with no alterations to the test programs themselves; all OPC5 netlists fully binary compatible with each other. Code: Cycles per Instruction 3.20 - first working version with fixed state machine progression and predication checks only in Effective Address (EA) calculation state 3.15 - do predicate checking in FETCH cycles 3.07 - allow some instructions to skip EA state altogether 2.48 - overlap FETCH0 and EXEC states, disable earlier optimisations 2.43 - reimplement predication checks in both FETCH cycles again 2.35 - allow instructions using r0 as source register to skip EA again
And that's the state of the current version in git, which is also the version which Ed posted about and where our fib.s test case takes 700 cycles. But like I say, just when you think it's all done you find there's something else you can do. I have another version now which is quite a bit bigger in terms of FPGA slices than the current release. I will probably release as an alternative implementation rather than as a replacement for the simpler one there now. The new one is still binary compatible with the others, and it still fits in 66 lines of code although things are (even) less tidy now than they were. The major change in the new one is that it adds a second read port on the register file allowing both register reads to be done in the same cycle if no immediate data is needed. That's the main cause for the increase in area from around 68 slices up to about 80. The big deal though is that this one is significantly more efficient in cycle counts. When code is tightly written to use streams of single word instructions, the machine needs just a FETCH and EXEC cycle to execute each one, and EXEC for one instruction can overlap FETCH for the next. It's not much slower in MHZ than the current implementation when running through the Xilinx tools, but there is a massive improvement in average cycles per instruction. For these benchmarks average cycles per instruction is reduced from 2.35 to just 1.84 and the total cycle count for the fib.s test is down to 520 cycles. Maybe this one is a better place to start on Arlet's challenge, so long as the extra area doesn't prove to be critical. Rich
|
Sun Jun 04, 2017 2:47 pm |
|
|
hoglet
Joined: Tue Feb 10, 2015 7:07 am Posts: 52
|
Revaldinho wrote: Dave is keen to try out an FPGA implementation of the basic OPC5 using just block RAM as the memory interface though so that'll be interesting to follow.
I've had some great fun over the last couple of hours bringing up an OPC5 based system on the Papilio One: https://github.com/hoglet67/opc/blob/ma ... c5system.vhttps://github.com/hoglet67/opc/blob/ma ... one/uart.vhttps://github.com/hoglet67/opc/blob/ma ... am_2k_16.vhttps://github.com/hoglet67/opc/blob/ma ... sevenseg.vThe only non-obvious bit is clocking the RAM block of the falling edge of the clock, to hide it's output register. And here it is running: Once the reset button is release, you can see it looping at 0x0058, which is correct. And on the serial port we have: [Mod: see attached image below] Next tasks are: - extend the one page UART to support receiving - improve the compile/debug cycle time using data2mem to patch the monitor code into the bitstream - make the monitor actually do something useful I think for the Monitor I'll try to implement the feature set of Bruce Clark's Compact Monitor: http://biged.github.io/6502-website-arc ... m/cmon.htmAt the moment, all I have is some I/O routines: https://github.com/hoglet67/opc/blob/ma ... /monitor.sI still find it amazing that the monitor is being compiled with a 66-line assembler! Dave
You do not have the required permissions to view the files attached to this post.
|
Sun Jun 04, 2017 5:49 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
That's brilliant, to see the OPC5 come to life!
|
Mon Jun 05, 2017 8:01 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
Just to note, I've branched off a thread for OPC5 activities, to leave room in this thread for other people's stories. Notes on the OPC5 - a one-page-CPU, 16 bits(I'm not sure if it's the right thing to do, but I did it! OPC5 has itself branched out a bit and now there's an OPC6. So much for me saying it's set now.)
|
Sun Jul 23, 2017 9:31 am |
|
|
monsonite
Joined: Mon Aug 14, 2017 8:23 am Posts: 157
|
Hi Ed, It's barely 2 weeks since you left a comment on my blog - and now we have OPC6 running on our open source Lattice ICE40 based dev-board! It's amazing (to me) to see a fully working monitor programme and the means to calculate Pi - from something 2 weeks ago I barely had running an 8-bit counter! This could only have been achieved because of the massive amount of work that Dave (Hoglet) has done over the last few days - and this is greatly appreciated. Our ICE40 platform grew out of being inspired by Clifford Wolf's open source reverse engineering of the Lattice bitstream format - about 18 months ago. There are now a series of low cost FPGA boards based on ICE40 taking full advantage of Clifford's "Project IceStorm". myStorm is kind of a "Field of Dreams" project. Alan (Wood) and I decided that if we built an inexpensive FPGA dev board - and made it available to hobbyists, makers and students at reasonable cost - then it would promote the use of open source FPGA hardware and software to a much wider community. "If you build it, - they will come...." So the current production board is called BlackIce - and sports a Lattice ICE40HX4K-TQ144. This is the only sensible ICE40 in a LQFP package - that makes board layout so much easier than BGA. BlackIce is laid out on a 2 layer board, designed in EagleCAD - that the seriously enthusiastic could have boards made by OSHPark - or similar, and then build their own FPGA board. I have built up three boards in this manner - but I really don't advise it - 0402 components are about the absolute limit of "human solderability". Better to spend £40 and buy a factory built board. When we designed BlackIce - the idea was to load it up with as many useful features as possible, so we have an ICE40 FPGA closely coupled to a 256Kx16 10nS SRAM which should appeal to cpu builders. When we looked at other FPGA boards - they all seemed to use the FTDI dual port USB interface chip to provide an SPI bus to program the bitstream flash and also to provide a virtual com port for FPGA communications. The FTDI device was nearly $4.00 in volume - so we took the decision to spend our $4 wisely - and put a fairly useful ARM Cortex M4 microcontroller (STM32L433) on board to provide programming FPGA support. The STM32L433 could be programmed to run the OPC6 assembler - as both microPython and JS are available for this device: https://micropython.org/ MicroPython https://www.espruino.com/ Java Script for STM32 Micros The STM32 is wired up to a series of Arduino headers - so most of its unused GPIO is available to the User, as are the range of peripherals on the STM32 - including 12 bit ADC and DACs. On the underside of the board is a microSD socket - accessible to both the FPGA and the ARM - which can be used for program storage or even store a series of bitstream files - whatever you fancy. Dave (Hoglet) notified me this morning of the Digilent VGA adaptor - available for £8.99 from Farnell - as a double-wide pmod. This looks like it should work with the BlackIce board. http://uk.farnell.com/digilent/410-345/ ... dp/2768189Additionally - when we laid the board out, Olimex had defined an FPGA expansion bus and a number of accessory boards - so we have an (unpopulated) footprint that will accept the Olimex FPGA expansion board connector. To summarise, the OPC6 has been ported to the Lattice ICE40 ecosystem using open source tools - which brings forth a whole bunch of low cost FPGA dev boards. The OPC6 design used just over 10% of the available logic, and 56% of the block RAM. It should be mentioned that the HX4K is actually an 8K die - and with Clifford's toolchain the whole 8K logic is available. If anyone is interested in trying a BlackIce board - we now have production quantities available - and they should contact me directly. We have a users forum at mystorm.uk Ken
|
Fri Aug 25, 2017 2:25 pm |
|
Who is online |
Users browsing this forum: AhrefsBot, claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|