Last visit was: Wed Oct 09, 2024 8:08 pm
|
It is currently Wed Oct 09, 2024 8:08 pm
|
TOYF (Toy Forth) processor
Author |
Message |
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
I still think that my 65ISR design is pretty cool --- the 1-bit instructions are a great feature! The 65ISR is somewhat ambitious for a first effort at HDL though. A guy who teaches Verilog for Xilinx suggested that a load-and-store processor would be easier to implement. With this thought in mind, I designed the TOYF processor. This has very simple instructions that execute in one clock cycle. It allows multiple (up to three) instructions to be packed into a single opcode so all of them execute concurrently in one clock cycle. This is similar to the MiniForth that I worked on previously at Testra (except that the MiniForth allowed up to five instructions to be packed into each opcode). Packing instructions boosts the speed and also reduces code bloat (RISC processors tend to have a lot of code bloat due to all the instructions being simple, as compared to a CISC processor in which each instruction does a lot). The reason it is called TOYF (Toy Forth) is because it doesn't support interrupts that can happen after arbitrary instructions. Also, it is much more Forth-centric than the 65ISR because it has the data-stack and return-stack that Forth traditionally needs. The TOYF supports local variables and quotations that have access to the parent function's local variables despite the fact that the HOF that executes them has local variables of its own. Hopefully the TOYF will be easy to implement. All of the instructions do something simple such as add one register to another. There are no addressing modes (the 65ISR has indexed addressing). I would expect that implementing such simple instructions should be pretty easy --- easy enough to be a first project for me --- I haven't really started studying Verilog yet though, so I don't really know anything about the subject. Other than the MiniForth/RACE that I worked on previously, most Forth processors have been CISC --- they have a few instructions (typically 32 or 64) and they are big multi-cycle instructions (for example a ROT instruction to rotate three items on the data-stack). My TOYF is a RISC processor similar to what I worked on previously. This is an example of parallelization: Code: PREP_REG_COPY: ; src dst cnt -- cnt ; sets AX=src DX=dst pls ldr DX pls ldr AX nxt
REG_COPY: ; needs AX=src DX=dst BX=cnt moves a datum from (AX) to (DX) post-incrementing, and decrements the cnt mov AX,MA add 1,AX sub 1,BX mov BX,CF ; CF set if BX is non-zero ldr LX mov DX,MA xch DX,BX str LX add 1,BX xch DX,BX mov -1,LX ; this will cause NXT to back up and do the REG_COPY primitive again cad LX,IP ; if CF then move IP + LX to IP nxt
Our REG_COPY takes 7 clock cycles per word copied, which is pretty fast considering how many group-M instructions we have. MOV AX,MA and ADD 1,AX and SUB 1,BX all parallelize into one opcode. MOV BX,CF and LDR LX parallelize into one opcode. MOV DX,MA and XCH DX,BX parallelize into one opcode. STR LX and ADD 1,BX parallelize into one opcode. XCH DX,BX and MOV -1,LX parallelize into one opcode. CAD LX,IP doesn't parallelize with anything, so it is an opcode by itself. NXT doesn't parallelize with anything, so it is an opcode by itself.
REG_COPY can be used to do very fast block moves! The problem is that it requires that NXT rather than POL be used, so a large block move could delay the POLL code too long. Because of this, it is best to only copy small blocks of data.
You do not have the required permissions to view the files attached to this post.
|
Fri Oct 20, 2017 5:26 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
An unusual machine! Looks a bit VLIW to me - someone somewhere has to figure out which instructions can be bundled up together - is that the person writing the program, or a job done by the assembler? In your code snippet, I don't see any annotation about grouping, or any explicit NOP, so I suppose the assembler is doing the thinking?
|
Fri Oct 20, 2017 9:13 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: An unusual machine! Looks a bit VLIW to me - someone somewhere has to figure out which instructions can be bundled up together - is that the person writing the program, or a job done by the assembler? In your code snippet, I don't see any annotation about grouping, or any explicit NOP, so I suppose the assembler is doing the thinking? Yes, the assembler rearranges the instructions and packs them into the opcodes with the goal of minimizing the NOPs that need to be inserted while still guaranteeing that the program does the same thing as it would if the instructions were compiled one per opcode in the same order that they appeared in the source-code. I wrote MFX (MiniForth cross-compiler) at Testra and my assembler did this --- so I know how to write it --- I'll just resurrect that old assembler for this TOYF processor. Yes, it is VLIW --- although most VLIW processors have a lot wider opcode --- Lockheed Martin has one that, iirc, is 128 bits wide. I could add another register (lets call it CX) and another field of instructions (lets call it group-C) --- if I did this, I would have to upgrade the opcode size size --- I don't think this is necessary, but maybe something to consider in the future for a super-duper Toy Forth processor. I have been told that the opcode size doesn't have to be a multiple of 8. It is possible to have a 20-bit opcode, for example. True?
|
Fri Oct 20, 2017 3:10 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
Ah, nice, your assembler looks after the packing and maintains the correctness, so the programmer doesn't have to worry.
Sure, you can make a memory system any width - but widths divisible by 8 or 9 make better use of the hardware you might buy. Usually 8. Or, your memory system could be a different width, and there will be some unused bits, or you'll need multiple reads to assemble a whole instruction. But for efficiency, you'd try to match things up.
For example, a 24-bit opcode would be a good fit for a 24-bit wide memory. Or, if you had an 8 or 16 bit wide memory, you'd need a buffer to marshall the instruction. Better than a buffer would be a cache, to hold several whole instructions ready for use. But complexity is going to increase.
|
Fri Oct 20, 2017 3:48 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: Ah, nice, your assembler looks after the packing and maintains the correctness, so the programmer doesn't have to worry. The MFX assembler would assemble an instruction as far back in the machine-code as possible. It knew what registers each instruction needed and what registers each instruction modified. It wouldn't assemble the instruction back as far as the last instruction that modified the registers it needed. It would look for an empty slot as far back as possible (empty because two instructions of the same group can't go in the same opcode). The assembler would also not compile back past a label. The assembler would not only compile a machine-code program for the target processor, but it would meta-compile a Forth program for the host processor that simulated the machine-code program. This allowed the simulator to run fast --- there was no need to decode the opcodes and do a <SWITCH on each field. All the work was done at compile-time, so at run-time it was only necessary to EXECUTE the xt values for the functions that simulated each instruction. I had two sets of registers: SRC and DST. When the instructions executed they would use the SRC register values and set the DST register values. After each opcode was simulated, all of the DST register values were moved to the SRC register set. This was because it often happens that you have two instructions executing concurrently, but one instruction reads a particular register and another writes to that register. On the actual processor, this happens concurrently. On the simulator on the host computer, this happens sequentially --- but, because there are two register sets, with SRC being read and DST being written, this works out --- the "same register" actually has two versions, for reading and writing. My boss at Testra didn't expect this. He was expecting the assembly-language programmer to pack the instructions manually. Also, at Lockheed Martin they have the assembly-language programmer packing the instructions manually (they use a spreadsheet with each row being an opcode and each column being a field). This is a very bad idea! Quite a lot of assembly-language code was written for the MiniForth --- it would have been too labor-intensive for the assembly-language programmer to manually pack the instructions --- the project would not have been completed if so much tedious work had been required of the poor assembly-language programmer. BigEd wrote: Sure, you can make a memory system any width - but widths divisible by 8 or 9 make better use of the hardware you might buy. Usually 8. Or, your memory system could be a different width, and there will be some unused bits, or you'll need multiple reads to assemble a whole instruction. But for efficiency, you'd try to match things up.
For example, a 24-bit opcode would be a good fit for a 24-bit wide memory. Or, if you had an 8 or 16 bit wide memory, you'd need a buffer to marshall the instruction. Better than a buffer would be a cache, to hold several whole instructions ready for use. But complexity is going to increase. Well, I want to decrease complexity. To a large extent, the goal of this design is to be as simple as possible to implement in Verilog because this would be my first-ever Verilog effort --- the 65ISR was too complicated --- the TOYF is supposed to ultra-simple (that is why there are no interrupts).
|
Fri Oct 20, 2017 7:33 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
As a small example, in the OPC project we made a 16 bit wide machine, and initially ran it with 16 bit wide memory. When we ported to an FPGA module which had only 8 bit wide memory, we needed at least a buffer. Dave [hoglet] wrote the code for that, and added in a very small code cache at the same time. In a little over 100 lines of verilog, we got something which could - interface the different width memory - accommodate the slow speed of the memory - partially compensate for the speed and the width with a code cache See here: https://github.com/revaldinho/opc/blob/ ... ntroller.v
|
Fri Oct 20, 2017 7:51 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
Hugh Aguilar wrote: Well, I want to decrease complexity. To a large extent, the goal of this design is to be as simple as possible to implement in Verilog because this would be my first-ever Verilog effort --- the 65ISR was too complicated --- the TOYF is supposed to ultra-simple (that is why there are no interrupts). I have a minor update. I got rid of the STR IP instruction, and added a MOV AX,IP instruction. This will help the FULL code to quickly initialize some Forth code. I mostly added more text explaining the design decisions of the TOYF, and also providing more example code. Explaining these design decisions got me to questioning those decisions. I am considering a major change in which the POLL code would only get executed when a function is called, but would not get executed after every primitive executes. This would allow a function to not execute POLL at all, assuming that it doesn't call any other functions but only executes primitives. This would allow DX to be used to hold data all the way through a function's execution. Currently, local variables are used to hold data all the way through a function's execution. They are pretty fast, but not as fast as a register would be. Currently, DX can be used to pass data from one function to another. The function that sets DX ends in NXT so it doesn't execute the POLL code (that might clobber DX). Functions can end in POL though, because DX is no longer valid it is okay to execute the POLL code. The assumption with the current design is that the POLL code has to be called every 20 to 40 clock cycles, lest that I/O data gets lost. Is this a reasonable assumption? Executing POLL code only on function calls would likely result in a latency of 100 to 200 clock cycles. How fast is I/O? How fast is the clock on the processor (realistically)? I'm getting somewhat out of my depth, because I'm not familiar with the timing considerations of typical micro-controller applications. Anyway, this update explains the current scheme in which the POLL code gets executed after most of the primitives, which would be 20 to 40 clock cycles.
You do not have the required permissions to view the files attached to this post.
|
Mon Oct 23, 2017 1:10 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: As a small example, in the OPC project we made a 16 bit wide machine, and initially ran it with 16 bit wide memory. When we ported to an FPGA module which had only 8 bit wide memory, we needed at least a buffer. Dave [hoglet] wrote the code for that, and added in a very small code cache at the same time. In a little over 100 lines of verilog, we got something which could - interface the different width memory - accommodate the slow speed of the memory - partially compensate for the speed and the width with a code cache See here: https://github.com/revaldinho/opc/blob/ ... ntroller.vI read up on your OPC project, but I don't know enough about Verilog to have an opinion on it. Would you consider the OPC processor to be more or less difficult to implement than TOYF? I designed TOYF with the primary purpose of making it easy to implement in Verilog, but I don't know Verilog so I'm not the best judge of what would be easy to implement in Verilog. The TOYF is similar to the MiniForth, and Testra managed to implement that on a Lattice isp1048 PLD in 1994 --- their goal was ease of implementation --- this was due to having such a limited platform to implement on. I'm not aware of anybody else implementing a processor on the Lattice PLD --- if it was easy, everybody would have been doing it.
|
Mon Oct 23, 2017 1:15 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
Well, I'm sort of verilog-literate, but far from an excellent practitioner. The verilog for OPC was mostly written by revaldinho, who is something of a wizard. So everything looks easy once it's done by a wizard! I think once you've got the bones of a CPU fetching and executing instructions, adding more instructions isn't so difficult. Adding a new address mode might need more work. I would suggest you tackle the problem incrementally. If I were to try it, I'd be very tempted to start with something which already exists and hack it into a new shape. That PLD is pretty small, but I wonder if it's big enough to take the smallest 6502 implementation? There's a surprisingly large range of implementation sizes for the different 6502 implementations out there. http://forum.6502.org/viewtopic.php?f=10&t=1673
|
Mon Oct 23, 2017 8:07 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: I think once you've got the bones of a CPU fetching and executing instructions, adding more instructions isn't so difficult. Adding a new address mode might need more work. I would suggest you tackle the problem incrementally. If I were to try it, I'd be very tempted to start with something which already exists and hack it into a new shape. You haven't read my TOYF.4TH document. I can't add any more instructions because all three groups are full --- I would have to have an opcode wider than 16-bit. I don't have any addressing-modes. This is a load-and-store design. None of the instructions have any operands. Perhaps wasn't worthwhile to post the TOYF.4TH document here...
|
Mon Oct 23, 2017 3:21 pm |
|
|
barrym95838
Joined: Tue Dec 31, 2013 2:01 am Posts: 116 Location: Sacramento, CA, United States
|
Hugh Aguilar wrote: ... Perhaps wasn't worthwhile to post the TOYF.4TH document here... I know that you're an emotional guy, but please don't make any hasty conclusions, Hugh. I (and I'm sure several others) are following with quiet interest. Mike B.
|
Mon Oct 23, 2017 3:26 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
I certainly did have a quick look through once. But what I was saying was something a little different: if your CPU has many instructions, as most have, I still think it's worth considering writing the verilog first with only a few of those instructions, and to add them incrementally, perhaps in groups if they fall naturally into groups. At the same time, you can write short test sequences, which first of all only use the few instructions you've implemented, and then get more varied as you implement more instructions.
But a fair point about addressing modes: if there's only one, then there's no natural grouping to fall out of that.
|
Mon Oct 23, 2017 4:30 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: I certainly did have a quick look through once. But what I was saying was something a little different: if your CPU has many instructions, as most have, I still think it's worth considering writing the verilog first with only a few of those instructions, and to add them incrementally, perhaps in groups if they fall naturally into groups. At the same time, you can write short test sequences, which first of all only use the few instructions you've implemented, and then get more varied as you implement more instructions.
But a fair point about addressing modes: if there's only one, then there's no natural grouping to fall out of that. Sorry if I seemed testy. I get frustrated when I explain what I'm thinking and people don't grok it immediately. I wrote an update on the document that explains out-of-order execution further. Also, I clarified how the assembler determines which opcode an instruction goes into --- my description earlier was over-simplified --- I had the assembly algorithm confused in my head (it has been over 20 years since I wrote the MFX assembler and the documentation is lost, so it is just in my head). All of the instructions are needed for Forth. I can't start with a few instructions, and add new instructions as I progress, because there is no subset that will support a full Forth system. I gave some thought to this and I think I have the simplest design possible. This is similar to the argument that Francis Crick made in regard to RNA and DNA being designed. There is no smaller simpler system that this system could have evolved from, but this system is too complex to have been blundered into. This TOYF is intended to be the simplest Forth processor possible --- this is so it will be easy to implement in HDL, and so it will work on an inexpensive FPGA --- there is no point in designing a super-awesome processor that can only be implemented on a super-expensive FPGA.
You do not have the required permissions to view the files attached to this post.
|
Fri Oct 27, 2017 3:27 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: That PLD is pretty small, but I wonder if it's big enough to take the smallest 6502 implementation? There's a surprisingly large range of implementation sizes for the different 6502 implementations out there. http://forum.6502.org/viewtopic.php?f=10&t=1673You are talking about the Lattice isp1048 PLD that the MiniForth was implemented on. At the time (1994), I asked about implementing a 6502. My boss said this was not realistic. There is no support for addition on the PLD, so you have to write a function to do addition using half-adders and XOR. The 6502 relies heavily on addition with its indexed addressing modes, so it is a bad choice. The PLD's best feature was that it had a lot of pins and a lot of connectivity, so it could support a Harvard Architecture processor with 16-bit address-bus and 16-bit data-bus for both code-memory and data-memory. Note that this was in the days prior to on-chip memory, so both code-memory and data-memory had to be external. Note that my 65ISR doesn't really have indexed addressing. It has page,Y and bank,W addressing, but these don't require addition to be done. The 65ISR-chico could conceivably be built on a Lattice isp1048 PLD --- the instructions that do addition (such as ADC SBC etc.) would be horribly slow by modern standards, but this may not matter if the system is just processing I/O data --- somewhat of an academic exercise though, as the Lattice isp1048 PLD is obsolete now.
|
Fri Oct 27, 2017 3:44 pm |
|
|
monsonite
Joined: Mon Aug 14, 2017 8:23 am Posts: 157
|
Hugh, The Lattice ICE40 range of FPGAs are becoming popular - as a result of an open-source tool chain called Project IceStorm. There are several development boards that have recently become available - as a direct result of the emergence of the open source tools. The ICE40HX4K part is really a 7680 "8K" logic element die - that was artificially disabled to 4K by the Lattice proprietary toolchain. They are not the biggest or fastest FPGAs - but they are low cost and ideal for implementing 8/16 bit cpus - up to about 40MHz usable clock frequency. Dave Banks (Hoglet67) has successfully implemented 6502, Z80 and OPC 6 processors on this device - plus complete machines including Acorn Atom, BBC Model B, CP/M machine and Jupiter Ace. The OPC6 processor used about 20% of 960 the available logic blocks. The BBC Model B computer was based on Arlet's verilog 6502 implementation - using 144 of the 960 blocks for the 6502 cpu. https://github.com/Arlet/verilog-6502The complete machine with video generator etc used about 85% of the logic blocks - https://forum.mystorm.uk/t/bbc-model-b- ... ice/258/56Speaking of Forth - you might wish to look at James Bowman's J1 Forth processor - which has also been ported to the Lattice ICE40 https://github.com/jamesbowman/j1regards Ken
|
Sat Oct 28, 2017 10:33 am |
|
Who is online |
Users browsing this forum: CCBot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|