Last visit was: Wed Oct 09, 2024 7:59 pm
|
It is currently Wed Oct 09, 2024 7:59 pm
|
TOYF (Toy Forth) processor
Author |
Message |
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
monsonite wrote: Hugh, The Lattice ICE40 range of FPGAs are becoming popular - as a result of an open-source tool chain called Project IceStorm. There are several development boards that have recently become available - as a direct result of the emergence of the open source tools. The ICE40HX4K part is really a 7680 "8K" logic element die - that was artificially disabled to 4K by the Lattice proprietary toolchain. They are not the biggest or fastest FPGAs - but they are low cost and ideal for implementing 8/16 bit cpus - up to about 40MHz usable clock frequency. Dave Banks (Hoglet67) has successfully implemented 6502, Z80 and OPC 6 processors on this device - plus complete machines including Acorn Atom, BBC Model B, CP/M machine and Jupiter Ace. The OPC6 processor used about 20% of 960 the available logic blocks. The BBC Model B computer was based on Arlet's verilog 6502 implementation - using 144 of the 960 blocks for the 6502 cpu. https://github.com/Arlet/verilog-6502The complete machine with video generator etc used about 85% of the logic blocks - https://forum.mystorm.uk/t/bbc-model-b- ... ice/258/56The MiniForth came out in 1995 on the Lattice isp1048 PLD --- this was not an FPGA --- since that time the MiniForth has been renamed RACE and implemented on a Lattice FPGA for a several-fold increase in speed. monsonite wrote: Speaking of Forth - you might wish to look at James Bowman's J1 Forth processor - which has also been ported to the Lattice ICE40 https://github.com/jamesbowman/j1I'm aware of the J1 --- this is pretty primitive --- it doesn't have local variables. Bernd Paysan's B16 and the RTX-2000 also lack local variables.
|
Sun Oct 29, 2017 1:58 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
monsonite wrote: The Lattice ICE40 range of FPGAs are becoming popular - as a result of an open-source tool chain called Project IceStorm.
There are several development boards that have recently become available - as a direct result of the emergence of the open source tools.
The ICE40HX4K part is really a 7680 "8K" logic element die - that was artificially disabled to 4K by the Lattice proprietary toolchain.
They are not the biggest or fastest FPGAs - but they are low cost and ideal for implementing 8/16 bit cpus - up to about 40MHz usable clock frequency. I have a major update on my TOYF design (attached). I now have a CX register. It is used for the counter in loops, or the node in a loop traversing a linked-list. Also, CX and DX have to be saved/restored by the I/O polling code --- they are expected to survive POL. I made several other changes. I fixed the multiply so it would work (there was a bug previously) and changed the design so it is now 1 clock cycle per bit, which is 16 total. At 40 Mhz., this is less than 1/2 microsecond --- that should be adequate for a PID congtroller. thanks for your interest --- Hugh
You do not have the required permissions to view the files attached to this post.
|
Mon Oct 30, 2017 4:14 am |
|
|
barrym95838
Joined: Tue Dec 31, 2013 2:01 am Posts: 116 Location: Sacramento, CA, United States
|
I don't fully grok all that you have shared yet, but I think I noticed that you sometimes hold the recently discarded TOS in DX. If my limited understanding is correct, would it be a) "easy" b) "difficult" c) "impossible" or d) "WTH are you talking about Mike?" to keep TOS in DX most of the time, especially between primitives?
Mike B.
|
Mon Oct 30, 2017 6:32 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
barrym95838 wrote: I don't fully grok all that you have shared yet, but I think I noticed that you sometimes hold the recently discarded TOS in DX. If my limited understanding is correct, would it be a) "easy" b) "difficult" c) "impossible" or d) "WTH are you talking about Mike?" to keep TOS in DX most of the time, especially between primitives?
Mike B. TOS is in BX all of the time. In some cases, a primitive will leave data in DX rather than push it onto the stack. So, temporarily, TOS (top-of-stack) is in DX and the BX that is normally TOS is now SOS (second-of-stack). For example: Without optimization, this compiles as: Code: lit9 ; pushes BX to the stack in memory, loads 9 into BX plus ; pulls the SOS from the stack in memory to a register, adds it to BX exit
With optimization, this compiles as: Code: lit9_dx ; loads 9 into DX fast_plus ; adds DX to BX
So, the compiler has to be smart enough to compile code that uses DX when it can. This shouldn't be too difficult. It can be a traditional single-pass Forth compiler. This is just peephole optimization. The compiler first compiles LIT9. Then when it finds the next thing to compile is PLUS it backs up and uncompiles the LIT9, compiles LIT9_DX instead, then compiles FAST_PLUS rather than PLUS. VFX has an "analytic compiler." It compiles everything into a data-structure, then analyzes the data-structure and generates code from that in a second pass. It likely makes several passes over the data-structure. TOYF doesn't need an analytic-compiler. We only have one free register, which is DX. A simple single-pass compiler with peephole optimization should be adequate to generate reasonably good code. I have never written an analytic-compiler. I have read about this and I think I know the basic idea, but don't have any experience. I don't want to delve into figuring out an analytic-compiler right now. The TOYF is a "Toy Forth," so I'm planning on a pretty simple straight-forward compiler. If I ever decide to write a Forth compiler for the ARM Cortex or the dsPIC I will have to write an analytic-compiler in order to take advantage of all those registers. That would be a lot of work! I am avoiding all that work by inventing my own processor that doesn't require me to learn anything new (the TOYF is pretty similar to the MiniForth/RACE that I have experience on). thanks for your interest --- Hugh
|
Mon Oct 30, 2017 5:13 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
What is a "video generator"? Is that something similar to the Vic-II chip used in the venerable C64? You get double-buffered screens and sprites? I wonder if my TOYF could become a game machine --- that would fit in well with the "Toy Forth" name. The major weakness of the TOYF is lack of interrupts. The POLL code is executed every time that POL is used to end a primitive (rather than NXT that just ends the primitive but doesn't poll the I/O). A game machine doesn't have a lot of I/O though. 1.) It needs to watch the clock so it can run the game at a smooth speed. This is pretty pedestrian though --- a heartbeat of maybe 100 milliseconds --- changing the screen much faster than this will just be a blur for the human viewer. 2.) It needs to poll the input device, which is likely a joystick. This is pretty pedestrian though --- the human can't change directions very quickly So, the lack of fast I/O support shouldn't be a problem. edit: fix typo
|
Wed Nov 01, 2017 2:01 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
Hugh Aguilar wrote: I wonder if my TOYF could become a game machine --- that would fit in well with the "Toy Forth" name. The major weakness of the TOYF is lack of interrupts. The POLL code is executed every time that POL is used to end a primitive (rather than NXT that just ends the primitive but doesn't poll the I/O). A game machine doesn't have a lot of I/O though. 1.) It needs to watch the clock so it can run the game at a smooth speed. This is pretty pedestrian though --- a heartbeat of maybe 100 milliseconds --- changing the screen much faster than this will just be a blur for the human viewer. 2.) It needs to poll the input device, which is likely a joystick. This is pretty pedestrian though --- the human can't change directions very quickly So, the lack of fast I/O support shouldn't be a problem. Actually, the TOYF won't work well in a game machine. We need high-speed interrupts for playing music. This would only be realistic if the 65ISR-chico was used as a coprocessor. The TOYF main-program could upload a music score (described in some easy-to-interpret code) to the coprocessor and it would actually play the music. Anyway, I have a new design. I added support for division. I also added support for linked-lists and wrote a lot of code to support linked-lists --- this looks like it should be efficient --- I used linked-lists as my standard data-structure in the novice package and intend to do so here also. I can really start on the assembler/simulator now --- I don't think there is any more that can be done to the design. thanks for your interest --- Hugh
You do not have the required permissions to view the files attached to this post.
|
Wed Nov 08, 2017 2:07 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
Hugh Aguilar wrote: Anyway, I have a new design. I have a very minor upgrade. I just got rid of the ISR instruction and added the CNX instruction --- this saves one clock cycle inside of the POLL code.
You do not have the required permissions to view the files attached to this post.
|
Sat Dec 16, 2017 3:43 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
I have another upgrade (attached).
I added an EX register, and I also upgraded my 16x16 multiplication to generate a 32-bit product --- previously my 16x16 multiplication only generated a 16-bit product. I had not done this in the past because I was concerned that adding another register would cause the TOYF to require a bigger and more expensive FPGA. Now I decided to do this anyway. The full multiplication is pretty useful.
I still have a very limited support for division. I can divide a 16-bit numerator by an 8-bit denominator for a 16-bit quotient and 8-bit remainder. This is adequate for converting 16-bit numbers into ascii strings for display. I don't think division is common enough in most micro-controller applications that I want to provide hardware support for it.
My goal is not to design the most powerful processor that I can --- my goal is to support motion-control (the PID algorithm) and keep the cost down as much as possible --- also, by making it a Forth processor it is more fun!
You do not have the required permissions to view the files attached to this post.
|
Thu Jan 11, 2018 9:26 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
Hugh Aguilar wrote: My goal is not to design the most powerful processor that I can --- my goal is to support motion-control (the PID algorithm) and keep the cost down as much as possible --- also, by making it a Forth processor it is more fun! I have yet another upgrade (attached). I provided more support for 32-bit arithmetic. I have instructions for shifting down the product --- this divides the product by unity assuming that unity is a power of 2 --- this makes the multiplication more useful. I also have support for adding and subtracting double-precision numbers --- still pretty slow though --- realistically, if anybody needs 32-bit numbers, they should just use the ARM Cortex that has 32-bit registers. I moved the stacks down to the bottom of zero-page. This is useful in an FPGA that doesn't have 256 words (512 bytes) of RAM on-board --- if the FPGA has 128 words (256 bytes), the stacks will still be in internal RAM --- I want the TOYF to be reasonably efficient on very small inexpensive FPGA chips (it is likely that low cost will be the only advantage it has over other designs). I added better support for working with byte data --- this could speed up string handling. I changed how literal values are loaded into AX. Now fewer instructions are needed. I purposely left some instructions undefined so they can be used for application-specific purposes. I still don't have support for division with a 32-bit numerator --- this is going to be a big target for criticism --- I don't think division is very useful in micro-controllers though, so I'm not going to worry about it. thanks for your interest --- Hugh
You do not have the required permissions to view the files attached to this post.
|
Sun Jan 21, 2018 4:54 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: Quote: I don't actually know what an LE is, except that it is a measure of FPGA usage. I have also heard the term LUT used as a measure of FPGA usage, but I don't know what that means exactly either. To within a small factor they are comparable. See here and links within. In general all you care about is whether you fit on the FPGA you have, or which FPGA to choose. Any reasonable CPU, I would think, should fit on any reasonable FPGA these days. Quote: Can anyone make an estimate as to how many LEs the TOYF would require? It will very much depend on how the HDL is coded. The various HDLs for 6502 have resulted in implementations ranging from 500 to 3000 LUTs. There's a good chance your CPU could fit in the same range. But I think there's no HDL, or block diagram, for it? I don't know anything about any HDL, and I don't know how to make a block diagram (although I have at least seen them, so I know the basic idea). I have a new version of the TOYF that is about twice the efficiency of the previous version attached to the previous post. The NEXT code is one clock cycle faster. The EXIT code is faster. Quotation calls are now the same speed as function word calls (this required the introduction of a new 5-bit register). I have 32/16 division now, and multiplication is faster. Linked lists are faster and cleaner as EX is used as the current node pointer. Copying blocks of data is faster. I now have logic instructions (and, ior, xor) that work with the CF and AX. There are many improvements throughout. Everything is faster! Despite doubling the efficiency, I have reduced the complexity. I now have two group-A and two group-B instructions undefined that could be used for application-specific purposes. Also, group-M no longer needs an ALU, which should reduce the FPGA resource-usage somewhat. Group-A needs a 32-bit ALU and group-B needs a 16-bit ALU. You described a range of 500 to 3000 LUT for the 6502, which is huge --- saying that the TOYF would be comparable doesn't really tell me anything. Also, my TOYF is very different from the 6502. The TOYF is 16-bit and the 6502 is 8-bit. The 6502 is a CISC with all instructions taking multiple clock cycles. The TOYF has every instruction taking exactly one clock-cycle, and it has up to three instructions executing concurrently. Are there any other FPGA processors discussed on this forum that are similar to the TOYF? Are there any designs on http://www.opencores.org or anywhere else with open HDL source-code that are similar to the TOYF? If I had something similar, I might be able to use that HDL as a starting-point. I actually like the 6502 --- my 65ISR design is derived from the 6502 --- but the TOYF is different in every way. This J68 is more like the 6502 than the TOYF --- both the 68000 and the 6502 are from the 1980s --- the TOYF is derived from the MiniForth that came out in 1995 (I wrote the development system for the MiniForth).
|
Thu Mar 01, 2018 7:49 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
Thanks for reposting over here! (It might be worth editing down the other one to just the j68 bit at the end.)
Indeed, 500 to 3000 LUTs is a huge range - but it's what actually happened, with different implementations of the same CPU, so it demonstrates a point, which is that a CPU doesn't have a well-defined size. Another point is that even 3000 LUTs is not large these days: even the cheap 4k Lattice FPGAs might be big enough for that.
There's no easy answer, until someone writes some HDL (or, arguably, expresses the design in a schematic, in an FPGA IDE.)
Have you written an emulator for TOYF? Have you tabulated what happens on a per-clock cycle basis? It feels to me you need some degree of refinement, beyond a description of the instruction set and register file, to get a handle on how complex the CPU actually would be.
Edit: I see now you mention that every instruction is a single clock cycle. I'm not sure any other CPU manages that - how do branches work? If it's a pipelined machine, what does the pipeline look like? I did very quickly skim your text file yesterday, but I confess I'm not likely to put a lot of effort in to study the machine, at least without some more clues about what's going on. Others might, of course: you will have an audience here, even if it's mostly silent.
|
Thu Mar 01, 2018 7:58 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: Have you written an emulator for TOYF? Have you tabulated what happens on a per-clock cycle basis? It feels to me you need some degree of refinement, beyond a description of the instruction set and register file, to get a handle on how complex the CPU actually would be. No, but I need to get started on the development software. I think my design is pretty much settled (I said that before though, and since then I have more than doubled the efficiency). Most of the development software would be ported from MFX (my assembler/simulator/compiler for the MiniForth). MFX was a single-pass assembler --- if I upgrade to a multi-pass assembler I will get better packing of the instructions into the opcodes, but the assembler will be more complicated. BigEd wrote: Edit: I see now you mention that every instruction is a single clock cycle. I'm not sure any other CPU manages that - how do branches work? The only way to change the PC is to do the NEXT code at the end of each primitive. This depends on IP --- normally IP points to the next cell in the Forth threaded-code, so execution proceeds sequentially through a Forth function --- IP can be changed inside of a primitive though, so a branch will be made to somewhere else. BRANCH and 0BRANCH are simple examples of this. We also have all the linked-list code that traverses a linked-list using primitives that modify IP internally. BigEd wrote: If it's a pipelined machine, what does the pipeline look like? The TOYF is not pipe-lined. It is Harvard Architecture, so it can obtain the next opcode from code-memory even if the current opcode is accessing data-memory, because code-memory and data-memory are totally separate and work in parallel. My assumption here is that the FPGA has a lot of pins and a lot of connectivity inside, so it can have two data-buses and two address-buses --- so a Harvard Architecture is possible --- this assumption was true on the Lattice isp1048 PLD that the MiniForth was implemented on in 1994, so I would expect it to still be true with the modern FPGAs. With Harvard Architecture you get good speed, but you avoid all of the complexity of a pipe-lined system --- for this to work though, you need a lot of pins and a lot of connectivity (that is why 1980s processors weren't Harvard Architecture). I don't know of any reason why von Neumann Architecture would be used, except for a shortage of pins on the chip --- von Neumann Architecture is a kludge for the 1980s chips that had a shortage of pins --- Harvard Architecture is both more efficient and less complicated.
|
Thu Mar 01, 2018 8:37 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
Indeed, you get a lot of connectivity inside an FPGA. You also get RAM - 64k bytes is not uncommon - and for some purposes on-chip RAM might be enough.
|
Thu Mar 01, 2018 9:37 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: Indeed, you get a lot of connectivity inside an FPGA. You also get RAM - 64k bytes is not uncommon - and for some purposes on-chip RAM might be enough. Well, I would expect to put code-memory inside the FPGA for speed --- that would be non-volatile though --- the TOYF can address up to 32KW (64KB) of code-memory. As for data-memory, the TOYF can address up to 64KW (128KB). Only 32KW can be used for Forth threaded-code, and it can be non-volatile, although RAM would be better because RAM would allow traditional interactive development which is one of Forth's best features. The entire 64KW can be used for data, and this would mostly be RAM, although some can be non-volatile such as for look-up tables. The very minimum RAM needed is 128 words --- the data-stack and return-stack are each 32 words and are in the lower 128 words of data-memory, which leaves room for I/O ports and some global variables. I want the TOYF to require only a very inexpensive FPGA chip --- as I said before, cost is often the only criteria that people have for choosing a processor --- the TOYF does need to be powerful though, so it can out-perform the myriad low-cost processors available. More powerful than the MSP430 and less expensive than the ARM Cortex --- plus, being an FPGA it can be customized with application-specific instructions (I have two group-A and two group-B instructions purposely left undefined).
|
Fri Mar 02, 2018 12:07 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
See this Survey of FPGA dev boards for boards from $50 to $200. The chips themselves are of course less: Xilinx's LX9 has 64k on board and is about £14.
|
Fri Mar 02, 2018 8:22 am |
|
Who is online |
Users browsing this forum: CCBot and 1 guest |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|