Last visit was: Wed Oct 09, 2024 6:39 pm
|
It is currently Wed Oct 09, 2024 6:39 pm
|
TOYF (Toy Forth) processor
Author |
Message |
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
robfinch wrote: I don't think it's necessarily the parallel processing, but the rather the logic complexity that causes problems.
It seems to me that parallel processing and logic complexity are pretty much the same thing. As the TOYF design stands, a lot of registers connect to other registers --- it is not quite a matter of "everything connecting to everything," because some registers aren't used very much --- you do have a lot of registers that connect to a lot of other registers in the sense that there is an instruction to move the data from the one to the other (often with some operation being done on the way). Hypothetically, lets say I had a design in which AX connected to all the other registers, but none of these registers connected to each other. Everything goes through AX. In this case, the instructions would never parallelize. Even though you can have up to 3 instructions per opcode all executing concurrently, dependencies always force the instructions to be done sequentially --- you always have 1 instruction per opcode --- the whole point of having a VLIW has been defeated! Connectivity makes parallelization possible.
|
Wed Oct 31, 2018 4:59 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
It's not quantity, it's complexity, which is the question. I've no idea whether or not you'll have trouble fitting your design: until you code it up and synthesise it, no-one can know. You can get some idea of complexity by drawing out the necessary connections. There's a world of difference between 4 registers and a single 4-register file. That's one reason why Arlet's core is by far the smallest 6502 implementation. But when you say "Some of the registers are not used very much" I feel you haven't yet got the idea. It's not about how often something is used, it's about how the data will travel when it is used. It feels like you are tabulating the state but not (yet) investigating the connectivity between bits of state. (Here's the MiniForth aka RACE machine. Quote: The RACE is a 16 bit RISC processor that executes code at 25 MIPs. Most Forth primitives take from 4 to 8 cycles, so Forth runs about 4MIPs. Code operators were devised so they could be combined to build efficient Forth primitives, and make best use of the PLD's limited resources. This meant some things were done in unconventional ways. Functions like anding, counting, and shifting, are easily done in one cycle, but arithmetic functions had to be broken into two parts. In the first part the operands are half added using an XOR instruction that takes one cycle. In the second part a special instruction is executed four times to propagate the carry through all 16 bits.
We see there were some microarchitectural compromises to get this to fit on the PLD. )
|
Wed Oct 31, 2018 8:38 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: It's not quantity, it's complexity, which is the question. I've no idea whether or not you'll have trouble fitting your design: until you code it up and synthesise it, no-one can know. You can get some idea of complexity by drawing out the necessary connections. There's a world of difference between 4 registers and a single 4-register file. That's one reason why Arlet's core is by far the smallest 6502 implementation. But when you say "Some of the registers are not used very much" I feel you haven't yet got the idea. It's not about how often something is used, it's about how the data will travel when it is used. It feels like you are tabulating the state but not (yet) investigating the connectivity between bits of state. (Here's the MiniForth aka RACE machine. Quote: The RACE is a 16 bit RISC processor that executes code at 25 MIPs. Most Forth primitives take from 4 to 8 cycles, so Forth runs about 4MIPs. Code operators were devised so they could be combined to build efficient Forth primitives, and make best use of the PLD's limited resources. This meant some things were done in unconventional ways. Functions like anding, counting, and shifting, are easily done in one cycle, but arithmetic functions had to be broken into two parts. In the first part the operands are half added using an XOR instruction that takes one cycle. In the second part a special instruction is executed four times to propagate the carry through all 16 bits.
We see there were some microarchitectural compromises to get this to fit on the PLD. ) When I said, "Some of the registers are not used very much," I meant that not many instructions accessed the registers and they weren't directly connected to many other registers (there was no MOV to or from the other registers). I didn't mean "how often" they were used, in the sense of how often those few instructions would be used during the run-time of typical programs. I'm not really clear on how to investigate the connectivity between bits of state. I also still don't know what a "block" is. Can you show me how this was done for other processors? I think I know what you mean that: "There's a world of difference between 4 registers and a single 4-register file." Over on the Parallax forum they will proudly tell you that the P2 has 512 registers. Realistically, it has a 512-word internal RAM (a "register file") that has easy access, similar to the zero-page on the 65c02 in which a one-byte operand was needed rather than the usual two-byte operand. Most likely the P2 has only 2 or 3 actual registers that it uses for everything --- I don't know anything about the P2 internal workings --- this is likely proprietary anyway, so nobody can examine it. A register file like this isn't necessarily a bad thing. The P2 enthusiasts say that having 512 "registers" makes assembly-language easy --- this is likely true --- the assembly-language programmer is saved from moving data into and out of a handful of registers, as usually done. The TOYF has several goals in mind, but making the assembly-language programmer's life easy isn't one of them. This is a rough estimate of difficulty for 16-bit processors: MiniForth is 4* > TOYF is 4* > MSP430 is 4* > MC68000 The MC68000 had 32-bit registers, so it was as easy as a 32-bit processor even though it was actually 16-bit hardware. I'm pretty sure the MC68000 and Z80 both had a register file and were both micro-coded --- by comparison, the 6502 was hard-wired, which is why it had only a very few 8-bit registers. Does the MSP430 have a register file? I could get rid of the HF register. I previously stored the LF on the return-stack during the execution of a quotation. I added HF to hold the LF because this is faster than accessing memory, and I want quotations to be fast. The EX register was provided to make multiplication and division fast, and it is also used to make linked-list traversal fast. Before I had EX my multiplication had a 16-bit product rather than a 32-bit product, which may be adequate for the PID algorithm, but is pretty hokey. Also, the linked-list traversal held the node pointer on the return-stack, which involved slow memory access. I'd be pretty dubious of getting rid of the EX register. The IV register was added in this latest release to support IRQs (IV is a pointer to the ISR). The IRQs aren't very good though because they just interrupt the EXIT primitive, so they may have a lot of latency. I said in the document that IV is 16-bit. This isn't actually true. It is only 2 bit: 1111,1111,xx00,0000 This allows for 4 IRQs. 1 edit: fixed a mistake in describing IV
|
Thu Nov 01, 2018 2:13 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
BTW, I'm not saying you have too many registers, or that your architecture will be difficult to implement. I'm saying it's very difficult to evaluate that, given only descriptions of the behaviour. For a 6502 block diagram, see for example Ruud's page: http://www.baltissen.org/newhtm/ttl6502.htmWe see here that there are three principal busses in the CPU, and they don't all connect to everything. We see for example that S is connected differently to X and Y. Indeed, S can be read and re-written in the same cycle, whereas X and Y cannot: it has two ports. We see that the ALU can only write to A. You're right that the Z80 has a register file - it's actually split, so PC and R+I have different connectivity to the main section. There are a couple of useful diagrams and a lot of information in Ken Shirriff's site here: http://www.righto.com/2014/10/how-z80s- ... -down.html
|
Thu Nov 01, 2018 7:36 am |
|
|
Tor
Joined: Tue Jan 15, 2013 10:11 am Posts: 114 Location: Norway/Japan
|
Hugh Aguilar wrote: Most likely the P2 has only 2 or 3 actual registers that it uses for everything --- I don't know anything about the P2 internal workings --- this is likely proprietary anyway, so nobody can examine it. The P1 - which is similar to the P2 when it comes to "cog" memory (512 32-bit longs, aka "registers", or simply 2KB of local memory) - was released by Parallax as open source (after re-implementation in Verilog) a couple of years back. The P2 design itself has been an open process, Chip (the designer) has been discussing the internal design of the P2 with forum members all the way. In extreme detail. Update: Links, as per BigEd's request https://www.parallax.com/microcontrolle ... pen-source https://github.com/JacGoudsmit/P1V(there are various user-maintained repos - Jac tries to keep a 'merged' version which supports various FPGA boards) For the P2 development there's too much to list - the P2 forum on forums.parallax has a huge number of threads. There's e.g. a single thread about FPGA test versions of the P2, that thread is 150 pages alone.
|
Thu Nov 01, 2018 1:14 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
So, drifting a little off topic here, but I think it's interesting and hopefully useful. The Parallax P1 is an 8 core machine where each core has a 32 bit CPU and a 512 word memory, and there's a larger shared memory accessible in round-robin fashion. Distinct instructions are used to access local memory or global memory. There are no named registers: all data operations are on memory or on memory-mapped I/O. So, we read hereQuote: There are two schools of thinking: One (that’s me and the Data Sheet!) says: there are 512 registers in a COG. The other school (that’s the rest of the world) says: There are no registers at all in a COG, except 16 I/O registers memory mapped to addresses 496 till 511.
and this seems reasonable to me. Don't get caught up in language, just see that there are no conventional programmer-visible registers and the closest fastest addressable state is in the local memory. I'm reminded of the very early machines, where there'd usually be an accumulator, but also there might be an identified Memory Address register and a Memory Buffer register. These are still found in conventional CPUs but we no longer name them and no longer regard them as part of the programmer's model. We name the registers you can see, not the ones you can't. There is a difference these days between architecture and microarchitecture. (As an example, the 6800 has a temporary register, and the Z80 has the barely-accessible WZ register pair. The 6502, at the microarchitecture level, has the Internal Data Latch, the Data Output Register, the AI and BI input registers to the ALU, the ADD adder hold register, and the ABL and ABH address bus registers - none of them programmer visible and only of interest to the implementer.
|
Thu Nov 01, 2018 4:19 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: I'm reminded of the very early machines, where there'd usually be an accumulator, but also there might be an identified Memory Address register and a Memory Buffer register. These are still found in conventional CPUs but we no longer name them and no longer regard them as part of the programmer's model. We name the registers you can see, not the ones you can't. There is a difference these days between architecture and microarchitecture. (As an example, the 6800 has a temporary register, and the Z80 has the barely-accessible WZ register pair. The 6502, at the microarchitecture level, has the Internal Data Latch, the Data Output Register, the AI and BI input registers to the ALU, the ADD adder hold register, and the ABL and ABH address bus registers - none of them programmer visible and only of interest to the implementer.
I have exposed MA to the programmer. In most processors, any "address register" could be used as an indirect pointer to memory. For example, in the 8086, BX BP etc. can all be used as the base-register for an indirect load. In the TOYF, only MA can be used as an address. LDR LX is used to load through (MA). We also have STR AX etc. that store through (MA). I don't need anything like the Z80's WZ register-pair. On the Z80, addresses were 16-bit, but the data-bus was 8-bit so a temporary register was needed. The example given was JMP that needed to load the destination-address in WZ and than move WZ to PC, because PC can't be loaded one byte at a time. In the TOYF, all the registers are 16-bit. In some cases, you have smaller sizes, such as 5-bit, but all the other bits are constants so it acts like a 16-bit register. I have two ALUs, for A processor and B processor (M processor doesn't need one). The A processor's ALU uses either AX or CX:AX as the destination. Various other registers are used as the source. The B processor's ALU uses BX as the destination, and uses CF to catch the carry. Various other registers are used as the source. This might be a problem. I might need to always use the same register for the source. If so, this will complicate the TOYF significantly.
|
Fri Nov 02, 2018 4:12 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
Hugh Aguilar wrote: I have two ALUs, for A processor and B processor (M processor doesn't need one). The A processor's ALU uses either AX or CX:AX as the destination. Various other registers are used as the source. The B processor's ALU uses BX as the destination, and uses CF to catch the carry. Various other registers are used as the source. This might be a problem. I might need to always use the same register for the source. If so, this will complicate the TOYF significantly. I was studying my design last night. I think I could get rid of the need for the ALU in the B processor. This is primarily provided to allow ADC and SBC to be in group-B, but these are mostly for D+ and D- that aren't commonly needed. If I did this, then the only ALU would be in the A processor. That should simplify the implementation somewhat. Performance will be somewhat degraded, but not a lot. Is it true that upgrading from a 16-bit opcode to an 18-bit opcode is easy on an FPGA? If the code-memory is internal to the FPGA, then 18-bit should be no problem. How much code-memory is realistic? The design addresses up to 32KW. FPGAs don't actually have this much internal memory though, do they?
|
Sat Nov 03, 2018 8:38 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
One of the (reasonable cost) FPGAs I've seen has 64kbyte, so yes, that would give you 32kW at 18 bits.
|
Sat Nov 03, 2018 8:50 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Quote: How much code-memory is realistic? The design addresses up to 32KW. Most recent FPGA’s have at least that amount of memory. For instance, the XC7A35 (the second smallest part in the series) has 225kB of block ram. More memory and logic resource can be had with larger FPGA’s. Quote: Is it true that upgrading from a 16-bit opcode to an 18-bit opcode is easy on an FPGA? One thing to keep in mind is the software and tools that are used. It looks easy to upgrade to wide opcodes, but then the software has to take that into consideration. I've tried a couple of oddball sized opcodes, 36 bits for instance and there were a lot of issues with support tools. Sticking to a multiple of byte will make things easier.
_________________Robert Finch http://www.finitron.ca
|
Sat Nov 03, 2018 9:10 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
robfinch wrote: Quote: How much code-memory is realistic? The design addresses up to 32KW. Most recent FPGA’s have at least that amount of memory. For instance, the XC7A35 (the second smallest part in the series) has 225kB of block ram. More memory and logic resource can be had with larger FPGA’s. Code-memory has to be non-volatile. I was expecting something like an EPROM that can only be programmed externally, which is what the MiniForth had. Something like FLASH that can be programmed internally would not be useful. I don't have any instructions that can write to code-memory. I doubt that this is even possible in a Harvard Architecture system because the code-memory address-bus and data-bus are busy every cycle reading in the next opcode. I was expecting data-memory to be RAM, but something like FLASH would be interesting as this would allow persistence of data from one run to the next. It would be pretty cool to have both code-memory and data-memory inside of the FPGA --- that would leave a lot of pins available outside of the chip for use as I/O ports. I remember that the Lattice isp1048 PLD had a lot of external pins, which is why it could have both code-memory and data-memory external. The MiniForth fit in the isp1032, but there weren't enough pins left over for I/O ports, so it would have just been an academic exercise with no practical purpose. robfinch wrote: Quote: Is it true that upgrading from a 16-bit opcode to an 18-bit opcode is easy on an FPGA? One thing to keep in mind is the software and tools that are used. It looks easy to upgrade to wide opcodes, but then the software has to take that into consideration. I've tried a couple of oddball sized opcodes, 36 bits for instance and there were a lot of issues with support tools. Sticking to a multiple of byte will make things easier. There are no support tools available for a VLIW! I have to write everything myself. This should not be a problem --- been there, done that, with the MFX --- my TOYF assembler should be pretty similar. I described the algorithm for generating out-of-order machine-code in my document. I didn't get my question answered: How much of a problem is it to have two ALUs? Is it worthwhile to get rid of the ALU in the B processor? This is possible, but some operations are going to be somewhat less efficient.
|
Sun Nov 04, 2018 3:49 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
You can use one-chip RAM, initialised from EEPROM at power-up and with no write mechanism, so it acts like ROM.
An ALU in itself probably isn't very large (unless it has a barrel shifter or multiply capability) - you need to be thinking about interconnectivity, either by drawing a picture or by trialling an implementation. The level of abstraction of the behaviour of the machine is a long way removed from the level of abstraction where routing congestion happens.
|
Sun Nov 04, 2018 8:48 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: You can use one-chip RAM, initialised from EEPROM at power-up and with no write mechanism, so it acts like ROM.
An ALU in itself probably isn't very large (unless it has a barrel shifter or multiply capability) - you need to be thinking about interconnectivity, either by drawing a picture or by trialling an implementation. The level of abstraction of the behaviour of the machine is a long way removed from the level of abstraction where routing congestion happens.
I don't have either a barrel-shifter or a multiply. For multiplication I have instructions to support a partial multiplication, and these get done iteratively, 16 times. This is pretty fast, as the instructions parallelize, so it is one clock-cycle per iteration. The same is true with division, that I have instructions to support a partial division that gets done repeatedly. On the PLD back in 1994, registers were expensive. This is why the MiniForth had so few registers. Nowadays it seems that registers aren't so expensive anymore. One way to reduce the connectivity would be to add another register. I currently have DX that is used for multiple purposes (it is a general-purpose register). I could add another register and dedicate it to one of the purposes that DX is currently used for. This would result in less "routing congestion" with DX. This might result in a faster implementation too. I could upgrade to an 18-bit opcode, giving myself a new field of 2 bits. This field would have 4 instructions (one of them has to be NOP though). The 3 instructions might be adequate for supporting the new register. Having another field in the opcode would result in more parallelization being done, which would result in a faster implementation --- the whole point of a VLIW is to pack as many instructions into a single opcode as possible, so they all execute in one clock cycle. How expensive are registers? Is adding another register a good way to simplify the processor?
|
Mon Nov 05, 2018 1:29 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
Registers are cheap. Whether they will help you is a question of connectivity. I can't guess at the answer, and you shouldn't guess either!
|
Mon Nov 05, 2018 6:40 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Quote: Nowadays it seems that registers aren't so expensive anymore. yeah, registers are cheap now at least in FPGA’s. They have ram resources available. One way to cut down on the interconnect is to use ram resources for registers. A ram has all the multiplexing built into it, so it doesn’t take connection resources. Otherwise discreate registers can take a lot of wiring. In RiSC-V they switched from having registers defined explicitly to using a register file component from a PLD vendor in order to clean up some of the inter-connect which was slowing the core down. FPGA’s often have multiplier blocks available that could be used to implement a multiply instruction. So it’s possible to implement a multiply instruction easily. The series 7 from Xilinx has DSP blocks setup for MAC operations with a 25x18 multiplier. I’m sure other vendors like Altera have similar offerings. Since a 64x64 multiplier takes 18 stages (clock cycles) optimally, I’ve been thinking about offering only an 24x16 multiply operation for the FT64 ISA that is single-cycle. A good part of a full multiply could be done in 18 clock cycles with separate instructions.
_________________Robert Finch http://www.finitron.ca
|
Tue Nov 06, 2018 1:43 am |
|
Who is online |
Users browsing this forum: CCBot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|