Last visit was: Tue Sep 10, 2024 10:09 am
|
It is currently Tue Sep 10, 2024 10:09 am
|
TOYF (Toy Forth) processor
Author |
Message |
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
robfinch wrote: Quote: Nowadays it seems that registers aren't so expensive anymore. yeah, registers are cheap now at least in FPGA’s. They have ram resources available. One way to cut down on the interconnect is to use ram resources for registers. A ram has all the multiplexing built into it, so it doesn’t take connection resources. Otherwise discreate registers can take a lot of wiring. In RiSC-V they switched from having registers defined explicitly to using a register file component from a PLD vendor in order to clean up some of the inter-connect which was slowing the core down. FPGA’s often have multiplier blocks available that could be used to implement a multiply instruction. So it’s possible to implement a multiply instruction easily. The series 7 from Xilinx has DSP blocks setup for MAC operations with a 25x18 multiplier. I’m sure other vendors like Altera have similar offerings. Since a 64x64 multiplier takes 18 stages (clock cycles) optimally, I’ve been thinking about offering only an 24x16 multiply operation for the FT64 ISA that is single-cycle. A good part of a full multiply could be done in 18 clock cycles with separate instructions. I was assuming discrete registers, just like on the MiniForth built on the Lattice isp1048 PLD. I know that FPGAs have internal RAM now, but if I use that then I would expect to have multi-cycle instructions. There would be microcode that would load a memory-address register with the pseudo-register's address, then read or write to it just like any memory location. I currently have nine 16-bit registers, four 5-bit registers, and three 1-bit registers. This is pretty conservative register usage. If I'm going to use the internal RAM for pseudo-registers, then I might as well go whole-hog like in the Propeller and have 512 32-bit registers or the RISC-V that has 32 32-bit registers. That is a totally different design path. There is no point in me going down that well-trodden path. Some guy on another forum was recently telling me that my TOYF design is nonsense, and if I want to be innovative I should just go with the RISC-V like everybody else who's claiming to be innovative these days. This reminds me of when the Chrysler PT Cruiser came out and became hugely popular, with all of the many buyers explaining that they wanted to be "distinctive." AFAIK, the Chrysler PT Cruiser was just a Chrysler Neon with a different body --- not very innovative --- ISTM that most "innovation" is just a collection of old ideas with new packaging (of course, my TOYF is based on the MiniForth that came out in 1995, so it is not exactly a cutting-edge design either). What exactly is distinctive about the RISC design? I suppose the idea is that data is loaded into registers, operated on between registers, and stored back into memory --- but you never have register-to-memory operations like on the Z80 or MC68000 --- by that criteria, my TOYF is a RISC (my purpose is to have only single-cycle instructions like on the MiniForth). Why is this significant though? If you have micro-code, and multi-cycle instructions, why not just do a register-to-memory operation like on the Z80 or MC68000? What is the point of limiting yourself to load and store?
|
Tue Nov 06, 2018 5:09 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1796
|
Just a thought, that final paragraph might be better as a new thread. There's lots to say, and it isn't necessarily tightly coupled with a TOYF discussion, although of course you could use TOYF as some kind of example. (It's not a great example, because it's a design without an implementation, and so there's much that's not known about how well it might work.)
|
Tue Nov 06, 2018 8:45 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
Hugh Aguilar wrote: What exactly is distinctive about the RISC design? I suppose the idea is that data is loaded into registers, operated on between registers, and stored back into memory --- but you never have register-to-memory operations like on the Z80 or MC68000 --- by that criteria, my TOYF is a RISC (my purpose is to have only single-cycle instructions like on the MiniForth). Why is this significant though? If you have micro-code, and multi-cycle instructions, why not just do a register-to-memory operation like on the Z80 or MC68000? What is the point of limiting yourself to load and store?
BigEd wrote: Just a thought, that final paragraph might be better as a new thread. There's lots to say, and it isn't necessarily tightly coupled with a TOYF discussion, although of course you could use TOYF as some kind of example. (It's not a great example, because it's a design without an implementation, and so there's much that's not known about how well it might work.)
I don't really care enough about discussing the distinction between RISC and CISC to start a thread on the question. That is almost as boring as discussing the distinction between Democrats and Republicans, and equally useless in regard to understanding how the world works. These are false dichotomies that don't really explain anything. It has become increasingly obvious that the TOYF is not going to work on an FPGA. Connectivity was a major problem in the 20th century, and it is still a problem in 21st century, and will be exactly the same problem in the 22nd century in the unlikely scenario that people are still building processors then. Routing is an intractable problem because you can only try out a tiny tiny percentage of the possible routing configurations --- you just have to hope that your heuristics lead to a working configuration, but as the woman in the "Predators" movie famously said: "Hope is not a strategy!" --- most likely, the MiniForth routed on the Lattice isp1048 PLD mostly by luck and this isn't going to happen again. A guy on the 6502 forum was briefly interested in implementing the TOYF. He was highly disturbed however to learn that the instructions in the three fields were supposed to execute concurrently. You can't just do them in sequence, because you might have one instruction in one field that loads a register and another instruction in another field that reads the same register. His solution was to have shadow registers complementing all of the registers. He would be using a register file (internal RAM), so doubling the number of registers (actually pseudo-registers) was not a problem. This would allow the three instructions in each opcode to be done sequentially. The writes would be into the shadow registers, and the reads would be from the main registers, then afterward the shadow-register values would be moved over to the main registers. So, the opcode would be executed in a ponderous sequential manner taking several clock cycles, but it would simulate the execution of the instructions in parallel all in a single clock cycle. I found this idea to be quite humorous. This technique with shadow-registers is the same technique that I used in 1994 in the simulator for the MiniForth that ran under MS-DOS (my MFX was written in 32-bit UR/Forth running under an MS-DOS "extender"). So, he would essentially be simulating parallel-processing with sequential-processing on the FPGA in the same way that I simulated the parallel-processing of the MiniForth with sequential-processing on the Pentium for testing purposes. The problem is that simulating parallel-processing with sequential-processing is very slow, so the TOYF would likely have less than 4 Forth MIPS (the MiniForth was doing 4 Forth MIPS in 1994, which was adequate for it to out-perform the MC68000 that the competition was using). My read on the situation is that FPGAs are not capable of parallel processing. They only do sequential processing. The FPGA is more powerful than the PLD of the 1990s because it has a lot more resources available, especially internal RAM. A processor such as the Propeller can have 512 32-bit "registers" because the FPGA has that much RAM (2KB) inside of it. The FPGA likely has a lot more RAM than that internally. The reason why the Propeller has "cogs" is so they can have 8 sets of 512 32-bit pseudo-registers, which comes to a whopping 16KB register-file. The FPGA is doing everything sequentially though. It is not doing parallel processing such as was done by the PLD running the MiniForth a quarter of a century ago. If a person is going to design a processor for an FPGA, the trick is to have a lot of pseudo-registers to take advantage of all the internal RAM in the FPGA. The Propeller has 512 32-bit pseudo-registers, which is quite a lot. The Propeller needs a 32-bit opcode to accomplish this (two 9-bit fields for the source and destination pseudo-registers, plus 14 bits to specify what is to be done). Having 32-bit opcodes is going to result in very bloated programs though. The Propeller-2 has its XBYTE though, which is provided to allow fast execution of byte-code, so I think the designers' plan was that most programs would be written in an HLL that generated byte-code and there would be relatively few primitives written in actual machine-code. A different strategy is to have 32 32-bit pseudo-registers, which would require two 5-bit fields, leaving 6 bits left over in a 16-bit opcode to specify what is to be done (you get 64 instructions). Having 16-bit opcodes is going to result in machine-code roughly half the size. OTOH, you only get 32 registers, so you are going to have to move data in and out of main-memory more, which will increase the size of the program (and slow it down). Anyway, I have no interest in processors such as the Propeller or RISC-V --- my background is in writing MFX for the MiniForth --- I would like to do something similar now, but VLIW processors such as the MiniForth no longer exist. BTW: My 65ISR was another bad design because I assumed that the goal was to minimize how many registers I had (similar to the MC6805 of yesteryear), whereas the goal is actually to take advantage of the FPGA's internal RAM by having a gigantic register-file (the Propeller likely wins the prize for having the biggest register-file of all big register-files).
|
Tue Nov 06, 2018 7:11 pm |
|
|
Tor
Joined: Tue Jan 15, 2013 10:11 am Posts: 114 Location: Norway/Japan
|
I'm no FPGA expert, but as I understand it FPGAs can indeed do parallel processing. Anyway, I only wanted to comment about the Propeller: > A processor such as the Propeller can have 512 32-bit "registers" because the FPGA has that much RAM (2KB) inside of it. That is not correct - the Propeller design is not related to FPGAs at all, the Propeller (aka P1 now) was designed and implemented directly into silicon, many years ago. It is only lately that it was re-implemented in Verilog so that it could be open-sourced. The new P2 (which just reached silicon samples) was designed in Verilog from the start, and tested in FPGAs, but its design was inherited from the P1. The 512 'registers' come from there, not from FPGA constraints. Same for cogs. The 9 bits used to address COG RAM (or 'registers') was simply a design decision by Chip Gracey - he had to divide the bits of the instructions into sections, and that's the compromise he reached. Also - the test images of the P2 testing in FPGA could definitely run the cogs in true parallel, the Propeller is I/O oriented and needs the parallel execution in order to do what it does with I/O pins.
|
Tue Nov 06, 2018 7:27 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1796
|
Yes, there are a few misconceptions and mistaken conclusions there...
|
Wed Nov 07, 2018 9:53 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
robfinch wrote: Quote: Nowadays it seems that registers aren't so expensive anymore. yeah, registers are cheap now at least in FPGA’s. They have ram resources available. One way to cut down on the interconnect is to use ram resources for registers. A ram has all the multiplexing built into it, so it doesn’t take connection resources. Otherwise discreate registers can take a lot of wiring. In RiSC-V they switched from having registers defined explicitly to using a register file component from a PLD vendor in order to clean up some of the inter-connect which was slowing the core down. The TOYF is really not going to work with a register-file. Almost all of the registers can be read by an instruction in one group and written by an instruction in another group, so if those two instructions are in the same opcode they have to be read and written in parallel. This isn't going to work with RAM. The whole point of a VLIW is that instructions can parallelize, and most of the time when they parallelize you have the same register being read and written. There are a few exceptions. For example, SS (data-stack pointer) is only read or written by group-M instructions. RS (return-stack pointer) seems like it would be the same, except that we have the ADD RS,AX instruction that is in group-A (necessary because we have an ALU in group-A but not in group-M). Theoretically, this could parallelize with a group-M instruction such as PLR that also accesses RS, although in practice this never happens, I don't understand why the concept of a register-file in RAM would be mentioned during a discussion of the TOYF which is a VLIW. This seems to be totally in contradiction to how VLIWs work. Am I missing something here??? I remember being pretty baffled by the concept of parallelization when I started at Testra, but I bluffed my way through that until I grokked the idea, which took a couple of weeks. I thought of the idea of using shadow registers in my simulation, although I have since realized that this idea is well known. It was my expectation from the beginning that all the registers would be discrete. I had very few registers initially, but added some over time to support features such as fast linked-list traversal and multiplication and division. Anyway, I have a new upgrade (file attached). This is a somewhat simplified version. I got rid of a lot of instructions for the purpose of reducing connectivity. For example, I had this previously: Code: add IP,AX ; move AX + IP to AX add LX,IP,AX ; move LX + IP to AX
I got rid of ADD LX,IP,AX because that seems like it requires a lot of connectivity. Now I have to do this instead: MOV LX,AX ADD IP,AX This takes 2 clock cycles rather than 1, so slightly less efficient; this is a bummer because it is in the branch primitives that need to be fast, but I suppose can live with that. I got rid of a lot of other instructions also, typically causing various primitives to be 1 clock cycle slower. I still have NOT AX that is only marginally useful and could be discarded if necessary. I added a few instructions, such as FLG BX,CF that are only marginally useful. This could be done instead with: NOT BX,CF NOT CF So, FLG BX,CF could be discarded if necessary. These two instructions, NOT AX and FLG BX,CF are the only two that could still be chopped without losing any major features. That is pretty much it though --- I've really trimmed it to the bone. I also fixed a few bugs I found in the example code throughout the document. I added another feature. I have a RND AX instruction now. I also got rid of BYT AX and replaced it with this: Code: byt AX,BX ; move AX -and- $FF to BX
Previously I said that my plan was for the TOYF to be used in motion-control. I still think this is a good plan. Motion-control works well with a paced-loop, and I have support for 1-bit data so the TOYF can be used like a PLC. I added the RND AX instruction though, for the purpose of allowing the TOYF to be used as a video-game machine. I remember enjoying game-programming when I was 18, 19, etc.. I think teenagers would like a game-machine that they can program themselves. I also think those first-person shooter games are a very negative influence on society. Sitting in front of a television for hours pretending to shoot people is going to mess with teenager's heads (I've even seen adults do this, and they were the type of adults that I would not offer employment to in the plumbing business where I work). It would be much better to have teenagers focus more on writing their own games, and also getting back to games like Super Mario World (no small green dinosaurs were harmed in the making of this video game). I said this in the document: Code: RND_BIT: ; seed-adr -- [0,1] ; depends upon MA=BX ldr LX mov LX,AX rnd AX flg AX,CF mov CF,BX ; BX= low bit of AX nxt AX ; update 16-bit seed
RND_BYTE: ; seed-adr -- [0,255] ; depends upon MA=BX ldr LX mov LX,AX rnd AX rnd AX rnd AX rnd AX rnd AX rnd AX rnd AX rnd AX byt AX,BX ; BX= low 8 bits of AX nxt AX ; update 16-bit seed
RND_BIT and RND_BYTE aren't adequate for encryption. They are provided for use in games. The TOYF uses a paced-loop which should work for games --- 100 Hz. (10ms) would update the display fast enough to look like motion. The major problem is that there is no low-latency IRQ that can be used to generate sound. A separate sound chip would be needed for music. It may be necessary to use a coprocessor to pony up the TOYF (the TOYF may not be able to start itself on power-up without help). The coprocessor (a C8051, for example) could also act as the sound chip. The TOYF would squirt a "sound file" over by way of the UART. The coprocessor would interpret this sound-file to play a tune by toggling a speaker. The sound quality should be comparable to the video games of the late 1980s (probably not at the level of the C64's 6581 SID though). The 1 Mhz. Apple-II could do speach synthesis (somewhat robotic, but understandable). A 72 Mhz. C8051 should be significantly better. The Super Nintendo (SNES) used a 3.58 Mhz. Ricoh 5A22 (a 65c816 derivative). It came out in 1990 and went kaput about 10 years later. The SNES was programmed in assembly-language (using self-modifying code for speed), and supported some pretty cool games. The TOYF should be an order of magnitude faster, and you get to program in Forth. High-school students might like it. :-)
Of course, it could be argued that a prng isn't a big part of video-games --- it is somewhat useful though. Code: rnd AX ; move bit-1 -xor- bit-2 -xor- bit-4 -xor- bit-15 to bit-0 | shift left bits 0..14 to bits 1..15 (discarding bit-15)
The hard part with a video-game is the video --- this has to either be done in the FPGA which seems difficult, or you get an external video chip which seems expensive.
You do not have the required permissions to view the files attached to this post.
|
Wed Nov 07, 2018 6:31 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1796
|
The RAMs on FPGAs are clocked, so they embody a pipeline stage, so you can read out the old value and write back in the new value. They are also (often) dual ported so you can read two different registers. If you have a rather small register set then it might feel inefficient to use up an 18kbit RAM for the purpose - but it still might be a good idea, especially if you had nothing better to do with that RAM block. There's another technique which I think was used on the Firepath CPU, invented by Acorn, developed by Element 14 and subsequently owned by Broadcom: if you need more read or write ports, you can double up the register file. You have two files, and you write the same data to both. Now you have twice the number of read ports. I'm not sure, but it doesn't feel like to get any increase in the number of write ports. Anyhow, I think we're mixing up architecture and microarchitecture to some extent. If your CPU is in an FPGA, and is using external RAM, there's a fair chance the CPU can clock faster than the RAM. In such a situation, it is not a performance loss to take two clocks to perform an action. It's rather like the Z80, which tends to take four clocks per memory cycle: if the limiting factor is the memory cycle, then you get more done with less hardware if you clock the CPU quicker. In the case of Z80, you get away with a 4-bit ALU. This is significantly smaller and therefore cheaper than an 8 bit ALU. It might be worth taking a look at this page: https://minnie.tuhs.org/Programs/UcodeCPU/You see an effort is made to draw a diagram which shows the functional units, their connectivity, and the clock boundaries.
|
Wed Nov 07, 2018 10:17 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: The RAMs on FPGAs are clocked, so they embody a pipeline stage, so you can read out the old value and write back in the new value. They are also (often) dual ported so you can read two different registers. If you have a rather small register set then it might feel inefficient to use up an 18kbit RAM for the purpose - but it still might be a good idea, especially if you had nothing better to do with that RAM block.
The guy who was telling me that the TOYF is nonsense and I should go with the RISC-V, was primarily critical of my small set of registers, as compared to the 32 registers of the RISC-V. He was saying that a large register-file is the best way to take advantage of the FPGA resources (internal RAM) and a small set of registers is not taking advantage of the available resources. BigEd wrote: There's another technique which I think was used on the Firepath CPU, invented by Acorn, developed by Element 14 and subsequently owned by Broadcom: if you need more read or write ports, you can double up the register file. You have two files, and you write the same data to both. Now you have twice the number of read ports. I'm not sure, but it doesn't feel like to get any increase in the number of write ports.
It took me a few minutes to fathom your words, but I think I grok the idea: If you have 2 processors, then you have 2 sets of registers shadowing each other. For example, AX' and AX'' would be shadow registers of the same nominal register. Every instruction takes 2 clock cycles. A processor will first access AX' in one clock cycle and then access AX'' in the next clock cycle. If the 2 processors are offset by 1 clock cycle, then while one is accessing AX' the other is accessing AX'' and they don't clash. I have 3 processors though: A B and M. This means that I would need 3 sets of registers. For example: AX' AX'' and AX'''. Considering that I have only a handful of 16-bit registers though, there is presumably plenty of RAM available for 3 sets of all the registers. Every instruction would take 3 clock cycles to execute. The good thing about this is that the TOYF could run at 3 times the speed of the data-memory. In my design, I require the MA (memory address) register to be set by one instruction and the memory access (such as with LDR LX or STR AX) to be done in a later instruction (usually the next instruction, although sometimes MA will hold a value for a while before the memory access is done). The code-memory however, would have to be twice as fast. This is because you need to set the PC register (usually by incrementing it, but sometimes by transferring a register value into it) and and read in the next opcode, all within those 3 clock cycles --- you don't get 6 clock cycles like you do with data-memory. To make this easier, lets say that each instruction takes 4 clock cycles to execute. We have 4 processors rather than just the 3 (A B and M). If we have an 18-bit opcode, then the 4th processor could have 4 instructions. Or, more likely, is that the 4th processor is a phantom that doesn't do anything (it effectively does a NOP every time). In this case, The TOYF could run at 4 times the speed of the data-memory and at 2 times the speed of the code-memory. That would make a lot more sense then trying to divide 3 clock cycles by 2.
|
Thu Nov 08, 2018 2:44 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
Hugh Aguilar wrote: I have 3 processors though: A B and M. This means that I would need 3 sets of registers. For example: AX' AX'' and AX'''. Considering that I have only a handful of 16-bit registers though, there is presumably plenty of RAM available for 3 sets of all the registers. Every instruction would take 3 clock cycles to execute. The good thing about this is that the TOYF could run at 3 times the speed of the data-memory. In my design, I require the MA (memory address) register to be set by one instruction and the memory access (such as with LDR LX or STR AX) to be done in a later instruction (usually the next instruction, although sometimes MA will hold a value for a while before the memory access is done). The code-memory however, would have to be twice as fast. This is because you need to set the PC register (usually by incrementing it, but sometimes by transferring a register value into it) and and read in the next opcode, all within those 3 clock cycles --- you don't get 6 clock cycles like you do with data-memory.
Actually, now that I think about it some more, this isn't really true. Accessing the register RAM (for one of: AX' AX'' AX''') takes 2 clock cycles, not 1. The 1st sets the micro-code MA for that small RAM and the 2nd does the read or write. So, every TOYF instruction has to take 6 clock cycles to execute, not 3. If we assume the FPGA is doing 120 Mhz., then the TOYF instructions would be executing at 20 Mhz.. This isn't very fast. If we assume that the typical length of a Forth primitive is 10 opcodes, then we are only doing 2 Forth MIPS. Way back in 1994 the MiniForth was doing 4 Forth MIPS, so the TOYF would be half the speed of a processor built about a quarter of a century ago. Back in 1995 when the MiniForth came out, it out-performed the competitor's MC68000 board, and also cost less. The TOYF would likely be roughly comparable to a 20 Mhz. MSP-430 in speed, but would cost more (my understanding is that FPGA chips are somewhat pricey). This isn't going to set the world on fire --- this is going to be mostly ignored as nothing more than a toy. The SNES game-machine was popular in 1995 and it used a 65c816 variant running at 3.58 Mhz.. Those were pretty good games. Nintendo didn't allow the general-public (people like me) to program the SNES, but only had a few authorized programmers. I saw a youtube video from a guy reverse-engineering the SNES and he said they were using self-modifying code, which is somewhat ugly. The TOYF could likely be used for a better game-machine than the SNES --- but the SNES was discontinued in 2003 --- something comparable to the SNES might become popular if the general-public is allowed to write their own games, but it might not as there aren't many teenagers who want to do that (nowadays people are discouraged from being creative thanks to one million internet trolls who will tell you that you are stupid). All in all, I don't think a register-file is going to work. The TOYF needs discrete registers, just like the MiniForth had. Nobody is going to give the TOYF any consideration unless it is at least 2x the speed of a 20 Mhz. MSP-430, and actually 3x or 4x would be required by most people to get interested --- 80 Mhz. with the TOYF instructions executing in one clock cycle would provide 8 Forth MIPS --- that would be roughly comparable to the RACE processor (the MiniForth upgraded to an FPGA and given a name change). The MiniForth had two versions, 40 Mhz. (4 Forth MIPS) and 80 Mhz. (8 Forth MIPS), but only the 40 Mhz. version was used because the memory chips for the 80 Mhz. version were expensive and such blazing speed wasn't needed for the laser-etcher.
|
Thu Nov 08, 2018 5:57 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1796
|
It may be that a register file makes no sense for TOYF, especially as there are so few registers.
But I think you haven't got the idea I was trying to explain. I think the term "shadow register" might be causing confusion. It's not important, but if you do want to understand, you may need to go around again.
It's surely not true that everyone needs diagrams in order to understand systems, but I certainly do, and pretty much every book has diagrams. I think the lack of diagrams for TOYF is a serious disadvantage for you.
How to make best use of an FPGA by designing an appropriate architecture is a very different question from how to make a good implementation of a chosen architecture, so the comparison with RISC-V should be read in that spirit.
|
Thu Nov 08, 2018 10:01 am |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
Tor wrote: the test images of the P2 testing in FPGA could definitely run the cogs in true parallel
You're telling me that all 8 cogs access hub memory in the same clock cycle??? RAM doesn't work like that. The cogs must certainly take turns accessing hub memory, which means they do so sequentially in a particular order. I haven't examined the P2, but I suspect what you have there are 8 processors with loose coupling. That is not necessarily a bad design. I might get into programming the P2, but I wouldn't want to believe that the cogs are accessing hub memory in parallel, because that can't be true, and sequential access is going to result in the usual problems (the "dining philosophers" and all that). The MiniForth had tight coupling. The five processors could access any particular register in the same clock. Each register could only be written to by one processor, but could be read from by one or more other processors in the same clock cycle. True parallel! This is where I'm going with the TOYF. This can be faked with shadow registers so you have sequential execution that looks like parallel execution, but I'm not interested in doing that. I think this is done primarily to make the programmer's model simpler, and avoid all that "dining philosopher" confusion. You are going to have multiple copies of the same data though, which seems wasteful of resources. Also, it is going to be slow because you are reading and writing the same data multiple times, and doing so sequentially, which seems like a lot of clock cycles burned for no good purpose.
|
Thu Nov 08, 2018 4:00 pm |
|
|
Tor
Joined: Tue Jan 15, 2013 10:11 am Posts: 114 Location: Norway/Japan
|
Hugh Aguilar wrote: Tor wrote: the test images of the P2 testing in FPGA could definitely run the cogs in true parallel
You're telling me that all 8 cogs access hub memory in the same clock cycle??? No, not at all, that's not what I was saying. The COGs operate in true parallel, with their own COG memory, but also with direct access to I/O pins - and that's in "true" parallel mode as well (accessing the same pin at the same time). The hub memory is different. That's where the name "Propeller" comes from - the cogs access hub memory in a round-robin fashion. Although the P2 does it differently from the P1. There's other memory in the P2 though, the third type is called LUT memory and in an earlier incarnation of the P2 it was possible for two cogs to access the LUT at the same time. Due to multi-ported RAM, of the type BigEd mentioned. In addition to that there's a lot of internal resources that are multi-ported to various degrees. I guess the point is that although an FPGA can't do everything in parallel, it can still do a lot in parallel - and is therefore different from the same thing executed in a software emulator.
|
Thu Nov 08, 2018 9:26 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1796
|
If you ever chance upon a diagram for the p2, please share! It sounds interesting. I did look at the forum thread linked earlier but didn't see what I wanted. (Hugh, it might help you to look into multi port RAM. Maybe study figure 2 here.)
|
Thu Nov 08, 2018 9:33 pm |
|
|
Tor
Joined: Tue Jan 15, 2013 10:11 am Posts: 114 Location: Norway/Japan
|
Well.. the hub access is sometimes called "the eggbeater" (the P2 variant). It's a bit like a "lazy Susan", the type of round table often found in Chinese restaurants. There's a central round table inside/on top of the 'outer' table, where all the food is placed along its rim. So the inner table (the HUB RAM) rotates, and the guests (the COGs) can grab food as it passes by. The main points are that a) everybody can access food (RAM) at the same time, just not the same food, and b) each guest can continue to access the "next" food item as the inner table keeps rotating, without delay. There's only an initial delay when you wait for the first food item you want to access. If what you then need happens to follow after that then you (and everybody else) can keep accessing, at the same time, without further delay.
|
Thu Nov 08, 2018 9:53 pm |
|
|
Hugh Aguilar
Joined: Sun Jul 23, 2017 1:06 am Posts: 93
|
BigEd wrote: The RAMs on FPGAs are clocked, so they embody a pipeline stage, so you can read out the old value and write back in the new value. They are also (often) dual ported so you can read two different registers. If you have a rather small register set then it might feel inefficient to use up an 18kbit RAM for the purpose - but it still might be a good idea, especially if you had nothing better to do with that RAM block.
You are telling me that it is possible to read and write to a pseudo-register in RAM in parallel in the same clock cycle? This is from my document: Code: DX_MINMAX: ; a b -- max ; sets DX= min xch BX,DX ; DX= b pls ldr LX ; LX= a mov LX,BX ; BX= a mov DX,AX ; AX= b mov 0,CF ; the TOYF's SBC needs CF set to 0 as done in the MC6800, not to 1 as done in the popular 6502 sbc AX,BX ; CF= b(AX,DX) > a(BX) ; CF indicates that b(DX) > a(BX) not CF ; CF= b(AX,DX) <= a(BX) ; CF indicates that b(DX) is the minimum and a(BX) is the maximum mov LX,BX ; BX= a ; the SBC AX,BX screwed up BX, so BX has to be restored mov 1,AX ; AX= offset to branch to next primitive add IP,AX bcs ; if BX is maximum and DX is minimum, then we are done SWAP_DX: ; a -- b ; needs DX=b, sets DX= a xch BX,DX nxt
group-A group-B group-M 1st opcode: XCH BX,DX PLS 2nd opcode: MOV DX,AX MOV 0,CF LDR LX ; MOV 0,CF moved all the way back here becaused it has no dependencies 3rd opcode: MOV LX,BX 4th opcode: MOV 1,AX SBC AX,BX ; it is legal to use a register and store into that register concurrently 5th opcode: ADD IP,AX NOT CF 6th opcode: MOV LX,BX BCS ; BCS will terminate the primitive if BX>=DX 7th opcode: XCH BX,DX NXT ; swap BX and DX so BX>=DX then terminate
In opcode #4 we have a group-A instruction writing a 1 into AX and a group-B instruction reading AX (the value is what was in AX already, not the 1 that is being written into AX in parallel). Is it possible to do this with RAM in one clock cycle? A lot of the time, parallelization involves reading and writing the same register in one clock cycle. Actually, all of the opcodes except #4 are exceptions in the above example --- still though, I would expect that reading and writing the same register is pretty common. In some cases, it is required: Code: These are combination instructions (have to be special-cased by the assembler because otherwise they won't pack together): xch AX,LX ; MOV AX,LX and MOV LX,AX packed together ; also called XCH LX,AX xch AX,BX ; MOV AX,BX and MOV BX,AX packed together ; also called XCH BX,AX xch AX,DX ; MOV AX,DX and MOV DX,AX packed together ; also called XCH DX,AX xch AX,EX ; MOV AX,EX and MOV EX,AX packed togheter ; also called XCH EX,AX xan LX,AX ; AND LX,AX and MOV AX,LX packed together xio LX,AX ; IOR LX,AX and MOV AX,LX packed together xxo LX,AX ; XOR LX,AX and MOV AX,LX packed together There are some others that can be done like this if needed.
|
Fri Nov 09, 2018 3:32 am |
|
Who is online |
Users browsing this forum: CCBot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|