AnyCPU http://anycpu.org/forum/ |
|
rfx32 http://anycpu.org/forum/viewtopic.php?f=23&t=1008 |
Page 1 of 1 |
Author: | robfinch [ Sun May 21, 2023 8:56 am ] |
Post subject: | rfx32 |
Been working on a project called rfx32 for a couple of days now. The goal behind rfx32 is to be much smaller than Thor. Rfx32 is a 32-bit machine. It is a two-way superscalar processor. Most instructions are 24-bit which should give it good code density. They may have constant postfixes which extends the instruction up to 80-bits if necessary. It is looking like at least one core will fit easily in the FPGA, and possibly two. |
Author: | robfinch [ Tue May 23, 2023 4:20 am ] |
Post subject: | Re: rfx32 |
The variable length instructions put the PC increment on the critical timing path. Although the increment was very simple, apparently it was not simple enough. So, now the instruction set has been reworked to be a fixed 32-bits in size. So, the PC increment is just the add of a constant. While fixed sized, instructions are byte-aligned. Code density is being traded off for a higher clock rate. Shooting for 50 MHz operation and 1 or more instructions per clock. A dual core configuration used 99% of the LUTs in the device and with fixes is bound to exceed device capacity. So, for now a single core is used. Following the original source code, the data cache was triple ported, this has now been made double-ported. There is issue logic on only the first six of eight ROB entries. Issue logic grows for each entry supported and it is unlikely that the last instruction entered in the ROB would be ready to execute as soon as it is queued. There may be a register fetch required before the instruction can issue. |
Author: | BigEd [ Tue May 23, 2023 8:37 am ] |
Post subject: | Re: rfx32 |
Interesting development! So, this core must be about half the size of Thor? Fixed length instructions sounds like a win. What sort of cache or memory interface do you have, to allow the fetching of more than one instruction per clock? |
Author: | robfinch [ Tue May 23, 2023 2:04 pm ] |
Post subject: | Re: rfx32 |
Quote: Interesting development! So, this core must be about half the size of Thor? Quote: Fixed length instructions sounds like a win. Quote: What sort of cache or memory interface do you have, to allow the fetching of more than one instruction per clock? Because an entire line is fetched it does not matter where on the line instructions are, the core allows for byte alignment. But, I have given some thought to having instructions nybble aligned. I think it may make it harder to hack the instructions since it is less certain where they are. Nybble alignment would cost a displacement bit in branches, but there are enough bits available. |
Author: | oldben [ Tue May 23, 2023 9:14 pm ] |
Post subject: | Re: rfx32 |
Some early machines like ATLAS, did virtual memory in hardware. Perhaps one needs to design around that, rather the the RISC idea design the ALU first, and make it run fast. Once you have memory and IO working, then the ALU section can wait or sleep for needed data. To me all the Thor computers are Super Computers (1960's) and neede to be designed as Cray would have done. Video graphics needs it's bus bandwith as well. Can video and floating point numbers share a high speed memory format? Do the modern video cards use the PC's main memory, or do they use as special memory? Ben. |
Author: | robfinch [ Wed May 24, 2023 3:25 am ] |
Post subject: | Re: rfx32 |
Quote: Once you have memory and IO working, then the ALU section can wait or sleep for needed Quote: To me all the Thor computers are Super Computers (1960's) and neede to be designed as Cray would have done. Quote: Video graphics needs it's bus bandwith as well. Can video and floating point numbers share a high speed memory format? Do the modern video cards use the PC's main memory, or do they use as special memory? In the case of the FPGA board in use memory is shared between all devices making it a bit of a bottleneck. There is a 64kB system cache between the DDR3 RAM and the rest of the system. The cache allows simultaneous reads by multiple devices. There are also streaming read caches for things like the frame buffer and audio. |
Author: | robfinch [ Wed May 24, 2023 3:30 am ] |
Post subject: | Re: rfx32 |
Added compare and divide operations to the ALU. Cannot get by without compare. It is looking like the core will run at 50 MHz which is very good. According to the tools it misses timing by 500ps for the total hold slack. It turns out to be a single signal for the timing clock for the uart and missing by one 20 MHz clock cycle is not likely to affect uart communications. It is a small percentage of jitter in the timing. Single thread performance should be very good, one of the goals. In the past I have had trouble getting a superscalar core to run much past 20 MHz. Added hardware interrupt support. Copying much of the core from Gambit. The divider restarts whenever the input operands change. So, it is constantly restarting as ALU instructions execute. It only gets to the done stage for a divide operation. Currently all ALU ops including multiply and except for divide are done in a single clock cycle. |
Author: | oldben [ Wed May 24, 2023 4:37 am ] |
Post subject: | Re: rfx32 |
Is the compare done as subtract, or a logic operation like the 7485? Can the flags be delayed to the next pipeline stage if that will speed things up? I asume software takes care of simplfying compares with constant zero, or bit masks like 0x80 or 0x8000 or 0x80000000 Do you have reverse subtract and reverse divide, for software that generates stack based code, or high level subroutine calls for Algol languages? |
Author: | robfinch [ Thu May 25, 2023 3:30 am ] |
Post subject: | Re: rfx32 |
Quote: Is the compare done as subtract, or a logic operation like the 7485? Quote: Can the flags be delayed to the next pipeline stage if that will speed things up? The toolset indicates that the result forwarding path from the output of the ALU back to its input is the critical timing limitation. Slower operations like multiply may be what is limiting the clock frequency. I think flag calculation is relatively fast. Quote: Do you have reverse subtract and reverse divide, for software that generates |
Author: | robfinch [ Thu May 25, 2023 3:35 am ] |
Post subject: | Re: rfx32 |
According to the tools the max clock rate for the core is 58.8 MHz. 60 MHz turns out to be a typical max frequency for several other designs on the web that use result forwarding. I tried building for 66.67MHz and timing was off by a couple of ns. This appears to be limited by forwarding networks; the path from the ALU output back to an ALU input. Since forwarding adds much to the performance of the machine it cannot be removed. The system was built using the 50 MHz clock which is conveniently the rate of memory controller. I am tempted to try pipelining the multiplier to see if performance can be improved. That would make multiply a multi-cycle operation. |
Page 1 of 1 | All times are UTC |
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/ |