Last visit was: Sat Sep 07, 2024 10:36 am
|
It is currently Sat Sep 07, 2024 10:36 am
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2153 Location: Canada
|
Been working on a project called rfx32 for a couple of days now. The goal behind rfx32 is to be much smaller than Thor. Rfx32 is a 32-bit machine. It is a two-way superscalar processor. Most instructions are 24-bit which should give it good code density. They may have constant postfixes which extends the instruction up to 80-bits if necessary. It is looking like at least one core will fit easily in the FPGA, and possibly two.
_________________Robert Finch http://www.finitron.ca
|
Sun May 21, 2023 8:56 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2153 Location: Canada
|
The variable length instructions put the PC increment on the critical timing path. Although the increment was very simple, apparently it was not simple enough. So, now the instruction set has been reworked to be a fixed 32-bits in size. So, the PC increment is just the add of a constant. While fixed sized, instructions are byte-aligned. Code density is being traded off for a higher clock rate. Shooting for 50 MHz operation and 1 or more instructions per clock. A dual core configuration used 99% of the LUTs in the device and with fixes is bound to exceed device capacity. So, for now a single core is used. Following the original source code, the data cache was triple ported, this has now been made double-ported.
There is issue logic on only the first six of eight ROB entries. Issue logic grows for each entry supported and it is unlikely that the last instruction entered in the ROB would be ready to execute as soon as it is queued. There may be a register fetch required before the instruction can issue.
_________________Robert Finch http://www.finitron.ca
|
Tue May 23, 2023 4:20 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1796
|
Interesting development! So, this core must be about half the size of Thor?
Fixed length instructions sounds like a win. What sort of cache or memory interface do you have, to allow the fetching of more than one instruction per clock?
|
Tue May 23, 2023 8:37 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2153 Location: Canada
|
Quote: Interesting development! So, this core must be about half the size of Thor? Yes, this core is significantly smaller than Thor , at least four times smaller. The current invocation of Thor has SIMD style vector registers, so the data-path is much wider. Thor needs a large chip which I am trying to find, and not break my budget. Quote: Fixed length instructions sounds like a win. I had thought that it would not make so much difference having variable length instructions, but I now see it does make enough difference to keep things fixed in size. Cascading length decoders, at least in an FPGA is too slow. I think it could be done if things were transistor level and did not involve table lookup and routing. It is being done for commercial processors. So, I am going back to Thor again to change the instruction set to a fixed sized one, likely 40-bits. Quote: What sort of cache or memory interface do you have, to allow the fetching of more than one instruction per clock? Fetching multiple instructions per clock is not too bad if the entire cache line is fetched. When an instruction is fetched the entire cache line the instruction is on, plus the next one in case the instruction spans a cache line, is fetched. Cache lines are long enough that several instructions (8) will fit. Eight or sixteen instructions are fetched at once, but only two are used. Although many instructions are fetched the two desired ones are somewhere on the cache line, so they must be aligned to the right-hand side for processing. The cache design is one for variable length instructions where the instructions could be up to 256-bits long. So, it is overkill for processing two fixed 32-bit instructions at a time. But I can just take the cache and plug it into different designs with only minor changes. I may adjust it at some point to reduce the footprint if possible. Because an entire line is fetched it does not matter where on the line instructions are, the core allows for byte alignment. But, I have given some thought to having instructions nybble aligned. I think it may make it harder to hack the instructions since it is less certain where they are. Nybble alignment would cost a displacement bit in branches, but there are enough bits available.
_________________Robert Finch http://www.finitron.ca
|
Tue May 23, 2023 2:04 pm |
|
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 627
|
Some early machines like ATLAS, did virtual memory in hardware. Perhaps one needs to design around that, rather the the RISC idea design the ALU first, and make it run fast. Once you have memory and IO working, then the ALU section can wait or sleep for needed data. To me all the Thor computers are Super Computers (1960's) and neede to be designed as Cray would have done. Video graphics needs it's bus bandwith as well. Can video and floating point numbers share a high speed memory format? Do the modern video cards use the PC's main memory, or do they use as special memory? Ben.
|
Tue May 23, 2023 9:14 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2153 Location: Canada
|
Quote: Once you have memory and IO working, then the ALU section can wait or sleep for needed I would have to agree that memory can be a bottleneck and some work needs to be done to get it to go fast. The external memory comes into play at the cache interface. Between the CPU and the cache it is about as fast as it can get. As long as it is faster than the cycle time of core it will not be on the critical timing path. External memory is so much slower than the core, even at 50 MHz that caches are really needed. Although the DDR3 RAM is incredibly fast, it is shared with other devices in the system. I have found during startup conditions the ALU does indeed sit idle for extended periods while the cache loads. Part of the incentive of an OoO machine is that memory latency for loads and stores is hidden, other instructions can execute while a load is taking place. Quote: To me all the Thor computers are Super Computers (1960's) and neede to be designed as Cray would have done. I am not sure about that. They have some more modern features, but a lot of the design probably has not changed since the 60's. I have studied more recent designs more thoroughly. I would think Thor is more influenced by designs like DLX, MIPS, PowerISA, RISCV, etc. Quote: Video graphics needs it's bus bandwith as well. Can video and floating point numbers share a high speed memory format? Do the modern video cards use the PC's main memory, or do they use as special memory? I believe most modern video uses 32-bit single precision or lower floats for performance. Graphics cards typically have their own memory. It has been a while since I skimmed through the Nvidea docs. In the case of the FPGA board in use memory is shared between all devices making it a bit of a bottleneck. There is a 64kB system cache between the DDR3 RAM and the rest of the system. The cache allows simultaneous reads by multiple devices. There are also streaming read caches for things like the frame buffer and audio.
_________________Robert Finch http://www.finitron.ca
|
Wed May 24, 2023 3:25 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2153 Location: Canada
|
Added compare and divide operations to the ALU. Cannot get by without compare. It is looking like the core will run at 50 MHz which is very good. According to the tools it misses timing by 500ps for the total hold slack. It turns out to be a single signal for the timing clock for the uart and missing by one 20 MHz clock cycle is not likely to affect uart communications. It is a small percentage of jitter in the timing.
Single thread performance should be very good, one of the goals. In the past I have had trouble getting a superscalar core to run much past 20 MHz.
Added hardware interrupt support. Copying much of the core from Gambit.
The divider restarts whenever the input operands change. So, it is constantly restarting as ALU instructions execute. It only gets to the done stage for a divide operation. Currently all ALU ops including multiply and except for divide are done in a single clock cycle.
_________________Robert Finch http://www.finitron.ca
|
Wed May 24, 2023 3:30 am |
|
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 627
|
Is the compare done as subtract, or a logic operation like the 7485? Can the flags be delayed to the next pipeline stage if that will speed things up? I asume software takes care of simplfying compares with constant zero, or bit masks like 0x80 or 0x8000 or 0x80000000 Do you have reverse subtract and reverse divide, for software that generates stack based code, or high level subroutine calls for Algol languages?
|
Wed May 24, 2023 4:37 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2153 Location: Canada
|
Quote: Is the compare done as subtract, or a logic operation like the 7485? Compare is done as a logic operation using verilog's built in '<' '>' and '=' operators. Quote: Can the flags be delayed to the next pipeline stage if that will speed things up? "Flags" are stored in general purpose registers. The toolset indicates that the result forwarding path from the output of the ALU back to its input is the critical timing limitation. Slower operations like multiply may be what is limiting the clock frequency. I think flag calculation is relatively fast. Quote: Do you have reverse subtract and reverse divide, for software that generates That is in the latest Thor ISA. Either operand may be an immediate value.
_________________Robert Finch http://www.finitron.ca
|
Thu May 25, 2023 3:30 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2153 Location: Canada
|
According to the tools the max clock rate for the core is 58.8 MHz. 60 MHz turns out to be a typical max frequency for several other designs on the web that use result forwarding. I tried building for 66.67MHz and timing was off by a couple of ns. This appears to be limited by forwarding networks; the path from the ALU output back to an ALU input. Since forwarding adds much to the performance of the machine it cannot be removed. The system was built using the 50 MHz clock which is conveniently the rate of memory controller. I am tempted to try pipelining the multiplier to see if performance can be improved. That would make multiply a multi-cycle operation.
_________________Robert Finch http://www.finitron.ca
|
Thu May 25, 2023 3:35 am |
|
Who is online |
Users browsing this forum: CCBot, trendictionbot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|