Last visit was: Fri Nov 01, 2024 12:39 am
|
It is currently Fri Nov 01, 2024 12:39 am
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Mainly debugging today. A lot of it. Getting the register file source set properly for Tomasula’s algorithm was a challenge. The sources need to be reset on a branch miss which is slightly tricky to do. Still does not work 100%, but it works about 99.5% correctly. Missing a rate corner case of some sort. Fixed some dependency logic. Fixing a bug caused the runs to change revealing more bugs.
_________________Robert Finch http://www.finitron.ca
|
Thu Apr 07, 2022 7:31 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Ran into simulation issues. The core is complex enough that sim has trouble detecting order of events. So, a few judiciously placed #1’s helped resolve timing issues.
Changed the instruction fetch, decompress, and decode schedulers to simple find-first-one selection. Removed a couple of cross-bar components that were consuming a lot of LUTs and replaced them with a better use of the common data bus.
The core just barely fits in the FPGA. The system build is being left to run over-night.
_________________Robert Finch http://www.finitron.ca
|
Sat Apr 09, 2022 5:55 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Reduced the size of the core to a more reasonable level by switching back to 64-bit wide registers. Fixed a few bugs and now the core runs at least until the LEDs activate in simulation. So, time to try it in the FPGA again.
Started working on a couple of other projects, the Black Widow and the Phoenix projects. Had a brief look at designing a 48/96-bit core. The appeal is the lower resource cost.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 10, 2022 3:07 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Trying the core in the FPGA reveals that it hangs due to a return to a bad address. It is supposed to return to $5F0 but returns to $600 instead. <- I got past this error by removing an offset adder from the return address. It was not being used. It should not have made a difference because zero was being added, but it did. But now the core works incorrectly in a different fashion. It seems to be skipping right past return statements without executing them. It works in simulation so I am mystified. But I saw a similar situation during simulation, I believe it was a simulator issue.
Latest Mod: the size of the sequence numbers used was reduced to six bits from 48 bits. This saved considerable hardware.
_________________Robert Finch http://www.finitron.ca
|
Mon Apr 11, 2022 3:14 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
The core is close to working in an FPGA. Running post synthesis simulation works the same way as the core works in the FPGA. It is missing memory write cycles for some reason. Post synthesis and behavioural simulation are giving different results. Since it is memory writes that appear differently I am going to try replacing the vendor-based memory fifo with an in-house queue component. Ultimately the queue component is desirable because it allows for load bypassing which is not possible with a simple fifo. I was going to use the component anyway once things were working.
Found a bug in the memory operations in simulation. Sometimes the load or store was being performed multiple times. A transaction id was added to ensure that a load or store occurs only once.
Adding the queue component did not resolve the memory access bug when running in an FPGA. I dumped a number of signals and found a busy signal that is always active in the synthesised version, but not during simulation. So, I have a starting place to look for the bug.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 12, 2022 4:23 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Got rid of the busy flag that was stuck high all the time. Modified the way the execute stage handles multi-cycle arithmetic operations. They now go into an arithmetic queue instead of setting a busy indicator.
Found a signal next_decompress that had a bit trimmed off of it for post synthesis sim. This leads to REB entries not being decompressed. Have not been able to figure out why this bit got trimmed. It shows as not trimmed in the schematic.
Created a couple of modules from mainline code. Moved the scheduling logic to its own module. Since the next_decompress signal is part of the scheduling logic hopefully it will make things easier to debug.
_________________Robert Finch http://www.finitron.ca
|
Wed Apr 13, 2022 4:40 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1803
|
robfinch wrote: Latest Mod: the size of the sequence numbers used was reduced to six bits from 48 bits. This saved considerable hardware. That's quite the reduction! It always makes me a bit nervous when counters are just big enough to count all outstanding things before wrapping around, but it must be that right-sizing is most efficient.
|
Wed Apr 13, 2022 7:17 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Quote: That's quite the reduction! It always makes me a bit nervous when counters are just big enough to count all outstanding things before wrapping around, but it must be that right-sizing is most efficient. Makes me nervous too. In this case the counters should never wrap around. The first input "count" is fixed at 63. Then things are decremented in the buffer. But since there are only six entries in the buffer the count should never go below 63-6 or 57. The older entries end up getting overwritten by new ones starting at a 63 count before they get below 57. I could try using a four-bit counter but I might want to increase the size of the buffer at some point. Spent some time working on the Phoenix project. For Thor broke the multi-cycle ops out of the mainline into their own module. Next to move will be the regular ops. It is desirable to have the datapath as a module so that multiple instances of it may be used for a superscalar design. Moving the scheduling logic to its own module fixed the truncated signal issue, but things still do not work in the FPGA or functional simulation.
_________________Robert Finch http://www.finitron.ca
|
Thu Apr 14, 2022 3:34 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Moving the alu code out to its own module trimmed a full six minutes off of the synthesis time. The core size has also been reduced. Moving the code must have made the mainline simpler enough that the code could be optimized better.
Moved the micro-code to its own module to allow duplication. Fetching two instructions at once means two micro-code instructions need to be fetched as well when active.
Modified the the fetch scheduler so that it could schedule two instruction fetches at the same time.
Modding the core for hopefully two-way superscalar operation while at the same time keeping it runnable.
_________________Robert Finch http://www.finitron.ca
|
Fri Apr 15, 2022 4:42 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Radically altered branches. They now use general purpose registers as link registers, and the code address register file has been got rid of. It was getting to be too much of a pita to deal with the extra register set for a superscalar version. The branch instructions formats changed slightly so the assembler had to be updated. Also modified the set instructions to work in a manner like branches. They now use a compare method to perform the comparison. The compare method selects comparisons of different data types; integer, float, decimal float, posit and unsigned integer. Given five different comparison methods and six to eight different compare relationships there are a lot of set instructions. The current set instructions set a value to 0 if the relationship is false, otherwise they set the value to Rc, which may be a small constant like one. So they always set a value. I was thinking of adding a group of set operations which only set to Rc if the condition is true. Otherwise they are a NOP operation. These two different types of set operations are like the ZSxx and CSxx instructions in the MMIX ISA. So I was also thinking of using those mnemonics. Currently the core crashes because of a branch miss and a cache miss occurring at the same time. It does not go to the proper address when the miss is processed.
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 17, 2022 5:47 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Examining trace output, I was wondering why the I$ hit signal was pulsing low for a single clock cycle on a miss. I was thinking it should be for about 15 to 20 clocks to load the cache line. Then I remembered there was a victim cache. The victim cache was not working properly. The cache line entry can be swapped with the victim cache entry in a single clock cycle. It was swapping invalid lines from the I$ and marking them valid. This ultimately resulted in a crash.
In theory the core can now retire two results per clock cycle. It never gets to though because the rest of the core does not support two operations per clock yet.
The cache miss, branch miss issue was resolved by always enabling the read port on the I$. It had been disabled during a miss to reduce power consumption. When disabled the I$ was outputting zeros.
_________________Robert Finch http://www.finitron.ca
|
Mon Apr 18, 2022 7:27 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
There is some sort of forwarding issue in the core now. The register file was moved out to its own module and forwarding logic placed in the register file. The register file can accept two results per clock cycle.
Got rid of the CARRY instruction as being too difficult to implement in the superscalar core. Replaced it by allowing some instructions to produce a second target value in a limited subset of registers, r9 to r15. Some instructions like ADD, and SLL can also make use of a third source operand from a limited subset of registers. This allows extended precision operations to be performed when needed. Got rid of the multiply-high instructions as the high order product bits can now be made available in a second target register. Divide also produces the remainder in a second target register.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 19, 2022 8:47 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Got back to working some more on Thor2022; the core is constantly evolving. Changed the way vector registers work. They are now 512-bit wide (or wider) SIMD registers with multiple 64-bit lanes. This makes the core too large, so for now 128-bit wide registers which have just two lanes are being used. Made the number of lanes a configuration parameter. Previously vector registers where in an array and processed with a loop in the fetch stage. Now they are simply treated as a wide register.
Thinking about going back to the one result per instruction paradigm.
_________________Robert Finch http://www.finitron.ca
|
Wed Aug 10, 2022 6:10 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Added vector mask instructions to the ALU. These had not been implemented yet. Since the vector mask registers look a lot like scalar registers they have been lumped in with the scalar register file. The register file now contains 40 entries. Added a second class of ALU specifically for a specific subset of the vector instructions which require operating on the entire vector register as a unit as opposed to operating on individual elements. For instance, these instructions include the vector ‘slide’ instructions VSLLV and VSRLV.
_________________Robert Finch http://www.finitron.ca
|
Thu Aug 11, 2022 4:57 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2205 Location: Canada
|
Ran the FPGA and it crashed at address $0000C0 after performing several instructions and jumping to main(). I still have not determined why it would suddenly transfer to the address.
Latest modifications: back to using dedicated link and count registers again. The link and count registers ended up as an extension of the GPR file anyway when implemented. This frees up three general purpose registers. For branch instructions, the link registers are visible as r29 and r30 which are normally the frame pointer and global data pointer. It does not make sense to branch to these registers, so the link registers were substituted instead. This makes conditional return from subroutine possible. The code for r31 is already in use to represent the instruction pointer instead of the stack pointer.
Added a dedicated LDI, load immediate, instruction which allows loading immediates into the vector mask registers, link registers and count register in addition to the GPRs.
Having trouble getting some of the vector instructions like vector compress, VCMPRSS, coded properly. I can code a version, but when implementation is run the tools complain about multiple signal drivers. So, some more work is needed.
Changed the nomenclature surrounding the set instructions. Now calling them compare, CMP, instructions. There are two sets of compare instructions, one set for common integer compares and a second set for floating point, decimal floating point, posit, or integer compares. The most common compare operations are 32-bit instructions, other compares use a 48-bit format.
Read up on the HPET timers.
Worked on the programmable interval timer. Upgraded the timer to 32 timers from four. One never can have enough hardware event timers. Added interrupt capability for the timers and a synchronization register. All timer controls may be updated at the same time from holding registers by writing a synchronization register. Thinking of adding a second PIT component for a total of 64 timers.
_________________Robert Finch http://www.finitron.ca
|
Fri Aug 12, 2022 4:23 am |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|