Last visit was: Wed Oct 09, 2024 7:04 pm
|
It is currently Wed Oct 09, 2024 7:04 pm
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
As a simple means to keep instruction arguments updated I had the following simple code which does an exhaustive search for new results which can be placed in argument registers. The code is very simple but not very hardware efficient, maybe not the brightest idea. It does an N x M search of hardware registers. Code: for (n = 0; n < 64; n = n + 1) begin for (m = 0; m < 64; m = m + 1) begin if (!rob[n].iav && rob[n].ias.rid==m && rob[m].cmt2) begin rob[n].iav <= TRUE; rob[n].ia <= rob[m].res; end if (!rob[n].ibv && rob[n].ibs.rid==m && rob[m].cmt2) begin rob[n].ibv <= TRUE; rob[n].ib <= rob[m].res; end if (!rob[n].icv && rob[n].ics.rid==m && rob[m].cmt2) begin rob[n].icv <= TRUE; rob[n].ic <= rob[m].res; end if (!rob[n].idv && rob[n].ids.rid==m && rob[m].cmt2) begin rob[n].idv <= TRUE; rob[n].id <= rob[m].res; end if (!rob[n].itv && rob[n].its.rid==m && rob[m].cmt2) begin rob[n].itv <= TRUE; end end end
It is much more hardware frugal to simply search the outputs of the functional units as results are produced. It is still an N x M search but it is much smaller because there are a lot fewer functional units than there are reorder buffer entries. In this case there are currently only four setup. It was necessary to create an abstraction for the functional unit output to make it easier to search by an index rather than testing each functional unit output separately. Code: for (n = 0; n < 64; n = n + 1) begin for (m = 0; m < 4; m = m + 1) begin if (!rob[n].iav && rob[n].ias.rid==funcUnit[m].rid) begin rob[n].iav <= TRUE; rob[n].ia <= funcUnit[m].res; end if (!rob[n].ibv && rob[n].ibs.rid==funcUnit[m].rid) begin rob[n].ibv <= TRUE; rob[n].ib <= funcUnit[m].res; end if (!rob[n].icv && rob[n].ics.rid==funcUnit[m].rid) begin rob[n].icv <= TRUE; rob[n].ic <= funcUnit[m].res; end if (!rob[n].idv && rob[n].ids.rid==funcUnit[m].rid) begin rob[n].idv <= TRUE; rob[n].id <= funcUnit[m].res; end if (!rob[n].itv && rob[n].its.rid==funcUnit[m].rid) begin rob[n].itv <= TRUE; end end end Added register renaming to the core. It now uses 128 physical registers for the scalar registers, with 64 available rename registers.
_________________Robert Finch http://www.finitron.ca
|
Fri May 28, 2021 3:35 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Got back to working on the victim cache. It takes two clock cycles now to update the I$ from the victim cache that is a heck of a lot better than updating from memory. There are only L1 caches in the core, no L2$. There is no real reason to use a second level cache. The D$ is already using block ram for storage of cache lines. It is 16kB which is probably adequate for the system. Tracked down the simulator issue to the following line of code: It is the instruction pointer update which happens as a sequentially clocked logic. Note the non-blocking assignment is used. The issue is I think that for some reason the simulator is treating this like a blocking assignment that takes place concurrently. If the ip were to change concurrently then there would be an issue because it could change whether or not there was a cache hit, which feeds back to control the IP. To confirm things I tried changing the line to this: which keeps the ip clocked to the same value, and then sim no longer quits with the error. But it also does not run the program so I cannot tell if things are working.
_________________Robert Finch http://www.finitron.ca
|
Sat May 29, 2021 4:48 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Had a “duh” moment today. I implemented the I$ using LUT ram followed by another stage to extract aligned instructions. It just occurred to me that the equivalent could be done using dual port block ram which allows the two ports to be different sizes. The I$ uses about 5,000 LUTs, it would be great to be able to use block ram instead. I had originally chosen LUT ram for its single cycle access but adding an instruction extractor adds and additional clock cycle, so it is not any faster than the block ram. And the block ram has a much larger capacity. I$ is now 32kB.
Moved the LEA instruction from having its own opcode at the root level to being an option for the indexed load operations. LEA is implemented only for the case of indexed loads and not register indirect with displacement loads as computing the indexed address is more complicated, due to index scaling. Otherwise, the ADD instruction can be used to calculate the address.
Adding some of the first vector instructions today. There was some difficulty figuring out what to do with the register file source signal. There cannot be a separate source signal for all 4096 vector elements. It is not practical to make a history backup of sources for branch miss processing. Decided to try tackling a couple of simpler vector instructions. The chosen ones were ABS and NOT. They are now coded and should work if there are no bugs. The register file source is dealt with by updating only at the end of a vector operation, and tracking only the register number, not all the individual elements. Another commit signal was added to indicate the end of a vector operation.
_________________Robert Finch http://www.finitron.ca
|
Sun May 30, 2021 4:20 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Added most of the remaining integer vector instructions. Synthesizing reveals the core size to be about 85,000 LUTs. This is too large for the xc7a100 device but will easily fit in a xc7a200. I have been wanting to use a core as part of a home-built machine using am xc7a100 device. The core will fit into an xc7a100 without the vector instructions, so I may end up using that. Toying with the support of string instructions. Strings could easily be loaded into vector registers. 64-elements is large enough to load about a 190 character UTF21 string. Ascii strings up to 512 characters could be loaded into a vector register. Having the string in a vector register is interesting. It opens the possibility of string specific instructions. There is already the BYTNDX instruction to find a byte in a scalar register. It could be extended to work for a vector register. Finding the BYTNDX of a zero byte is computing the string length for instance. Learning new tricks. In the code there was a structure assignment Code: rob[rob_exec-1] <= robo Where rob is an array of structs and robo is a single struct. Since this gave a warning during synthesis about blocking/non-blocking assignments, I decided to re-write the code to assign structure elements individually as in: Code: rob[rob_exec-2'd1].wr_fu <= robo.wr_fu; rob[rob_exec-2'd1].takb <= robo.takb; rob[rob_exec-2'd1].cause <= robo.cause; rob[rob_exec-2'd1].res <= robo.res; rob[rob_exec-2'd1].cmt <= robo.cmt; rob[rob_exec-2'd1].cmt2 <= robo.cmt2; rob[rob_exec-2'd1].vcmt <= robo.vcmt; The re-written code got rid of the warning message and also had a great side effect. The code synthesized to a size savings of 30,000 LUTs! 58,000LUTs instead of 90,000! Understanding why this occurs is because without assigning individual struct members the synthesizer must copy across *all* the struct members including ones that do not need to change. This generates a lot more hardware. So, the trick to minimizing the logic generated is to not use structure assignments. Instead copy the fields individually. It looks like it is back to fitting in a xc7a100 *with* the vector operations.
_________________Robert Finch http://www.finitron.ca
|
Mon May 31, 2021 2:32 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Finally got past the simulator issue. I took about four or five days. It turned out that the dependency list needed to be explicitly listed in one always block instead of allowing the simulator to resolve them.
Issues with the pipeline tonight. It insists on stuffing a zero for the instruction into the pipeline when there is a cache miss. I’ve been playing with pipeline enables with a variety of effects none of them correct. It’s a puzzle at the moment.
Floating point support has been added. It cost about 7,000 LUTs.
_________________Robert Finch http://www.finitron.ca
|
Tue Jun 01, 2021 6:13 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Been adding support for a variety of instructions. String operations with vector registers should be a breeze. It should be possible to perform a concatenate using only five or six vector instructions and no loops. Code: strlen: bytndx r1,v1,#0 strcat: bytndx r1,v1,#0 ; find the null character in a vector register vsllv v2,v2,r1 ; shift string to concatonate into position mtvl r1 ; set the vector length to length of string sync ; make sure vl updated mov v2,v1 ; copy into the low part for string concat ; now v2 = v1 concat v2
Bin thinking about how to support 128-bit decimal floating point when the registers and busses are only 64-bit. Pairs of registers could be used to hold 128-bit values.
_________________Robert Finch http://www.finitron.ca
|
Thu Jun 03, 2021 2:44 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Got the notion of storing the current vector step number in the instruction pointer’s most significant bits making the step number part of the instruction pointer. This is so that it may easily be saved and restored for exception processing. Do this also opens up the possibility of branching into the middle of a vector operation. I cannot think of why this would be useful, however. Scrapped that idea pretty quickly. Instead the exceptioned step number is stored in the estep register. Vector instructions will be interruptible with enough state being saved so that they may be resumed. There is not a lot of intermediate state that needs to be saved for vector instructions. Thinking about exception stack frames and whether or not to stack them automatically. It looks like vector instructions are being pushed onto the reorder buffer correctly. Got some timing for a simple test bench program and it looks good. After subtracting out the stalls the IPC is: 0.86 or almost 1 clock cycle per instruction. This is very good for a scalar machine. Considering that the program does several stores that are about 10 cycles each, it can be seen that the memory access time is hidden. Quote: ticks: 277 executed: 178 ifStalls: 70
_________________Robert Finch http://www.finitron.ca
|
Fri Jun 04, 2021 3:22 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Added a generic reduction ‘or’ module for the floating point multiply operation. Fixed up the vector shift instructions. The vector shifting instructions can treat the vector as a 4096-bit wide register which can be shifted left or right up to 4096 bits.
Fixing a major goof-up is pending. The issue is that vector instructions cannot make use of instruction modifiers because vector instructions are pushed to the re-order buffer repeatedly, once for each step of the vector operation. The issue is that the instruction modifier needs to be present for every step pushed, meaning that possibly several instructions need to be pushed to the re-order buffer as a group. Scalar instructions do not require instruction modifiers to live beyond their once use. Pushing instructions to the re-order buffer was far simpler when all instructions were 64-bit. I am now thinking of having multiple instruction sizes to make things simpler than using modifiers. Many vector instructions require the use of modifiers. The complicated solution is to record all the instruction modifiers for an instruction, then cycle through them again as vector step instructions are pushed. Meaning a buffer for the instruction modifiers is required along with a count of the number of modifiers so that the push operation knows which buffered modifiers to push.
_________________Robert Finch http://www.finitron.ca
|
Sat Jun 05, 2021 4:13 am |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Rob:
Perhaps I don't understanding the problem, but for my core, where multiple modifiers may be simultaneously present, I use a single bit flag to represent each one. The remain set until the completion of the instruction that they are intended to modify. I only need 5 flags to represent all of the valid combinations of modifier. Perhaps a scheme like this may be applicable to your multi-cycle / multi-step vector instruction. The flags representing the applied modifiers are pushed to your reorder buffer, and they remain set until the vector instruction clears them when it completes.
_________________ Michael A.
|
Sat Jun 05, 2021 12:06 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Hi Michael, I have almost the same thing for modifiers. A flag is set by the modify instruction which is then later reset by the following instruction. The issue was having to do with re-issuing for vector instructions. I had things setup to simply feed the same instruction into the pipeline, but that was not good enough to include modifiers. (Modifiers need to read register ports again, so they also need to be redone.) To include modifiers a backwards branch is required to the start of the modifier chain. I was trying to implement this at the decode stage, this was not a good place for this. Instead this is now being done at the very start of the pipeline. Whew! It was a ton of experimentation to find what I hope is the best way to handle vector instructions with modifiers. It was handled by decoding vector instructions and modifiers at the ifetch stage instead of the decode stage. 16 parallel decoders were required because all that is available at ifetch Is the entire cache line, before it is muxed to a single instruction. Fortunately detecting a vector instruction was made ridiculously easy. It is a vector instruction if bit 7 of the instruction is set. Detecting a modifier is slightly harder but not much. IF there is a vector instruction, the core simply does a high-speed branch backwards by the count of modifiers. Code: if (is_modif) begin mod_cnt <= mod_cnt + 2'd1; ip <= ip + 4'd4; end else begin mod_cnt <= 3'd0; if (decven2 < vl) begin if (is_vector[ip[5:2]]) begin ip <= ip - {mod_cnt,2'b00}; decven2 <= decven2 + 2'd1; end else begin decven2 <= 6'd0; ip <= ip + 4'd4; end end else begin decven2 <= 6'd0; ip <= ip + 4'd4; end end Decoding the instruction at ifetch must be very fast and simple. The logic to manage the ip increment should also be short and sweet. Because branching is being done at the ifetch stage there is no branch latency making it very fast.
_________________Robert Finch http://www.finitron.ca
|
Sat Jun 05, 2021 2:51 pm |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Rob:
While you're actively working this problem, I am not.
However, while thinking about it from an instruction restart perspective, i.e., virtual memory, my thinking was that I would store the PC of the first modifier in an instruction sequence. If I had to restart an instruction because of a page fault, I thought that this would be the easiest / simplest way of determining the address of the instruction sequence. I don't fault the instruction stream if the programmer utilizes multiples of the same instruction modifiers, this would result in being unable to compute the number by which the current instruction address would need to be modified.
Thus, I think my planned solution solves the problem of restarting instructions with modifiers without resorting to adding additional fault logic to detect repeated use of modifiers, which I can't count with my simple flag registers.
_________________ Michael A.
|
Sat Jun 05, 2021 10:42 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Quote: However, while thinking about it from an instruction restart perspective, i.e., virtual memory, my thinking was that I would store the PC of the first modifier in an instruction sequence. If I had to restart an instruction because of a page fault, I thought that this would be the easiest / simplest way of determining the address of the instruction sequence. I don't fault the instruction stream if the programmer utilizes multiples of the same instruction modifiers, this would result in being unable to compute the number by which the current instruction address would need to be modified.
That is a good idea. I had not put thought to virtual memory. The core should be recording the instruction restart address somewhere. Dealing with modifiers is so much fun, I almost switched back to just using wider instruction formats. But I think modifiers are the way to go to keep good code density. There are also only two register read ports required. Modified the core so that *all* instructions may execute out of order. Previously simpler instruction executed in order in a single clock cycle once args were available. There is now a function to select the next instruction to execute based on the instructions arguments being ready. This created pipelining issues where the same instruction would be executed twice in a row. It took me a while to iron out details.
_________________Robert Finch http://www.finitron.ca
|
Sun Jun 06, 2021 8:13 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Trimmed three clock cycles off all cache load access by bypassing the TLB after the first access. The remainder of the burst addresses are found by simply incrementing the address bus. There is no need to translate after the first address because the addresses in a burst are sequential and will not cross page boundaries.
Branches are being altered to trim two bits from the displacement field and use them to identify the data type for the branch. Otherwise, any thing except integer branches would require a modifier to be used. Having the data type as part of the instruction means that usually only a single instruction word is required for a branch. The branch displacement is however reduced to 12 bits or ±2kB without a modifier. The modifier for branches increases the branch range to ±8MB.
An overflow area at the end of a data cache line was added to support unaligned data crossing a cache line boundary. The first part of the next cache line is stored in the overflow area. Then data may be accessed without needing to do two cache accesses for unaligned data at the end of a line. The cost is that an extra data packet needs to be loaded into the cache line. This is five memory accesses instead of four.
Only data addresses are being translated by the TLB. Instruction addresses go through untranslated.
Toying with the idea of expanding the core to 128-bits. The reason being I would like to be able to store graphics points in a register. Points have x,y,z components and are represented with 16.16 fixed point. It takes 96 bits to represent a point. I would like to see a graphics transform instruction that takes three points (for a triangle) as input and produces three transformed points as output. I am wondering if 16.5 fixed point format can be used to represent the point?
Experimentally, the core registers have been extended to 96-bits to allow the storage of graphics points. A point transform instruction was added which transforms a point from one location to another based on a coefficient matrix. It does this in only two clock cycles. About nine multiplies and a dozen adds.
_________________Robert Finch http://www.finitron.ca
|
Mon Jun 07, 2021 3:24 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Made the instruction scheduler most sophisticated. It now prefers to choose branch instructions to execute out of a possible list of executable instructions. This is to minimize the number of wasted instruction loads that take place. The scheduler also chooses older instructions in preference to newer ones because they are more likely to cause dependency stalls.
_________________Robert Finch http://www.finitron.ca
|
Tue Jun 08, 2021 3:40 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Broke out the instruction scheduler into its own module. I was going to try and use an age matrix for scheduling, then I found out it was patented still, so the scheduler is implemented differently. Put together a test system including the any1 processor core for the FPGA. Now to write an assembler for the any1 instruction set. Started simulating the test system.
_________________Robert Finch http://www.finitron.ca
|
Wed Jun 09, 2021 4:56 am |
|
Who is online |
Users browsing this forum: CCBot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|