Last visit was: Wed Oct 09, 2024 6:35 pm
|
It is currently Wed Oct 09, 2024 6:35 pm
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Moved the execute stage back out of the mainline. Two steps forward, one step backwards. It turns out having it in the mainline caused the core to be much larger and it would no longer fit in the FPGA. Not sure how it affected synthesis, but it was 50,000LUTs larger. So, a slightly slower design fits.
Toying with the idea of allowing loads to use nybble aligned addresses.
With a little bit of work, the core can be configured for other sizes such as an 80-bit core. 80-bits is enough room to allow three 13.13 fixed point values in a register. Or three FP24 values.
_________________Robert Finch http://www.finitron.ca
|
Fri Jun 25, 2021 9:00 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Squeezed another branch displacement bit out of the instruction by noting that it does not make much sense to allow vector registers in the compare and branch instructions. Hence the bit used to distinguish vector and scalar registers could be re-purposed as a branch displacement bit. That gives 16 bits for displacement for a range of ±144kB.
Worked on the graphics accelerator today. Changed the bus master port from 64 to 128 bits and added some additional color depths.
_________________Robert Finch http://www.finitron.ca
|
Sat Jun 26, 2021 3:16 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Latest Results: The core can be seen in ILA sequencing through instructions, but it is not outputting the LED I/O address. It seems to be treating instructions as if they were NOPs. It does seems to be fetching the correct instructions indicating that the I$ is likely working. Latest Fixes: The NMI input to the core was left open. Not sure if this was an issue. There have been issues in the past with open signals defaulting to active after synthesis. Latest Mods: Decoupled queue and decode. Decode now takes place sometime after queue. This allows an exec instruction to be implemented.
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Sun Jun 27, 2021 5:39 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Latest Mods: Removed the stack-based call and return instructions. There was a bug in the call instruction taking some effort to identify. Code will now be slightly larger but will execute just as fast. Removed the exec instruction. The extra code that was piling up to support it made it not worth it.
Milestone: Got the execution run time up over 100 us in sim of the boot rom today.
_________________Robert Finch http://www.finitron.ca
|
Mon Jun 28, 2021 4:06 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Latest Mods: Modified the branch instruction modifier to specify only two bits for the target link register. This keeps link register choice consistent with JAL and BAL instructions. It also allows bit to be re-purposed for more displacement bits. The modifier supports 19 additional bits now. The total number of branch displacement bits is now 35 with a modifier. The arg B field of the queue was added to the bypass matrix to allow the fourth register of an instruction to be bypassed properly.
Latest Additions: Added a watchdog timer to the instruction queue. If the queue’s execute pointer does not change for 512 cycles then an exception is generated.
Latest Bug Fixes: The multiply fast MULF instruction was flagged as a multi-cycle operation it is single cycle. Multiply fast immediate was performing an add instead of a multiply. In the assembler, populating the branch displacement fields was incorrect causing branches more than 64 nybbles to branch to the wrong address.
_________________Robert Finch http://www.finitron.ca
|
Tue Jun 29, 2021 3:23 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Did a lot of experimentation with the EXEC and MYST instructions. Then finally wrapped them up in #SUPPORT directives and disabled them. EXEC executes any instruction contained in a register. It is not a very performance-oriented instruction as it may stall the processor while waiting to determine the instruction register. The EXEC instruction added about 3% to the size of the core. MYST is similar to EXEC except that the registers are encoded in the MYST instruction instead of coming from the register containing the instruction. This makes it somewhat faster than EXEC. Still not a good performer. Using JIT code would probably beat the use of EXEC or MYST.
_________________Robert Finch http://www.finitron.ca
|
Thu Jul 01, 2021 3:40 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Latest additions: LDM and STM, load and store multiple registers. These instructions are useful at function entry and exit and for context switches. The front end of the core had to be modified. They look like an ordinary load or store instructions except that a register list modifier is used with them. The modifier allows specifying x1 to x30 for the LDM / STM. Adding LDM / STM did not affect the size of the core very much. Surprisingly the core was about 500 LUTs smaller with the instructions added. If loading or storing three or more registers it is probably more efficient to use a LDM / STM.
_________________Robert Finch http://www.finitron.ca
|
Fri Jul 02, 2021 3:40 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Cannot get the core working in an FPGA. It seems to work beautifully in simulation. In the FPGA the proper instructions can be seen entering the pipeline thanks to ILA, but the core does not seem to execute the instructions. The STM instruction is one of the first instructions executed, present simply for testing in simulation. There is a burst of activity seen as the STM instruction is processed in the pipeline, but there are no writes to memory occurring. If no writes make it to memory then no commits will come back and the core will hang. The first hack to try is widening the write pulse to two clock cycles which is only a single clock. It might be getting missed due to timing issues, but I doubt it. I think the watchdog timeout is unsticking the core as there seems to be about 512 cycles between bursts of activity.
_________________Robert Finch http://www.finitron.ca
|
Mon Jul 05, 2021 6:38 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
A picture is worth 1000 words. After about 1000 ticks (500+ clock cycles) suspiciously looks like a watchdog event happening.
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Mon Jul 05, 2021 6:50 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Sketched out a 20-bit compressed instruction format, then investigated what it would take to implement. Re-arranged some of the store instructions to make room for a 20-bit opcode space. One issue is the IP increment depends on the length if there are compressed instructions. The length decode is a 128-to-one, four wide multiplex and it would be cascaded into the IP increment logic. I suspect it may be slow to do so.
Debugging: Doubling the width of the write pulse caused two back-to-back write cycles in simulation. This was expected to happen. But still no output from FPGA.
Put a vector in the execute stage to indicate which ‘if’s are being taken running on the FPGA. The vector is dumped in ILA so the progress is visible. So far it is indicating ‘4’ which is an unimplemented instruction or an exception of some sort. Found that exception cause code 68h is present. This is not a cause code used in the system, it does however match the opcode for the instruction. So, it looks like an opcode is making it into the cause field. Something is amiss. I am just trying to create a different build of the system now to see if it’s build related or an issue with the FPGA.
_________________Robert Finch http://www.finitron.ca
|
Tue Jul 06, 2021 3:28 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Latest Additions: Coded up a storm for 20-bit compressed instructions. Moved several pieces of the execute logic out to tasks that could be shared with 20-bit instructions. 20-bit instructions excepting branches all branch backwards by four nybbles, this is to get to the next instruction since the IP has already been incremented by nine. A backwards branch is about the least expensive means and also the lowest performance. A full complement of shift / rotate instructions. Originally only left and right shift were supported. Now included are arithmetic shift right, and left and right rotate.
_________________Robert Finch http://www.finitron.ca
|
Wed Jul 07, 2021 5:49 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Worked on getting the compiler up to speed. It had last been modified for RTF64 which uses condition registers. So, there were a lot of modifications required to compare and branch instructions throughout the compiler. The compiler can accept an option passed to a switch statement to generate “naked” switches. A naked switch omits the range testing code. This is meant only for code that is known to work to boost performance. It must be guaranteed that values will not fall outside of the proper range, otherwise a naked switch would likely cause a crash. Normal Switch: Code: ; switch(x) { ldo $t0,64[$fp] sge $t1,$t0,#1 ; x varies between 1 and 12 sle $t2,$t0,#12 and $t1,$t1,$t2 beq $t1,TestSwitch_89 sub $t0,$t0,#1 sll $t0,$t0,#4 ldo $t0,TestSwitch_116[$t0] jmp $t0
Naked Switch Code: ; switch(x; naked) { ldo $t0,64[$fp] sub $t0,$t0,#1 sll $t0,$t0,#4 ldo $t0,TestSwitch_144[$t0] jmp $t0
Code for switch generation was modified to use a binary search for the case value if there are more than two case values. Otherwise, a linear search is used. If case values are densely packed then a table lookup is used.
_________________Robert Finch http://www.finitron.ca
|
Thu Jul 08, 2021 4:16 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Worked on the compiler some more. Changed code generation of branches for ANY1 from RTF64.
_________________Robert Finch http://www.finitron.ca
|
Fri Jul 09, 2021 4:45 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
There have been numerous fixes to the compiler. The latest addition was support of designators in array initializations. Some of the code for that is pretty scary and little bit incomplete. It is now possible to code designator as in: Code: void TestArray(int aa) { int z[20] = {[5...13]=5,[0]=0,1,2,3,4,[14...19]=6}; }
Structures were being entered into the global symbol table instead of the tag table. This led to structure definitions not being found sometimes. The interesting thing is that the compiler’s search facility is so powerful that it would find the structure definitions most of the time anyway. Trying to get the compiler to compile the following expression held me up for a while: Code: void (*_Atfuns[32])(void) = {0};
I think the goal is to initialize the first element of the array of pointers to zero. The CC64 compiler initializes the first element as specified then fills the remaining storage with zeros.
_________________Robert Finch http://www.finitron.ca
|
Sat Jul 10, 2021 7:47 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Latest Additions Back in the instruction set are the LINK and UNLINK instructions used for subroutine linkage. They are not particularly fast but they are code dense. They expand out into a few of the more usual instructions. Added an SLL optimization. If the target of the SLL operation is an index register and scaling can be used then the SLL instruction is removed, and scaling used.
Code fix pending: The compiler’s forcefit() routine, which coerces types to the larger of the two input types, needs to be fixed up. I am not sure why, but the forcefit() routine connects the source node to the destination node. This would not work, as the source and destination need to be kept separate. I am not sure what I was thinking at the time it was originally coded. The reason the compiler works is that forcefit() was always followed by code to link the source and destination operands, which made the operand separate again. So, there is extra dead code in the forcefit() routine that might be confusing to someone looking at the inner workings of the compiler.
_________________Robert Finch http://www.finitron.ca
|
Mon Jul 12, 2021 5:34 am |
|
Who is online |
Users browsing this forum: CCBot and 1 guest |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|