Last visit was: Wed Jan 22, 2025 3:40 am
It is currently Wed Jan 22, 2025 3:40 am



 [ 133 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9  Next
 Qupls (Q+) 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Latest Bug Fixes
The write ack from the system scratchpad RAM for an ERC cycle was being generated too soon. This led to a response of all zeros from the RAM. An ERC write is a request for a response back after the write cycle.

After running for 50us the data cache controller locked up. It hit a tranid that was the same as a tranid in one of its buffers for an older transaction, and assumed the transaction was done already. So, it never started a new one. The transaction command field is now used as a flag, being set to NONE once a transaction is complete, so the controller can tell if it is matching against an already completed transaction.

_________________
Robert Finch http://www.finitron.ca


Thu Feb 15, 2024 3:12 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Milestone Reached:
SIM working to the point of setting LEDs in the boot code for the scalar machine. First build for FPGA: does not work.

_________________
Robert Finch http://www.finitron.ca


Thu Feb 15, 2024 9:56 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Spent most of the last few days updating the compiler.

Did not update enough entries in the page table to cover the use during startup. This led to a page fault hang while accessing memory.

Tried running the core in the FPGA board a few times. According to the logic analyzer it hangs right when it tries to write the LEDs. The address used to write the LEDs is incorrect. Since this works in SIM things must be close. At least it can be seen that the CPU starts up and fetches instructions.

The boot program looks like the following, which does not do anything more than flash the LEDs.
Code:
extern void SerialInit(void);
extern void SerialTest(void);

static const int pte[26] = {
   0x1EDF, 0x83000FFFFFFFFEDF,   /* LEDs */
   0x1EC0, 0x83000FFFFFFFFEC0,   /* text mode screen */
   0x1ED0,   0x83000FFFFFFFFED0,   /* Serial port */
   0x1EDC, 0x83000FFFFFFFFEDC,   /* Keyboard */
   0x1EE1, 0x83000FFFFFFFFEE1,   /* random number generator */
   0x1FF8,   0x82000FFFFFFFFFF8,   /* BIOS RAM */
   0x1FF9,   0x82000FFFFFFFFFF9,   /* BIOS RAM */
   0x1FFA,   0x82000FFFFFFFFFFA,   /* BIOS RAM */
   0x1FFB,   0x82000FFFFFFFFFFB,   /* BIOS RAM */
   0x1FFC,   0x83800FFFFFFFFFFC,   /* BIOS ROM */
   0x1FFD,   0x83800FFFFFFFFFFD,   /* BIOS ROM */
   0x1FFE,   0x83800FFFFFFFFFFE,   /* BIOS ROM */
   0x1FFF,   0x83800FFFFFFFFFFF   /* BIOS ROM */
};

integer another_var;

/* Display blinking LEDs while delaying to show CPU is working.
*/
private inline(0) void Delay3s(void)
begin
   integer* leds = 0x0FFFFFFFFFEDFFF00;
   integer cnt;
   
   for (cnt = 0; cnt < 300000; cnt++)
      leds[0] = cnt >> 17;
end

public void bootrom(void)
begin
   integer* pgtbl = 0xfffffffffff80000;
   integer* PTBR = 0xfffffffffff4ff20;
   integer* leds = 0xffffffffFEDFFF00;
   integer cnt, ndx;
   short integer* pRand;

   *PTBR = &pgtbl[0];
   pRand = 0xFFFFFFFFFEE1FD00;
   
   __sync(0x0FFFF);
   /* clear out page table */
   for (cnt = 0; cnt < 16; cnt++)
      pgtbl[cnt] = 0;
   for (cnt = 0; cnt < 26; cnt+= 2)
      pgtbl[pte[cnt]] = pte[cnt+1];
   __sync(0x0FFFF);
   leds[0] = -1;
   pRand[1] = 0;                  /* select random stream #0 */
   pRand[2] = 0x99999999;   /* set random seed value */
   pRand[3] = 0x99999999;
   Delay3s();
   SerialInit();
   SerialTest();
end

Note the assignment of values to pointers. I think this is illegal in a lot of programming languages.

_________________
Robert Finch http://www.finitron.ca


Wed Feb 21, 2024 2:27 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Latest Status:
Running the core in the FPGA hangs, but it works in simulation. However, post synthesis simulation does the same thing as the hardware running on an FPGA board. So, there is some subtle difference. It hangs accessing an I/O device. I tried changing which device was being accessed and it did not make a difference. Functional simulation shows that a page fault occurred. This should not happen. The scratchpad RAM is accessed in the same manner as the other I/O and access to the RAM is working.
So, virtual memory is disabled for now. The core can now clear the screen and flash LEDs!

Got a surprise looking at compiler output generated for interrupt routines. On entry all the registers are saved as specified by a register mask, but on exit not all the registers were loaded. The compiler detected the lack of use of some registers and elided the load instructions. The interrupt routine did not use any temporary registers so they did not get loaded at the end of the routine.

Latest Changes:
Made the TLB four-way associative.

Latest Additions:
Added jump to subroutine with an absolute address mode. The target range is 37-bits. This has the same format as the branch to subroutine instruction. There is already an indexed jump to subroutine instruction but with a 24-bit displacement. Thinking of running systems without virtual memory, the address range supported by the jump instructions needs to be larger.

Added constant optimization of branches to the compiler. If two constants are being compared for a branch condition, then either an unconditional branch is output or no branch is output. Unconditional branches are faster than conditional ones because they are performed earlier in the pipeline. So, smaller and faster code results if constants are being compared. This happens fairly often at the start of loops with loop inversion. For the first iteration the value of the loop counter is known.

_________________
Robert Finch http://www.finitron.ca


Thu Feb 22, 2024 5:57 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Latest Bugfixes
Found a bug in the page table walker. The select lines were being shifted over too much. This would cause them to be set to zero sometimes. Meaning the data may not have been retrieved in all cases. I think the select lines might be ignored by the scratchpad memory for reads. If a read cycle is taking place the memory does not care which bytes it is, they are all returned.

Latest Changes
Broke the page table walker down into three modules as part of refactoring for code cleanup.

_________________
Robert Finch http://www.finitron.ca


Fri Feb 23, 2024 6:38 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Latest Bugs
Block RAM/ROM was not being initialized correctly. A branch displacement was incorrect causing a branch to the wrong address and a crash. It appears specifying a .mem file to load did not work correctly. It appears to be a toolset bug, a Windows bug, or a hardware error on the workstation. So, the contents of the ROM were specified directly in the Verilog source code.

_________________
Robert Finch http://www.finitron.ca


Sat Feb 24, 2024 2:54 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Latest Bugfixes
Figured out a hang that was occurring after code ran in the FPGA for about three seconds. Millions of instructions executed, then it hung on a store operation. It turns out there was an I$ load happening at the same time, and the store needed to be retried. The retry code got left out when the sequential core was created. It is amazing that it worked for millions of load / store ops.

Got the page table walker working.
A startup condition was causing a false TLB miss to occur. The page table walker then could not walk the page table properly because it had not been setup yet. The reset PC was being placed on the bus for the TLB to translate, before the default TLB entries were setup. There is a delay of about 15 clocks before the TLB is ready. Before that time the TLB would always generate a miss. A startup timer was used to prevent the TLB from translating the PC address until after the TLB had time to load defaults.

The data cache was locking up because tran id’s assigned to two different bus transaction were the same. The second transaction never got processed then. So the data cache thought the CPU’s bus transaction was not complete. The solution was to scrap the automatic assignment of tran id's and manually hard-code a set.

_________________
Robert Finch http://www.finitron.ca


Mon Feb 26, 2024 9:21 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Decided to work on GBOoO version of Qupls today. Qupls is further along than Bigfoot. It is running in simulation. It currently hangs trying to write to address zero where there is no memory. For some reason arguments for the address generator are not being picked up properly. It is supposed to be writing to the stack. The very same arguments are picked up by other following instructions correctly. It must be something specific to the AGENs.

Queue slot #14 picks up the register value calculated in slot 12, but the memory op in slot #13 does not. The register is 35/4. Registers are interesting because they are represented as logical and physical registers. I use the notation logical/physical.

Code:
.Q  sn:70 12: -dd--- 0 -- 0 - a  4 Rt 35/   4=000ffffffeffffe0 ff Rs 35/ 259 v Ra 35/ 259=000fffffff000000 v Rb  0/   0=000fffffff000000 v Rc 35/ 259=0000000000000000 v I=ffffffffffffffe0 fff8142e.000 cp:0 ins=ffffe08f9f04 #
C.  sn:71 13: v--o-- 0 q- 0 - m 83 Rt  0/   0=xxxxxxxxxxxxxxxx ff Rs  0/   0 v Ra 35/   4=0000000000000000 v Rb  0/   0=0000000000000000 v Rc 30/1023=0000000000000000 v I=0000000000000000 fff81434.000 cp:0 ins=0000000f9e53 #
..  sn:72 14: vdd--- 0 -- 0 - a 15 Rt 30/ 516=000ffffffeffffe0 ff Rs 30/1023   Ra 35/   4=000ffffffeffffe0 v Rb  0/   0=0000000000000000 v Rc  0/   0=000ffffffeffffe0 v I=0000000000000000 fff8143a.003 cp:0 ins=0000000f9e0f #

_________________
Robert Finch http://www.finitron.ca


Mon Sep 09, 2024 5:43 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Bug fixes
Figured out the AGEN issue. It turned out to be not directly related to AGEN, but it was the register file. Outputs were shifted by a clock cause port #1 output to appear on port #2 etc.

The argument valid signal was one clock cycle too soon. The register file had not been updated yet. This led to instructions issuing one cycle too soon. In a similar vein the address generator output valid was signalled one cycle too soon.

The renamer was too fancy operating on a five-times clock. The idea was to accelerate searches for available registers. But it turns out it was assigning the same register twice to two different targets sometimes (it did not work right). So it was switched back to operate on a one-times clock. For a short test run it does not make a difference. The renamer stalled only once in about 200 instructions.

Changes
Souped up the address generators. There was some logic for the address valid signal in the Qupls mainline that really belonged within the AGEN.

Stripped stomp logic out of the mainline and made it its own module. Added the capacity to stomp the vector expansion buffer.

*****

Tried to get a size estimate for the core by synthesizing it. But it did not synthesize properly. The size came back as something like 30kLUTs and it should really be much closer to 100kLUTs. The last size recorded was about 97kLUTs. I was hoping to beat that size.

_________________
Robert Finch http://www.finitron.ca


Tue Sep 10, 2024 4:22 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Finally got synthesis to return a sensible size estimate. 117k LUTs including all registers and 2 ALUs. Going to run with a reduced version to save some LUTs. 14 vector registers instead of 24 and just one ALU. Eight checkpoints. 91k LUTs.

To begin with, I left out some of the vector mask logic to keep things simple. So, I added it in. Hopefully the size of the core does not explode too badly.

Thinking about going with four ALUs instead of two ALUs and two AGENs, then doing AGEN with the ALU. A lot of the support logic is the same but the AGENs are much simpler than ALUs.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 11, 2024 6:04 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Modified the address generators to support packed (compressed) or unpacked vector loads and stores. For vector memops the address calc has to add in a scaled version of the element number to the address.

Moved some more of the AGEN logic to separate modules. Trying to break up the 7500 line mainline into smaller modules.

Simplified the vector expand. It used to limit expansion to the current vector length. Now, it just expands out to max vector length. The issue is that the vector length in use may not be known until the instruction executes. The length could be modified in the time between expansion and calculation.

Started adding code to support alternate fetch paths for branch misses. There is room in the fetch stage to fetch along an alternate branch path. It can be chosen at a later stage which instructions to process.

Size 64-bit: 81k LUTs
Size 128-bit: 107k LUTs

_________________
Robert Finch http://www.finitron.ca


Fri Sep 13, 2024 3:07 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Late night musings

It is late nite, time for bad ideas. I got to wondering how to implement the capabilities tag in memory. It would be nice if there were an extra bit available. Then I got to browsing my website and saw the ECC RAM. ECC RAM makes use of 16-bit memory to store a 11-bit byte and 5-checkbits. Reducing this to a 10-bit byte there would be an extra bit available that could be used for the capability tag bit. It would “not take much” to redesign the memory system to use 10-bit bytes with ECC checking.

A new version of Q+ with 10-bit bytes (80-bit datapath) is not much larger (<10%) than the 64-bit version. Instructions would be 50-bits instead of 48-bits. 40-bit pointers could be used with a 40-bit capability field in the register. This is much more compact than using a 128-bit datapath.

Chose to stick with a 64-bit machine using 32-bit memory pointers. Capabilities can be encoded in 64-bits then.

******

Added more support for capabilities. Added a capabilities tag cache. It sits alongside the data cache. It has storage for 64k tags, enough to cover 512kB of RAM. Most of the capabilities instructions are performed by ALU #0. A few instructions require memory support.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 14, 2024 6:50 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Finally got the system build out to the implementation stage to get an idea of timing. It takes about 2 hours for the system to build, the FPGA being about 95% full.
It turns out that a 200MHz 6x clock for register access is too fast. I half expected that. A rough estimate of the max is about 142 MHz. Using a 100MHz clock instead gives a CPU clock of 16.667MHz.

What was on the critical timing path was not what I expected. It turned out to be response buffers for the SoC bus. I had coded the response buffers to do a priority search for the highest priority response to place on the bus. It turns out that performing the search in a single clock cycle with combo logic results in 80! Logic levels. That is a way too many for the desired 33.3MHz clock. But since the clock is being reduced to 16.67MHz it may work. It is still something that should be improved.
The issue is that there are 32 buffers to search (4 buffers each for 8 channels). The search really needs to be split up across multiple clock cycles and then pipelined. Alternately a cascade of response buffers could be used that contain fewer channels.

27 levels of logic in the ALU path. 7ns logic, 14 ns route. All ALU operations go through a mask stage as a last step to set the outputs for the selected lanes of execution. Added a register in the ALU datapath before the mask stage to try and reduce the number of logic levels. The ALU now claims to be done one cycle before it actually is, but I do not think it matters as it is pipelined. It can still safely start the next operation. Results appear in the register file one cycle later.

The next worst timing offender is the RAT with 19 logic levels. 4ns logic, 10 ns route. I think the RAT will be much more difficult to reduce the number of logic levels.

The in-memory representation of capabilities is a compressed version of the capability. Using the compressed version of the capability means that the capability logic must uncompress, calculate, then re-compress the capability. I am toying with the idea of using un-compressed capabilities in the CPU for performance reasons. But that means having quite wide registers. 32-bit capabilities would need 128-bit registers. One thought is to use vector registers to represent the capability.

_________________
Robert Finch http://www.finitron.ca


Sun Sep 15, 2024 4:13 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1808
80 logic levels is a lot... is that in any sense a linear search, which might be replaced by a divide-and-conquer search?


Sun Sep 15, 2024 7:04 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Quote:
80 logic levels is a lot... is that in any sense a linear search, which might be replaced by a divide-and-conquer search?

Yes, it is a simple linear search. I thought the toolset might be able to render it using a tree or something.

I do not think it is easily made into divide and conquer. I had thought of maybe priority sorting the inputs as they came in; I cannot think of a simple fix. The buffer is designed to cycle around, starting the search at the next buffer location every clock cycle so that none of the buffers are starved for access. I could maybe remove the priority aspect as it is not being used, but it still has to search for buffers.

I split the buffer into three buffers, two half size feeding the third, so it is an extra clock cycle. But I think it is only about 40 logic levels now. It does not show up on the timing report anymore. It may poke its ugly head up again though. My five/six time clock for the register file writes limits the CPU to about 17MHz. (450MHz/17MHz= about 26 logic levels). I am not that worried about the clock rate. Trying to squeeze a mega-LUT design into 100k LUTs is bound to impact it. I am more interested in seeing if the it will run at all.

Trying to get something to run at 100MHz only allows one about four to five logic levels to work with. Which means a really simple scalar machine which can execute < 1 instruction per clock. 50 MHz is a more realistic target. I think OPC was pretty fast. A good scalar pipelined machine likely gets an IPC around 0.75, or about 37 MIPs at 50MHz - that is an optimistic estimate running out of caches / BRAMs. Assuming there is real DRAM to interface to, things are likely much slower. Running at 20 MHz with an IPC of about 2 is about the same performance level. Timing on the core will likely be 20 MHz or better.

Adding regs into the ALU datapath improved things, other adjustments helped too. According to the tools it meets 17MHz timing now. The five times clock could maybe be boosted to 120 MHz giving a 24 MHz CPU clock.
Multiplexing the write ports did not limit the timing. The RAT table has about 20 logic levels to it, and it shows up on the timing report.

_________________
Robert Finch http://www.finitron.ca


Sun Sep 15, 2024 11:24 am WWW
 [ 133 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software