Last visit was: Thu Mar 12, 2026 2:24 pm
It is currently Thu Mar 12, 2026 2:24 pm



 [ 292 posts ]  Go to page Previous  1 ... 16, 17, 18, 19, 20  Next
 Qupls (Q+) 
Author Message

Joined: Mon Oct 07, 2019 2:41 am
Posts: 914
good idea. minus 0 as the reset fault.:)
Can you tag the internal registers for address and data, where a zero in address is a fault or null pointer?


Mon Jan 19, 2026 10:13 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
Quote:
good idea. minus 0 as the reset fault.:)
Can you tag the internal registers for address and data, where a zero in address is a fault or null pointer?

I made the registers a byte wider so that tags or flags could be added to them, but this is not implemented. There is a bit reserved to indicate a pointer in a register. It could be combined with the value to indicate a fault.

The MSI interrupt controller was outputting an NMI all the time. This occurred because a state of no interrupts was indicated with the same code as an NMI (with one more top bit set).

Pretty much decided to drop predicates and the PRED modifier from the ISA. It turns out to be tricky to implement with multiple threads present. I think I got it implemented but sheesh. Predicates are adding a fair bit of logic to the design. It is a bit disproportionate to the value. There could be multiple active predicates in the pipeline for each thread and everything needs to be tracked.
A predicate fault occurred during testing. The fault occurs when the commit pointer is sitting at an instruction waiting for a predicate and there are no longer any active predicates. It is a hardware issue. I decided not to spend time debugging.

The design may clock faster without the predicates. Branches automatically predicate instructions anyway if they are in a short branch shadow.

For some reason the exception flag on the ROB entry was being set during enqueue. This caused all instructions to exception. It was supposed to be set only if there was a decode exception.

Spent about an hour figuring out why the reset jump was jumping to $FFFFC000 instead of $FFFF8000 like it should. Looked at the assembler encoding, instruction decoding, etc. Then I remembered that the memory file used to load the ROM was still pointing to the file for 2025. I switched this to the 2026 version to fix things.

Milestone:
The core is jumping to the reset address, so it is executing the first jump instruction after a reset now. And it is loading up instructions in the pipeline. Still more work to do…

_________________
Robert Finch http://www.finitron.ca


Tue Jan 20, 2026 5:03 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
Bug Fixes
There was no valid signal associated with an I$ miss address. This led to cache lines for the default miss address to be fetched repeatedly. This would affect performance, which was obvious on the sim trace as long spaces between instructions. A valid miss address signal was added to rectify this.

Micro-ops were not marked as valid when translated resulting in all micro-ops being treated as NOP operations. No instructions were executing except branches performed in the extract stage. (Branches in the extract stage operate on raw instructions not micro-ops).

It seems that NOPs are not always indicated as such in the decode bus. I am not sure where the issue is. But since they are normally marked done right away and do not get dispatched, lacking a NOP indicator causes the machine to hang. Just wondering if to add dispatch logic for this case. Eventually, I figured this one out.

A couple of the test conditions for instruction dispatchability were incorrect. This meant instructions would not dispatch causing the machine to hang. Dispatchability conditions are pre-computed before they are used.

Fixed up the use of the old 'instruction' structure to the newer micro-op structure in a few places. Some of the older code was ported that did not use micro-ops. I should really go through all the code and ensure there are no references to the old structure.

The super-fast, super compact instruction dispatcher from the other day did not work. It required a fix, not sure how timing is impacted.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 23, 2026 3:50 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
Choosing the ready RSE (reservation station entry) written as a function was re-written as a module. It did not work as a function, always returning -1 meaning no RSE was selected.

I found a better way to handle flow control dependencies. All instructions are marked as depending on the stream they are associated with. It only matters for instructions that have dependencies like stores for instance. When the state of the stream resolves to a known value, then the instructions for that stream are unmarked as dependent. I think this is more efficient than the previous mechanism that searched backwards through the ROB for flow control dependencies. To do this a stream state array had to be added along with some state management.

A case statement in the LSQ (load store queue) was coded incorrectly. I must have been drunk at the time.

Results do not seem to be written to the register file. I found one or two bugs related to this but have not found the panacea yet. There are so many bugs it is hard to know where to start.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 24, 2026 5:06 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 914
My debugging is old school, use the front panel.
If I can't read/write memory then back to the drawing board.
Then I check if halt works and other simple operations.
Then I run the basic bootstrap from ROM
with the current memory to load displayed on the front panel and a # is output on the serial port.
This uses a only a few primitive basic instructions.
register operate, half word register indirect, jmp , jz ,half word immediate.
At this point I can use the bootstrap to run simple test programs.
later I can then burn the e-prom for the full bootstrap.
hardware debugging is easy, just sleep on it over night.
Software debugging is a pain as I often only can revise a few lines of code a day.


Sat Jan 24, 2026 11:53 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
Quote:
hardware debugging is easy, just sleep on it over night.
Software debugging is a pain as I often only can revise a few lines of code a day.

I have a process where I debug what causes the machine to hang. Eventually longer and longer runs of instructions work. At some point it may be debugged well enough.


Immediate constants do not seem to make it through the pipeline resulting in zeros instead of the constant. This leads to all values being zero. I have not been able to track down yet what the issue is. Cache-lines seem to be passed down the pipeline correctly and constants are read from the cache-lines. There must be a decode issue of some sort but it is not obvious.

The bitmap for the function results queue select signal was not counter rotated. This led to results not being updated correctly and queues not being read in the right order. Which eventually led to a hang. Fixing this allowed the machine to progress further.

The FLO144 / FLO288 (find-last set bit) modules were messed up. This caused the same register rename to appear for two registers causing a rename stall and a hang.

_________________
Robert Finch http://www.finitron.ca


Sun Jan 25, 2026 6:26 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
I found one issue with constants being zero which turned out to be a software issue. The assembler was encoding constants incorrectly, it was off by one bit. This caused the constant one to appear to be a zero when decoded by the CPU.

I found a second issue with micro-op translation. The raw micro-op was not completely transferred to the cooked version. This caused constants and other information to be omitted. Fixing this issue shows micro-ops being propagated down the pipeline correctly, but still did not fix the zero constants.

All the instructions making it through the front-end were marked as stomped on. In the micro-op translation the code that marks unused micro-ops was not working correctly, and ended up marking all micro-ops unused.

Some of the fields were not correctly assigned at the data input to the functional results queue. This led to ‘X’s in values and incorrect values.

Finally got the constant one to appear as an immediate in the re-order buffer.

Some results almost make it to the register file now. They are coming out of the queue as ‘00000000x’ instead of ‘000000001’. (They do make it to the register file now).

Decoding for oddball instructions was not complete. This showed up as a commit count of zero resulted. The total instruction count was also zero. I had purposely left this decode as something to be done at a later date…

The instruction count is not working. I am mystified as to why. It is very simple logic. The only thing I could think of is of there is a name conflict of some sort. I called the count ‘I’. So I have renamed it to ‘TotInsn’. How can this code go wrong? Cmtcnt is 4, do_commit is 1, irst is zero, and the clock is happening. I is fixed at zero and it should be incrementing by cmtcnt.
Code:
// Total instructions committed.
always_ff @(posedge clk)
if (irst)
   TotInsn <= 40'd0;
else begin
   if (do_commit)
      TotInsn <= TotInsn + cmtcnt;
end

_________________
Robert Finch http://www.finitron.ca


Mon Jan 26, 2026 4:33 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
After all the fixes that have gone in recently, I decided to try re-synthesizing the core. It turns out to be too big now by about 20%. I do not think there is an easy way to fix this. I decided to shelve the core for now and work on other things.

The instruction count issue from the other day seems to possibly be an issue with the core size or complexity. There is nothing wrong with the code, but it does not work in simulation. This has left me wondering what else does not work.

One thought I had was to re-write the core as maybe a two-wide core without the reservation stations and just using the re-order buffer to buffer things. It might be small enough then.

Other things worked on being the memory controller today. The test bench for it was improved, and the memory controller ports connected up to WISHBONE bridges as it is desired to use the WISHBONE bus to interface to the core. The core currently works using an asynchronous bus. A project was specially created just for the testing. The tools were not working well with the core in place in another project.

The WISHBONE bridge was modified slightly to accommodate the memory controller core. It used TID values of 1 and 2 to perform transactions, and the memory controller needs to see a TID of zero for streamed memory accesses. Normally zero is an invalid TID used when the bus is supposed to be vacant. A TID of zero is a good value to use as it is somewhat meaningless for a burst access.

After a couple of minor fixes the core seems to work at least in simulation.

The next project will likely be a traffic generator to test the memory controller in the FPGA. The rf68000 project or something similar may be used.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 28, 2026 12:15 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
I have been working on an MMU for Qupls and other projects. It has kept me busy for days. I started working on it to try and reduce the size and simplify it. I veered off into working on a hash-table based MMU for a bit.

Got the basic page MMU TLB working in simulation , and combined it with a hardware table walker.

Qupls4 needs a triple ported MMU, one for instructions, two for data. My thought was to make a single ported MMU then replicate it three times. Otherwise all of the MMU components would need to be triple ported and that is complex to do and test (the current MMU works that way).

To support extra large page sizes the TLB and table walker are replicated for each page size, I am not sure this is a good approach. There are currently two supported page sizes, 8kB and 8MB. With two TLBs per MMU and three MMUs, there are six TLBs to manage. Setting things up so that they look like a single unit will be tricky. Software may be a bit messy. But if all the components can be treated uniformly that may help.

The TLBs for both page sizes try to translate the virtual address concurrently. If either TLB has a translation then that translation is used for the physical address and the translation miss is cancelled in the other TLB. (Needs more work). If both TLBs have a translation, then the translation for the larger page size is used, other translations are ignored. If neither TLB has a translation, they both attempt to look up a translation from their page tables.

Normally, translations for multiple page sizes are stored in the same set of page tables. However, with the TLBs acting independently they will need to use separate translation tables. The PTEs are setup so that the same table can be used for all translations.

A lot of work and head scratching. Re-inventing the wheel again.

_________________
Robert Finch http://www.finitron.ca


Wed Feb 04, 2026 8:21 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
Added a bunch of cores to the Memory-Cores repository on Github.

These are mainly cores being developed for use with Qupls and other projects.

Included are a TLB, PTW, and MMU.

Been busy testing the TLB and PTW (page table walker).
Attachment:
TLB Address Timing.png

Attachment:
TLB Miss Timing.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Sun Feb 08, 2026 11:24 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
Still working on this...

Made a 64-bit MMU co-processor to do table walking. It has an instruction set geared towards the task. It is a little bit like OPC. Instructions are OP Rd,Rs1,Rs2,Imm. 12 registers, r0 to r11 where r0=0 always. No support for interrupts. The program is expected to sit in a waiting loop looking for the miss signal as a trigger. It should be more flexible than a dedicated SM (state machine).

It has only a 10-bit instruction pointer which is enough to address 4kB of ROM. Jumps and branches are absolute. The table walking program will not be more than a couple dozen lines of code.
The instructions are stored in BRAM that is dual ported so the main CPU can modify the program code and variables.
There is an eight-entry IP stack.
Timing is 200 MHz with instructions executing in a single clock cycle except for loads, stores and jumps. Jumps take two clocks. The timing should be comparable to a dedicated state machine.

The page table walk software looks something like:
Code:
start:
idle_state:
   wait r7                           # wait for a miss signal or command
   load_config                     # load registers r1 to r6 with config
   jltz r7,process_cmd         # jump if command present
   jeqz r4,read_level0

read_leveln:
   calc_pte_index r7
   calc_pte_adr r8,r2,r7
   load r2,[r8]               # r2 = fetch PTE
   jgez r2,page_fault      # check for valid page
   shl r8,r2,$10               # get bit 53 status
   jltz r8,read_level0a   # if a shortcut page...
   and_ppn r2,r2               # extract PPN
   shl r2,r2,r3               # r2 = address of page table
   djmp r4,read_leveln      # go down a level

read_level0:
   calc_pte_index r7
   calc_pte_adr r8,r2,r7
   load r3,[r8]         # r3 = fetch PTE
   jgez r3,page_fault   # check for valid page
read_level0a:
   build_vpn r4,r7,r5,r6   # r4 = TLBE high
   move r2,r7            # r2 = read_adr = miss_adr >> LOG_PAGESIZE
   and r1,r2,$1ff      # r1 = read_adr masked for 512 TLB entries
   xor r2,r2,r2         # r2 = way to read (0) (LRU)
   jsr set_tlbe
   jmp idle_state

page_fault:
   copy_miss_info_to_args
   set_page_fault
   jmp idle_state

# Parameters:
#      r1 = entry number
#      r2 = way to get
# Returns:
#      r1 = TLBE low (PTE)
#      r2 = TLBE high

get_tlbe:
   build_entry_no r10,r1,r2,$30
   store r10,tlb_base_adr+0x20   # set entry number field
   nop                                       # wait a bit
   load r1,tlb_base_adr+0x00      # get low order TLBE
   load r2,tlb_base_adr+0x08      # get high order TLBE
   ret

# Parameters:
#      r1 = entry number
#      r2 = way to get
#      r3 = TLBE low (PTE)
#      r4 = TLBE high
# Returns:

set_tlbe:
   store r3,tlb_base_adr+0x00   # set low order TLBE
   store r4,tlb_base_adr+0x08   # set high order TLBE
   build_entry_no r10,r1,r2,$31
   store r10,tlb_base_adr+0x20   # set entry number field
   ret

#

process_cmd:
   shl r7,r7,$1         # test bit 62
   jltz r7,.get_pte
   shl r7,r7,$1
   jltz r7,.set_pte
   jmp idle_state
.get_pte
   get_entry_no         # r1 = entry number, r2 = way
   jsr get_tlbe
   copy_tlbe_to_args   # set TLBE in arg area
   jmp idle_state
.set_pte
   get_entry_no
   copy_args_to_tlbe   # get TLBE into r3,r4
   jsr set_tlbe
   jmp idle_state

_________________
Robert Finch http://www.finitron.ca


Tue Feb 10, 2026 1:24 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
I have been working on a co-processor to handle TLB misses. The state machine for a TLB miss is just complex enough that I think it is better to have this done by a custom processor. There are a few contemporary processors that work this way. Realized it is really a system control processor (SCP) that I am working on.

I got the bright idea of adding the Qupls4 Copper audio/video co-processor to the MMU TLB miss co-processor. Why have two separate co-processors if you can get away with one? The Copper is a co-processor modelled after the Copper processor in the Amiga which had just four instructions (WAIT, SKIP, MOVE, and JUMP).

I added a couple of additional instructions to the co-processor instruction set to make it more flexible. So, it can branch based on video scan position (a substitute for the SKIP instruction). It can work as a generic processor now (a couple of important instructions like ADD were added); but the basic intent is that it processes video frame and TLB miss interrupts. There is a WAIT instruction that puts the co-processor in a lower power mode until either a TLB miss occurs or the video frame starts. A couple of the instructions need to be aligned at odd addresses as they are followed by 64-bit constants (AND64 and STOREI) which need even addresses. STOREI (store immediate) is equivalent to the MOVE instruction in the Copper.

Currently the video frame interrupt is a higher priority than the TLB miss (it gets checked for first if they occur concurrently), but the TLB miss is capable of interrupting the frame processing code as the frame processing code may be lengthy. I have some qualms about this. Two separate co-processors may work better but are also more costly.

Each interrupt type has some of its own registers swapped with GPRs when an interrupt occurs and swapped back on an IRET. It has very fast interrupt servicing. From the WAIT instruction it takes only three clock cycles to get to the interrupt service routine, including saving the address on an internal stack and swapping registers. Two cycles is the required wakeup time for the BRAM.

The amount of local memory the co-processor has was increased to 32kB mainly as some of the audio/video processing code may get complex. With the additional complexity and size of the processor, the timing has degraded to about 150 MHz still plenty fast for the destination system.

_________________
Robert Finch http://www.finitron.ca


Wed Feb 11, 2026 7:11 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
More work on the co-processor. Got it going in simulation. Wrote an assembler for it so code could be generated to test in sim. Most of the assembler work was deleting lines of code from the Qupls4 assembler.
I ended up turning the co-processor into more of a sequential machine. There were just too many timing issues; it was getting to be like a finger puzzle. I was having trouble getting it to simulate correctly. It definitely will not work if it does not simulate. It now takes an average of about 2.5 clocks per instruction. Still not too bad. In the simulation run, the TLB miss is handled in about 84 clock cycles which I think is pretty normal. Just two tables levels is enough for a 32-bit address space.

The page table walking routine takes up just over 256 bytes. I may end up reducing the amount of memory locally available.

I may end up using the co-processor to test things. I like to see stuff running in an FPGA as opposed to just simulation.

I broke the co-processor into a couple of more modules, and that tanked the size of it. It is now much larger. It is about 7500 LUTs.

The first eight registers are stacked with the IP on an interrupt OR subroutine call (using a 530-bit wide stack). The return instruction takes has a mask indicating which registers to restore. (So, return does both ordinary and interrupt returns). I have written the TLB miss handler so that it uses just eight registers, that way there is no need to save and restore things, it’s automatic.

_________________
Robert Finch http://www.finitron.ca


Fri Feb 13, 2026 3:05 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
Added bitmap instructions to the co-processor. These instructions test, set, clear, or change a single bit in memory. I do not usually have these instructions in the ISA as they are not frequently used. They look like other load / store instructions except that the index register specifies a bit location relative to the displacement. The range of indexing is reduce by 64 times. Sample:
Code:
; Free a page of memory
free_page:
   loadi %r4,$131072               ; number of bits in PAM
   load %r1,cmd_arg1               ; get page number to free
   jge %r1,%r4,cmd_ok            ; ignore bad request
   bmclr %r0,PAM[%r1]            ; clear the bit in the PAM
   jump cmd_ok

Bitmap instructions replace about a half dozen other instructions including a bit testing loop.

Added page allocation map (PAM) functions to the copro’s software. These functions allocate or free a page of memory using a bitmap of allocated physical pages. The bitmap is maintained in the co-processors internal memory.

Worked on the TLB, adding flush operations. Found that an address mux was replicated four times by accident. (It was inside a generate block and should not have been). Fixing this cut the size of the TLB in half.
The extra muxes required for flushes slowed down the TLB a few MHz.

_________________
Robert Finch http://www.finitron.ca


Sat Feb 14, 2026 5:03 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2483
Location: Canada
Drifted around today. Made a copy of the co-processor and started adding code from the AVIC project from 2017! I was thinking it may be just as fast to do things like line-draws and audio playback from the co-processor. Contemplating making a system on the chip somewhat like an Amiga or Atari ST.

AVIC is an audio / video controller like some of the micros from yesteryear.

_________________
Robert Finch http://www.finitron.ca


Sun Feb 15, 2026 5:26 am WWW
 [ 292 posts ]  Go to page Previous  1 ... 16, 17, 18, 19, 20  Next

Who is online

Users browsing this forum: AhrefsBot, claudebot, CN-mobile-56047 and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software