Last visit was: Sat Apr 11, 2026 3:10 pm
It is currently Sat Apr 11, 2026 3:10 pm



 [ 302 posts ]  Go to page Previous  1 ... 17, 18, 19, 20, 21  Next
 Qupls (Q+) 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Took a break from the usual to make a Collatz conjecture machine which can process hundreds of numbers in parallel. It should be able to run at 250 MHz. Now trying to figure out how to display results. I have the calculation organized as a 24x24 matrix of calculators. So, I may display the done status of each calculation as a block graphic on the screen. Once all 24x24 calcs are done, the screen will clear and begin with the next set of numbers. This should repeat up until the last specified number to be tested.

The rusty old rf6847 video display generator (VDG) is being used. The done status bits are connected to the character bitmap input and an external character generator selected. I had to modify the rf6847 to accept external sync inputs. Normally it generates the sync signals. The display is not quite right yet, vertical timing is off resulting in a flickering screen. Horizontal timing seems okay.

_________________
Robert Finch http://www.finitron.ca


Mon Feb 16, 2026 4:56 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
More work on the co-processor. This time integrating graphics acceleration. Graphics acceleration includes a hardware blitter and hardware cursors. The screen is 400x300 as it needs to fit into BRAM with memory available for other things.

The co-processor can now write to the screen memory (screen clear is successful). Point plot acceleration does not work yet, it is the current item being worked on. Because the processor only handles 64-bit data and pixels are 16-bit, there is not an easy way to plot pixels. Hence the point plot accelerator. The accelerator plots a point at the current graphics position in the current color.
Graphics commands are posted by the processor to a graphics command queue. The queue allows longer running operations to be run without stalling the graphics commands coming from the processor. The command queue seems to work.

Operation of the core is somewhat tricky since it is a co-processor capable of running a program, yet at the same time it acts as a state machine to perform accelerated graphics operations. In the instruction fetch stage the graphics command queue is checked to see if it is empty. If it is not empty, then graphics processing is triggered. This acts like an interrupt. It is effectively an interrupt routine performed by hardware. (Hardware based hardware interrupt - HHI). Vertical sync and TLB misses are also detected at the ifetch stage. A trick behind the graphics acceleration is that it only performs a few states at a time before returning to the ifetch stage. This is so that not very many cycles are lost before an interrupt is serviced. Because graphics operations are sent to a command queue it should be possible to perform graphics operations in an interrupt subroutine. Any outstanding operation will complete first though.

_________________
Robert Finch http://www.finitron.ca


Wed Feb 18, 2026 5:51 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
The pixel plot function works!

I keep forgetting that the co-ordinates are 16.16 fixed point and the graphics ends up at the top left of the screen co-ordinate 0,0. Things are slowly coming together. Drawing filled rectangles almost works. It works except that every other pixel is drawn, and it alternates between lines, resulting in a checkerboard appearance. I have yet to figure out the cause of this. Filled rectangle drawing is done by the blitter, so that is almost working.

Loop counts to display things on the screen are not being honored. There is a loop to 15,000 to display random pixels on the screen, but only about 50 pixels show up. Triangles do not show up on the screen yet. The test program runs all the way through, but no triangles are displayed.

Hardware cursor logic is in place but the cursors do not show up yet.

_________________
Robert Finch http://www.finitron.ca


Thu Feb 19, 2026 5:51 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 922
A good test might be the game of life.


Fri Feb 20, 2026 12:25 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Quote:
A good test might be the game of life.
Yeah, that would be a good test. I need to get the hardware working a little bit better first.

Little by little things improve.

Not sure what I did, but the checkboard pattern for filled rectangles is fixed. I think it had to do with the data input latching. The input data latch buffer was not wide enough, a holdover from the 32-bit data version of the core.
In theory the blitter destination channel may now be used to zero out memory (any constant may be set).

Figured out a couple of reasons why loops were not being honored. The first one was loading the loop count register using too large of a number to fit in the immediate field of the instruction. There was a ‘loadi %r5,$25000’ the 25000 is treated as a negative number because the immediate field is only 15 bits. Another reason loop counts were off was in the state machine for the decrement-and-jump instruction. The register was updated twice, once in the branch state, and a second time in a writeback state. This caused the register to be decremented by two.

Found one issue with the triangle draw code. The divider load signal was stuck active causing the divider not to work.

Eight of the thirty-two hardware cursors now appear on the screen. IDK why the other 24 do not appear. They could be transparently colored (colors being chosen randomly). Greatly reduced the number of clock cycles consumed fetching cursor data. Done by assuming that cursor data is available within two clock cycles, so the normal data latching state is not used. Also, the fetch machine was revised so that it only fetches data for cursors that are displayed. With 32 cursors displayed it takes about 70 CPU clocks during the horizontal sync period. This works out to about 28 video clocks.

Text blitted characters show up on the screen, but the location seems somewhat random. The issue with location has me a bit mystified as the character box looks correct, the correct width and height. The width and height are included in the address calculation so it is at least partially working. Also, the character display is not correct. It would be nice to be able to dump messages to the screen.

_________________
Robert Finch http://www.finitron.ca


Fri Feb 20, 2026 4:24 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Fixing timing issues in the core. The graphics processing state was being skipped over (a task was not called) for some reason. So, I re-wrote the logic a little bit to try and ensure the task is always called. There is better handshaking of the task invocation.

Point plot is busted. It is always plotting to the top left corner of the screen, obviously a co-ordinates issue. I do not think there is an issue with the logic, but rather physical issues. One issue turned out to be software.
Wrote tasks to take care of memory reads and writes. This ensures they are done in a consistent manner. It also reduces the number of LOC. Memory writes seem to not always work. So, I increased the delay before an ack is generated on a write to give the write more time.

The character textblit command code was defined as zero for convenience in displaying characters. It turns out that at startup the command fifo was reading a zero at the output, causing a textblit operation to occur to the upper left corner of the screen. So, a blank character appeared at this location after reset. So, the command code for character blit was changed. Command zero is now a NOP operation.

_________________
Robert Finch http://www.finitron.ca


Tue Feb 24, 2026 10:48 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Not having much luck getting things to work well. There seemed to be an issue with instructions in the BRAM. So, thinking it may be a bad BRAM cell I made the instruction memory triple-mode redundant. It did not make a difference.

There is a jump to the reset address occurring when an attempt is made to perform a graphics draw operation. I can see this in the integrated logic analyzer. But, the jump is not in the program. I checked the .mem file used to load the FPGA’s BRAM and it is correct. No jump. So, where / how this jump is being inserted is somewhat mystical. I tried numerous approaches to remedy this, but no luck yet. It is almost as if it is being hacked.

Branches were made instruction pointer relative now. The IP can address up to a megabyte of memory, but the address field of the instruction is only 15 bits. To range further a value must be loaded into a register and a register indirect jump performed.

64-bit constant support was added to the OR, XOR instructions. Support for 32-bit constants was also added. Two values were used to indicate 32-bit and 64-bit constants in place of additional instructions. If the immediate field of the instruction is 15’h4001 then a 32-bit constant follows. If the immediate field of the instruction is 15’h4000 then a 64-bit constant follows. Immediates can now be aligned to 32-bit address boundaries (previously they must have been aligned to 64-bit addresses).

I chose to switch over to working on the rf68000.

_________________
Robert Finch http://www.finitron.ca


Wed Feb 25, 2026 2:36 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Worked on Qupls4 again.

Updated the micro-op translator to use explicit field naming rather than implicit positional fields. It is more text (LOC) but it also more likely to work correctly. It also avoids issues when field sizes change. With named fields the synthesizer works out the offsets instead of needing to be aware of the offsets during coding.

Still pondering how to reduce the size of the RAT. I may try and make it into an option, but without the RAT there needs to be more dependency checking. The RAT is 53 kLUTs for four ways.

Also thinking about increasing the register size and data-path.

Studied capabilities some more and noted that they do not provide fine-grained access rights. It is just a single permissions bit which is yea or ney. For example: can load or cannot do a load. I would like the capabilities to have access levels applied which changes things. For example, can load if access level > 50, so things would not be black and white. But with about 16 (or more) different permissions providing multiple bits for each one results in a lot of bits. 256 access levels for each bit (a byte for the access level) would require 128-bits. I had thought that to do this would require at least 256-bit registers. This may be handled with vector registers.

_________________
Robert Finch http://www.finitron.ca


Fri Mar 13, 2026 5:40 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Floating-point on the mind for the past couple of days. The double-precision reciprocal estimate of my float modules has been updated for 9-bit or better accuracy. It also handles special cases better (nans, zero, infinity, powers of two). Takes about 550 LUTs. I figure the table for nine-bit accuracy does not take up enough additional room to worry about. It was a challenge to get sub-normal numeric results accurate. Picked infinity minus one LSB, and zero plus 1 LSB among special test cases. Trying to hit the extremes.

The float modules are in common to several projects I am working on, really a project in their own right.

_________________
Robert Finch http://www.finitron.ca


Thu Mar 26, 2026 4:08 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
With the latest batch of changes for Qupls it looks like it will just fit on the FPGA. No room for a DRAM controller though. But should be able to get something working using the scratchpad RAM.

I have been updating the documentation to reflect changes to the ISA design. There is no longer a separate set of vector registers. Instead, the number of GPRs has increased to 128.

This was done as instructions are now mapped 1:1 to micro-ops. Eliminating the micro-op ISA expansion and micro-op queue saved a large number of LUTs.

I managed to reduce the size of the RAT significantly, at the cost of some performance. It is organized differently. There are sets of eight rename registers for each pair of ISA registers. The processor will stall more often as it cannot pick from all available rename registers.

_________________
Robert Finch http://www.finitron.ca


Mon Mar 30, 2026 3:23 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Updated the docs.

Got rid of the ‘V’ field in the instructions which was selecting whether a register spec was a vector or scalar register. The ‘V’ field is now hidden in the micro-op and set depending on instruction decoding. The opcode and specific instruction determine if a register is a vector or scalar register. It turned out that in many cases the assignment is fixed and it was wasteful to explicitly specify it in the instruction. For instance, a ‘SET’ type instruction always sets a scalar destination register. The Rs3 field is most often a scalar register because it is used as the vector mask.

Removing the ‘V’ field made four more bits available in the instruction which were rapidly used to specify more registers. 128 registers can now be specified. For vector instructions the first register of a group of registers needed to hold the vector is specified.

A lot of changes had been made to the documentation. In a classic blunder, I removed all the vector instructions. Then changed my mind. Much to my dismay, it turned out that I had not recently updated the docs in Github. I searched around and managed to find a backup document with most of the material that I removed so it could be restored.
I had removed the vector material thinking that micro-ops would be removed from the design. Then I looked at the code.

Changed the way micro-ops are loaded. An instruction can be translated into up to 16 micro-ops. With four instructions processed every clock cycle that is 64 micro-ops. The code was packing the micros together so there would be no NOPs, and that turns out to require a lot of LUT resources. The synthesized code was 116,000 LUTs.
Instead of packing the micro-ops which are something like 300 bits wide, the code was changed to leave the micro-ops in place. Instead, an side array of indexes into the micro-ops was created and packed. The index is only 7 bits, 40x smaller. This resulted in a much smaller footprint, just a few kLUTs. Hopefully it will be possible to use micro-ops. Synthesis is still running with the changes.

_________________
Robert Finch http://www.finitron.ca


Thu Apr 02, 2026 2:29 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Workin, workin, workin on the FPU…

Seems like a lot of hardware is required to support a static round mode. The round mode needs to be fed to the FP reservation stations, and it must come from the re-order buffer or the FP control register. Searching the ROB for the round mode will really burn up the LUTs.
I am tempted just to stipulate that if a static rounding change is needed, then an FP sync instruction must be used whenever the static round mode is changed. Switching the static rounding mode would be quite slow then (20+ cycles) as the pipeline would need to flush of FP instructions. I suppose the sync could be hidden in the instruction to set the round mode (FRM).

Re-arranged the structure of Qupls4 a little bit to reduce the number of functional units.

Moved IMUL from its own unit to be being part of the FPU. Moved crypto functions into SAU #1. Moved IDIV and ISQRT under the TRIG Functions.

Made the result queues optional, and now write all functional unit outputs directly to the register file.

Added a passthru mode to the function unit result queues. This allows the queue to be in place in the code without doing anything. This is used when the number of write ports to the register file is the same as the number of function units. There is no need for a queue in that case.

Execute stage now looks like:
Attachment:
Qupls4_exec_stage.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Fri Apr 03, 2026 2:15 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
More work re-arranging the execute stage. Got things down to eight basic pipelines, by stacking units vertically. Each pipeline writes to the register file rather than having a dynamically selected pipeline writing as it was before.

Synthesized the core and it is about 10% too large.

Converted the neural-network code (for neural net acceleration) to use single precision floating-point instead of fixed point. Conversion was not too bad. A simple state machine was added to control the flow of things. The floating-point version is five times slower and larger than the fixed point. The neural-net code is too large to incorporate into the minimal version of the core.

Coded a float-double reciprocal square-root approximation as micro-ops. It provides a really rough estimate good to four bits accuracy. But it only takes seven clocks cycles to calculate. This approximation can be fed through some N-R iterations to improve accuracy as needed.

Attachment:
Qupls4_exec_stage_2.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 05, 2026 2:19 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Strange coincidences. I went to update the floating-point control and status register docs, and I could not find the docs on the register anywhere recent. It is gone from all kinds of copies of documents as if someone wrote a script to remove it. I found a copy from like 3 years ago minus updates.

It is a coincidence, as I went to update it because I thought it was poorly done. Having studied some of the more recent machines. I was going to switch it to be more similar to the ARM or RISCV.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 06, 2026 12:45 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2501
Location: Canada
Put some more work into the config utility.
Attachment:
ExecConfig.png


It was not until I went to update the GUI config utility that I realized I should make the pairs of execute pipelines identical, then have options to remove components. That allows building the system with ½ the pipelines present to reduce the size.

Green text items are enabled by default.
Attachment:
Qupls4_exec_stage_3.png


Note that for some modules a well-balanced system would have only a single module of that type as the instructions are rarely executed. For instance, there is very little need for two ISQRT modules. The purpose of providing two is to allow an entire pipeline to be removed without losing functionality. It is the same case for divide and trig functions and others.

There are a lot of ALUs as the ALU performs common operations: ADD, SUB, CMP, AND, OR, XOR, FCMP, FNEG, FABS, MOVE.

Only one branch unit can be enabled.


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Tue Apr 07, 2026 10:48 am WWW
 [ 302 posts ]  Go to page Previous  1 ... 17, 18, 19, 20, 21  Next

Who is online

Users browsing this forum: claudebot, DotBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software