Last visit was: Sun Dec 07, 2025 4:43 am
It is currently Sun Dec 07, 2025 4:43 am



 [ 241 posts ]  Go to page Previous  1 ... 12, 13, 14, 15, 16, 17  Next
 Qupls (Q+) 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Compare instruction formats and 2r1w instructions formats were worked on today. Instructions may specify the operation size 8,16,32 or 64 bits for integer instructions and the size of a constant (10,50,90, or 130) bits independently. The odd constant sizes are due to the use of 40-bit instruction words coupled with the ability to encode up to 10 bits in the 2r1w instruction. The constants will be extended or limited to the operation size.

An instruction may for instance specify a 90-bit constant but only a 32-bit operation which would allow the upper constant word to “hide” an instruction.
The 2r1w formats may substitute a constant for either register. There are also instructions that directly support a constant field of 20-bits, assuming one register (Rs1) and one constant are in use.

Branches are going to be based on a bit test of a vector from a compare result. The compare result is stored in a GPR. There will also be branches able to branch based on a register value or 0 (false),1(true),<0 or >0. The branch displacement is 20 bits.

_________________
Robert Finch http://www.finitron.ca


Wed Oct 29, 2025 3:05 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 880
A branch to subroutine would be handy with 20 bit offset.


Sat Nov 01, 2025 12:01 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Quote:
A branch to subroutine would be handy with 20 bit offset.

Unconditional branches and branch to subroutine have a 30-bit displacement.
Unconditional absolute jumps and calls have either a 30-bit or 70-bit address.
Conditional relative branches are stuck at 20-bits.

I have been looking at Qupls2024 version and thinking of modifying that instead. It had 64-bit instructions which would be reduced to 40-bits.
Also looking at the Stark tempted to modify that which has 32-bit instructions.
Not 100% sure which direction I am going ATM.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 02, 2025 12:42 pm WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 880
Have you considered adding virtual machines like P-CODE. BCPL and Algol type languages?
Many instructions a CSIC types simply because they do more work to get a EFA.


Mon Nov 03, 2025 8:02 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Quote:
Have you considered adding virtual machines like P-CODE. BCPL and Algol type languages?
Many instructions a CSIC types simply because they do more work to get a EFA.


EFA means?
I am not sure what hardware support would be good for virtual machines. There is a memory indirect jump that might be used for the NEXT function of an interpreter.

I decided to innovate on Qupls2024 creating Qupls2026. Instructions are switched to 48-bit to keep some of the properties of Qupls2024 but with better code density. 48-bit instructions allow three register reads and one write per instruction and also specifying two operations per instruction. This allows a four-wide machine to execute up to eight operations per cycle.
Inline constants are limited to 48-bits though. It takes a couple of instructions to build a 64-bit constant in a register.

I am just iterating through all of the 2024 documentation updating everything to 48-bits and adding tweaks where needed. There are over 500 pages of docs.

Hopefully there is some RTL code I can use from Qupls2024.

_________________
Robert Finch http://www.finitron.ca


Wed Nov 05, 2025 1:58 pm WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1862
(more work to compute an Effective Address?)


Wed Nov 05, 2025 2:40 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Added branch iterators to the conditional branch instructions. The iterator follows the branch instruction with an additional instruction word containing a constant and an iteration operation. The Rs1 register of the branch may then be modified (iterated) according to the iteration op which is one of ADD,SUB,MUL,AND,OR,XOR,ASL,LSR,ASR,ROL,ROR.

A conditional branch instruction with an iterator looks like:

Code:
BEQ Rs1, Rs2, label : ADD 10   ; adds 10 to Rs1 each time through loop


Being able to shift or rotate allows the use of ring counters for iterations.
MUL may be useful for functions that use exponentiation or algorithms that back-off exponentially.

The neat thing about an iterator is that it is fused with the branch instruction when performed and does not increase the dynamic instruction count.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 06, 2025 10:57 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Playing with the cordic functions today. It seems that the standard approach is to calculate the cordic to the number of bits of the floating-point type in use. It takes something like 140 clocks to get the fsincos calculated to double precision.

I would like to be able get an approximation of cordic functions to a specific number of bits based on the number of iterations performed. The required accuracy in bits would be specified as part of the instruction to trim the number of clocks required.

So, if only 15 bits of accuracy are required, fcosd(angle,bits precision) would return the cosine as a double accurate to ‘bits precision’.

*****

Added a couple of load instructions (FLDH – float load half, FLDS float load single) that perform NaN boxing of lower precision floating-point values to the register width. Nan boxing is similar to a signed load operation, except that the extended bits are always set to one instead of the sign extension. (Setting all the bits equals a NaN at higher precision in floating-point). For Qupls the sign bit also copied to the most significant bit of the register so the sign of the Nan reflects the sign of the number.

There are no corresponding float store instructions as lower precision stores simply copy the lowest bits to memory. Regular store instructions can be used to do this.

I may add load and store instructions that automatically convert between different precisions, but do not see a lot of need for them. Values can be converted using FCVTxx instructions like FCVTD2S to convert a double to a single value.

_________________
Robert Finch http://www.finitron.ca


Fri Nov 07, 2025 11:48 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Working on exceptions and interrupts today. When an exception occurs it is processed at the current operating mode. The exception escalation register is checked, and if the escalate bit is set for the exception, then the exception is triggered at the next operating level, otherwise it is processed at the current level. So, if an exception happens at the app level the escalation bit pattern may redirect it to secure mode.
There is a separate 16-entry base exception table for each operating mode. There is a bit for each exception type at each operating mode. This takes 64-bits.

*****

Scrapped branch iterators and incrementing/decrementing branches and replaced it with the ability to use cache-line constants in place of registers when performing a branch comparison. The branch iterator was not any more code dense than just using another ordinary instruction before the branch. It would also be tricky to implement and somewhat confusing.

*****

Switched the instruction fetch stage of Qupls4 to fetch five instructions at a time instead of four. There is a maximum of 10 instructions that would fit into a cache-line. This is not divided evenly by a power of two. The issue is how far the machine advances on any given cycle. With a four-wide fetch and a max of 10 instructions, it would advance 4,4,2. Making Qupls4 a five-wide fetch allows it to advance 5, 5. Instructions are converted into micro-ops. From the micro-op stage forward the machine is four-wide.

Here are some slides outlining Qupls4 (Qupls2026). These slides started out from the StarkCPU.
Attachment:
Slide1.JPG

Attachment:
Slide2.JPG

Attachment:
Slide3.JPG

Attachment:
Slide4.JPG

Attachment:
Slide5.JPG

Attachment:
Slide6.JPG


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 09, 2025 2:07 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1862
Thanks for the diagrams! Really helps to see what's going on.


Sun Nov 09, 2025 8:25 am

Joined: Sat Oct 04, 2025 10:54 am
Posts: 25
Yes the slides are really helpful, thank you!


Sun Nov 09, 2025 9:49 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Worked on the constant decode RTL code to get an idea if things would work. Some of the code was inherited from the StarkCPU.

Most instructions where a source register can be used allow a constant to be substituted instead. There are five constant sizes: 6,8,16,32, and 64-bits. Six-bit constants are encoded directly in the instruction. Other constants appear on the cache-line. Thinking about omitting the 8-bit constant size.

Constants encoded on the cache-line consume a minimum of 16 bits. (An 8-bit constant uses 16 bits of storage).

There can be up to three constants per instruction. The architecture does not prevent the usage of constants for all source operands. One can code ADD R1,1,2,3 – adding 1,2 and 3 together as constants. The compiler will likely strip out operations where the result can be calculated at compile time.

Constants can be used with floating-point instructions like FMA as well. The CPU will convert from lower precision constants to the precision of the operation size. This allows a small constant like 1.0 to be encoded using only 6-bits while used with a double precision operation.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 10, 2025 2:46 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Modifying constants in the ISA again. It is early and the design is not very stable yet. Modifications are the result of some experimentation trying to implement constants.

Modified where on the cache-line constants are stored. This is to simplify the PC increment. With constants at the end of the cache-line the increment was a bit scary. Now they are placed inline with instructions in “zones”.

Came up with the idea of constant zones which are embedded in the instruction stream inline following instructions that need constants. A constant zone has its own opcode which the CPU uses to mark the area as non-executable. Up to four constant zones may be concatenated together to form a 160-bit zone.

Constants in a zone are multiples of 16-bits in size. Constant sizes are now 16,32,48, and 64 bits.

The size of a constant zone can vary according to the needs of the instruction. Instructions may reference multiple constants in the zone. 160 bits is large enough for two 64-bit constants plus.

There may be some wasted space in the zone if only a single 16-bit constant is needed. However, since there are instructions supporting embedded constants it is much more likely that a 32-bit or larger constant would be needed.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 11, 2025 1:45 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Got rid of the immediate versions of the SET instructions. This freed up 10 more opcodes resulting in about 35 free opcodes.

The SET instructions are not used very often and were consuming a lot of opcode space. Since immediate constants can be substituted for register values, they were also redundant. The only benefit they offered was a larger constant encoded directly in the instruction.

Got the number of opcodes in use down to about 80. There are not really 80 different instructions though, as some of the opcodes are used to represent different precisions for the same operation. The instruction set could almost fit into 64 instructions if the precision were represented by a field in the instruction. But I think I am going to leave it as is.

The dual operation instructions appear to be working out at least while working in assembler. They are not used that often, but maybe often enough.
‘nand_or’ is almost a bitfield deposit instruction. Used a couple of times unexpectedly.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 11, 2025 11:59 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2430
Location: Canada
Oscillating between using 40 or 48 bits for instructions. Put a lot of work into revising the design for 40-bit instructions. Briefly forayed into the concept of instruction fusing. Then decided I liked 48-bits better.

Things impacted:

Branch displacement: At 40 bits there is room for only 12 bits for the displacement. Ideally the displacement should be over 16 bits in size. 48-bit instructions allowed a 20-bit displacement which was ideal. However, probably < 10% of branches would need more than 12 bits, meaning 40-bits would work okay. It would significantly reduce the code density.

Load / Store displacements: at 40-bits there is room for only a 9-bit displacement. At 48-bits 17-bits were possible. Since most load / stores are for stack variables a 9-bit displacement may be adequate. Once again 40-bits has better code density and only minor hiccups in terms of performance. Part of this is driven by the desire to support scaled-indexed addressing and keeping the number of addressing modes simple. There is just one address mode in the design.

There is more shifting the instruction fetch for 40-bit instructions compared to 48 bit. 40-bit instructions need byte alignment whereas 48-bitters can get away with 16-bit alignment.

It is probably only about 5 to 10% of instructions that would benefit from a 48-bit size instead of a 40-bit instruction. 48-bits is only about 20% larger though. The extra instructions for cases 40-bit cannot handle eat into this difference. When the difference in code size is less than 10% maybe it does not matter. Using 48-bits reduces the number of instructions executing which is good for performance.

This is, although very complex, also a hobby class processor. It is best to keep things simple. Why it was decided to go bigger than 32-bits. Instruction formats and decoding can be simpler with a larger instruction available.

I want the processor to have forward upgradability. Lots of opcodes are left available.

Added floating-point fused dot product - x = a*b + c*d

_________________
Robert Finch http://www.finitron.ca


Thu Nov 13, 2025 11:59 am WWW
 [ 241 posts ]  Go to page Previous  1 ... 12, 13, 14, 15, 16, 17  Next

Who is online

Users browsing this forum: Amazonbot, Baidu [Spider], facebook crawler, SemrushBot and 16 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software