View unanswered posts | View active topics It is currently Sat Apr 27, 2024 7:29 pm



Reply to topic  [ 67 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next
 Qupls (Q+) 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Made this pretty block diagram of Q+.
Attachment:
File comment: Block Diagram of Q+
Qupls.png
Qupls.png [ 52.27 KiB | Viewed 722 times ]

Still working on branches.
Got ENTER and LEAVE working with the compiler and assembler.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 27, 2024 5:56 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
oh thanks - I do like a block diagram! A very good way to get a handle on things.


Sat Jan 27, 2024 9:07 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Latest Changes:
Discovered the CHK instruction did almost the same thing as the SYS instruction, so got rid of the SYS instruction and rearranged the opcodes a little bit. CHK invokes an exception handler if the check fails, while SYS always invokes an exception handler. Using a CHK that always fails does the same thing. The CHK instruction accepts an argument specifying the exception cause code. The CHK instruction was also made more powerful, allowing it to do privilege level checks, and the stack canary check too.

I had a heck of a time figuring out why branches were going to the wrong address. Then I realized while looking at vasm code that I had tentatively modified the branch displacement calculation to support 36-bit instructions instead of 40-bits. Forgot to switch the calc back to 40-bit instructions.

*****
Still working on branches.

Unconditional branches are performed at the extract stage of the pipeline as soon as an instruction can be decoded. This is for performance of at least two cycles sooner than they would be executed in the execute stage.
However, a branch miss from an earlier instruction can occur during execute at the same time as an unconditional branch which is done in the extract stage. The earlier instruction should take precedence.

The extract stage branch needs to stomp on the instructions following in the fetch and align stage. The conditional branch executing from the execute stage needs to stomp on all the instructions in the pipeline.

Now throw micro-code into the mix. There needs to be “on and off ramps” to get into and exit micro code. Micro-code is entered at the extract stage triggered by a decode of a macro instruction. Since there is a pipeline delay after decode, one and only one instruction from extract needs to be stomped on after the decode of a macro instruction. There is a positive edge detector involved to do this. Micro-code exits by performing a branch back to ordinary code. The front of the pipeline which was stalled/locked for the micro-code needs to be unlocked, otherwise the branch cannot execute.

It is tricky to figure out which line of code is affecting things. I have had it almost working several times now. Thought it to be working but then ran into cases running longer code. ATM it is stomping on too many instructions after a branch.

_________________
Robert Finch http://www.finitron.ca


Tue Jan 30, 2024 5:53 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Slow day. Spent some time looking at FT833 a 65816/65832 compatible core with the thought of some potential updates.

There were some extra bits in the CHARNDX instruction, so it was made more powerful. It will now perform a masking operation on the character array so that different classes of characters can be found as opposed to just matching on a specific character.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 31, 2024 4:08 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Latest Changes:
Made branches more powerful, now capable of incrementing a register and branching if the condition is met. There was an extra unused bit in branch instructions. Also added logical and bitwise branches for branch-and, branch-or, branch nand, and branch-nor. Came up with these instructions from an earlier CPU core. There is some weirdness as it is possible to increment a floating-point value if a float value is in a register.

Latest Fixes:
Several fixes got branches working much better. Branches are horrendously slow, being worked on with branch prediction disabled. A tight three-instruction loop with a memory store is taking about 40 clocks per iteration.

The instruction group at the branch target was being queued twice. The pipeline was being stopped until the IP matched the miss IP. The check needed to be against the next IP value not the current one.

Stomping at the fetch stage needed to be made sticky as it was only combinational logic. The stomp signal was disappearing too soon causing instructions that should have been stomped on to make it to enqueue.

The register map in the RAT was not being propagated forward to the next checkpoint properly. This meant that incorrect values were being compared in branch instructions causing loops to not iterate properly. Checkpoint increment and restore seem to be working now. Setting a branch checkpoint does cause the machine to stall while the checkpoint RAM is being updated. Things are this way ATM to allow outstanding updates to the RAT to take place in the current checkpoint before it is switched. Something left for future improvement.

Milestone:
Finally, got a loop to work. At least it looped the expected number of times. Verifying that everything is correct though is not that simple. The debug dump of the register file is not very accurate and shows the register value as zero when it should be 40. If I manually lookup the value in the register file, it is the correct value that it should be at the end of the loop. The reason the file dump is not very accurate is due to the CPUs register renaming. As registers are renamed it hides valid values from previous calculations. The display is only occasionally accurate. That combined with the use of multiple register files controlled by a live value table makes things challenging.

Status:
Now the CPU core hangs after about 10 micro-seconds / 150 instructions, on a memory access that is supposed to be stomped on. It looks like part of the stomped on worked but it messed up the done state for the instruction. It never makes it to done, so the CPU hangs on that instruction.

_________________
Robert Finch http://www.finitron.ca


Thu Feb 01, 2024 2:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added some documentation for I/O devices.

Broke the 400-page barrier in the Qupls book.

Working on getting a load / store bug fixed. And have given some thought to providing a configuration option to not use virtual addressing.
Removing the virtual addressing aspect of the core would reduce its size, and probably improve performance. A modern OS relies on virtual addressing though.

Also re-read through some of the documentation from the ForwardCom project by Agner Fog. https://github.com/ForwardCom

_________________
Robert Finch http://www.finitron.ca


Fri Feb 02, 2024 10:04 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Latest Changes:
Moved the link registers to register file locations 44 to 47 out of the r0 to r31 space. Subroutine link registers do not need to be treated like GPRs. This gives two more registers for temporaries.

Modified the shifted immediate instructions to shift by multiples of 21 bits instead of 20. The issue was that shifting by 20 bits only allowed a 63-bit constant to be generated in three instructions. Shifting by one more bit allows a 64-bit constant to be generated in the same number of instructions.

Latest Additions:
Added memory indirect subroutine call and jump. This instruction can load the low order instruction pointer bits from a table. It may load a wyde, tetra, or octa value. This is a micro-coded instruction.

Added a check for unimplemented instructions. They should now generate the unimplemented instruction exception.

Added the BSET, BMOV, BFND, and BCMP instructions as micro-code. BSET sets a block of memory to a value in a register, decrementing the loop counter as it goes. BMOV moves a block of memory. BFND finds a value in memory. BCMP compares two memory areas.. The block functions move bytes according to a stride value. For instance, it is possible to set every fourth byte of memory. The stride amount can be set to zero which allows BMOV to stream a block of memory to a I/O device. The block find function searches according to the same conditions as a signed or unsigned branch. BCMP is a little more limited, but can compare equal, not equal, and signed and unsigned less than, or less than or equal.

Added some code to make micro-code interruptible. In theory it should work but it will be a while before it gets tested.

Ignored the load / store bug for a day hoping it would go away :)

_________________
Robert Finch http://www.finitron.ca


Sun Feb 04, 2024 5:09 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Latest Changes:
Got rid of the extra stack pointers for interrupts. Interrupts now default to using the machine stack pointer.

Vectors are no longer loaded from the vector table at reset or for exceptions. A simple jump to the vector is done to invoke an exception handler. So, there must be a branch instruction at the vector location now instead of an address. The stack pointer is no longer loaded from the vector table. It must be initialized in the boot routine.

The Qupls utility to generate micro-code was updated.

Latest Fixes:
The load store bug was due to using the incorrect pipeline stage register specs for the address generator. Things work much better now.

In a classic blunder the author displayed the incorrect micro-code pointer along with instructions in the debug dump. It was off by a pipeline stage. This made debugging the branches more challenging. The dump was not showing correctly which instructions were being stomped on.

Latest Bugs Identified:
The first group of micro-code instructions after a branch to a macro-instruction is being stomped on. For example, branching to the ENTER macro instruction caused the first part of the micro-code to be stomped on. Several fixes were tried without much luck. To get the CPU working in spite of this bug, micro-code NOPs were added as the first group of instructions in the micro-code.

Compiled the Fibonacci function using the Arpl compiler with the same optimization settings for riscv and Qupls:
Riscv: 31 instructions, 120 bytes (approx, needs a couple of bug fixes)
Qupls: 13 instructions, 65 bytes

Most of the code savings was due to the use of subroutine linkage instructions, ENTER and LEAVE. For a small routine these made up a significant portion of the code. Otherwise the code is almost the same.

Arpl code for Fibonacci:
Code:
integer Fibonacci(integer n)
begin
   integer x;
   integer f0,f1,f2;
   
   f0 = 0;
   f1 = 1;
   for (x = 0; x < n; x++) begin
      f2 = f0 + f1;
      f0 = f1;
      f1 = f2;
   end
   return (f0);
end


riscv-code
Code:
   .sdreg   3
_Fibonacci:
  sub sp,sp,32
  sd fp,[sp]
  mv fp,sp
  sd lr0,8[fp]
  sub sp,sp,72
  sd s0,[sp]
  sd s1,8[sp]
  sd s2,16[sp]
  sd s3,24[sp]
  sd s4,32[sp]
  ld s0,-8[fp]
  ld s1,0[fp]
  ld s2,-24[fp]
  ld s3,-16[fp]
  ld s4,-32[fp]
; f0 = 0;
  mv s3,r0
; f1 = 1;
  add s2,r0,1
; for (x = 0; x < n; x++) begin
  mv s0,r0
  bge s0,s1,.00015
.00014:
; f2 = f0 + f1;
  add s4,s3,s2
; f0 = f1;
  mv s3,s2
; f1 = f2;
  mv s2,s4
.00016:
  add s0,s0,1
  blt s0,s1,.00014
.00015:
; return (f0);
  mv a0,s3
.00013:
  ld s0,[sp]
  ld s1,8[sp]
  ld s2,16[sp]
  ld s3,24[sp]
  ld s4,32[sp]
  mv sp,fp
  ld fp,[sp]
  add sp,sp,40
  jal r0,lr0
  bra .00013


Qupls Code:
Code:
_Fibonacci:
  enter 5,32
  ldo s1,32[fp]
  ldo s4,-32[fp]
; f0 = 0;
  mov s3,r0
; f1 = 1;
  ldi s2,1
; for (x = 0; x < n; x++) begin
  mov s0,r0
  bge s0,s1,.00015
.00014:
; f2 = f0 + f1;
  add s4,s3,s2
; f0 = f1;
  mov s3,s2
; f1 = f2;
  mov s2,s4
.00016:
  iblt s0,s1,.00014
.00015:
; return (f0);
  mov a0,s3
.00013:
  leave 5,8

_________________
Robert Finch http://www.finitron.ca


Tue Feb 06, 2024 3:52 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Latest Changes:
Revamped the entire instruction set. Took a piece from the ANY-1 ISA. Dedicated two bits in every single instruction to indicate the presence of a vector instruction. The instruction extractor uses this *before* the decode stage to expand out vector instructions into multiple scalar instructions. Lots of documentation updates required.

Heavily modified the front end of the CPU. Added two pipeline substages to instruction extract to expand out vector instructions, and then buffer them, allowing them to be selected a group at time from the buffer. All this to implement vector instructions without using micro-code.

Moving away from using micro-code where possible.

No bug fixes, lots of bug creation.

_________________
Robert Finch http://www.finitron.ca


Wed Feb 07, 2024 8:21 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Worked on bug fixes and getting things back to the status quo. The core does not work quite as well as it should, there are zeros coming through for instructions.

Latest Changes:
Switched back to just a single bit to indicate a vector instruction, but then added a bit to each register spec to indicate if it’s a vector register or not. The issue was that adding a vector format code to the instruction, two bits, made the instructions too big to fit into 40 bits. So, I decided to make the instructions even larger and added another bit to each register spec for sign control. That makes seven bits for each register spec, with four per instruction. This should be like the sign control on the My66000 discussed on comp.arch. Instructions are now 48-bits in size. But they pack a lot of power. There are only about 16 different formats including several weirder not too used formats for the CHK instruction and CSR instructions.

_________________
Robert Finch http://www.finitron.ca


Thu Feb 08, 2024 7:04 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Latest Additions:
Put together the Qupls scalar machine, QuplsSeq, which is a non-superscalar implementation of the Qupls ISA. Many components developed for the superscalar version were simply used for the scalar version. It is a lot smaller of an implementation, being about 23,000 LUTs or ¼ the size. Planning on using it to test components. It builds much faster than Qupls.
Also put together an alternate FPU based on exponent, two's complement significand.

Did some research into using polynomials to calculate functions.

Code:
// Evaluate polynomial using Horner's method.

double polynomial(double x, double* coeff, integer len)
begin
   double poly;
   double xp;
   integer n;
   
   xp = x;
   poly = coeff[0];
   for (n = 1; n < len; n = n + 1) begin
      poly = coeff[n] + poly * xp;
   end
   return (poly);
end

double polynomial_odd(double x, double* coeff, integer len)
begin
   double poly;
   double xx,xp;
   integer n;
   
   xp = x;
   xx = x * x;
   poly = coeff[0];
   for (n = 1; n < len; n = n + 1) begin
      poly = poly + coeff[n] * xp;
      xp *= xx;
   end
   return (poly);
end

_________________
Robert Finch http://www.finitron.ca


Sun Feb 11, 2024 8:17 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
http://simh.trailing-edge.com/papers.html
"how vax lost its poly" may be of use in what to avoid.


Sun Feb 11, 2024 11:12 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
http://simh.trailing-edge.com/papers.html
"how vax lost its poly" may be of use in what to avoid.
Interesting reading.

*****

Creating a scalar version of the core was a great idea. I managed to find and fix numerous bugs within a matter of hours.
The scalar core is much slower than the superscalar one. I estimate the superscalar one to be over twice as fast at the same clock rate. The scalar core is a simple state machine with about seven states to execute an instruction.

Bug Fixes:
The wrong bit field was being used by branches to determine when to increment or decrement. This was in both the register decoder and alu decoder.

There were several fixes to the assembler which did not get completely updated for the 48-bit instruction format. There were a couple of places where the instruction size was still set at five bytes. Fields for constants also had to be rearranged slightly in some cases.

Milestone:
Fibonacci working using scalar version.

I think I will work with the scalar version for a while. Shelving the superscalar version. I spent about a week trying to get the branch stomp logic working and it is turning into not fun.

_________________
Robert Finch http://www.finitron.ca


Mon Feb 12, 2024 5:18 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
Would this version be better able to sync to the cache?
The cache is so important now days, would it be better to design timing around that
rather than the alu pipelines?
PS: it is a 3 am here. I have best ideas now. :)


Mon Feb 12, 2024 10:57 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Would this version be better able to sync to the cache?
The cache is so important now days, would it be better to design timing around that

It uses the very same caches as the superscalar version. The instruction cache just returns a cache line specified by the program counter. So it does not care what processing the CPU does with the data. The superscalar version reads multiple instructions from the cache line while the scalar core just reads one instruction. The data cache is similar, it just reads or writes a cache line based on a memory pointer. The scalar version is simpler because the memory ops are not queued by the processor. I put a chunk of work into the caches to make them usable by different CPUs at hopefully high performance, hopefully without much modification.

I've lost track of time again myself :) I keep a regular meal schedules though.

_________________
Robert Finch http://www.finitron.ca


Tue Feb 13, 2024 10:24 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 67 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next

Who is online

Users browsing this forum: No registered users and 97 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software