View unanswered posts | View active topics It is currently Tue Apr 16, 2024 7:54 am



Reply to topic  [ 159 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 11  Next
 ANY-1 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
In the interest of reducing LUT requirements the scalar registers could be accessed as vector element #63 and the number of supported vector elements reduced by one to 63. Setting things up this way removes the need to multiplex between vector and scalar registers and instead uses the register file memory to do the multiplexing.

During decode, the current vector element number (CEN) would be set to 63 for a scalar register or zero for a vector register. Once all the current vector element number registers are greater than or equal to the value in the vector length register, then the instruction is complete.

Pondering the implementation of data type conversion instructions. With four base data types and roughly four sizes of data there could be potentially hundreds or thousands of different conversions required. Obviously only the most commonly used ones would be done in hardware.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 26, 2021 4:11 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
A consideration for an out-of-order machine is renamed registers. Perhaps it would be good to trim additional registers from the number of elements to support register renaming. If the last four elements were using as rename registers that would allow for 256 physical registers.
-> reduced the size of vector registers down to 64x56 to allow eight renames of each scalar register. If more than eight renames are needed simultaneously then the machine will stall until a register becomes available. 512 registers from the register file are not really needed, but the register designators are needed meaning those register otherwise can not be used.

Pondering how to implement the instruction set as an out-of-order superscalar machine. Since it is desirable that vector instructions be interruptible some means of recording the state of the vector operation is required. The machine should be able to resume a vector operation after an interrupt. I think this could be done using several bitmasks equal in width to the number of vector elements. As vector elements for an instruction are committed a bit in the bitmask is set corresponding to the element number. The bitmask is state that must be saved between interrupts.

_________________
Robert Finch http://www.finitron.ca


Tue Apr 27, 2021 3:37 am
Profile WWW

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
I sounds very complex. I wonder how do you debug it?


Tue Apr 27, 2021 7:01 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
It is very complex, this project is really too big for a single individual. It might take a decade for it to be really “done”. I wonder what the number of man years for such a project would be. Modern / existing toolsets accelerate things. Some of the larger pieces of the project are:
Instruction Set Specification
Software emulator.
C / C++ compiler (assembler, linker)
RTL code for classic pipeline
RTL code for superscalar pipeline
Porting some test apps.

It is very complex, in part to keep the implementation small. Splitting up processing of vectors so that elements may be processed in a serial fashion leads to loops in the pipeline and a need to track individual elements. A simpler approach would be to process all vector elements at the same time using a classic five-stage risc pipeline. But that would require huge amount of resources.

The design is probably large enough that a classic pipeline is not going to execute faster than 40-50MHz in an inexpensive FPGA. The superscalar pipeline will top out about 30MHz in the same FPGA, so would offer only marginally better performance.

Debugging would be done very carefully, bit-by-bit. Lots of probes. Some parts of the project can be used to verify other parts as things must remain consistent between parts for the whole thing to work.

_________________
Robert Finch http://www.finitron.ca


Wed Apr 28, 2021 11:37 am
Profile WWW

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
My "simple" CPU took 4-5 years just to be designed. I changed my mind a lot of times regarding the ISA and its implementation, but since I made independent circuits I could recycle the most ot tests , however I cannot really figure out how to manage a bigger project like this, which doesn't come from a pre-existing work but rather with everything made from scratch.

Is it your first vector project, right?

I cannot also figure out how to test out-of-order things. By formal verification it's rather impossible to achieve it, by classic test-benching there are too many circuits in too many combinations, so probably a project like this needs something new to me, perhaps a smart testing mechanism done on a machine that spends its time by automatically planning and executing testing stuff.

Perhaps this is the right job of an AI-assisted testing program.


Thu Apr 29, 2021 9:34 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 589
While have smaller desgin than most, I still find a good hard copy does wonders for debugging the weird errors.
I use printfile (windows program) to print two pages of text to per printed page. Ben.


Thu Apr 29, 2021 5:49 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Is it your first vector project, right?
I would probably classify it as the second or third. I have added vectors to a couple of instruction sets now, but not got around to getting them actually working. Thor/FT64 was probably the first that supported vector instructions in a superscalar fashion but it had serious errors in it.

Quote:
I cannot also figure out how to test out-of-order things.
It can be quite challenging at times. Especially once memory loads and stores are in a program.
I think one approach is to build something simpler, in-order, then compare the output results to the out-of-order version. I start with small tests, gradually getting more sophisticated. First thing to do: turn on a diagnostic LED. Later Fibonacci series calc and Sieve of Eratosthenes. One of the first things to do: test the I$ and fetch. It helps a lot to have a decent logic analyzer. I do lots of simulator dumps. One thing to do is to program a series of instructions for testing that should be executed effectively in-order on the out-of-order machine. This might be a bunch of single-cycle add instructions for instance. If the cpu can be "prevented" from executing out-of-order it may be easier to debug. It should schedule simple add's in series for instance. A simple program to generate a checksum for the rom, comparable to one generated by the software toolset, is another early test program. The stages of an out-of-order machine can be dumped in a manner similar to an in-order machine. One good dump is the instruction queue or window with head and tail indicators.

Added a load / store multiple instruction prefix to the instruction set. The desire was to have at least pair loading and storing instructions. The author also did not want to add additional load / store instruction formats. The LSM instruction allows up to seven registers to be specified which are substituted in succession as the target or source register for the memory operation.

_________________
Robert Finch http://www.finitron.ca


Fri Apr 30, 2021 4:17 am
Profile WWW

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
For my current project, which is a simple multi-cycle architecture, I have hacked a FPGA-PCI board based on a small XC3S400 and attached 32+4 digital digital probes to a LA of which I reverse-engineered the USB protocol in order to have it operating on Linux.

I have also hacked ISE v14 to have it running and scriptable under Qemu/x86 and the HDL simulator able to make the process automated, so when it testes a HDL module on the simulator, if the test passes the testbech then it dumps signals into a session, then it compiles the VHDL into bit-stream, it uploads it into the FPGA over the jtag on the PCI itself, and it starts comparing results.

Code:
foreach test in test_list
        do
            test[i].is_Ok = (measured_on_FPGA(test[i]) =?= expected_from_simulator(test[i]));
        done


It's not the same problem we are discussing about, but it's cool and has some nice potential for testing stuff automatically.

I have this special PCI-FPGA card installed in an HPPA RISC computer, I still have to manually writes each testbench (scripting language + VHDL), so next step will probably be ... automatic the testbench process to automatically generate tests!

But first I need a "testing language", something to describe tests in an abstract way.

Probably "Chisel/SCALA" already has something similar

Quote:
The Chisel Community Conference China 2021 (CCC2021) is planned for June 25, 2021 at the ShanghaiTech University. CCC is an annual gathering of Chisel community enthusiasts and technical exchange workshop. With the support of the Chisel development community, this conference will bring together designers and developers with hands-on experience in Chisel from home and abroad to share cutting-edge results and experiences from the open source community and industry.

Session topics include and are not limited to
  • CPU Core (recommended but not restricted to RISC-V) implementations
  • SoC implementations
  • Verification <------------
  • Simulation
  • Synthesis
  • Education
  • Experience sharing
  • Types of manuscripts.


unfortunately I cannot run Java/Scala on HPPA because there is no JRE for such architecture. I need to buy a second hand x86 mac-mini to study it.


Fri Apr 30, 2021 3:55 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Tonight’s quandary: ANY1 Memory Keys

Every physical memory page in the system is going to have a 20-bit key associated with it. In my rinky-dink system the memory keys are stored in block ram in an FPGA directly mapped to main memory since the amount of physical memory is small. The block ram is accessed in parallel to main-memory and the key can be verified while main memory access takes place. This system was storage space efficient. Only a single copy of the key was stored per memory page. This scheme will not work for larger memory systems. An obvious adjustment to the system is to store the keys in a table in main memory. But that would then require two memory accesses for each main memory access, way too slow. So, to boost the performance a key-cache is required. Then on a memory access check for a key-cache miss in addition to other memory access requirements. So, the core needs a register to record the location of the memory key table in memory. Key-cache entries need to be invalidated when a new key is set for a memory page. There also needs to be logic in the core to read the key table and update the cache on a key-cache miss. Fortunately, key-cache misses should be infrequent because they occur for memory pages.

Quote:
But first I need a "testing language", something to describe tests in an abstract way.

Probably "Chisel/SCALA" already has something similar

I started looking into Chisel/SCALA myself and found that manipulating fields of bits (concatonating etc) is a bit painful. But maybe its just I don't have enough experience with it.

Automated testing is good for regression testing when one wants to detect what changed from an existing system. I think the first time a test is built it can not be setup automatically, it must be manually setup. The then tests have to be adapted to updates in the system.

_________________
Robert Finch http://www.finitron.ca


Wed May 05, 2021 5:26 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added the STPTR instruction, then removed it, then added it back and finally removed it. It is somewhat of a questionable instruction due to complexity added for the memory faults that might occur. The idea behind STPTR is to implement most of a write barrier for garbage collection and make a write barrier visible as just a single instruction. The write barrier involves multiple writes to memory at calculated addresses. This could be done effectively for a small system by writing in parallel to block rams covering the entire address range of memory. And STPTR maybe an effective instruction for smaller systems. However, when using a larger memory system an issue is created with multiple writes taking place; one of the writes may generate a memory fault.
A write barrier for garbage collection looks something like:
Code:
 
   ; Milli-code routine for garbage collect write barrier.
   ; Usable with up to 64-bit memory systems.
   ;
GCWriteBarrier:
STO   a0,[a1]            ; store the pointer value to memory at a1
SRL   a1,a1,#LOG_CARDSIZE      ; compute card address
STB   x0,[a1]            ; clear byte in card memory
SRL   a1,a1,#LOG_CARDSIZE      ; repeat for each table level
STB   x0,[a1]
SRL   a1,a1,#LOG_CARDSIZE
STB   x0,[a1]
SRL   a1,a1,#LOG_CARDSIZE
STB   x0,[a1]
SRL   a1,a1,#LOG_CARDSIZE
STB   x0,[a1]
SRL   a1,a1,#LOG_CARDSIZE
STB   x0,[a1]
SRL   a1,a1,#LOG_CARDSIZE
STB   x0,[a1]
JMP   ra1

The above shows a routine that supports a full 64-bit address range with seven levels of card tables (LOG_CARDSIZE = 8). The routine will typically be shorter for a reduced memory range.
The store operations will consume far more time than jumping to and returning from a milli-code routine. The overhead of routine call and return is probably acceptable when more levels are present.
Updating card memory may also be done with in-line code in a small system where there may only be a single table level to update, meaning only two writes take place and its only a couple of instructions. However, in a larger memory system multiple table levels will be present. It is desirable to have this code in-lined for performance, but the resulting code density may not be good. Hence, the idea of use of the STPTR instruction.

One gotcha is that the write barrier updating card memory must be done from a mode higher than user mode so that physical addresses are updated rather than virtual ones. That means that supervisor code or higher is managing the garbage collected objects.

A second consideration is the card memory begins at physical address zero and extends as far as needed to cover the whole address range with cards. (eg 262kB for a 512MB memory) This means that address range cannot be used for other things.

_________________
Robert Finch http://www.finitron.ca


Thu May 06, 2021 4:12 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Studying the tracking of the source of data, a piece to Tomasulo’s algorithm for a superscalar design. I started looking at the reset of the source id’s that happens during a branch miss as I found some of that logic a little harder to follow. Each register has associated with it a source id indicating where data for the register is coming from. It turns out the source id require six write ports and nine read ports. Fortunately, the source id is small - four bits in this case. When the register file is small, 32 entry for example, a brute force implementation of the source id field works. However, with 4096 registers, having that many fifteen ported registers just plain does not work in a simple fashion. It is too big and complex for the toolset to work out for the given device. So, some means of simplifying the design must be use for that many registers.
The design may end up being super-pipelined to reduce the size of the design. By time-domain-multiplexing (TDM) register writes, the size of register files can be reduced substantially. For instance, the main register file may need four write ports resulting in quite a large register file since there are 4096 entries. A TDM multiplex on the write would reduce this to a single port. The cost is in terms of performance in clock cycles. But those clock cycles are at a multiple of the core’s primary clock.

_________________
Robert Finch http://www.finitron.ca


Sat May 08, 2021 2:21 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
> six write ports and nine read ports

Yikes! That would strain any implementation scheme...


Sat May 08, 2021 8:10 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Having done some scheming on the register source that may allow it to be implemented,
I ran into the following piece of code in the register validation logic:
Code:
   if (branchmiss) begin
      for (n = 1; n < AREGS; n = n + 1)
         rf_v[n] <= `VAL;
         for (m = 0; m < QENTRIES; m = m + 1)
            if (n==livetargetList[m])
               rf_v[n] <= rf_v[n];
   end

Which says if there is a branch miss, then all the registers are automatically marked valid, except the registers that appear in the live target list.
This is another ‘Yikes’ and a good argument for not doing a superscalar version in the manner I intended. With 4096 registers, I believe the above translates into 4096x16 12 bit comparators. I am liking the idea of an eight-register machine a lot right now :)
The impression I have now gotten is that an out-of-order superscalar machine is not a suitable target for a machine with lots of registers. The amount of logic becomes unreasonable when there are thousands of registers involved. Each vector element cannot be handled as if it were a separate register.
The above piece of code could be “fixed” by breaking the branch miss into a multi-cycle operation and processing fewer compares per cycle.
The key may be to process as if the machine has 128 registers instead of 4096. 128 registers should be much more manageable but will result in a more complicated design. Somehow the 64 elements of a vector register must be presented and processed. Vector registers may be partially valid and partially invalid as some elements may have been processed while others are still waiting. We do not want the machine to stall until all 64 elements of a vector are valid. Processing may occur as long as all the elements needed for a calculation are valid. This is not necessarily all the elements of the vector. This puts the kibosh on register file valid signal, if that signal only applies to the entire vector register.
So how is the issue of the huge number of registers solved in a modern cpu that supports vector type operations A guess is the issue is circumvented by processing all vector elements in parallel in the ALU / FPU, assigning an ALU to each element and having enough for all elements of a vector. This would work for vectors with small numbers of elements, otherwise it would be a lot of hardware.

_________________
Robert Finch http://www.finitron.ca


Sun May 09, 2021 4:56 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I wonder if this structure looks a little like a CAM: an address is broadcast to all the words, and where it matches some action is taken. A CAM is a nice primitive which matches well to circuitry - but not so well, perhaps, to HDL description or to FPGA fabric.


Sun May 09, 2021 6:03 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
It does look a little bit like a CAM. It could also be implemented with livetargets as a 4096-bit wide register. Then 16 4096:1 multiplexers would be required. It is a lot of logic either way.
I have avoided the use of CAM’s in an FPGA as they are resource intensive and slow. There were some CAM’s in use for memory management until it was discovered they used a huge amount of resource compared to a simple mapping table.
Things are setup at the moment to use 128 regs instead of 4096, and instead of a single bit rf_v, regfile valid (rf_v) is a 64-bit vector now. 1 bit for each vector element. There is still a correspondence of a valid bit per vector element of the register file. For scalar registers, only one bit of rf_V is significant.
Code was modified slightly to make use of 64-bit wide rf_v.
Code:
   if (branchmiss) begin
      for (n = 1; n < AREGS; n = n + 1)
         rf_v[n] <= {64{`VAL}};  // Inplace of just ‘VAL
         for (m = 0; m < QENTRIES; m = m + 1)
            if (n==livetargetList[m])
               rf_v[n] <= rf_v[n]; //rv_n[n] was a single bit, now an array
   end

There is code like this which says, if the vector element number is zero or it is a scalar instruction, and it queued with a target register, then set all valid bits for the register file to invalid.
Code:
   if (!branchmiss)
      case(slotvd)
      4'b0000:   ;
      4'b0001:
         if (venz || !slot_isvec[0]) begin
            if (queuedOn[0]) begin
               if (slot_rfw[0]) begin
                  if (!Rd[0][7])
                     rf_v [Rd[0]] <= {64{`INV}};
               end
            end
         end

Then when data commits to the register file, the individual bits for vector elements are set valid
Code:
always @*
begin
   for (n = 1; n < AREGS; n = n + 1)
   begin
      if (commit0_v && n=={commit0_tgt[RBIT:0]} && !rf_v[n][commit0_tgt[11:6]])
         regIsValid[n][commit0_tgt[11:6]] = ((rf_source[ {commit0_tgt[RBIT:0]} ][`RBITS] == commit0_id) || (branchmiss && (iq_source[ rob_id[commit0_id] ])));
end


_________________
Robert Finch http://www.finitron.ca


Mon May 10, 2021 2:01 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 159 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 11  Next

Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software