View unanswered posts | View active topics It is currently Thu Mar 28, 2024 2:27 pm



Reply to topic  [ 121 posts ]  Go to page 1, 2, 3, 4, 5 ... 9  Next
 RTF64 processor 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Spent the day working on rtf64 processor. It is somewhat similar to a PowerPC with four condition code registers and two return address registers separate from the main 32-entry register file. It uses a fixed 32-bit instruction format. Most instructions can update one of the condition code registers. Branches use absolute addressing with 24 bit range. I quickly put together a non-overlapped pipelined version starting with CS01 code. I spent about three days documenting and generating over 130 pages of docs for the project.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 23, 2020 7:36 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
That's a lot of docs! Quick question: how does a pair of return address registers get used?


Wed Sep 23, 2020 8:06 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I was wondering that myself. They do not get used at the same time. It is sometimes handy to be able to be able to have a depth of two for register-based routine returns without having to stack addresses. The first level routine returns using ra0 then the next level routine returns using ra1. Having two registers allows an alternate return path to be specified in a function. Why one register would not be sufficient I do not know. Can not just the one register be reloaded with values? I have seen a couple of risc designs that allow for two return address registers. So, I gather there must be some need for them. For example, PowerPC allows a return using the count register as an alternate for the normal link register. RISCV ABI reserves two registers for linkage use as well.

There is a pair of return address registers associated with each integer register set. Yes, the machine sports multiple integer-register sets (32 of them). So, there is an array of 64 subroutine link registers.

With all the features I was afraid the implementation would be huge, but it turns out with most things implemented it is only about 8,500 LUTs for the non-overlapped pipeline version.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 23, 2020 4:07 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Interesting - reading up on RISC-V I see the second link register is used (when it is used) to enable an inner layer of subroutine - what they call millicode - specifically used in the case of function prologue and epilogue, where load and store multiple instructions would be handy, but for a list of reasons are not part of RISC-V ISA.



Wed Sep 23, 2020 5:59 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added memory key checking logic to the core. Checking the memory key adds yet another two cycles to the memory access time. It is needed because the final physical address after segmentation and paging needs to have the key check done. The key check must also be done before the memory cycle starts after the address is generated. So, it ends up taking about five clock cycles before the bus cycle begins.

I also added a telescopic card memory, but I had to then alter it. It turned out to consume too many block ram resources. I had setup a decent resolution of 128-byte sized cards. But I see in the textbook that card tables may be composed of up to 512-byte sized cards. So, it is possible to get by with ¼ the memory requirements. The text describes how card memories are updated using software. There were issuing requiring atomic updates of the card table. I decided to try implementing it all in hardware so there should be no atomic update issues when I go to write a garbage collector.

With all the features, paging ram, key memory, and card memory about 2/3 of the block rams in the device are used. I may have to remove a feature later.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 24, 2020 4:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added bitfield ops and shift ops which I missed in the initial coding. I also added a physical memory attribute checker. The PMA check aborts an instruction fetch one clock cycle after it has begun if the PMA check fails. The PMA needs the physical address to check and it is not available until it is presented on the bus. PMA checking for loads and stores takes place at the same time memory keys are checked. Once again after the physical address is present on the bus, but in this case before the bus cycle has started. For loads and stores the address is available a cycle sooner.

I've been hesitant to work on an assembler until the ISA is more stable. Maybe time to start working it.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 25, 2020 4:05 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I am a fan of trace. I have added simple trace facilities in the form of pc history records to cores in the past. This time I got a bit more sophisticated having read up on trace some. An 8kB fifo is used to record history. https://github.com/riscv/riscv-trace-sp ... e-spec.pdf

Added a trace feature to the core. It has minimal intelligence to it but does use branch compression to economize on the ram storage for the trace. It records the branch taken-not-taken status as single bits in a history record. Except for when there is another type of jump, where it stores the entire address then. And it records the current address about every 200 branches (four history records) regardless in case things get out of sync.
I managed to get enough of the assembler updated for rtf64 that the Fibonacci test program could be run in simulation. Lots of bugs yet.
Code:
; Fibonacci calculator RTF64 asm
; r1 in the end will hold the Nth Fibonacci number
  code 24 bits
   org   $FFFC0100

start:
   ldi     $t2,#$FD
   ldi     $t2,#$01   ; x = 1
   sto     $t2,$00

   ldi     $t3,#$10      ; calculates 16th fibonacci number (13 = D in hex) (CHANGE HERE IF YOU WANT TO CALCULATE ANOTHER NUMBER)
   or     $t1,$t3,$x0   ; transfer y register to accumulator
   sub     $t3,$t3,#3   ; handles the algorithm iteration counting

   ldi     $t1,#2        ; a = 2
   sto     $t1,$04        ; stores a

loop:
   ldo     $t2,$04        ; x = a
   add     $t1,$t1,$t2   ; a += x
   sto     $t1,$04        ; stores a
   sto     $t2,$00        ; stores x
   sub.  $t3,$t3,#1   ; y -= 1
  bne   cr0,loop     ; jumps back to loop if Z bit != 0 (y's decremention isn't zero yet)
  nop
  nop

_________________
Robert Finch http://www.finitron.ca


Sat Sep 26, 2020 4:46 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added some R1 instructions to the core (com, not, neg, cntlz, cntlo, cntpop, tst) that were missed initially. I did a lot of work on the assembler today. The assembled code is looking better. The Fibonacci test program does not work quite right yet. It keeps doubling the value instead of generating the Fibonacci series.

_________________
Robert Finch http://www.finitron.ca


Sun Sep 27, 2020 3:28 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Ahh... the bugs bunny error again.
While not part of the design,I like the idea of B - A (CAD) instruction. I alot of simple
compilers push everything, and the subtract sense is reversed.
LD B S++ NEG B ADD A B could get replaced by CAD A S++
Ben.


Sun Sep 27, 2020 5:31 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The external data bus size for rtf64 has been modified to be configurable as 32, 64, or 128 bits. When the bus size is 64 or 128 bits there is a micro cache for instructions for 2 or 4 instructions, respectively. So, an external fetch for instructions does not have to occur. The current test system is built around a cpu using 128-bit fetches and stores to make the best use of the dram. Well the micro-cache got axed and replaced with an L1 cache.

Found two errors in the Fibonacci. One was the Fibonacci test program, values were being written to overlapping memory regions. The second error was forgetting to code the 2r add form instruction.

Latest addition to the instruction set is the LEA instruction. It might seem redundant given an add instruction can do the same thing. But there are a couple of features about LEA that make it appealing. The indexed form of addressing scales the index register, LEA copies this. Without LEA another instruction or two would be required to compute the index scale. LEA also implicitly means an address pointer is being manipulated. It might be convenient to track this is a future a version of the core. Some software cares about where pointers are, and a future version of the core may have a pointer status associated with a register that would be set by an LEA operation. I ran into the use of LEA for the CACHE command, it needs the final effective address in Rs1 so, it might be necessary to calculate it first.

The I$ is physically indexed. The CACHE command must travel all the way through the memory system to physical address generation to know which line to invalidate.

I am mulling over the idea of an ‘add-to-next’ instruction operation. Apollo mission style. What is needed is the ability to parameterize the access size to memory from an external function. The code needs to load or store a byte, wyde, tetra, or octa sized data item. I do not want a series of branches to set the memory access size. It might be better if a math operation could be performed on an instruction to do this.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 28, 2020 4:41 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I added stack-oriented subroutine call and return instructions. While it does violate the load / store paradigm a little bit the instructions are code dense.

I did a comparison of RTF64 to RISCV for the same code ported over to RTF64. RTF64 uses 464 fewer instructions or is about 2.5% smaller. Not enough of a difference to worry about. Compare and test instructions in RTF64 take up an additional 199 instructions compared to RISCV. This could be reduced considerably if I were willing to accept the condition code register as a return value from functions. Many of the test instructions are done on return values. It is hard to beat the compare-and-branch all-in-one instruction. RTF64 beats out the RISCV in a couple of ways. One is the encoding of immediate values is slightly more efficient in RTF64. The presence of indexed address mode allows slightly more efficient code when an address displacement cannot be encoded in a 12-bit field. Indexed address mode gets used for about 3% of instructions. The second place RTF64 wins occur with return instructions which can encode a stack adjustment. It frequently takes one less instruction at the end of a routine in RTF64 compared to RISCV.
I made several measurements on different code and depending on the code RTF64 or RISCV was smaller but not significantly so.
RTF64:
Code:
number of bytes: 19344.000000
number of instructions: 4836

CS01 (RISCV):
Code:
number of bytes: 19808.000000
number of instructions: 4952
 

_________________
Robert Finch http://www.finitron.ca


Tue Sep 29, 2020 3:53 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The assembler has been modified to support source code libraries. It can “link” source code from a library file and strip out the unreferenced code and data. The ‘C’ ctype library has been hand-coded in assembler. It uses table lookup and bit masking for most of the functions. Functions look like:
Code:
public _islower:
  asl     $a0,$a0,#1
  ldwu    $a0,__ctyptbl[$a0]
  and.    $a0,$a0,#_LO
  rtl
endpublic

They are very short with no branches.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 30, 2020 4:56 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I have been working on a TLB as an alternative to the page mapping ram. The page mapping ram uses about 64 block rams and only supports 32 maps. The TLB uses eight block rams and supports a full 256 address spaces. It is a 1024 entry TLB (smallest size workable with block ram) made four-way associative. Pages are 16kB in size.
For the address tables a simple table arrangement is in use that is not very efficient in terms of memory usage, but it should work, and it is simple. There are 256 address spaces of 4096 pages max which equates to 2MB of memory required to represent every possible mapping. That is where the table is inefficient, generally being able to map every possible mapping is not required. Each entry requires a 16-bit physical page number. The virtual page number, which is only 12 bits, plus the 8-bit ASID is used to index into the table. There is only a single level of table in use. Handing a TLB miss is relatively straightforward then. There is only one load operation and one branch in the TLB miss handler which is 22 instructions long.
Code:
  align   16 
TLBIRQ:
  csrrw   $t0,#CSR_BADADDR,$x0    ; get the bad address
  lsr     $t0,$t0,#LOG_PGSZ       ; convert to page number
  and     $t0,$t0,#4095           ; max number of virtual pages-1 (safety mask)
  asl     $t0,$t0,#1              ; convert to index
  csrrw   $t1,#CSR_ASID,$x0       ; get current ASID
  asl     $t1,$t1,#13             ; shift into position
  csrrw   $t2,#CSR_PTA,$x0        ; $t2 = page table address
  or      $t2,$t2,$t0             ; $t2 = pta + offset from virtual address
  ldw.    $t2,[$t2+$t1]           ; fetch physical page number
  bmi     .notAssigned            ; valid page number? (>0)
  asl     $t1,$t1,#43             ; $t1 = ASID in bits 56 to 63
  asl     $t0,$t0,#31             ; $t0 = virtual page number in bits 32 to 47
  or      $t2,$t2,$t1,$t0         ; $t2 = value to enter into TLB
  lsr     $t0,$t0,#32             ; $t0 = virtual page number in bits 0 to 15
  and     $t0,$t0,#$3FF           ; make an TLB index
  csrrw   $t1,#CSR_LFSR,$x0       ; get a random value
  and     $t1,$t1,#$C00           ; mask to way position
  or      $t1,$t1,$t0
  ldi     $t0,#1
  dep     $t1,$t0,#63,#0          ; set bit 63 for write
  tlbrw   $x0,$t1,$t2             ; update the TLB (should clear interrupt)
  rti
  ; Here the app has requested access to a page that isn't mapped.
  ; Trigger the app's exception handler.

.notAssigned:

The issue worked on momentarily is what to do when the page map is not a valid one. It seems to the author that the thing to do is trigger the app’s exception handler logic. This is not as easy as it sounds to do. It requires returning to the exception handler rather than a direct return. So, the return address must be set properly. The implication is that there is an exception handler address, well where is it? In a register of some sort. Exception handling is quite complex and the author it tempted to leave it to the software gurus. The author is more interested in the hardware. But he must prove it works.

_________________
Robert Finch http://www.finitron.ca


Thu Oct 01, 2020 3:01 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
October fools day, work was started on the compiler. It is virtually the same as it was for nvio3 which also used compare results registers. One difference was the size of the registers, nvio3 was 80-bit, and rtf64 is only 64-bit. Compiler output is looking pretty good. The rtf64 test system is being built. It takes about 2 hours to build.

I measured the CPI for the CPU in simulation, it worked out to an average CPI of about 14. :{ Then I made a cycle-by-cycle list of what is going on. It takes about five clock cycles just to fetch the instruction. It takes another three for register fetches. Some instructions can be finished in about 10 clocks, but a memory access requires a pile more clock cycles to perform. There are lots of register stages in the design so the hope is that it will be able to clock at a high frequency. 14 does not seem to compare well with the about 3 CPI for a superscalar, but it might if it can be clocked five times as fast. The superscalar core could only be pushed to about 30MHz in the FPGA. The hope is to break a 100 MHz clock barrier with this design.

The instruction set is a bit redundant because there are two approaches to branches. One is to use a compare operation then branch based on the result, the second is to use a set operation and branch. When I ported the boot rom which is hand coded assembly I used compare then branch sequences and there are almost no set instructions in thousands of lines of code. Then I looked at the compiled code and there are almost no compare operations because it uses set operation instead. Both sets and compares are not needed. One or the other would work just fine, but I can’t make my mind up.

I put some effort into having the compare and set operations merge results into the target CR register, but it turns out to be rarely used. The compiler does not use it and I am not sure I want to add it to the compiler. It is a feature I think I am going to remove.

_________________
Robert Finch http://www.finitron.ca


Fri Oct 02, 2020 5:18 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
I like the old style memory, you knew just how fast your 6502 was.
A if statement allways has branch component. Can you have prefetch instruction
to fetch the new ip address in a small taged cache.
if a & 3 {
prefetch offset tag
load a
load 3
and
beq tag
...
This is valid for simple logic. if *i++=foobar() is going to be slow no matter what you do.


Fri Oct 02, 2020 4:30 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 121 posts ]  Go to page 1, 2, 3, 4, 5 ... 9  Next

Who is online

Users browsing this forum: AhrefsBot and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software